← Back to work
Project
PythonBNGBiodiversityData Processing

BNG Metric Matrix Processor v0.2

Upload BNG Metric File

Upload an Excel file (.xlsx, .xls, or .xlsm) containing BNG Metric data. The tool processes the "G-2 Habitat groups" sheet and generates a pivot table summary.

Summary: Challenges with BNG Metric Excel File Processing

Business Context & Goal

The sponsor needs to calculate net unit increases by final habitat type (units available for sale), accounting for net losses from transitions. This requires:

  • Parcel-to-parcel joining across baseline, creation, and enhancement sheets
  • Tracking which habitat types were sacrificed to generate units (e.g., for Sankey flow diagrams)
  • Re-aggregation back to habitat type after parcel-level joins

Why It's Complex

  • The summary page loses information due to aggregation functions
  • Units delivered alone isn't sufficient - you can only sell the net difference, not units delivered
  • Manual parcel-to-parcel joining is labor-intensive (currently ~2 hours per transform) and highly error-prone

Data Quality Challenges

1. Inconsistent Column Naming & Location

  • Same logical column uses different names across sheets (e.g., "Baseline Ref" vs "Baseline ref")
  • Column positions vary by set (D vs E) and sheet type
  • Case and spacing differences require normalization
  • Impact: Can't rely on fixed column positions or names

2. Inconsistent Join Key Availability

  • Not all rows have values in primary join columns
  • Some rows have values in secondary columns (e.g., "Habitat reference")
  • New habitats may have no join keys (e.g., rural tree added)
  • Users can choose which columns/refs to use: either Ref (numeric) or habitat ID (if provided)
  • Impact: A single join strategy doesn't work for all rows

3. Mixed Data Quality Within Sheets

  • Some rows have complete join keys; others are missing or empty
  • Rows with no matching keys should be treated as new habitats, not matched on empty values
  • Empty join keys must never match each other
  • Impact: Requires row-by-row matching that tries multiple strategies per row

4. Complex Header Structures

  • Multi-row headers with merged cells
  • Headers span multiple rows, requiring special parsing logic
  • Impact: Standard Excel parsing fails

5. Set-Specific Variations

  • Different sets (D, E, etc.) have different structures
  • Column names, positions, and data patterns vary by set
  • Impact: One-size-fits-all approach doesn't work

6. Reference ID Conflicts

  • Same reference IDs (e.g., "1") can appear in different sets with different meanings
  • Must track set context to avoid false conflicts
  • Impact: Need set-aware conflict detection

Business Rules & Constraints

Guiding principles that must be enforced:

  • Net changes for a given habitat type should never be negative (input validation requirement; plans may be rejected if violated)
  • Need to re-aggregate back up to habitat type after parcel joins
  • A habitat cannot both be created and enhanced
  • A creation sometimes might not have/doesn't need a baseline ref (e.g., rural tree added)
  • User can choose which cols/refs to use as refs: either Ref (numeric) or habitat ID (if provided)

Current Pain Points

  • Manual process takes ~2 hours per transform
  • Highly error-prone due to data inconsistencies
  • Labor-intensive parcel-to-parcel joining
  • Information loss from summary page aggregation
  • Difficulty tracking which habitats were sacrificed

Solution Benefits

  • Automated processing: Reduces 2-hour manual work to seconds
  • Error detection: Flags inconsistencies and can fail analysis to enforce standards
  • Data quality enforcement: Helps establish a "data/metric completion standard" that projects should be aware of upfront
  • Flexible matching: Row-by-row matching handles inconsistent data automatically
  • Detailed logging: Shows which join strategies were used for troubleshooting
  • Configurable rules: JSON config defines structure and join logic per set, allowing adaptation to variations

Known Limitations & Failure Cases

The solution will fail or require manual intervention for:

  • On-site sheets (A/B/C sets) - not yet supported (future work)
  • Manually corrupted/edited headers
  • Manually added/removed/moved columns
  • Data quality issues (e.g., a row added in both created and enhanced for the same field)

Future Work

  • Add support for A/B/C sets (onsite) - probably more for developers
  • Add support for area/length change in addition to unit change

Value Proposition

This solution transforms a 2-hour, error-prone manual process into an automated, consistent workflow that:

  • Enforces data quality standards
  • Provides transparency through detailed logging
  • Handles real-world data inconsistencies
  • Establishes a "data/metric completion standard" for projects
  • Flags inconsistencies to help improve data quality over time

The flexibility to handle inconsistent data while flagging issues helps move toward standardized, high-quality data inputs that benefit both the analysis process and project planning.

How to use

  1. Upload your BNG Metric Excel file (.xlsx, .xls, or .xlsm)
  2. Click "Process File" to analyze the data
  3. Review the summary statistics showing join results
  4. Examine the habitat transitions table showing all transition records
  5. View the habitat transition matrix showing net units delivered from original to final habitats
  6. Download transitions and matrix as CSV files for further analysis

The tool automatically:

  • Finds and loads baseline, creation, and enhancement sheets
  • Performs row-by-row matching using priority keys configured in metric-config.json
  • Tries each priority key in order until a match is found for each source row
  • Treats rows with no matching keys as new habitats (right_only)
  • Never matches on empty strings - ensures data accuracy
  • Performs full outer joins to preserve all data
  • Filters out zero-unit transitions
  • Prevents duplication by ensuring each baseline row appears only once
  • Generates both row-wise transitions and aggregated matrix views
  • Logs join summaries showing which keys were used for how many rows

Technical notes

Ported from Python scripts in /scripts/bng_metric_processor/ to TypeScript for client-side browser processing. Uses the xlsx library for Excel file parsing. Processes data entirely in-memory in the browser for fast results and privacy (no data sent to server).

The core logic is implemented in biodiv-metric-core.ts and uses a JSON configuration file (metric-config.json) that mirrors the Python metric_config.py structure. The tool supports:

  • Multi-row Excel headers
  • Configurable join strategies with priority keys and fallbacks
  • Full outer joins preserving unmatched rows
  • Automatic filtering of empty and zero-unit rows
  • Habitat transition matrix computation

All processing happens client-side, ensuring data privacy and fast performance.

Process Biodiversity Statutory Metric workbooks to generate habitat transition matrices. This tool extracts habitat data from baseline, creation, and enhancement sheets, joins them by reference columns, and produces aggregated matrices showing habitat-to-habitat transitions.

What It Does

The tool processes Biodiversity Metric Excel workbooks (.xlsm files) to:

  1. Load Data: Reads baseline, creation, and enhancement sheets from the workbook
  2. Join Data: Matches rows between sheets using reference columns (defaults to "Ref")
  3. Build Transitions: Creates transition records showing habitat changes with units lost/delivered
  4. Generate Matrix: Aggregates transitions into a habitat-to-habitat matrix showing net units

Output Files

  • Habitat Transitions: Row-wise transition records with:

    • set_id: The set identifier (e.g., "D")
    • source_kind: Whether this is a 'creation' or 'enhancement' transition
    • merge_type: Join result ('both', 'left_only', 'right_only')
    • reference_id: The join key value used to match rows (e.g., Ref number)
    • original_habitat: Habitat type from baseline
    • final_habitat: Habitat type from creation/enhancement
    • units_lost: Units lost from baseline
    • units_delivered: Units delivered from creation/enhancement
    • net_units: Net change (delivered - lost)

    Note: Rows where both units_lost and units_delivered are 0 are automatically filtered out (empty/placeholder rows).

  • Habitat Matrix: Aggregated n×n matrix where rows = final habitats, columns = original habitats, values = net units

Join Logic

The tool uses row-by-row matching to intelligently join baseline sheets with creation/enhancement sheets. Each source row tries multiple join strategies in priority order until it finds a match, ensuring accurate data alignment even when columns have mixed or missing values.

Row-by-Row Matching Strategy

Unlike traditional joins that select one join key for all rows, this tool processes each source row individually:

  1. For each source row, try priority keys in order:

    • Check if the row has a non-null value in the source column
    • If require_baseline_non_null: true, verify baseline column has data
    • Attempt to match with baseline rows using that key
    • If a match is found, use that key and stop trying others
    • If no match, try the next priority key
  2. Empty strings never match: Rows with empty join key values are never joined. They become right_only (new habitats) instead of incorrectly matching on empty strings.

  3. No fallback to empty joins: If a source row has no matching keys with baseline, it's treated as a new habitat (right_only), not joined on empty values.

Configuration Structure

Join logic is configured in metric-config.json for each set (D, E, etc.):

{
  "joins": {
    "baseline_creation": {
      "priority_keys": [
        {
          "name": "baseline_ref",
          "source_col": { "name": "Baseline Ref", "letter": "AG" },
          "baseline_col": { "name": "Ref", "letter": "D" },
          "require_baseline_non_null": true
        },
        {
          "name": "habitat_ref",
          "source_col": { "name": "Habitat reference", "letter": "AE" },
          "baseline_col": { "name": "Habitat reference", "letter": "AE" },
          "require_baseline_non_null": true
        }
      ],
      "fallback": { "kind": "none" }
    }
  }
}

Configuration Fields:

  • priority_keys: Array of join strategies tried in order
    • name: Human-readable key name (for logging)
    • source_col: Column in creation/enhancement sheet to match
    • baseline_col: Column in baseline sheet to match against
    • require_baseline_non_null: If true, both source AND baseline columns must have data
  • fallback: Currently unused (legacy support)

Example: D-1 (Baseline) ↔ D-2 (Creation) Join

Sheets:

  • D-1 Off-Site Habitat Baseline: Contains original habitats and units lost
  • D-2 Off-Site Habitat Creation: Contains proposed habitats and units delivered

Priority Keys (in order):

  1. baseline_ref: D-2.Baseline Ref (column AG) → D-1.Ref (column D)

    • Used when source row has non-null "Baseline Ref" AND it matches a baseline "Ref"
    • Requires baseline to have data (require_baseline_non_null: true)
  2. habitat_ref: D-2.Habitat reference (column AE) → D-1.Habitat reference (column AE)

    • Used when source row has non-null "Habitat reference" AND it matches a baseline "Habitat reference"
    • Requires baseline to have data (require_baseline_non_null: true)

Example Row Behavior:

  • Row with "Baseline Ref" = "1" → matches using baseline_ref key
  • Row with "Baseline Ref" = empty, "Habitat reference" = "HAB-123" → matches using habitat_ref key
  • Row with all keys empty → becomes right_only (new habitat, not in baseline)

Example: D-1 (Baseline) ↔ D-3 (Enhancement) Join

Sheets:

  • D-1 Off-Site Habitat Baseline: Contains original habitats
  • D-3 Off-Site Habitat Enhancement: Contains enhanced habitats

Priority Keys (in order):

  1. baseline_ref: D-3.Baseline ref (column E) → D-1.Ref (column D)

    • Used when source row has non-null "Baseline ref" AND it matches a baseline "Ref"
    • Requires baseline to have data (require_baseline_non_null: true)
  2. habitat_ref: D-3.Habitat reference (column AT) → D-1.Habitat reference (column AE)

    • Used when source row has non-null "Habitat reference" AND it matches a baseline "Habitat reference"
    • Requires baseline to have data (require_baseline_non_null: true)

Join Summary Logging

After each join, the tool logs a summary showing which keys were used:

Row-by-row matching summary: 15 rows used 'baseline_ref', 3 rows used 'habitat_ref', 2 rows unmatched (right_only - new habitats, not in baseline).

This helps verify that joins are working as expected and identifies how many rows used each strategy.

Join Types in Output

The merge_type column in transitions indicates the join result:

  • both: Row matched between baseline and creation/enhancement
  • left_only: Baseline row with no corresponding creation/enhancement (habitat lost)
  • right_only: Creation/enhancement row with no baseline (new habitat)

Important - No Duplication Logic: Each baseline row appears in only one transition type to prevent duplication:

  • Rows matching creation (D-2) → appear only in creation transitions (even if they also match D-3)
  • Rows matching enhancement (D-3) but not creation → appear only in enhancement transitions
  • Rows matching neither → appear as left_only in creation transitions (representing lost habitats)

This ensures that each land parcel is counted exactly once in the final output.

Key Features

  • Row-by-Row Matching: Each source row tries priority keys individually for accurate matching
  • Safe Empty Handling: Empty join keys never match - prevents incorrect data alignment
  • Configurable Join Strategies: Priority keys defined in JSON config for easy customization
  • Automatic Key Selection: Each row uses the first priority key where it has matching data
  • Empty Row Filtering: Automatically removes rows with missing key data
  • Zero-Unit Filtering: Removes transitions where both units_lost and units_delivered are 0
  • No Duplication: Each baseline row appears in exactly one transition (creation, enhancement, or lost)
  • Reference Tracking: Includes reference_id column showing the join key value used
  • Full Outer Joins: Preserves all data including unmatched rows (new habitats, lost habitats)
  • Join Summary Logging: Shows which keys were used and how many rows matched/unmatched
  • Multi-row Headers: Handles complex Excel header structures