Summary: Challenges with BNG Metric Excel File Processing

Business Context & Goal

The sponsor needs to calculate net unit increases by final habitat type (units available for sale), accounting for net losses from transitions. This requires:

Parcel-to-parcel joining across baseline, creation, and enhancement sheets
Tracking which habitat types were sacrificed to generate units (e.g., for Sankey flow diagrams)
Re-aggregation back to habitat type after parcel-level joins

Why It's Complex

The summary page loses information due to aggregation functions
Units delivered alone isn't sufficient - you can only sell the net difference, not units delivered
Manual parcel-to-parcel joining is labor-intensive (currently ~2 hours per transform) and highly error-prone

Data Quality Challenges

1. Inconsistent Column Naming & Location

Same logical column uses different names across sheets (e.g., "Baseline Ref" vs "Baseline ref")
Column positions vary by set (D vs E) and sheet type
Case and spacing differences require normalization
Impact: Can't rely on fixed column positions or names

2. Inconsistent Join Key Availability

Not all rows have values in primary join columns
Some rows have values in secondary columns (e.g., "Habitat reference")
New habitats may have no join keys (e.g., rural tree added)
Users can choose which columns/refs to use: either Ref (numeric) or habitat ID (if provided)
Impact: A single join strategy doesn't work for all rows

3. Mixed Data Quality Within Sheets

Some rows have complete join keys; others are missing or empty
Rows with no matching keys should be treated as new habitats, not matched on empty values
Empty join keys must never match each other
Impact: Requires row-by-row matching that tries multiple strategies per row

4. Complex Header Structures

Multi-row headers with merged cells
Headers span multiple rows, requiring special parsing logic
Impact: Standard Excel parsing fails

5. Set-Specific Variations

Different sets (D, E, etc.) have different structures
Column names, positions, and data patterns vary by set
Impact: One-size-fits-all approach doesn't work

6. Reference ID Conflicts

Same reference IDs (e.g., "1") can appear in different sets with different meanings
Must track set context to avoid false conflicts
Impact: Need set-aware conflict detection

Business Rules & Constraints

Guiding principles that must be enforced:

Net changes for a given habitat type should never be negative (input validation requirement; plans may be rejected if violated)
Need to re-aggregate back up to habitat type after parcel joins
A habitat cannot both be created and enhanced
A creation sometimes might not have/doesn't need a baseline ref (e.g., rural tree added)
User can choose which cols/refs to use as refs: either Ref (numeric) or habitat ID (if provided)

Current Pain Points

Manual process takes ~2 hours per transform
Highly error-prone due to data inconsistencies
Labor-intensive parcel-to-parcel joining
Information loss from summary page aggregation
Difficulty tracking which habitats were sacrificed

Solution Benefits

Automated processing: Reduces 2-hour manual work to seconds
Error detection: Flags inconsistencies and can fail analysis to enforce standards
Data quality enforcement: Helps establish a "data/metric completion standard" that projects should be aware of upfront
Flexible matching: Row-by-row matching handles inconsistent data automatically
Detailed logging: Shows which join strategies were used for troubleshooting
Configurable rules: JSON config defines structure and join logic per set, allowing adaptation to variations

Known Limitations & Failure Cases

The solution will fail or require manual intervention for:

On-site sheets (A/B/C sets) - not yet supported (future work)
Manually corrupted/edited headers
Manually added/removed/moved columns
Data quality issues (e.g., a row added in both created and enhanced for the same field)

Future Work

Add support for A/B/C sets (onsite) - probably more for developers
Add support for area/length change in addition to unit change

Value Proposition

This solution transforms a 2-hour, error-prone manual process into an automated, consistent workflow that:

Enforces data quality standards
Provides transparency through detailed logging
Handles real-world data inconsistencies
Establishes a "data/metric completion standard" for projects
Flags inconsistencies to help improve data quality over time

The flexibility to handle inconsistent data while flagging issues helps move toward standardized, high-quality data inputs that benefit both the analysis process and project planning.

How to use

Upload your BNG Metric Excel file (.xlsx, .xls, or .xlsm)
Click "Process File" to analyze the data
Review the summary statistics showing join results
Examine the habitat transitions table showing all transition records
View the habitat transition matrix showing net units delivered from original to final habitats
Download transitions and matrix as CSV files for further analysis

The tool automatically:

Finds and loads baseline, creation, and enhancement sheets
Performs row-by-row matching using priority keys configured in metric-config.json
Tries each priority key in order until a match is found for each source row
Treats rows with no matching keys as new habitats (right_only)
Never matches on empty strings - ensures data accuracy
Performs full outer joins to preserve all data
Filters out zero-unit transitions
Prevents duplication by ensuring each baseline row appears only once
Generates both row-wise transitions and aggregated matrix views
Logs join summaries showing which keys were used for how many rows

Technical notes

Ported from Python scripts in /scripts/bng_metric_processor/ to TypeScript for client-side browser processing. Uses the xlsx library for Excel file parsing. Processes data entirely in-memory in the browser for fast results and privacy (no data sent to server).

The core logic is implemented in biodiv-metric-core.ts and uses a JSON configuration file (metric-config.json) that mirrors the Python metric_config.py structure. The tool supports:

Multi-row Excel headers
Configurable join strategies with priority keys and fallbacks
Full outer joins preserving unmatched rows
Automatic filtering of empty and zero-unit rows
Habitat transition matrix computation

All processing happens client-side, ensuring data privacy and fast performance.

Process Biodiversity Statutory Metric workbooks to generate habitat transition matrices. This tool extracts habitat data from baseline, creation, and enhancement sheets, joins them by reference columns, and produces aggregated matrices showing habitat-to-habitat transitions.

What It Does

The tool processes Biodiversity Metric Excel workbooks (.xlsm files) to:

Load Data: Reads baseline, creation, and enhancement sheets from the workbook
Join Data: Matches rows between sheets using reference columns (defaults to "Ref")
Build Transitions: Creates transition records showing habitat changes with units lost/delivered
Generate Matrix: Aggregates transitions into a habitat-to-habitat matrix showing net units

Output Files

Habitat Transitions: Row-wise transition records with:
- set_id: The set identifier (e.g., "D")
- source_kind: Whether this is a 'creation' or 'enhancement' transition
- merge_type: Join result ('both', 'left_only', 'right_only')
- reference_id: The join key value used to match rows (e.g., Ref number)
- original_habitat: Habitat type from baseline
- final_habitat: Habitat type from creation/enhancement
- units_lost: Units lost from baseline
- units_delivered: Units delivered from creation/enhancement
- net_units: Net change (delivered - lost)
Note: Rows where both units_lost and units_delivered are 0 are automatically filtered out (empty/placeholder rows).
Habitat Matrix: Aggregated n×n matrix where rows = final habitats, columns = original habitats, values = net units

Join Logic

The tool uses row-by-row matching to intelligently join baseline sheets with creation/enhancement sheets. Each source row tries multiple join strategies in priority order until it finds a match, ensuring accurate data alignment even when columns have mixed or missing values.

Row-by-Row Matching Strategy

Unlike traditional joins that select one join key for all rows, this tool processes each source row individually:

For each source row, try priority keys in order:
- Check if the row has a non-null value in the source column
- If require_baseline_non_null: true, verify baseline column has data
- Attempt to match with baseline rows using that key
- If a match is found, use that key and stop trying others
- If no match, try the next priority key
Empty strings never match: Rows with empty join key values are never joined. They become right_only (new habitats) instead of incorrectly matching on empty strings.
No fallback to empty joins: If a source row has no matching keys with baseline, it's treated as a new habitat (right_only), not joined on empty values.

Configuration Structure

Join logic is configured in metric-config.json for each set (D, E, etc.):

{
  "joins": {
    "baseline_creation": {
      "priority_keys": [
        {
          "name": "baseline_ref",
          "source_col": { "name": "Baseline Ref", "letter": "AG" },
          "baseline_col": { "name": "Ref", "letter": "D" },
          "require_baseline_non_null": true
        },
        {
          "name": "habitat_ref",
          "source_col": { "name": "Habitat reference", "letter": "AE" },
          "baseline_col": { "name": "Habitat reference", "letter": "AE" },
          "require_baseline_non_null": true
        }
      ],
      "fallback": { "kind": "none" }
    }
  }
}

Configuration Fields:

priority_keys: Array of join strategies tried in order
- name: Human-readable key name (for logging)
- source_col: Column in creation/enhancement sheet to match
- baseline_col: Column in baseline sheet to match against
- require_baseline_non_null: If true, both source AND baseline columns must have data
fallback: Currently unused (legacy support)

Example: D-1 (Baseline) ↔ D-2 (Creation) Join

Sheets:

D-1 Off-Site Habitat Baseline: Contains original habitats and units lost
D-2 Off-Site Habitat Creation: Contains proposed habitats and units delivered

Priority Keys (in order):

baseline_ref: D-2.Baseline Ref (column AG) → D-1.Ref (column D)
- Used when source row has non-null "Baseline Ref" AND it matches a baseline "Ref"
- Requires baseline to have data (require_baseline_non_null: true)
habitat_ref: D-2.Habitat reference (column AE) → D-1.Habitat reference (column AE)
- Used when source row has non-null "Habitat reference" AND it matches a baseline "Habitat reference"
- Requires baseline to have data (require_baseline_non_null: true)

Example Row Behavior:

Row with "Baseline Ref" = "1" → matches using baseline_ref key
Row with "Baseline Ref" = empty, "Habitat reference" = "HAB-123" → matches using habitat_ref key
Row with all keys empty → becomes right_only (new habitat, not in baseline)

Example: D-1 (Baseline) ↔ D-3 (Enhancement) Join