Changelog

All notable changes to BioMetaHarmonizer are documented in this file. Format follows Keep a Changelog.


v0.6.0 — 2025

Added

  • Fixed 51-column output schema defined in _load_final_schema() (ingestion.py). Every record is pre-initialised with all 51 columns so downstream code never needs to handle missing columns.

  • GeoEngine (geo_engine.py) — structured parsing of geo_loc_name strings into six output columns: geo_country, geo_region, geo_locality, geo_iso3166, geo_sea_ocean, geo_loc_raw. Includes ISO 3166-1 resolution via pycountry, UK sub-country handling, country alias table, historical country detection, ocean/sea lookup, and coordinate-only entry detection.

  • DateEngine (date_engine.py) — ISO 8601 truncated date parsing with seven range-detection patterns applied before dateutil to prevent silent misparsing. Populates two output columns: collection_date (point date) and collection_date_range (verbatim original for ranges/approximate).

  • OneHealthClassifier (one_health.py) — multi-layer, multi-field One Health categorization loaded from one_health_dictionaries.json. Supports classify(), classify_joint(), classify_with_confidence(), and classify_multi_field() methods. Confidence model with high, medium, low, and unresolved evidence levels. Optional rapidfuzz fuzzy fallback layer.

  • Antibiogram extraction (_parse_antibiogram() in ingestion.py) — automatic parsing of <Table class="Antibiogram.1.0"> XML tables from NCBI Pathogen BioSample packages. Ten canonical field names via _ANTIBIOGRAM_HEADER_MAP. Result stored as a native list in _extra_attributes["antibiogram"].

  • Assembly accession support (ingestion.py) — GCF_/GCA_ accessions are resolved to BioSample accessions via a two-step process: local NCBI assembly summary index (auto-downloaded, 7-day TTL) followed by an Entrez elink fallback.

  • ``_extra_attributes`` column — all BioSample attributes that do not map to a named schema column are preserved as a JSON dict. Multiple values for the same key are pipe-joined.

  • Back-fill from assembly indexbioproject_accession, assembly_accession_refseq, and assembly_accession_genbank are back-filled for all records whose BioSample accession appears in the cached assembly summary files.

  • Two-layer synonym lookup (synonyms.py) — unified.json (Layer 1) plus optional ncbi_attributes.xml (Layer 2, built by scripts/build_ncbi_attribute_cache.py). Result cached via functools.lru_cache.

  • KeyMapper (key_mapper.py) — column renaming and coalescing for custom/non-ingestion workflows using the shared synonym lookup.

  • Output module (output.py) — write() and write_summary() functions supporting CSV, TSV, Excel (openpyxl), and Parquet (pyarrow).

  • CLI (cli.py) — biometaharmonizer run subcommand with full pipeline: ingest → key-map → date/geo/One Health → output. Supports format auto-inference from file extension, comma-separated accession input, --summary fill-rate output, --refresh-cache flag.

  • ``build_dictionaries.py`` script — builds one_health_dictionaries.json from OLS4 (ENVO, FoodOn, UBERON, Plant Ontology), NCBI Taxonomy BFS walk, and optional UMLS synonym expansion. Implements base_wins merge strategy and _resolve_collisions() with ambiguous_category_terms output.

  • ``build_ncbi_attribute_cache.py`` script — fetches ncbi_attributes.xml from NCBI for Layer 2 synonym coverage.

  • ``generate_summary_report.py`` script — generates interactive HTML reports with Plotly visualizations covering data quality, geography, temporal trends, One Health distribution, host analysis, and _extra_attributes coverage.

  • ``refresh_cache`` parameter on ingest() and --refresh-cache CLI flag for forcing re-download of assembly summary files.

  • Exponential backoff retry_MAX_RETRIES = 3, base 2 s, capped at 30 s, applied to all transient Entrez request failures.

  • Null normalization — comprehensive _NULL_PATTERNS regex covering 30+ explicit null/missing/restricted/unknown variants applied to every parsed attribute value.

  • Rate-aware inter-batch sleep0.12 s with API key, 0.34 s without, computed from the module-level ENTREZ_API_KEY value.

Changed

  • Switched from per-module synonym tables to a single shared build_synonym_lookup() function in synonyms.py consumed by both ingestion.py and key_mapper.py.

  • KeyMapper.map_columns() no longer drops columns; all overflow attributes are preserved in _extra_attributes by the ingestion layer.

  • Assembly summary files are now read with functools.lru_cache keyed on path and mtime to avoid redundant disk reads within a session.

Fixed

  • Cross-batch WebEnv/query_key accumulation bug: each esearch batch now creates its own fresh History slot so that efetch retrieves exactly the records in that batch.

  • 2018-2020-style year ranges are no longer silently misparsed as 2018-01-20 by dateutil; they are caught by _YEAR_ONLY_RANGE before dateutil is invoked.

  • Country strings containing a parenthesised qualifier with an internal comma (e.g. "United Kingdom (England, Wales & N. Ireland)") are no longer incorrectly split at the internal comma.