Changelog
All notable changes to BioMetaHarmonizer are documented in this file. Format follows Keep a Changelog.
v0.6.0 — 2025
Added
Fixed 51-column output schema defined in
_load_final_schema()(ingestion.py). Every record is pre-initialised with all 51 columns so downstream code never needs to handle missing columns.GeoEngine (
geo_engine.py) — structured parsing ofgeo_loc_namestrings into six output columns:geo_country,geo_region,geo_locality,geo_iso3166,geo_sea_ocean,geo_loc_raw. Includes ISO 3166-1 resolution viapycountry, UK sub-country handling, country alias table, historical country detection, ocean/sea lookup, and coordinate-only entry detection.DateEngine (
date_engine.py) — ISO 8601 truncated date parsing with seven range-detection patterns applied beforedateutilto prevent silent misparsing. Populates two output columns:collection_date(point date) andcollection_date_range(verbatim original for ranges/approximate).OneHealthClassifier (
one_health.py) — multi-layer, multi-field One Health categorization loaded fromone_health_dictionaries.json. Supportsclassify(),classify_joint(),classify_with_confidence(), andclassify_multi_field()methods. Confidence model withhigh,medium,low, andunresolvedevidence levels. Optionalrapidfuzzfuzzy fallback layer.Antibiogram extraction (
_parse_antibiogram()iningestion.py) — automatic parsing of<Table class="Antibiogram.1.0">XML tables from NCBI Pathogen BioSample packages. Ten canonical field names via_ANTIBIOGRAM_HEADER_MAP. Result stored as a native list in_extra_attributes["antibiogram"].Assembly accession support (
ingestion.py) — GCF_/GCA_ accessions are resolved to BioSample accessions via a two-step process: local NCBI assembly summary index (auto-downloaded, 7-day TTL) followed by an Entrez elink fallback.``_extra_attributes`` column — all BioSample attributes that do not map to a named schema column are preserved as a JSON dict. Multiple values for the same key are pipe-joined.
Back-fill from assembly index —
bioproject_accession,assembly_accession_refseq, andassembly_accession_genbankare back-filled for all records whose BioSample accession appears in the cached assembly summary files.Two-layer synonym lookup (
synonyms.py) —unified.json(Layer 1) plus optionalncbi_attributes.xml(Layer 2, built byscripts/build_ncbi_attribute_cache.py). Result cached viafunctools.lru_cache.KeyMapper (
key_mapper.py) — column renaming and coalescing for custom/non-ingestion workflows using the shared synonym lookup.Output module (
output.py) —write()andwrite_summary()functions supporting CSV, TSV, Excel (openpyxl), and Parquet (pyarrow).CLI (
cli.py) —biometaharmonizer runsubcommand with full pipeline: ingest → key-map → date/geo/One Health → output. Supports format auto-inference from file extension, comma-separated accession input,--summaryfill-rate output,--refresh-cacheflag.``build_dictionaries.py`` script — builds
one_health_dictionaries.jsonfrom OLS4 (ENVO, FoodOn, UBERON, Plant Ontology), NCBI Taxonomy BFS walk, and optional UMLS synonym expansion. Implementsbase_winsmerge strategy and_resolve_collisions()withambiguous_category_termsoutput.``build_ncbi_attribute_cache.py`` script — fetches
ncbi_attributes.xmlfrom NCBI for Layer 2 synonym coverage.``generate_summary_report.py`` script — generates interactive HTML reports with Plotly visualizations covering data quality, geography, temporal trends, One Health distribution, host analysis, and
_extra_attributescoverage.``refresh_cache`` parameter on
ingest()and--refresh-cacheCLI flag for forcing re-download of assembly summary files.Exponential backoff retry —
_MAX_RETRIES = 3, base 2 s, capped at 30 s, applied to all transient Entrez request failures.Null normalization — comprehensive
_NULL_PATTERNSregex covering 30+ explicit null/missing/restricted/unknown variants applied to every parsed attribute value.Rate-aware inter-batch sleep —
0.12 swith API key,0.34 swithout, computed from the module-levelENTREZ_API_KEYvalue.
Changed
Switched from per-module synonym tables to a single shared
build_synonym_lookup()function insynonyms.pyconsumed by bothingestion.pyandkey_mapper.py.KeyMapper.map_columns()no longer drops columns; all overflow attributes are preserved in_extra_attributesby the ingestion layer.Assembly summary files are now read with
functools.lru_cachekeyed on path andmtimeto avoid redundant disk reads within a session.
Fixed
Cross-batch
WebEnv/query_keyaccumulation bug: eachesearchbatch now creates its own fresh History slot so thatefetchretrieves exactly the records in that batch.2018-2020-style year ranges are no longer silently misparsed as2018-01-20bydateutil; they are caught by_YEAR_ONLY_RANGEbeforedateutilis invoked.Country strings containing a parenthesised qualifier with an internal comma (e.g.
"United Kingdom (England, Wales & N. Ireland)") are no longer incorrectly split at the internal comma.