.. _changelog:
=========
Changelog
=========
All notable changes to BioMetaHarmonizer are documented in this file.
Format follows `Keep a Changelog `_.
----
v0.6.0 — 2025
---------------
Added
~~~~~
- **Fixed 51-column output schema** defined in ``_load_final_schema()``
(``ingestion.py``). Every record is pre-initialised with all 51 columns so
downstream code never needs to handle missing columns.
- **GeoEngine** (``geo_engine.py``) — structured parsing of ``geo_loc_name``
strings into six output columns: ``geo_country``, ``geo_region``,
``geo_locality``, ``geo_iso3166``, ``geo_sea_ocean``, ``geo_loc_raw``.
Includes ISO 3166-1 resolution via ``pycountry``, UK sub-country handling,
country alias table, historical country detection, ocean/sea lookup, and
coordinate-only entry detection.
- **DateEngine** (``date_engine.py``) — ISO 8601 truncated date parsing with
seven range-detection patterns applied before ``dateutil`` to prevent silent
misparsing. Populates two output columns: ``collection_date`` (point date)
and ``collection_date_range`` (verbatim original for ranges/approximate).
- **OneHealthClassifier** (``one_health.py``) — multi-layer, multi-field One
Health categorization loaded from ``one_health_dictionaries.json``. Supports
``classify()``, ``classify_joint()``, ``classify_with_confidence()``, and
``classify_multi_field()`` methods. Confidence model with ``high``,
``medium``, ``low``, and ``unresolved`` evidence levels. Optional
``rapidfuzz`` fuzzy fallback layer.
- **Antibiogram extraction** (``_parse_antibiogram()`` in ``ingestion.py``) —
automatic parsing of ``
`` XML tables from
NCBI Pathogen BioSample packages. Ten canonical field names via
``_ANTIBIOGRAM_HEADER_MAP``. Result stored as a native list in
``_extra_attributes["antibiogram"]``.
- **Assembly accession support** (``ingestion.py``) — GCF\_/GCA\_ accessions
are resolved to BioSample accessions via a two-step process: local NCBI
assembly summary index (auto-downloaded, 7-day TTL) followed by an Entrez
elink fallback.
- **``_extra_attributes`` column** — all BioSample attributes that do not
map to a named schema column are preserved as a JSON dict. Multiple values
for the same key are pipe-joined.
- **Back-fill from assembly index** — ``bioproject_accession``,
``assembly_accession_refseq``, and ``assembly_accession_genbank`` are
back-filled for all records whose BioSample accession appears in the
cached assembly summary files.
- **Two-layer synonym lookup** (``synonyms.py``) — ``unified.json`` (Layer 1)
plus optional ``ncbi_attributes.xml`` (Layer 2, built by
``scripts/build_ncbi_attribute_cache.py``). Result cached via
``functools.lru_cache``.
- **KeyMapper** (``key_mapper.py``) — column renaming and coalescing for
custom/non-ingestion workflows using the shared synonym lookup.
- **Output module** (``output.py``) — ``write()`` and ``write_summary()``
functions supporting CSV, TSV, Excel (openpyxl), and Parquet (pyarrow).
- **CLI** (``cli.py``) — ``biometaharmonizer run`` subcommand with full
pipeline: ingest → key-map → date/geo/One Health → output. Supports format
auto-inference from file extension, comma-separated accession input,
``--summary`` fill-rate output, ``--refresh-cache`` flag.
- **``build_dictionaries.py`` script** — builds ``one_health_dictionaries.json``
from OLS4 (ENVO, FoodOn, UBERON, Plant Ontology), NCBI Taxonomy BFS walk,
and optional UMLS synonym expansion. Implements ``base_wins`` merge strategy
and ``_resolve_collisions()`` with ``ambiguous_category_terms`` output.
- **``build_ncbi_attribute_cache.py`` script** — fetches
``ncbi_attributes.xml`` from NCBI for Layer 2 synonym coverage.
- **``generate_summary_report.py`` script** — generates interactive HTML
reports with Plotly visualizations covering data quality, geography, temporal
trends, One Health distribution, host analysis, and ``_extra_attributes``
coverage.
- **``refresh_cache``** parameter on ``ingest()`` and ``--refresh-cache`` CLI
flag for forcing re-download of assembly summary files.
- **Exponential backoff retry** — ``_MAX_RETRIES = 3``, base 2 s, capped at
30 s, applied to all transient Entrez request failures.
- **Null normalization** — comprehensive ``_NULL_PATTERNS`` regex covering
30+ explicit null/missing/restricted/unknown variants applied to every
parsed attribute value.
- **Rate-aware inter-batch sleep** — ``0.12 s`` with API key, ``0.34 s``
without, computed from the module-level ``ENTREZ_API_KEY`` value.
Changed
~~~~~~~
- Switched from per-module synonym tables to a single shared
``build_synonym_lookup()`` function in ``synonyms.py`` consumed by both
``ingestion.py`` and ``key_mapper.py``.
- ``KeyMapper.map_columns()`` no longer drops columns; all overflow attributes
are preserved in ``_extra_attributes`` by the ingestion layer.
- Assembly summary files are now read with ``functools.lru_cache`` keyed on
path and ``mtime`` to avoid redundant disk reads within a session.
Fixed
~~~~~
- Cross-batch ``WebEnv``/``query_key`` accumulation bug: each ``esearch``
batch now creates its own fresh History slot so that ``efetch`` retrieves
exactly the records in that batch.
- ``2018-2020``-style year ranges are no longer silently misparsed as
``2018-01-20`` by ``dateutil``; they are caught by ``_YEAR_ONLY_RANGE``
before ``dateutil`` is invoked.
- Country strings containing a parenthesised qualifier with an internal comma
(e.g. ``"United Kingdom (England, Wales & N. Ireland)"``) are no longer
incorrectly split at the internal comma.