.. _changelog: ========= Changelog ========= All notable changes to BioMetaHarmonizer are documented in this file. Format follows `Keep a Changelog `_. ---- v0.6.0 — 2025 --------------- Added ~~~~~ - **Fixed 51-column output schema** defined in ``_load_final_schema()`` (``ingestion.py``). Every record is pre-initialised with all 51 columns so downstream code never needs to handle missing columns. - **GeoEngine** (``geo_engine.py``) — structured parsing of ``geo_loc_name`` strings into six output columns: ``geo_country``, ``geo_region``, ``geo_locality``, ``geo_iso3166``, ``geo_sea_ocean``, ``geo_loc_raw``. Includes ISO 3166-1 resolution via ``pycountry``, UK sub-country handling, country alias table, historical country detection, ocean/sea lookup, and coordinate-only entry detection. - **DateEngine** (``date_engine.py``) — ISO 8601 truncated date parsing with seven range-detection patterns applied before ``dateutil`` to prevent silent misparsing. Populates two output columns: ``collection_date`` (point date) and ``collection_date_range`` (verbatim original for ranges/approximate). - **OneHealthClassifier** (``one_health.py``) — multi-layer, multi-field One Health categorization loaded from ``one_health_dictionaries.json``. Supports ``classify()``, ``classify_joint()``, ``classify_with_confidence()``, and ``classify_multi_field()`` methods. Confidence model with ``high``, ``medium``, ``low``, and ``unresolved`` evidence levels. Optional ``rapidfuzz`` fuzzy fallback layer. - **Antibiogram extraction** (``_parse_antibiogram()`` in ``ingestion.py``) — automatic parsing of ```` XML tables from NCBI Pathogen BioSample packages. Ten canonical field names via ``_ANTIBIOGRAM_HEADER_MAP``. Result stored as a native list in ``_extra_attributes["antibiogram"]``. - **Assembly accession support** (``ingestion.py``) — GCF\_/GCA\_ accessions are resolved to BioSample accessions via a two-step process: local NCBI assembly summary index (auto-downloaded, 7-day TTL) followed by an Entrez elink fallback. - **``_extra_attributes`` column** — all BioSample attributes that do not map to a named schema column are preserved as a JSON dict. Multiple values for the same key are pipe-joined. - **Back-fill from assembly index** — ``bioproject_accession``, ``assembly_accession_refseq``, and ``assembly_accession_genbank`` are back-filled for all records whose BioSample accession appears in the cached assembly summary files. - **Two-layer synonym lookup** (``synonyms.py``) — ``unified.json`` (Layer 1) plus optional ``ncbi_attributes.xml`` (Layer 2, built by ``scripts/build_ncbi_attribute_cache.py``). Result cached via ``functools.lru_cache``. - **KeyMapper** (``key_mapper.py``) — column renaming and coalescing for custom/non-ingestion workflows using the shared synonym lookup. - **Output module** (``output.py``) — ``write()`` and ``write_summary()`` functions supporting CSV, TSV, Excel (openpyxl), and Parquet (pyarrow). - **CLI** (``cli.py``) — ``biometaharmonizer run`` subcommand with full pipeline: ingest → key-map → date/geo/One Health → output. Supports format auto-inference from file extension, comma-separated accession input, ``--summary`` fill-rate output, ``--refresh-cache`` flag. - **``build_dictionaries.py`` script** — builds ``one_health_dictionaries.json`` from OLS4 (ENVO, FoodOn, UBERON, Plant Ontology), NCBI Taxonomy BFS walk, and optional UMLS synonym expansion. Implements ``base_wins`` merge strategy and ``_resolve_collisions()`` with ``ambiguous_category_terms`` output. - **``build_ncbi_attribute_cache.py`` script** — fetches ``ncbi_attributes.xml`` from NCBI for Layer 2 synonym coverage. - **``generate_summary_report.py`` script** — generates interactive HTML reports with Plotly visualizations covering data quality, geography, temporal trends, One Health distribution, host analysis, and ``_extra_attributes`` coverage. - **``refresh_cache``** parameter on ``ingest()`` and ``--refresh-cache`` CLI flag for forcing re-download of assembly summary files. - **Exponential backoff retry** — ``_MAX_RETRIES = 3``, base 2 s, capped at 30 s, applied to all transient Entrez request failures. - **Null normalization** — comprehensive ``_NULL_PATTERNS`` regex covering 30+ explicit null/missing/restricted/unknown variants applied to every parsed attribute value. - **Rate-aware inter-batch sleep** — ``0.12 s`` with API key, ``0.34 s`` without, computed from the module-level ``ENTREZ_API_KEY`` value. Changed ~~~~~~~ - Switched from per-module synonym tables to a single shared ``build_synonym_lookup()`` function in ``synonyms.py`` consumed by both ``ingestion.py`` and ``key_mapper.py``. - ``KeyMapper.map_columns()`` no longer drops columns; all overflow attributes are preserved in ``_extra_attributes`` by the ingestion layer. - Assembly summary files are now read with ``functools.lru_cache`` keyed on path and ``mtime`` to avoid redundant disk reads within a session. Fixed ~~~~~ - Cross-batch ``WebEnv``/``query_key`` accumulation bug: each ``esearch`` batch now creates its own fresh History slot so that ``efetch`` retrieves exactly the records in that batch. - ``2018-2020``-style year ranges are no longer silently misparsed as ``2018-01-20`` by ``dateutil``; they are caught by ``_YEAR_ONLY_RANGE`` before ``dateutil`` is invoked. - Country strings containing a parenthesised qualifier with an internal comma (e.g. ``"United Kingdom (England, Wales & N. Ireland)"``) are no longer incorrectly split at the internal comma.