.. _harmonization: ============= Harmonization ============= After raw XML is fetched and parsed, four engines apply in-place transformations to specific columns of the output DataFrame. All engines deduplicate unique values before processing for performance. Date Engine ----------- Module: :mod:`biometaharmonizer.date_engine` Class: :class:`~biometaharmonizer.date_engine.DateEngine` The date engine converts any date string to ISO 8601 *truncated* representation and populates two output columns: - **``collection_date``** — ISO 8601 point date (``YYYY``, ``YYYY-MM``, or ``YYYY-MM-DD``). This field is **always** ``NaN`` for any range or approximate input — without exception. - **``collection_date_range``** — the verbatim original string, set only for range/approximate inputs; ``NaN`` for all point-date inputs. Range detection runs *before* ``dateutil.parser`` to prevent silent misparsing. For example, ``2018-2020`` would be misparsed by dateutil as ``2018-01-20``, so it is caught by ``_YEAR_ONLY_RANGE`` first. Range pattern evaluation order (first match wins): 1. ``_INSDC_SLASH_RANGE`` — numeric INSDC slash: ``2004-07/2004-12`` 2. ``_YEAR_ONLY_RANGE`` — year-only range where start ≠ end: ``2018-2020`` 3. ``_NUMERIC_DASH_RANGE`` — numeric dash or "to" word: ``2021-01-15 - 2021-03-20`` 4. ``_NAMED_MONTH_SAME_YEAR`` — named-month same year: ``July-December 2004`` 5. ``_NAMED_MONTH_CROSS_YEAR`` — named-month cross-year: ``Oct 2020-Feb 2021`` 6. ``_SEASON_RANGE`` — season strings: ``Spring 2019``, ``Winter 2020-2021`` 7. ``_APPROX_DATE`` — approximate prefixes: ``~2015``, ``circa 2010``, ``early March 2020``, ``late 2019``, ``mid-2018`` Bare two-digit year strings (e.g. ``"95"``) are always rejected and produce a warning log. The main public method is :meth:`~biometaharmonizer.date_engine.DateEngine.parse_with_range`, which returns a two-column DataFrame. The legacy :meth:`~biometaharmonizer.date_engine.DateEngine.parse` method returns only ``collection_date`` as a Series. Date Parsing Examples ~~~~~~~~~~~~~~~~~~~~~ The following table shows representative inputs and their normalized outputs. .. list-table:: Geographic string parsing examples :header-rows: 1 * - Input string - collection_date - collection_date_range * - ``2021`` - ``2021`` - NaN * - ``2021-06`` - ``2021-06`` - NaN * - ``2021-06-15`` - ``2021-06-15`` - NaN * - ``Jun 2019`` - ``2019-06`` - NaN * - ``15/06/2021`` - ``2021-06-15`` - NaN * - ``June 15, 2021`` - ``2021-06-15`` - NaN * - ``2018-2020`` - NaN - ``2018-2020`` * - ``2004-07/2004-12`` - NaN - ``2004-07/2004-12`` * - ``2021-01-15/2021-03-20`` - NaN - ``2021-01-15/...`` * - ``July-December 2004`` - NaN - ``July-December 2004`` * - ``Jan-Mar 2019`` - NaN - ``Jan-Mar 2019`` * - ``Oct 2020-Feb 2021`` - NaN - ``Oct 2020-Feb 2021`` * - ``Spring 2019`` - NaN - ``Spring 2019`` * - ``Winter 2020-2021`` - NaN - ``Winter 2020-2021`` * - ``~2015`` - NaN - ``~2015`` * - ``circa 2010`` - NaN - ``circa 2010`` * - ``early March 2020`` - NaN - ``early March 2020`` * - ``late 2019`` - NaN - ``late 2019`` * - ``missing`` - NaN - NaN * - ``unknown`` - NaN - NaN * - ``not provided`` - NaN - NaN * - ``2015/2017`` - NaN - ``2015/2017`` Geo Engine ---------- Module: :mod:`biometaharmonizer.geo_engine` Class: :class:`~biometaharmonizer.geo_engine.GeoEngine` The geo engine parses NCBI ``geo_loc_name`` strings into six structured output columns. The expected input format is ``"Country: Region, Locality"``; the fallback format is ``"Country, Locality"`` (no colon). **Output columns populated:** - ``geo_country`` — normalised country display name (e.g. ``"United Kingdom"``) - ``geo_region`` — sub-national region as submitted (e.g. ``"England"``) - ``geo_locality`` — locality or sub-region as submitted - ``geo_iso3166`` — ISO 3166-1 alpha-2 country code (e.g. ``"GB"``), or the string ``"HISTORICAL"`` for defunct countries, or NaN if not resolvable - ``geo_sea_ocean`` — ocean or sea name for marine samples (e.g. ``"Pacific Ocean"``) - ``geo_loc_raw`` — original submitted string, set **only** for coordinate-only entries (e.g. ``"45.3 N, 30.1 E"``); NaN for all successfully parsed records The public method is :meth:`~biometaharmonizer.geo_engine.GeoEngine.parse`, which accepts a :class:`pandas.Series` and returns a six-column :class:`pandas.DataFrame`. Special handling rules: - **UK sub-countries:** ``"England"``, ``"Scotland"``, ``"Wales"``, ``"Northern Ireland"`` are all mapped to ISO code ``"GB"`` and display name ``"United Kingdom"``. - **Country aliases:** ``"Turkey"``/``"Türkiye"`` → ``"TR"``; ``"Namibia"`` → ``"NA"``; ``"DR Congo"``/``"DRC"``/``"Congo-Kinshasa"`` → ``"CD"``; ``"Burma"``/``"Myanmar (Burma)"`` → ``"MM"``; ``"Palestine"``/``"Gaza"``/ ``"West Bank"`` → ``"PS"``. - **Historical countries:** ``"USSR"``, ``"Soviet Union"``, ``"Yugoslavia"``, ``"Czechoslovakia"``, ``"German Democratic Republic"``, ``"Zaire"``, and others are tagged ``geo_iso3166 = "HISTORICAL"`` and a WARNING is logged. - **Coordinate-only entries:** values matching the pattern ``"[±]DDD.DDD [NS], [±]DDD.DDD [EW]"`` are stored in ``geo_loc_raw``. - **Parenthetical qualifiers:** trailing parenthetical suffixes such as ``"United Kingdom (England, Wales & N. Ireland)"`` are stripped before country lookup so the comma inside the parentheses does not break parsing. - **Ocean/sea lookup:** when the country token (after stripping parenthetical qualifiers) matches one of the 15 named ocean/sea entries, the value is stored in ``geo_sea_ocean`` instead of ``geo_country``. Geo Parsing Examples ~~~~~~~~~~~~~~~~~~~~~ .. list-table:: Geographic string parsing examples :header-rows: 1 :widths: 40 16 16 16 12 * - Input string - geo_country - geo_region - geo_locality - geo_iso3166 * - ``Russia: Novosibirsk, Akademgorodok`` - Russia - Novosibirsk - Akademgorodok - RU * - ``USA: California, San Diego`` - USA - California - San Diego - US * - ``United Kingdom`` - United Kingdom - NaN - NaN - GB * - ``England: Yorkshire`` - United Kingdom - Yorkshire - NaN - GB * - ``Germany: Bavaria`` - Germany - Bavaria - NaN - DE * - ``Pacific Ocean`` - NaN - NaN - NaN - NaN * - ``USSR`` - USSR - NaN - NaN - HISTORICAL * - ``Turkey: Istanbul`` - Turkey - Istanbul - NaN - TR * - ``45.3 N, 30.1 E`` - NaN - NaN - NaN - NaN * - ``China, Shanghai`` - China - NaN - Shanghai - CN One Health Classifier --------------------- Module: :mod:`biometaharmonizer.one_health` Class: :class:`~biometaharmonizer.one_health.OneHealthClassifier` The One Health classifier assigns each record to a standardized category using deterministic, multi-layer semantic analysis. All biological knowledge is loaded from ``one_health_dictionaries.json``; no terms are hardcoded in the Python source. **Valid output categories for** ``one_health_category``: - ``Human`` — isolates from human clinical specimens or hosts - ``Animal`` — domestic and companion animals, livestock, veterinary samples - ``Aquatic`` — aquatic animal hosts and water-column samples - ``Wildlife`` — wild-animal isolates (birds, rodents, bats, wild ungulates) - ``Plant`` — plant material, rhizosphere, phytopathological samples - ``Food`` — food products, ingredients, food-processing environments - ``Environmental`` — soil, sediment, air, water, biofilms not otherwise classified - ``Lab`` — culture collections, ATCC strains, in-vitro/in-vivo laboratory samples - ``Unclassified`` — no category could be determined with sufficient confidence The ``one_health_category`` column is always a string; it is never ``NaN``. Unclassifiable records receive the string ``"Unclassified"``. **Public methods:** - :meth:`~biometaharmonizer.one_health.OneHealthClassifier.classify` — single-field classification from ``isolation_source``; returns a :class:`pandas.Series`. - :meth:`~biometaharmonizer.one_health.OneHealthClassifier.classify_joint` — two-field classification from ``isolation_source`` and ``host``; delegates to ``classify_multi_field`` and returns the ``one_health_category`` Series. - :meth:`~biometaharmonizer.one_health.OneHealthClassifier.classify_with_confidence` — single-field with confidence; returns a DataFrame with columns ``one_health_category``, ``one_health_term``, ``one_health_confidence``. - :meth:`~biometaharmonizer.one_health.OneHealthClassifier.classify_multi_field` — full multi-field evidence integration accepting named Series for any of: ``isolation_source``, ``host``, ``env_medium``, ``env_local_scale``, ``env_broad_scale``, ``sample_type``. **Confidence model:** .. code-block:: text confidence = min(1.0, term_specificity * field_weight + corroboration_bonus) Term specificity values: - Unambiguous list or host dict hit: **1.0** - Tier1 phrase ≥ 8 characters: **0.90** - Tier1 term 4–7 characters: **0.75** - Tier1 term < 4 characters: **0.50** - Ambiguous specimen term: **0.3** Confidence is discretized by ``discretize_confidence()``:: >= 0.85 → "high" >= 0.60 → "medium" >= 0.30 → "low" < 0.30 → "unresolved" Category and Example Isolation Sources ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: Geographic string parsing examples :header-rows: 1 * - Category - Example ``isolation_source`` or ``host`` values * - ``Human`` - blood, urine, cerebrospinal fluid, rectal swab, wound, surgical site, sputum, Homo sapiens * - ``Animal`` - bovine feces, swine nasal swab, chicken, cow, dog, pig, horse fecal, poultry litter, Bos taurus * - ``Aquatic`` - fish, salmon, shrimp, aquaculture water, trout, tilapia, oyster, clam * - ``Wildlife`` - wild bird, bat, rodent, deer, wild boar, migratory bird, fox, raccoon * - ``Plant`` - plant root, rhizosphere, leaf surface, tomato, rice, wheat stem, Arabidopsis * - ``Food`` - ground beef, raw milk, cheese, lettuce, retail chicken, food processing surface, ready-to-eat meat * - ``Environmental`` - soil, river sediment, wastewater, biofilm, air sample, drinking water, estuary water * - ``Lab`` - ATCC strain, in vitro culture, laboratory stock, type strain, passage culture Synonym Resolution ------------------ Module: :mod:`biometaharmonizer.synonyms` Function: :func:`~biometaharmonizer.synonyms.build_synonym_lookup` The synonym lookup table maps every lowercased synonym to a canonical standard key. It is built in two layers and cached via ``functools.lru_cache`` for the lifetime of the process: **Layer 1 — unified.json:** Project-defined synonyms. Each field entry has a ``standard_key`` and a list of ``synonyms``. All synonyms are lowercased and mapped to the ``standard_key``. **Layer 2 — ncbi_attributes.xml (optional):** NCBI BioSample official ``HarmonizedName`` and ``Synonym`` entries. Present only after running ``scripts/build_ncbi_attribute_cache.py``. Layer 2 overwrites Layer 1 conflicts with the authoritative NCBI mapping. Selected synonym → canonical key mappings (from ``unified.json``): .. list-table:: Geographic string parsing examples :header-rows: 1 * - Synonym (lowercased) - Canonical key * - ``collection date`` - ``collection_date`` * - ``collection_date`` - ``collection_date`` * - ``geo_loc_name`` - ``geo_loc_name`` * - ``geographic location`` - ``geo_loc_name`` * - ``geographic_location`` - ``geo_loc_name`` * - ``host organism`` - ``host`` * - ``isolation source`` - ``isolation_source`` * - ``isolation_source`` - ``isolation_source`` * - ``collected by`` - ``collected_by`` * - ``sample type`` - ``sample_type`` Null Normalization in Harmonization Engines -------------------------------------------- Both :class:`~biometaharmonizer.date_engine.DateEngine` and :class:`~biometaharmonizer.geo_engine.GeoEngine` maintain their own ``NULL_PATTERNS`` class attribute (a compiled regex). The date engine's pattern covers: ``missing``, ``unknown``, ``n/a``, ``not provided``, ``not collected``, ``na``, ``none``, ``--``, and any ``missing:.*``, ``not applicable:.*``, or ``restricted access`` variant. The geo engine uses the same comprehensive pattern as the ingestion module.