Harmonization
After raw XML is fetched and parsed, four engines apply in-place transformations to specific columns of the output DataFrame. All engines deduplicate unique values before processing for performance.
Date Engine
Module: biometaharmonizer.date_engine
Class: DateEngine
The date engine converts any date string to ISO 8601 truncated representation and populates two output columns:
``collection_date`` — ISO 8601 point date (
YYYY,YYYY-MM, orYYYY-MM-DD). This field is alwaysNaNfor any range or approximate input — without exception.``collection_date_range`` — the verbatim original string, set only for range/approximate inputs;
NaNfor all point-date inputs.
Range detection runs before dateutil.parser to prevent silent
misparsing. For example, 2018-2020 would be misparsed by dateutil as
2018-01-20, so it is caught by _YEAR_ONLY_RANGE first.
Range pattern evaluation order (first match wins):
_INSDC_SLASH_RANGE— numeric INSDC slash:2004-07/2004-12_YEAR_ONLY_RANGE— year-only range where start ≠ end:2018-2020_NUMERIC_DASH_RANGE— numeric dash or “to” word:2021-01-15 - 2021-03-20_NAMED_MONTH_SAME_YEAR— named-month same year:July-December 2004_NAMED_MONTH_CROSS_YEAR— named-month cross-year:Oct 2020-Feb 2021_SEASON_RANGE— season strings:Spring 2019,Winter 2020-2021_APPROX_DATE— approximate prefixes:~2015,circa 2010,early March 2020,late 2019,mid-2018
Bare two-digit year strings (e.g. "95") are always rejected and produce
a warning log.
The main public method is parse_with_range(),
which returns a two-column DataFrame. The legacy
parse() method returns only
collection_date as a Series.
Date Parsing Examples
The following table shows representative inputs and their normalized outputs.
Input string |
collection_date |
collection_date_range |
|---|---|---|
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
|
|
NaN |
NaN |
|
NaN |
NaN |
|
NaN |
NaN |
|
NaN |
|
Geo Engine
Module: biometaharmonizer.geo_engine
Class: GeoEngine
The geo engine parses NCBI geo_loc_name strings into six structured output
columns. The expected input format is "Country: Region, Locality"; the
fallback format is "Country, Locality" (no colon).
Output columns populated:
geo_country— normalised country display name (e.g."United Kingdom")geo_region— sub-national region as submitted (e.g."England")geo_locality— locality or sub-region as submittedgeo_iso3166— ISO 3166-1 alpha-2 country code (e.g."GB"), or the string"HISTORICAL"for defunct countries, or NaN if not resolvablegeo_sea_ocean— ocean or sea name for marine samples (e.g."Pacific Ocean")geo_loc_raw— original submitted string, set only for coordinate-only entries (e.g."45.3 N, 30.1 E"); NaN for all successfully parsed records
The public method is parse(),
which accepts a pandas.Series and returns a six-column
pandas.DataFrame.
Special handling rules:
UK sub-countries:
"England","Scotland","Wales","Northern Ireland"are all mapped to ISO code"GB"and display name"United Kingdom".Country aliases:
"Turkey"/"Türkiye"→"TR";"Namibia"→"NA";"DR Congo"/"DRC"/"Congo-Kinshasa"→"CD";"Burma"/"Myanmar (Burma)"→"MM";"Palestine"/"Gaza"/"West Bank"→"PS".Historical countries:
"USSR","Soviet Union","Yugoslavia","Czechoslovakia","German Democratic Republic","Zaire", and others are taggedgeo_iso3166 = "HISTORICAL"and a WARNING is logged.Coordinate-only entries: values matching the pattern
"[±]DDD.DDD [NS], [±]DDD.DDD [EW]"are stored ingeo_loc_raw.Parenthetical qualifiers: trailing parenthetical suffixes such as
"United Kingdom (England, Wales & N. Ireland)"are stripped before country lookup so the comma inside the parentheses does not break parsing.Ocean/sea lookup: when the country token (after stripping parenthetical qualifiers) matches one of the 15 named ocean/sea entries, the value is stored in
geo_sea_oceaninstead ofgeo_country.
Geo Parsing Examples
Input string |
geo_country |
geo_region |
geo_locality |
geo_iso3166 |
|---|---|---|---|---|
|
Russia |
Novosibirsk |
Akademgorodok |
RU |
|
USA |
California |
San Diego |
US |
|
United Kingdom |
NaN |
NaN |
GB |
|
United Kingdom |
Yorkshire |
NaN |
GB |
|
Germany |
Bavaria |
NaN |
DE |
|
NaN |
NaN |
NaN |
NaN |
|
USSR |
NaN |
NaN |
HISTORICAL |
|
Turkey |
Istanbul |
NaN |
TR |
|
NaN |
NaN |
NaN |
NaN |
|
China |
NaN |
Shanghai |
CN |
One Health Classifier
Module: biometaharmonizer.one_health
Class: OneHealthClassifier
The One Health classifier assigns each record to a standardized category
using deterministic, multi-layer semantic analysis. All biological knowledge
is loaded from one_health_dictionaries.json; no terms are hardcoded in the
Python source.
Valid output categories for one_health_category:
Human— isolates from human clinical specimens or hostsAnimal— domestic and companion animals, livestock, veterinary samplesAquatic— aquatic animal hosts and water-column samplesWildlife— wild-animal isolates (birds, rodents, bats, wild ungulates)Plant— plant material, rhizosphere, phytopathological samplesFood— food products, ingredients, food-processing environmentsEnvironmental— soil, sediment, air, water, biofilms not otherwise classifiedLab— culture collections, ATCC strains, in-vitro/in-vivo laboratory samplesUnclassified— no category could be determined with sufficient confidence
The one_health_category column is always a string; it is never NaN.
Unclassifiable records receive the string "Unclassified".
Public methods:
classify()— single-field classification fromisolation_source; returns apandas.Series.classify_joint()— two-field classification fromisolation_sourceandhost; delegates toclassify_multi_fieldand returns theone_health_categorySeries.classify_with_confidence()— single-field with confidence; returns a DataFrame with columnsone_health_category,one_health_term,one_health_confidence.classify_multi_field()— full multi-field evidence integration accepting named Series for any of:isolation_source,host,env_medium,env_local_scale,env_broad_scale,sample_type.
Confidence model:
confidence = min(1.0, term_specificity * field_weight + corroboration_bonus)
Term specificity values:
Unambiguous list or host dict hit: 1.0
Tier1 phrase ≥ 8 characters: 0.90
Tier1 term 4–7 characters: 0.75
Tier1 term < 4 characters: 0.50
Ambiguous specimen term: 0.3
Confidence is discretized by discretize_confidence():
>= 0.85 → "high"
>= 0.60 → "medium"
>= 0.30 → "low"
< 0.30 → "unresolved"
Category and Example Isolation Sources
Category |
Example |
|---|---|
|
blood, urine, cerebrospinal fluid, rectal swab, wound, surgical site, sputum, Homo sapiens |
|
bovine feces, swine nasal swab, chicken, cow, dog, pig, horse fecal, poultry litter, Bos taurus |
|
fish, salmon, shrimp, aquaculture water, trout, tilapia, oyster, clam |
|
wild bird, bat, rodent, deer, wild boar, migratory bird, fox, raccoon |
|
plant root, rhizosphere, leaf surface, tomato, rice, wheat stem, Arabidopsis |
|
ground beef, raw milk, cheese, lettuce, retail chicken, food processing surface, ready-to-eat meat |
|
soil, river sediment, wastewater, biofilm, air sample, drinking water, estuary water |
|
ATCC strain, in vitro culture, laboratory stock, type strain, passage culture |
Synonym Resolution
Module: biometaharmonizer.synonyms
Function: build_synonym_lookup()
The synonym lookup table maps every lowercased synonym to a canonical standard
key. It is built in two layers and cached via functools.lru_cache for the
lifetime of the process:
Layer 1 — unified.json: Project-defined synonyms. Each field entry has a
standard_key and a list of synonyms. All synonyms are lowercased and
mapped to the standard_key.
Layer 2 — ncbi_attributes.xml (optional): NCBI BioSample official
HarmonizedName and Synonym entries. Present only after running
scripts/build_ncbi_attribute_cache.py. Layer 2 overwrites Layer 1
conflicts with the authoritative NCBI mapping.
Selected synonym → canonical key mappings (from unified.json):
Synonym (lowercased) |
Canonical key |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Null Normalization in Harmonization Engines
Both DateEngine and
GeoEngine maintain their own
NULL_PATTERNS class attribute (a compiled regex). The date engine’s
pattern covers: missing, unknown, n/a, not provided,
not collected, na, none, --, and any missing:.*,
not applicable:.*, or restricted access variant. The geo engine
uses the same comprehensive pattern as the ingestion module.