Developer Scripts

The scripts/ directory contains three standalone maintenance scripts for contributors and power users who build or refresh the data assets consumed at runtime. None of these scripts import from the biometaharmonizer package itself; they are intentionally standalone.

build_dictionaries.py

Purpose

scripts/build_dictionaries.py builds the enriched one_health_dictionaries.json file that powers the OneHealthClassifier. It queries three external sources:

OLS4 API — Environmental Ontology (ENVO), FoodOn, UBERON, and Plant Ontology for Environmental, Food, _anatomy, and Plant category terms.
NCBI Taxonomy local dump — builds host_to_category mappings for vertebrate animals and plants by walking the taxonomic tree from configured root taxon IDs.
UMLS API (optional, requires API key) — synonym expansion for 17 clinical specimen CUIs.

The hand-curated base file is loaded first. The merge strategy is base_wins: any key already present in the base dictionary is never overwritten by ontology-derived data. This preserves expert-curated entries against automated ontology updates.

OLS4 Integration

The following ontologies are queried, mapped via OLS_ONTOLOGY_MAP:

build_dictionaries.py CLI flags
Ontology	OLS ID	One Health category	Seed IRIs (short form)
ENVO	`envo`	Environmental	ENVO:00000428, ENVO:00010483, ENVO:01000254, ENVO:01001110, ENVO:00000063, ENVO:00000015, ENVO:00000873, ENVO:00000134, ENVO:00002006
FoodOn	`foodon`	Food	FOODON:00001002, FOODON:03400361, FOODON:00001709, FOODON:03420194
UBERON	`uberon`	_anatomy (split post-fetch)	UBERON:0000465
PO	`po`	Plant	PO:0025131

The OLS4 hierarchicalDescendants endpoint is used to traverse the class hierarchy. Only hasExactSynonym annotations are collected (not broad, narrow, or related) to minimise false-positive category assignments.

Each raw OLS term string is cleaned by _clean_ols_term() which applies six rules in order:

Strip OLS language/scope tags: (exact), (related), etc.
Strip parenthetical scope tags that appear after a comma.
Reject GS1 GPC catalogue codes (e.g. "0900000 - cereals (GS1 GPC)").
Reject regulatory catalogue codes matching _RE_REGULATORY_CATALOGUE, which covers EFSA FoodEx2 codes, EC codes, EuroFIR, EFG, CIAA, CCFAC, and Codex entries. These are classification artefacts, not free-text terms that could match BioSample metadata.
Reject terms of fewer than 2 characters.
Return None to signal that the term should be discarded.

UBERON Anatomy Classification

UBERON terms under UBERON:0000465 (material anatomical entity) are split into three buckets using membership in two constant sets:

``_uberon_human`` — terms in UBERON_HUMAN_EXCLUSIVE: cerebrospinal fluid, pleural fluid, peritoneal fluid, synovial fluid, amniotic fluid, dialysate, bronchoalveolar lavage, sputum, dental plaque, catheter, central venous.
``_uberon_animal`` — terms in UBERON_ANIMAL_EXCLUSIVE: rumen, reticulum, omasum, abomasum, gizzard, proventriculus, crop, cloaca, swim bladder, gill, hemolymph, exoskeleton.
``_uberon_ambiguous`` — all remaining UBERON anatomy terms that cannot be assigned to either set; stored in ambiguous_specimen_terms.

NCBI Taxonomy Integration

A BFS walk of the NCBI taxonomy tree is performed from the following root taxon IDs (NCBI_TAXON_ROOTS):

build_dictionaries.py CLI flags
Taxon ID	Name	Category
9606	Homo sapiens	Human
40674	Mammalia	Animal
8782	Aves	Animal
8504	Reptilia	Animal
8292	Amphibia	Animal
7776	Chondrichthyes	Animal
7898	Actinopterygii	Animal
6656	Arthropoda	Animal
6447	Mollusca	Animal
6231	Nematoda	Animal
6340	Annelida	Animal
7586	Echinodermata	Animal
6073	Cnidaria	Animal
6040	Porifera	Animal
33090	Viridiplantae	Plant
2763	Rhodophyta	Plant
3041	Chlorophyta	Plant
2870	Phaeophyceae	Plant

Note

Homo sapiens (txid 9606) is treated as an exact match only — its subtree is not walked, because the subtree contains only subspecies/race taxa that should not produce additional entries. Fungi (txid 4751) are intentionally excluded: their One Health category is context-dependent (Environmental pathogen, Food spoilage, Animal/Human mycosis) and cannot be determined from taxonomy alone.

Name strings are extracted from names.dmp using only the following name_class values (NAMES_DMP_KEEP_CLASSES):

scientific name
common name
genbank common name
equivalent name

Names of 1–3 tokens are kept; longer names are excluded to reduce noise.

The --taxdmp argument accepts three input forms:

A path to a pre-downloaded taxdmp.zip file.
A path to an extracted directory containing names.dmp and nodes.dmp.
Omitted entirely, in which case taxdmp.zip is downloaded automatically from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip (approximately 65 MB).

Collision Resolution

_resolve_collisions(base) detects terms that appear in multiple categories or conflict between dictionary sections. Two collision types are handled:

Intra-ontology_map: a term string appears in two or more category lists within ontology_map (e.g. "blood" in both Food and Animal).
Cross-section: a term in ontology_map also exists in host_to_category, unambiguous_human_terms, unambiguous_animal_terms, or ambiguous_specimen_terms.

In both cases the term is removed from its ontology_map category list(s) and appended to base["ambiguous_category_terms"] with a list of conflicting source labels. The base_wins exemption applies: any term that was present in the hand-curated base ontology_map before this build run is excluded from collision processing.

UMLS Integration

When --umls-key is provided, synonym expansion is performed for 17 clinical specimen CUIs via the UMLS TGT → service-ticket authentication flow:

A TGT (Ticket Granting Ticket) is obtained by POST to the UMLS UTS API with the API key.
For each CUI in UMLS_SPECIMEN_CUIS, a service ticket is requested and used to query the CUI’s atom list for English synonyms.

The 17 CUIs and their canonical names:

build_dictionaries.py CLI flags
CUI	Canonical name
C0005767	blood
C0042036	urine
C0038569	sputum
C0007555	cerebrospinal fluid
C0205189	pleural fluid
C0003967	ascitic fluid
C0039981	synovial fluid
C0006252	bronchial lavage
C0444941	wound
C0000735	abscess
C0032227	pus
C0015411	feces
C0521481	rectal swab
C0029001	oral swab
C0042048	vaginal swab
C0877612	nasal swab
C0586478	throat swab

CLI Flags

All flags are derived from parse_args() in the script:

`build_dictionaries.py` CLI flags
Flag	Description
`--base`	Path to the hand-curated base JSON. Default: `src/biometaharmonizer/schemas/one_health_dictionaries.json`
`--output`	Output path for the enriched JSON. Default: same as `--base` (overwrites in place).
`--taxdmp`	Path to `taxdmp.zip` or an extracted directory containing `names.dmp` and `nodes.dmp`. Omit to trigger automatic download (~65 MB) from NCBI FTP.
`--umls-key`	UMLS API key for synonym expansion. Omit to skip UMLS.
`--skip-ols`	Skip all OLS4 queries.
`--skip-ncbi`	Skip NCBI Taxonomy processing.
`--dry-run`	Build the enriched dict in memory but do not write to disk.

Usage Examples

# Full run — overwrites bundled dictionary in place:
python scripts/build_dictionaries.py \
    --base    src/biometaharmonizer/schemas/one_health_dictionaries.json \
    --output  src/biometaharmonizer/schemas/one_health_dictionaries.json

# Use a pre-downloaded taxdmp.zip to skip the ~65 MB download:
python scripts/build_dictionaries.py --taxdmp /path/to/taxdmp.zip

# Skip NCBI taxonomy entirely:
python scripts/build_dictionaries.py --skip-ncbi

# Full run with UMLS synonym expansion:
python scripts/build_dictionaries.py --umls-key YOUR_UMLS_API_KEY

When to Re-run

Re-run this script when:

NCBI taxonomy is updated and new host names need to be incorporated into host_to_category.
New OLS ontology versions are released with additional terms.
New One Health categories are required (add them to the base JSON first, then re-run to merge ontology data).
After adding new hand-curated entries to the base JSON to propagate collision resolution correctly.

build_ncbi_attribute_cache.py

Purpose

scripts/build_ncbi_attribute_cache.py fetches the official NCBI BioSample attribute harmonization XML from NCBI and saves it as src/biometaharmonizer/schemas/ncbi_attributes.xml.

This file is Layer 2 of the synonym lookup used by build_synonym_lookup(). Without it, only Layer 1 (unified.json) is active and some NCBI HarmonizedName attributes may not be recognized. The file is parsed at runtime by build_synonym_lookup() every time the process starts (cached via lru_cache thereafter).

Output file:

src/biometaharmonizer/schemas/ncbi_attributes.xml

This is the raw XML response from: https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml

CLI Flags:

build_dictionaries.py CLI flags
Flag	Description
`--output-dir`	Directory to write `ncbi_attributes.xml` into. Default: `src/biometaharmonizer/schemas/`
`--skip-fetch`	Skip the network request. Validate and report an existing `ncbi_attributes.xml` only.

The script retries up to 3 times with exponential backoff (2 s, 4 s) on network failures (MAX_ATTEMPTS = 3, TIMEOUT = 30 seconds).

After saving, parse_and_report() prints the count of HarmonizedName entries and Synonym entries found in the XML.

Usage examples:

# Fetch and save to default location:
python scripts/build_ncbi_attribute_cache.py

# Save to a custom directory:
python scripts/build_ncbi_attribute_cache.py --output-dir /tmp/schemas

# Validate an existing file without a network request:
python scripts/build_ncbi_attribute_cache.py --skip-fetch

When to re-run:

Re-run periodically (e.g. monthly) or whenever NCBI adds or renames BioSample attributes. The tool functions without this file (Layer 2 disabled) but synonym coverage will be lower for packages that use non-standard attribute names that are only defined in the NCBI XML.

generate_summary_report.py

Purpose

scripts/generate_summary_report.py generates a comprehensive visual summary report from a BioMetaHarmonizer output file. It reads a CSV/TSV/ Parquet file produced by write() and produces an interactive HTML report (and optionally JSON and CSV summaries) with Plotly visualizations.

Input:

A harmonized DataFrame file produced by biometaharmonizer (CSV, TSV, or Parquet). The script loads it with pandas.

CLI Flags:

build_dictionaries.py CLI flags
Flag	Description
`--input, -i`	Required. Path to the input harmonized data file.
`--output, -o`	Output file path (for single-format output).
`--output-dir, -d`	Output directory for multi-format output.
`--formats, -f`	One or more of: `html`, `json`, `csv`. Default: inferred from `--output` suffix.
`--verbose, -v`	Enable DEBUG-level logging.

Output sections and visualizations generated:

The HTML report is produced by generate_full_html_report() and includes:

Data Quality Dashboard (generate_quality_dashboard()) — four-panel subplot: fill rates by category (bar chart), overall completeness distribution (histogram), category-wise average fill rate (bar), and top-15 most complete columns (bar).
Geospatial Visualizations (generate_geo_visualizations()) — country distribution choropleth or bar chart based on geo_country and geo_iso3166 columns.
Temporal Analysis (generate_temporal_analysis()) — time-series distribution of collection_date values grouped by year or year-month.
One Health Chart (generate_one_health_chart()) — pie or bar chart of one_health_category distribution.
Host Analysis (generate_host_analysis()) — top host values from the host column.
Extra Attributes Analysis (generate_extra_attributes_analysis()) — summary of keys present in _extra_attributes across all records, including antibiogram presence rate.

Metrics computed (compute_fill_rates() and generate_json_metrics()):

Per-column non_null_count, null_count, fill_pct for all 51 schema columns.
Category-level average fill rates (using COLUMN_CATEGORIES groupings).
Overall dataset completeness summary.
One Health category distribution counts and percentages.
Temporal coverage statistics (min/max year, year distribution).
Geographic coverage (unique countries, coverage by ISO3166).
Top host values and their frequencies.
_extra_attributes key frequency table.

Usage examples:

# Generate HTML report only:
python scripts/generate_summary_report.py \
    --input harmonized.csv \
    --output report.html

# Generate all formats:
python scripts/generate_summary_report.py \
    --input harmonized.csv \
    --output-dir reports/ \
    --formats html json csv

# Verbose logging:
python scripts/generate_summary_report.py \
    -i harmonized.parquet -o report.html -v

Note

Plotly must be installed for HTML/PDF output (pip install plotly). PDF export additionally requires kaleido (pip install kaleido). The script imports Plotly conditionally and will still produce JSON and CSV summaries if Plotly is absent.