Developer Scripts

The scripts/ directory contains three standalone maintenance scripts for contributors and power users who build or refresh the data assets consumed at runtime. None of these scripts import from the biometaharmonizer package itself; they are intentionally standalone.

build_dictionaries.py

Purpose

scripts/build_dictionaries.py builds the enriched one_health_dictionaries.json file that powers the OneHealthClassifier. It queries three external sources:

  1. OLS4 API — Environmental Ontology (ENVO), FoodOn, UBERON, and Plant Ontology for Environmental, Food, _anatomy, and Plant category terms.

  2. NCBI Taxonomy local dump — builds host_to_category mappings for vertebrate animals and plants by walking the taxonomic tree from configured root taxon IDs.

  3. UMLS API (optional, requires API key) — synonym expansion for 17 clinical specimen CUIs.

The hand-curated base file is loaded first. The merge strategy is base_wins: any key already present in the base dictionary is never overwritten by ontology-derived data. This preserves expert-curated entries against automated ontology updates.

OLS4 Integration

The following ontologies are queried, mapped via OLS_ONTOLOGY_MAP:

build_dictionaries.py CLI flags

Ontology

OLS ID

One Health category

Seed IRIs (short form)

ENVO

envo

Environmental

ENVO:00000428, ENVO:00010483, ENVO:01000254, ENVO:01001110, ENVO:00000063, ENVO:00000015, ENVO:00000873, ENVO:00000134, ENVO:00002006

FoodOn

foodon

Food

FOODON:00001002, FOODON:03400361, FOODON:00001709, FOODON:03420194

UBERON

uberon

_anatomy (split post-fetch)

UBERON:0000465

PO

po

Plant

PO:0025131

The OLS4 hierarchicalDescendants endpoint is used to traverse the class hierarchy. Only hasExactSynonym annotations are collected (not broad, narrow, or related) to minimise false-positive category assignments.

Each raw OLS term string is cleaned by _clean_ols_term() which applies six rules in order:

  1. Strip OLS language/scope tags: (exact), (related), etc.

  2. Strip parenthetical scope tags that appear after a comma.

  3. Reject GS1 GPC catalogue codes (e.g. "0900000 - cereals (GS1 GPC)").

  4. Reject regulatory catalogue codes matching _RE_REGULATORY_CATALOGUE, which covers EFSA FoodEx2 codes, EC codes, EuroFIR, EFG, CIAA, CCFAC, and Codex entries. These are classification artefacts, not free-text terms that could match BioSample metadata.

  5. Reject terms of fewer than 2 characters.

  6. Return None to signal that the term should be discarded.

UBERON Anatomy Classification

UBERON terms under UBERON:0000465 (material anatomical entity) are split into three buckets using membership in two constant sets:

  • ``_uberon_human`` — terms in UBERON_HUMAN_EXCLUSIVE: cerebrospinal fluid, pleural fluid, peritoneal fluid, synovial fluid, amniotic fluid, dialysate, bronchoalveolar lavage, sputum, dental plaque, catheter, central venous.

  • ``_uberon_animal`` — terms in UBERON_ANIMAL_EXCLUSIVE: rumen, reticulum, omasum, abomasum, gizzard, proventriculus, crop, cloaca, swim bladder, gill, hemolymph, exoskeleton.

  • ``_uberon_ambiguous`` — all remaining UBERON anatomy terms that cannot be assigned to either set; stored in ambiguous_specimen_terms.

NCBI Taxonomy Integration

A BFS walk of the NCBI taxonomy tree is performed from the following root taxon IDs (NCBI_TAXON_ROOTS):

build_dictionaries.py CLI flags

Taxon ID

Name

Category

9606

Homo sapiens

Human

40674

Mammalia

Animal

8782

Aves

Animal

8504

Reptilia

Animal

8292

Amphibia

Animal

7776

Chondrichthyes

Animal

7898

Actinopterygii

Animal

6656

Arthropoda

Animal

6447

Mollusca

Animal

6231

Nematoda

Animal

6340

Annelida

Animal

7586

Echinodermata

Animal

6073

Cnidaria

Animal

6040

Porifera

Animal

33090

Viridiplantae

Plant

2763

Rhodophyta

Plant

3041

Chlorophyta

Plant

2870

Phaeophyceae

Plant

Note

Homo sapiens (txid 9606) is treated as an exact match only — its subtree is not walked, because the subtree contains only subspecies/race taxa that should not produce additional entries. Fungi (txid 4751) are intentionally excluded: their One Health category is context-dependent (Environmental pathogen, Food spoilage, Animal/Human mycosis) and cannot be determined from taxonomy alone.

Name strings are extracted from names.dmp using only the following name_class values (NAMES_DMP_KEEP_CLASSES):

  • scientific name

  • common name

  • genbank common name

  • equivalent name

Names of 1–3 tokens are kept; longer names are excluded to reduce noise.

The --taxdmp argument accepts three input forms:

  • A path to a pre-downloaded taxdmp.zip file.

  • A path to an extracted directory containing names.dmp and nodes.dmp.

  • Omitted entirely, in which case taxdmp.zip is downloaded automatically from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip (approximately 65 MB).

Collision Resolution

_resolve_collisions(base) detects terms that appear in multiple categories or conflict between dictionary sections. Two collision types are handled:

  1. Intra-ontology_map: a term string appears in two or more category lists within ontology_map (e.g. "blood" in both Food and Animal).

  2. Cross-section: a term in ontology_map also exists in host_to_category, unambiguous_human_terms, unambiguous_animal_terms, or ambiguous_specimen_terms.

In both cases the term is removed from its ontology_map category list(s) and appended to base["ambiguous_category_terms"] with a list of conflicting source labels. The base_wins exemption applies: any term that was present in the hand-curated base ontology_map before this build run is excluded from collision processing.

UMLS Integration

When --umls-key is provided, synonym expansion is performed for 17 clinical specimen CUIs via the UMLS TGT → service-ticket authentication flow:

  1. A TGT (Ticket Granting Ticket) is obtained by POST to the UMLS UTS API with the API key.

  2. For each CUI in UMLS_SPECIMEN_CUIS, a service ticket is requested and used to query the CUI’s atom list for English synonyms.

The 17 CUIs and their canonical names:

build_dictionaries.py CLI flags

CUI

Canonical name

C0005767

blood

C0042036

urine

C0038569

sputum

C0007555

cerebrospinal fluid

C0205189

pleural fluid

C0003967

ascitic fluid

C0039981

synovial fluid

C0006252

bronchial lavage

C0444941

wound

C0000735

abscess

C0032227

pus

C0015411

feces

C0521481

rectal swab

C0029001

oral swab

C0042048

vaginal swab

C0877612

nasal swab

C0586478

throat swab

CLI Flags

All flags are derived from parse_args() in the script:

build_dictionaries.py CLI flags

Flag

Description

--base

Path to the hand-curated base JSON. Default: src/biometaharmonizer/schemas/one_health_dictionaries.json

--output

Output path for the enriched JSON. Default: same as --base (overwrites in place).

--taxdmp

Path to taxdmp.zip or an extracted directory containing names.dmp and nodes.dmp. Omit to trigger automatic download (~65 MB) from NCBI FTP.

--umls-key

UMLS API key for synonym expansion. Omit to skip UMLS.

--skip-ols

Skip all OLS4 queries.

--skip-ncbi

Skip NCBI Taxonomy processing.

--dry-run

Build the enriched dict in memory but do not write to disk.

Usage Examples

# Full run — overwrites bundled dictionary in place:
python scripts/build_dictionaries.py \
    --base    src/biometaharmonizer/schemas/one_health_dictionaries.json \
    --output  src/biometaharmonizer/schemas/one_health_dictionaries.json

# Use a pre-downloaded taxdmp.zip to skip the ~65 MB download:
python scripts/build_dictionaries.py --taxdmp /path/to/taxdmp.zip

# Skip NCBI taxonomy entirely:
python scripts/build_dictionaries.py --skip-ncbi

# Full run with UMLS synonym expansion:
python scripts/build_dictionaries.py --umls-key YOUR_UMLS_API_KEY

When to Re-run

Re-run this script when:

  • NCBI taxonomy is updated and new host names need to be incorporated into host_to_category.

  • New OLS ontology versions are released with additional terms.

  • New One Health categories are required (add them to the base JSON first, then re-run to merge ontology data).

  • After adding new hand-curated entries to the base JSON to propagate collision resolution correctly.

build_ncbi_attribute_cache.py

Purpose

scripts/build_ncbi_attribute_cache.py fetches the official NCBI BioSample attribute harmonization XML from NCBI and saves it as src/biometaharmonizer/schemas/ncbi_attributes.xml.

This file is Layer 2 of the synonym lookup used by build_synonym_lookup(). Without it, only Layer 1 (unified.json) is active and some NCBI HarmonizedName attributes may not be recognized. The file is parsed at runtime by build_synonym_lookup() every time the process starts (cached via lru_cache thereafter).

Output file:

src/biometaharmonizer/schemas/ncbi_attributes.xml

This is the raw XML response from: https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml

CLI Flags:

build_dictionaries.py CLI flags

Flag

Description

--output-dir

Directory to write ncbi_attributes.xml into. Default: src/biometaharmonizer/schemas/

--skip-fetch

Skip the network request. Validate and report an existing ncbi_attributes.xml only.

The script retries up to 3 times with exponential backoff (2 s, 4 s) on network failures (MAX_ATTEMPTS = 3, TIMEOUT = 30 seconds).

After saving, parse_and_report() prints the count of HarmonizedName entries and Synonym entries found in the XML.

Usage examples:

# Fetch and save to default location:
python scripts/build_ncbi_attribute_cache.py

# Save to a custom directory:
python scripts/build_ncbi_attribute_cache.py --output-dir /tmp/schemas

# Validate an existing file without a network request:
python scripts/build_ncbi_attribute_cache.py --skip-fetch

When to re-run:

Re-run periodically (e.g. monthly) or whenever NCBI adds or renames BioSample attributes. The tool functions without this file (Layer 2 disabled) but synonym coverage will be lower for packages that use non-standard attribute names that are only defined in the NCBI XML.

generate_summary_report.py

Purpose

scripts/generate_summary_report.py generates a comprehensive visual summary report from a BioMetaHarmonizer output file. It reads a CSV/TSV/ Parquet file produced by write() and produces an interactive HTML report (and optionally JSON and CSV summaries) with Plotly visualizations.

Input:

A harmonized DataFrame file produced by biometaharmonizer (CSV, TSV, or Parquet). The script loads it with pandas.

CLI Flags:

build_dictionaries.py CLI flags

Flag

Description

--input, -i

Required. Path to the input harmonized data file.

--output, -o

Output file path (for single-format output).

--output-dir, -d

Output directory for multi-format output.

--formats, -f

One or more of: html, json, csv. Default: inferred from --output suffix.

--verbose, -v

Enable DEBUG-level logging.

Output sections and visualizations generated:

The HTML report is produced by generate_full_html_report() and includes:

  1. Data Quality Dashboard (generate_quality_dashboard()) — four-panel subplot: fill rates by category (bar chart), overall completeness distribution (histogram), category-wise average fill rate (bar), and top-15 most complete columns (bar).

  2. Geospatial Visualizations (generate_geo_visualizations()) — country distribution choropleth or bar chart based on geo_country and geo_iso3166 columns.

  3. Temporal Analysis (generate_temporal_analysis()) — time-series distribution of collection_date values grouped by year or year-month.

  4. One Health Chart (generate_one_health_chart()) — pie or bar chart of one_health_category distribution.

  5. Host Analysis (generate_host_analysis()) — top host values from the host column.

  6. Extra Attributes Analysis (generate_extra_attributes_analysis()) — summary of keys present in _extra_attributes across all records, including antibiogram presence rate.

Metrics computed (compute_fill_rates() and generate_json_metrics()):

  • Per-column non_null_count, null_count, fill_pct for all 51 schema columns.

  • Category-level average fill rates (using COLUMN_CATEGORIES groupings).

  • Overall dataset completeness summary.

  • One Health category distribution counts and percentages.

  • Temporal coverage statistics (min/max year, year distribution).

  • Geographic coverage (unique countries, coverage by ISO3166).

  • Top host values and their frequencies.

  • _extra_attributes key frequency table.

Usage examples:

# Generate HTML report only:
python scripts/generate_summary_report.py \
    --input harmonized.csv \
    --output report.html

# Generate all formats:
python scripts/generate_summary_report.py \
    --input harmonized.csv \
    --output-dir reports/ \
    --formats html json csv

# Verbose logging:
python scripts/generate_summary_report.py \
    -i harmonized.parquet -o report.html -v

Note

Plotly must be installed for HTML/PDF output (pip install plotly). PDF export additionally requires kaleido (pip install kaleido). The script imports Plotly conditionally and will still produce JSON and CSV summaries if Plotly is absent.