.. _scripts: ====================== Developer Scripts ====================== The ``scripts/`` directory contains three standalone maintenance scripts for contributors and power users who build or refresh the data assets consumed at runtime. None of these scripts import from the ``biometaharmonizer`` package itself; they are intentionally standalone. build_dictionaries.py ---------------------- **Purpose** ``scripts/build_dictionaries.py`` builds the enriched ``one_health_dictionaries.json`` file that powers the :class:`~biometaharmonizer.one_health.OneHealthClassifier`. It queries three external sources: 1. **OLS4 API** — Environmental Ontology (ENVO), FoodOn, UBERON, and Plant Ontology for ``Environmental``, ``Food``, ``_anatomy``, and ``Plant`` category terms. 2. **NCBI Taxonomy local dump** — builds ``host_to_category`` mappings for vertebrate animals and plants by walking the taxonomic tree from configured root taxon IDs. 3. **UMLS API** (optional, requires API key) — synonym expansion for 17 clinical specimen CUIs. The hand-curated base file is loaded first. The merge strategy is **base_wins**: any key already present in the base dictionary is never overwritten by ontology-derived data. This preserves expert-curated entries against automated ontology updates. OLS4 Integration ~~~~~~~~~~~~~~~~~ The following ontologies are queried, mapped via ``OLS_ONTOLOGY_MAP``: .. list-table:: build_dictionaries.py CLI flags :header-rows: 1 * - Ontology - OLS ID - One Health category - Seed IRIs (short form) * - ENVO - ``envo`` - Environmental - ENVO:00000428, ENVO:00010483, ENVO:01000254, ENVO:01001110, ENVO:00000063, ENVO:00000015, ENVO:00000873, ENVO:00000134, ENVO:00002006 * - FoodOn - ``foodon`` - Food - FOODON:00001002, FOODON:03400361, FOODON:00001709, FOODON:03420194 * - UBERON - ``uberon`` - _anatomy (split post-fetch) - UBERON:0000465 * - PO - ``po`` - Plant - PO:0025131 The OLS4 ``hierarchicalDescendants`` endpoint is used to traverse the class hierarchy. Only ``hasExactSynonym`` annotations are collected (not broad, narrow, or related) to minimise false-positive category assignments. Each raw OLS term string is cleaned by ``_clean_ols_term()`` which applies six rules in order: 1. Strip OLS language/scope tags: ``(exact)``, ``(related)``, etc. 2. Strip parenthetical scope tags that appear after a comma. 3. Reject GS1 GPC catalogue codes (e.g. ``"0900000 - cereals (GS1 GPC)"``). 4. Reject regulatory catalogue codes matching ``_RE_REGULATORY_CATALOGUE``, which covers EFSA FoodEx2 codes, EC codes, EuroFIR, EFG, CIAA, CCFAC, and Codex entries. These are classification artefacts, not free-text terms that could match BioSample metadata. 5. Reject terms of fewer than 2 characters. 6. Return ``None`` to signal that the term should be discarded. UBERON Anatomy Classification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ UBERON terms under ``UBERON:0000465`` (material anatomical entity) are split into three buckets using membership in two constant sets: - **``_uberon_human``** — terms in ``UBERON_HUMAN_EXCLUSIVE``: cerebrospinal fluid, pleural fluid, peritoneal fluid, synovial fluid, amniotic fluid, dialysate, bronchoalveolar lavage, sputum, dental plaque, catheter, central venous. - **``_uberon_animal``** — terms in ``UBERON_ANIMAL_EXCLUSIVE``: rumen, reticulum, omasum, abomasum, gizzard, proventriculus, crop, cloaca, swim bladder, gill, hemolymph, exoskeleton. - **``_uberon_ambiguous``** — all remaining UBERON anatomy terms that cannot be assigned to either set; stored in ``ambiguous_specimen_terms``. NCBI Taxonomy Integration ~~~~~~~~~~~~~~~~~~~~~~~~~~ A BFS walk of the NCBI taxonomy tree is performed from the following root taxon IDs (``NCBI_TAXON_ROOTS``): .. list-table:: build_dictionaries.py CLI flags :header-rows: 1 * - Taxon ID - Name - Category * - 9606 - Homo sapiens - Human * - 40674 - Mammalia - Animal * - 8782 - Aves - Animal * - 8504 - Reptilia - Animal * - 8292 - Amphibia - Animal * - 7776 - Chondrichthyes - Animal * - 7898 - Actinopterygii - Animal * - 6656 - Arthropoda - Animal * - 6447 - Mollusca - Animal * - 6231 - Nematoda - Animal * - 6340 - Annelida - Animal * - 7586 - Echinodermata - Animal * - 6073 - Cnidaria - Animal * - 6040 - Porifera - Animal * - 33090 - Viridiplantae - Plant * - 2763 - Rhodophyta - Plant * - 3041 - Chlorophyta - Plant * - 2870 - Phaeophyceae - Plant .. note:: **Homo sapiens (txid 9606)** is treated as an exact match only — its subtree is not walked, because the subtree contains only subspecies/race taxa that should not produce additional entries. **Fungi (txid 4751)** are intentionally excluded: their One Health category is context-dependent (Environmental pathogen, Food spoilage, Animal/Human mycosis) and cannot be determined from taxonomy alone. Name strings are extracted from ``names.dmp`` using only the following ``name_class`` values (``NAMES_DMP_KEEP_CLASSES``): - ``scientific name`` - ``common name`` - ``genbank common name`` - ``equivalent name`` Names of 1–3 tokens are kept; longer names are excluded to reduce noise. The ``--taxdmp`` argument accepts three input forms: - A path to a pre-downloaded ``taxdmp.zip`` file. - A path to an extracted directory containing ``names.dmp`` and ``nodes.dmp``. - Omitted entirely, in which case ``taxdmp.zip`` is downloaded automatically from ``https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip`` (approximately **65 MB**). Collision Resolution ~~~~~~~~~~~~~~~~~~~~~ ``_resolve_collisions(base)`` detects terms that appear in multiple categories or conflict between dictionary sections. Two collision types are handled: 1. **Intra-ontology_map:** a term string appears in two or more category lists within ``ontology_map`` (e.g. ``"blood"`` in both ``Food`` and ``Animal``). 2. **Cross-section:** a term in ``ontology_map`` also exists in ``host_to_category``, ``unambiguous_human_terms``, ``unambiguous_animal_terms``, or ``ambiguous_specimen_terms``. In both cases the term is removed from its ``ontology_map`` category list(s) and appended to ``base["ambiguous_category_terms"]`` with a list of conflicting source labels. The ``base_wins`` exemption applies: any term that was present in the hand-curated base ``ontology_map`` before this build run is excluded from collision processing. UMLS Integration ~~~~~~~~~~~~~~~~~ When ``--umls-key`` is provided, synonym expansion is performed for 17 clinical specimen CUIs via the UMLS TGT → service-ticket authentication flow: 1. A TGT (Ticket Granting Ticket) is obtained by POST to the UMLS UTS API with the API key. 2. For each CUI in ``UMLS_SPECIMEN_CUIS``, a service ticket is requested and used to query the CUI's atom list for English synonyms. The 17 CUIs and their canonical names: .. list-table:: build_dictionaries.py CLI flags :header-rows: 1 * - CUI - Canonical name * - C0005767 - blood * - C0042036 - urine * - C0038569 - sputum * - C0007555 - cerebrospinal fluid * - C0205189 - pleural fluid * - C0003967 - ascitic fluid * - C0039981 - synovial fluid * - C0006252 - bronchial lavage * - C0444941 - wound * - C0000735 - abscess * - C0032227 - pus * - C0015411 - feces * - C0521481 - rectal swab * - C0029001 - oral swab * - C0042048 - vaginal swab * - C0877612 - nasal swab * - C0586478 - throat swab CLI Flags ~~~~~~~~~~ All flags are derived from ``parse_args()`` in the script: .. list-table:: ``build_dictionaries.py`` CLI flags :header-rows: 1 :widths: 20 80 * - Flag - Description * - ``--base`` - Path to the hand-curated base JSON. Default: ``src/biometaharmonizer/schemas/one_health_dictionaries.json`` * - ``--output`` - Output path for the enriched JSON. Default: same as ``--base`` (overwrites in place). * - ``--taxdmp`` - Path to ``taxdmp.zip`` or an extracted directory containing ``names.dmp`` and ``nodes.dmp``. Omit to trigger automatic download (~65 MB) from NCBI FTP. * - ``--umls-key`` - UMLS API key for synonym expansion. Omit to skip UMLS. * - ``--skip-ols`` - Skip all OLS4 queries. * - ``--skip-ncbi`` - Skip NCBI Taxonomy processing. * - ``--dry-run`` - Build the enriched dict in memory but do not write to disk. Usage Examples ~~~~~~~~~~~~~~ .. code-block:: bash # Full run — overwrites bundled dictionary in place: python scripts/build_dictionaries.py \ --base src/biometaharmonizer/schemas/one_health_dictionaries.json \ --output src/biometaharmonizer/schemas/one_health_dictionaries.json # Use a pre-downloaded taxdmp.zip to skip the ~65 MB download: python scripts/build_dictionaries.py --taxdmp /path/to/taxdmp.zip # Skip NCBI taxonomy entirely: python scripts/build_dictionaries.py --skip-ncbi # Full run with UMLS synonym expansion: python scripts/build_dictionaries.py --umls-key YOUR_UMLS_API_KEY When to Re-run ~~~~~~~~~~~~~~ Re-run this script when: - NCBI taxonomy is updated and new host names need to be incorporated into ``host_to_category``. - New OLS ontology versions are released with additional terms. - New One Health categories are required (add them to the base JSON first, then re-run to merge ontology data). - After adding new hand-curated entries to the base JSON to propagate collision resolution correctly. build_ncbi_attribute_cache.py ------------------------------- **Purpose** ``scripts/build_ncbi_attribute_cache.py`` fetches the official NCBI BioSample attribute harmonization XML from NCBI and saves it as ``src/biometaharmonizer/schemas/ncbi_attributes.xml``. This file is **Layer 2** of the synonym lookup used by :func:`~biometaharmonizer.synonyms.build_synonym_lookup`. Without it, only Layer 1 (``unified.json``) is active and some NCBI ``HarmonizedName`` attributes may not be recognized. The file is parsed at runtime by :func:`~biometaharmonizer.synonyms.build_synonym_lookup` every time the process starts (cached via ``lru_cache`` thereafter). **Output file:** ``src/biometaharmonizer/schemas/ncbi_attributes.xml`` This is the raw XML response from: ``https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml`` **CLI Flags:** .. list-table:: build_dictionaries.py CLI flags :header-rows: 1 * - Flag - Description * - ``--output-dir`` - Directory to write ``ncbi_attributes.xml`` into. Default: ``src/biometaharmonizer/schemas/`` * - ``--skip-fetch`` - Skip the network request. Validate and report an existing ``ncbi_attributes.xml`` only. The script retries up to 3 times with exponential backoff (2 s, 4 s) on network failures (``MAX_ATTEMPTS = 3``, ``TIMEOUT = 30`` seconds). After saving, ``parse_and_report()`` prints the count of ``HarmonizedName`` entries and ``Synonym`` entries found in the XML. **Usage examples:** .. code-block:: bash # Fetch and save to default location: python scripts/build_ncbi_attribute_cache.py # Save to a custom directory: python scripts/build_ncbi_attribute_cache.py --output-dir /tmp/schemas # Validate an existing file without a network request: python scripts/build_ncbi_attribute_cache.py --skip-fetch **When to re-run:** Re-run periodically (e.g. monthly) or whenever NCBI adds or renames BioSample attributes. The tool functions without this file (Layer 2 disabled) but synonym coverage will be lower for packages that use non-standard attribute names that are only defined in the NCBI XML. generate_summary_report.py --------------------------- **Purpose** ``scripts/generate_summary_report.py`` generates a comprehensive visual summary report from a BioMetaHarmonizer output file. It reads a CSV/TSV/ Parquet file produced by :func:`~biometaharmonizer.output.write` and produces an interactive HTML report (and optionally JSON and CSV summaries) with Plotly visualizations. **Input:** A harmonized DataFrame file produced by ``biometaharmonizer`` (CSV, TSV, or Parquet). The script loads it with ``pandas``. **CLI Flags:** .. list-table:: build_dictionaries.py CLI flags :header-rows: 1 * - Flag - Description * - ``--input, -i`` - **Required.** Path to the input harmonized data file. * - ``--output, -o`` - Output file path (for single-format output). * - ``--output-dir, -d`` - | Output directory for multi-format output. * - ``--formats, -f`` - One or more of: ``html``, ``json``, ``csv``. Default: inferred from ``--output`` suffix. * - ``--verbose, -v`` - Enable DEBUG-level logging. **Output sections and visualizations generated:** The HTML report is produced by ``generate_full_html_report()`` and includes: 1. **Data Quality Dashboard** (``generate_quality_dashboard()``) — four-panel subplot: fill rates by category (bar chart), overall completeness distribution (histogram), category-wise average fill rate (bar), and top-15 most complete columns (bar). 2. **Geospatial Visualizations** (``generate_geo_visualizations()``) — country distribution choropleth or bar chart based on ``geo_country`` and ``geo_iso3166`` columns. 3. **Temporal Analysis** (``generate_temporal_analysis()``) — time-series distribution of ``collection_date`` values grouped by year or year-month. 4. **One Health Chart** (``generate_one_health_chart()``) — pie or bar chart of ``one_health_category`` distribution. 5. **Host Analysis** (``generate_host_analysis()``) — top host values from the ``host`` column. 6. **Extra Attributes Analysis** (``generate_extra_attributes_analysis()``) — summary of keys present in ``_extra_attributes`` across all records, including antibiogram presence rate. **Metrics computed** (``compute_fill_rates()`` and ``generate_json_metrics()``): - Per-column ``non_null_count``, ``null_count``, ``fill_pct`` for all 51 schema columns. - Category-level average fill rates (using ``COLUMN_CATEGORIES`` groupings). - Overall dataset completeness summary. - One Health category distribution counts and percentages. - Temporal coverage statistics (min/max year, year distribution). - Geographic coverage (unique countries, coverage by ISO3166). - Top host values and their frequencies. - ``_extra_attributes`` key frequency table. **Usage examples:** .. code-block:: bash # Generate HTML report only: python scripts/generate_summary_report.py \ --input harmonized.csv \ --output report.html # Generate all formats: python scripts/generate_summary_report.py \ --input harmonized.csv \ --output-dir reports/ \ --formats html json csv # Verbose logging: python scripts/generate_summary_report.py \ -i harmonized.parquet -o report.html -v .. note:: Plotly must be installed for HTML/PDF output (``pip install plotly``). PDF export additionally requires ``kaleido`` (``pip install kaleido``). The script imports Plotly conditionally and will still produce JSON and CSV summaries if Plotly is absent.