.. _faq: === FAQ === .. rubric:: 1. How do I register and use an NCBI API key? Create a free NCBI account at https://www.ncbi.nlm.nih.gov/account/, navigate to **Settings → API Key Management**, and click **Create an API Key**. Copy the generated string and pass it to :func:`~biometaharmonizer.ingestion.ingest` via the ``api_key`` argument, or register it globally with :func:`~biometaharmonizer.ingestion.set_api_key`: .. code-block:: python import biometaharmonizer as bmh bmh.set_api_key("YOUR_KEY") df = bmh.ingest(source="ids.txt", email="your@email.com") Without a key, NCBI allows 3 requests/second (``inter_batch_sleep = 0.34 s``); with a key the limit rises to 10 requests/second (``inter_batch_sleep = 0.12 s``). .. rubric:: 2. What happens when an accession is suppressed or withdrawn? NCBI returns no ```` XML element for suppressed, withdrawn, or otherwise invalid accessions. The record is silently absent from the output DataFrame. When assembly accessions cannot be resolved through either the local index or the Entrez elink fallback, a WARNING is logged: .. code-block:: text WARNING biometaharmonizer.ingestion: NOT resolved (suppressed/invalid after both passes): 2 Unresolved: ['GCF_000000001.1', 'SAMN00000000'] The count of output rows will be less than the count of input IDs. Compare ``len(df)`` with your input size to detect suppressed accessions. .. rubric:: 3. How do I handle the rate-limit error HTTP 429? An HTTP 429 response from NCBI means too many requests were sent in the last second. BioMetaHarmonizer's retry logic (``_MAX_RETRIES = 3``) handles transient 429s automatically using exponential backoff: wait = min(2^n, 30) seconds per attempt. If 429 errors persist: - Ensure you have registered and passed an NCBI API key. - Reduce ``fetch_batch_size`` (e.g. to 100) to send smaller requests. - Avoid running multiple concurrent instances against the same API key. .. rubric:: 4. What does ``refresh_cache=True`` do and when should I use it? When ``refresh_cache=True`` is set, the assembly summary flat files in ``CACHE_DIR`` (``assembly_summary_refseq.txt`` and ``assembly_summary_genbank.txt``) are deleted and re-downloaded unconditionally — regardless of their age — before the run begins. Use this flag when NCBI has added new assemblies since your last run and you need the latest assembly→BioSample mappings: .. code-block:: python df = bmh.ingest(source="new_assemblies.txt", email="your@email.com", refresh_cache=True) Without this flag, cached files are used for up to 7 days (``_CACHE_TTL_DAYS``). .. rubric:: 5. Why do some records have ``None`` in most columns? Records with many ``None`` values fall into two categories: 1. **Minimal BioSample records:** the submitter only provided the mandatory fields. For Pathogen packages this is common; for environmental packages MIxS fields dominate while clinical fields are absent. 2. **Null normalization:** BioMetaHarmonizer applies ``_normalize_null()`` to every attribute value. Any value matching ``_NULL_PATTERNS`` (e.g. ``"missing"``, ``"N/A"``, ``"not provided"``, ``"unknown"``) is stored as ``None``. This is intentional: downstream tools can use ``df.notna()`` to compute genuine fill rates. Use :func:`~biometaharmonizer.output.write_summary` to generate a fill-rate report and identify which columns are sparsely populated for your dataset. .. rubric:: 6. How do I use BioMetaHarmonizer in Google Colab? Override the cache directory to a location inside ``/content`` so that the ~200 MB assembly index files are stored on the Colab writable filesystem (or in a mounted Google Drive): .. code-block:: console !pip install biometaharmonizer import biometaharmonizer as bmh bmh.set_cache_dir("/content/bmh_cache") # or "/content/drive/MyDrive/bmh_cache" bmh.set_email("your@email.com") bmh.set_api_key("YOUR_KEY") df = bmh.ingest(source=["SAMN02436525", "SAMN02434874"]) df.to_csv("harmonized.csv", index=False) The module docstring of ``ingestion.py`` contains this exact note under *Working directory note (Colab)*. .. rubric:: 7. What is the difference between ``fetch_batch_size`` and ``esearch_batch_size``? - **``fetch_batch_size``** (default 200) — controls how many BioSample records are retrieved per ``efetch`` HTTP request. Each ``efetch`` call returns raw XML for ``N`` records; larger batches mean fewer round trips but larger per-response payloads. - **``esearch_batch_size``** (module constant ``_ESEARCH_BATCH`` = 100) — controls how many accession strings are assembled into a single ``esearch`` term (e.g. ``"SAMN01[Accession] OR SAMN02[Accession] OR ..."``) before a server-side History slot is created. It is also used in the Entrez elink fallback path for resolving GCF/GCA accessions that are not found in the local index. In the CLI, ``--esearch-batch-size`` defaults to 200 (overriding the module constant). The Python API default is 100. .. rubric:: 8. How do I access antibiogram data from the output DataFrame? Antibiogram data is stored as a JSON list inside the ``_extra_attributes`` column. To extract it: .. code-block:: python import json, pandas as pd def get_antibiogram(row): if pd.isna(row["_extra_attributes"]): return [] ea = json.loads(row["_extra_attributes"]) return ea.get("antibiogram", []) abg_rows = [] for _, row in df.iterrows(): for entry in get_antibiogram(row): entry["biosample_accession"] = row["biosample_accession"] abg_rows.append(entry) abg_df = pd.DataFrame(abg_rows) See :ref:`antibiogram` for a complete walkthrough. .. rubric:: 9. How do I filter only records with a known ``one_health_category``? The ``one_health_category`` column is always a string (never NaN). Records that could not be classified receive the string ``"Unclassified"``. To filter for classified records: .. code-block:: python classified = df[df["one_health_category"] != "Unclassified"] To filter for a specific category: .. code-block:: python human = df[df["one_health_category"] == "Human"] food = df[df["one_health_category"] == "Food"] .. rubric:: 10. Why does ``_extra_attributes`` contain pipe-separated values for some keys? NCBI BioSample allows submitters to include multiple ```` elements with the same ``attribute_name`` on a single record (a repeatable-attribute pattern used for AMR fields, multiple culture collection numbers, etc.). When this occurs, the ingestion parser joins them with ``|`` before storing in ``extras[key]``. This ensures no data is lost while keeping the JSON representation compact. To split them back: .. code-block:: python import json ea = json.loads(row["_extra_attributes"]) for key, val in ea.items(): if isinstance(val, str) and "|" in val: parts = val.split("|") .. rubric:: 11. How do I input assembly accessions vs. BioSample accessions? BioMetaHarmonizer automatically detects the accession type by prefix: - ``SAMN``, ``SAME``, ``SAMD`` → BioSample accessions (direct fetch) - ``GCF_``, ``GCA_`` → Assembly accessions (two-step resolution) You can mix both types in the same input file or list. Unrecognized IDs (wrong prefix) are logged at WARNING level and skipped. See :ref:`ingestion` for the full two-step resolution process for assembly accessions. .. rubric:: 12. How do I rebuild the assembly index cache after NCBI adds new genomes? Pass ``refresh_cache=True`` to ``ingest()`` or use ``--refresh-cache`` on the CLI. This forces deletion and re-download of both assembly summary files from NCBI FTP before processing begins, regardless of how recently they were last downloaded: .. code-block:: bash biometaharmonizer run \ --input assemblies.txt \ --email your@email.com \ --output harmonized.csv \ --refresh-cache The files (~100 MB each) are stored in ``CACHE_DIR`` (``~/.biometaharmonizer/cache/`` by default). .. rubric:: 13. When and how do I re-run ``build_dictionaries.py`` to refresh the One Health term dictionary? Re-run ``scripts/build_dictionaries.py`` when: - NCBI taxonomy is updated and new host/organism names are needed. - New OLS ontology versions are released. - You add new hand-curated entries to the base ``one_health_dictionaries.json`` and need to propagate collision detection. - New One Health categories are required. To rebuild in place (overwrites the bundled file): .. code-block:: bash python scripts/build_dictionaries.py \ --base src/biometaharmonizer/schemas/one_health_dictionaries.json \ --output src/biometaharmonizer/schemas/one_health_dictionaries.json The base dictionary is loaded first; hand-curated entries are never overwritten (``base_wins`` strategy). .. rubric:: 14. How do I use ``build_dictionaries.py`` with a pre-downloaded ``taxdmp.zip``? To avoid the ~65 MB automatic download from NCBI FTP, download ``taxdmp.zip`` once and pass its local path via ``--taxdmp``: .. code-block:: bash # Download once: wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip # Use the local file on subsequent runs: python scripts/build_dictionaries.py \ --taxdmp /path/to/taxdmp.zip \ --output src/biometaharmonizer/schemas/one_health_dictionaries.json You may also pass the path to an already-extracted directory that contains ``names.dmp`` and ``nodes.dmp``: .. code-block:: bash python scripts/build_dictionaries.py \ --taxdmp /path/to/extracted_taxdmp/ \ --output src/biometaharmonizer/schemas/one_health_dictionaries.json