FAQ

1. How do I register and use an NCBI API key?

Create a free NCBI account at https://www.ncbi.nlm.nih.gov/account/, navigate to Settings → API Key Management, and click Create an API Key. Copy the generated string and pass it to ingest() via the api_key argument, or register it globally with set_api_key():

import biometaharmonizer as bmh
bmh.set_api_key("YOUR_KEY")
df = bmh.ingest(source="ids.txt", email="your@email.com")

Without a key, NCBI allows 3 requests/second (inter_batch_sleep = 0.34 s); with a key the limit rises to 10 requests/second (inter_batch_sleep = 0.12 s).

2. What happens when an accession is suppressed or withdrawn?

NCBI returns no <BioSample> XML element for suppressed, withdrawn, or otherwise invalid accessions. The record is silently absent from the output DataFrame. When assembly accessions cannot be resolved through either the local index or the Entrez elink fallback, a WARNING is logged:

WARNING biometaharmonizer.ingestion:
  NOT resolved (suppressed/invalid after both passes): 2
  Unresolved: ['GCF_000000001.1', 'SAMN00000000']

The count of output rows will be less than the count of input IDs. Compare len(df) with your input size to detect suppressed accessions.

3. How do I handle the rate-limit error HTTP 429?

An HTTP 429 response from NCBI means too many requests were sent in the last second. BioMetaHarmonizer’s retry logic (_MAX_RETRIES = 3) handles transient 429s automatically using exponential backoff: wait = min(2^n, 30) seconds per attempt. If 429 errors persist:

Ensure you have registered and passed an NCBI API key.
Reduce fetch_batch_size (e.g. to 100) to send smaller requests.
Avoid running multiple concurrent instances against the same API key.

4. What does refresh_cache=True do and when should I use it?

When refresh_cache=True is set, the assembly summary flat files in CACHE_DIR (assembly_summary_refseq.txt and assembly_summary_genbank.txt) are deleted and re-downloaded unconditionally — regardless of their age — before the run begins. Use this flag when NCBI has added new assemblies since your last run and you need the latest assembly→BioSample mappings:

df = bmh.ingest(source="new_assemblies.txt",
                email="your@email.com",
                refresh_cache=True)

Without this flag, cached files are used for up to 7 days (_CACHE_TTL_DAYS).

5. Why do some records have None in most columns?

Records with many None values fall into two categories:

Minimal BioSample records: the submitter only provided the mandatory fields. For Pathogen packages this is common; for environmental packages MIxS fields dominate while clinical fields are absent.
Null normalization: BioMetaHarmonizer applies _normalize_null() to every attribute value. Any value matching _NULL_PATTERNS (e.g. "missing", "N/A", "not provided", "unknown") is stored as None. This is intentional: downstream tools can use df.notna() to compute genuine fill rates.

Use write_summary() to generate a fill-rate report and identify which columns are sparsely populated for your dataset.

6. How do I use BioMetaHarmonizer in Google Colab?

Override the cache directory to a location inside /content so that the ~200 MB assembly index files are stored on the Colab writable filesystem (or in a mounted Google Drive):

!pip install biometaharmonizer

import biometaharmonizer as bmh

bmh.set_cache_dir("/content/bmh_cache")       # or "/content/drive/MyDrive/bmh_cache"
bmh.set_email("your@email.com")
bmh.set_api_key("YOUR_KEY")

df = bmh.ingest(source=["SAMN02436525", "SAMN02434874"])
df.to_csv("harmonized.csv", index=False)

The module docstring of ingestion.py contains this exact note under Working directory note (Colab).

7. What is the difference between fetch_batch_size and esearch_batch_size?

``fetch_batch_size`` (default 200) — controls how many BioSample records are retrieved per efetch HTTP request. Each efetch call returns raw XML for N records; larger batches mean fewer round trips but larger per-response payloads.
``esearch_batch_size`` (module constant _ESEARCH_BATCH = 100) — controls how many accession strings are assembled into a single esearch term (e.g. "SAMN01[Accession] OR SAMN02[Accession] OR ...") before a server-side History slot is created. It is also used in the Entrez elink fallback path for resolving GCF/GCA accessions that are not found in the local index.

In the CLI, --esearch-batch-size defaults to 200 (overriding the module constant). The Python API default is 100.

8. How do I access antibiogram data from the output DataFrame?

Antibiogram data is stored as a JSON list inside the _extra_attributes column. To extract it:

import json, pandas as pd

def get_antibiogram(row):
    if pd.isna(row["_extra_attributes"]):
        return []
    ea = json.loads(row["_extra_attributes"])
    return ea.get("antibiogram", [])

abg_rows = []
for _, row in df.iterrows():
    for entry in get_antibiogram(row):
        entry["biosample_accession"] = row["biosample_accession"]
        abg_rows.append(entry)

abg_df = pd.DataFrame(abg_rows)

See Antibiogram for a complete walkthrough.

9. How do I filter only records with a known one_health_category?

The one_health_category column is always a string (never NaN). Records that could not be classified receive the string "Unclassified". To filter for classified records:

classified = df[df["one_health_category"] != "Unclassified"]

To filter for a specific category:

human = df[df["one_health_category"] == "Human"]
food  = df[df["one_health_category"] == "Food"]

10. Why does _extra_attributes contain pipe-separated values for some keys?

NCBI BioSample allows submitters to include multiple <Attribute> elements with the same attribute_name on a single record (a repeatable-attribute pattern used for AMR fields, multiple culture collection numbers, etc.). When this occurs, the ingestion parser joins them with | before storing in extras[key]. This ensures no data is lost while keeping the JSON representation compact. To split them back:

import json

ea = json.loads(row["_extra_attributes"])
for key, val in ea.items():
    if isinstance(val, str) and "|" in val:
        parts = val.split("|")

11. How do I input assembly accessions vs. BioSample accessions?

BioMetaHarmonizer automatically detects the accession type by prefix:

SAMN, SAME, SAMD → BioSample accessions (direct fetch)
GCF_, GCA_ → Assembly accessions (two-step resolution)

You can mix both types in the same input file or list. Unrecognized IDs (wrong prefix) are logged at WARNING level and skipped. See Ingestion for the full two-step resolution process for assembly accessions.

12. How do I rebuild the assembly index cache after NCBI adds new genomes?

Pass refresh_cache=True to ingest() or use --refresh-cache on the CLI. This forces deletion and re-download of both assembly summary files from NCBI FTP before processing begins, regardless of how recently they were last downloaded:

biometaharmonizer run \
    --input assemblies.txt \
    --email your@email.com \
    --output harmonized.csv \
    --refresh-cache

The files (~100 MB each) are stored in CACHE_DIR (~/.biometaharmonizer/cache/ by default).

13. When and how do I re-run build_dictionaries.py to refresh the One Health term dictionary?

Re-run scripts/build_dictionaries.py when:

NCBI taxonomy is updated and new host/organism names are needed.
New OLS ontology versions are released.
You add new hand-curated entries to the base one_health_dictionaries.json and need to propagate collision detection.
New One Health categories are required.

To rebuild in place (overwrites the bundled file):

python scripts/build_dictionaries.py \
    --base   src/biometaharmonizer/schemas/one_health_dictionaries.json \
    --output src/biometaharmonizer/schemas/one_health_dictionaries.json

The base dictionary is loaded first; hand-curated entries are never overwritten (base_wins strategy).

14. How do I use build_dictionaries.py with a pre-downloaded taxdmp.zip?

To avoid the ~65 MB automatic download from NCBI FTP, download taxdmp.zip once and pass its local path via --taxdmp:

# Download once:
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip

# Use the local file on subsequent runs:
python scripts/build_dictionaries.py \
    --taxdmp /path/to/taxdmp.zip \
    --output src/biometaharmonizer/schemas/one_health_dictionaries.json

You may also pass the path to an already-extracted directory that contains names.dmp and nodes.dmp:

python scripts/build_dictionaries.py \
    --taxdmp /path/to/extracted_taxdmp/ \
    --output src/biometaharmonizer/schemas/one_health_dictionaries.json