FAQ
1. How do I register and use an NCBI API key?
Create a free NCBI account at https://www.ncbi.nlm.nih.gov/account/, navigate
to Settings → API Key Management, and click Create an API Key. Copy
the generated string and pass it to ingest()
via the api_key argument, or register it globally with
set_api_key():
import biometaharmonizer as bmh
bmh.set_api_key("YOUR_KEY")
df = bmh.ingest(source="ids.txt", email="your@email.com")
Without a key, NCBI allows 3 requests/second (inter_batch_sleep = 0.34 s);
with a key the limit rises to 10 requests/second (inter_batch_sleep = 0.12 s).
2. What happens when an accession is suppressed or withdrawn?
NCBI returns no <BioSample> XML element for suppressed, withdrawn, or
otherwise invalid accessions. The record is silently absent from the output
DataFrame. When assembly accessions cannot be resolved through either the
local index or the Entrez elink fallback, a WARNING is logged:
WARNING biometaharmonizer.ingestion:
NOT resolved (suppressed/invalid after both passes): 2
Unresolved: ['GCF_000000001.1', 'SAMN00000000']
The count of output rows will be less than the count of input IDs. Compare
len(df) with your input size to detect suppressed accessions.
3. How do I handle the rate-limit error HTTP 429?
An HTTP 429 response from NCBI means too many requests were sent in the
last second. BioMetaHarmonizer’s retry logic (_MAX_RETRIES = 3) handles
transient 429s automatically using exponential backoff:
wait = min(2^n, 30) seconds per attempt. If 429 errors persist:
Ensure you have registered and passed an NCBI API key.
Reduce
fetch_batch_size(e.g. to 100) to send smaller requests.Avoid running multiple concurrent instances against the same API key.
4. What does refresh_cache=True do and when should I use it?
When refresh_cache=True is set, the assembly summary flat files in
CACHE_DIR (assembly_summary_refseq.txt and
assembly_summary_genbank.txt) are deleted and re-downloaded
unconditionally — regardless of their age — before the run begins. Use this
flag when NCBI has added new assemblies since your last run and you need the
latest assembly→BioSample mappings:
df = bmh.ingest(source="new_assemblies.txt",
email="your@email.com",
refresh_cache=True)
Without this flag, cached files are used for up to 7 days (_CACHE_TTL_DAYS).
5. Why do some records have None in most columns?
Records with many None values fall into two categories:
Minimal BioSample records: the submitter only provided the mandatory fields. For Pathogen packages this is common; for environmental packages MIxS fields dominate while clinical fields are absent.
Null normalization: BioMetaHarmonizer applies
_normalize_null()to every attribute value. Any value matching_NULL_PATTERNS(e.g."missing","N/A","not provided","unknown") is stored asNone. This is intentional: downstream tools can usedf.notna()to compute genuine fill rates.
Use write_summary() to generate a fill-rate
report and identify which columns are sparsely populated for your dataset.
6. How do I use BioMetaHarmonizer in Google Colab?
Override the cache directory to a location inside /content so that the
~200 MB assembly index files are stored on the Colab writable filesystem
(or in a mounted Google Drive):
!pip install biometaharmonizer
import biometaharmonizer as bmh
bmh.set_cache_dir("/content/bmh_cache") # or "/content/drive/MyDrive/bmh_cache"
bmh.set_email("your@email.com")
bmh.set_api_key("YOUR_KEY")
df = bmh.ingest(source=["SAMN02436525", "SAMN02434874"])
df.to_csv("harmonized.csv", index=False)
The module docstring of ingestion.py contains this exact note under
Working directory note (Colab).
7. What is the difference between fetch_batch_size and esearch_batch_size?
``fetch_batch_size`` (default 200) — controls how many BioSample records are retrieved per
efetchHTTP request. Eachefetchcall returns raw XML forNrecords; larger batches mean fewer round trips but larger per-response payloads.``esearch_batch_size`` (module constant
_ESEARCH_BATCH= 100) — controls how many accession strings are assembled into a singleesearchterm (e.g."SAMN01[Accession] OR SAMN02[Accession] OR ...") before a server-side History slot is created. It is also used in the Entrez elink fallback path for resolving GCF/GCA accessions that are not found in the local index.
In the CLI, --esearch-batch-size defaults to 200 (overriding the module
constant). The Python API default is 100.
8. How do I access antibiogram data from the output DataFrame?
Antibiogram data is stored as a JSON list inside the _extra_attributes
column. To extract it:
import json, pandas as pd
def get_antibiogram(row):
if pd.isna(row["_extra_attributes"]):
return []
ea = json.loads(row["_extra_attributes"])
return ea.get("antibiogram", [])
abg_rows = []
for _, row in df.iterrows():
for entry in get_antibiogram(row):
entry["biosample_accession"] = row["biosample_accession"]
abg_rows.append(entry)
abg_df = pd.DataFrame(abg_rows)
See Antibiogram for a complete walkthrough.
9. How do I filter only records with a known one_health_category?
The one_health_category column is always a string (never NaN). Records
that could not be classified receive the string "Unclassified". To filter
for classified records:
classified = df[df["one_health_category"] != "Unclassified"]
To filter for a specific category:
human = df[df["one_health_category"] == "Human"]
food = df[df["one_health_category"] == "Food"]
10. Why does _extra_attributes contain pipe-separated values for some keys?
NCBI BioSample allows submitters to include multiple <Attribute> elements
with the same attribute_name on a single record (a repeatable-attribute
pattern used for AMR fields, multiple culture collection numbers, etc.).
When this occurs, the ingestion parser joins them with | before storing
in extras[key]. This ensures no data is lost while keeping the JSON
representation compact. To split them back:
import json
ea = json.loads(row["_extra_attributes"])
for key, val in ea.items():
if isinstance(val, str) and "|" in val:
parts = val.split("|")
11. How do I input assembly accessions vs. BioSample accessions?
BioMetaHarmonizer automatically detects the accession type by prefix:
SAMN,SAME,SAMD→ BioSample accessions (direct fetch)GCF_,GCA_→ Assembly accessions (two-step resolution)
You can mix both types in the same input file or list. Unrecognized IDs (wrong prefix) are logged at WARNING level and skipped. See Ingestion for the full two-step resolution process for assembly accessions.
12. How do I rebuild the assembly index cache after NCBI adds new genomes?
Pass refresh_cache=True to ingest() or use --refresh-cache on the
CLI. This forces deletion and re-download of both assembly summary files
from NCBI FTP before processing begins, regardless of how recently they were
last downloaded:
biometaharmonizer run \
--input assemblies.txt \
--email your@email.com \
--output harmonized.csv \
--refresh-cache
The files (~100 MB each) are stored in CACHE_DIR
(~/.biometaharmonizer/cache/ by default).
13. When and how do I re-run build_dictionaries.py to refresh the One Health term dictionary?
Re-run scripts/build_dictionaries.py when:
NCBI taxonomy is updated and new host/organism names are needed.
New OLS ontology versions are released.
You add new hand-curated entries to the base
one_health_dictionaries.jsonand need to propagate collision detection.New One Health categories are required.
To rebuild in place (overwrites the bundled file):
python scripts/build_dictionaries.py \
--base src/biometaharmonizer/schemas/one_health_dictionaries.json \
--output src/biometaharmonizer/schemas/one_health_dictionaries.json
The base dictionary is loaded first; hand-curated entries are never overwritten
(base_wins strategy).
14. How do I use build_dictionaries.py with a pre-downloaded taxdmp.zip?
To avoid the ~65 MB automatic download from NCBI FTP, download
taxdmp.zip once and pass its local path via --taxdmp:
# Download once:
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
# Use the local file on subsequent runs:
python scripts/build_dictionaries.py \
--taxdmp /path/to/taxdmp.zip \
--output src/biometaharmonizer/schemas/one_health_dictionaries.json
You may also pass the path to an already-extracted directory that contains
names.dmp and nodes.dmp:
python scripts/build_dictionaries.py \
--taxdmp /path/to/extracted_taxdmp/ \
--output src/biometaharmonizer/schemas/one_health_dictionaries.json