Ingestion
This page documents the behavior of biometaharmonizer.ingestion in
full, derived entirely from its source code.
Overview
ingest() is the single entry point for all
data retrieval. It accepts a list of NCBI accessions (or a path to a text file
containing them), fetches the corresponding BioSample XML records from NCBI
Entrez, parses and harmonizes each record, and returns a
pandas.DataFrame conforming to the fixed 51-column output schema
defined in _load_final_schema().
ID Classification
Before any network request, _classify_ids() partitions the input into three
buckets by prefix:
Assembly accessions (
GCF_,GCA_) — require two-step resolution to BioSample accessions.BioSample accessions (
SAMN,SAME,SAMD) — fetched directly.Unrecognized IDs — logged at WARNING level and skipped entirely; they produce no rows in the output.
The classification is case-insensitive and checked via startswith.
Assembly Accession Resolution
Assembly accessions cannot be fetched directly from the biosample Entrez
database. They are resolved to BioSample accessions via two sequential passes:
Pass 1 — Local assembly index:
The function _ensure_assembly_summaries() downloads and caches two NCBI
assembly summary flat files on first use:
assembly_summary_refseq.txtfromhttps://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txtassembly_summary_genbank.txtfromhttps://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
Both are stored in CACHE_DIR (default: ~/.biometaharmonizer/cache/).
The TTL is 7 days (_CACHE_TTL_DAYS). Files older than this are
deleted and re-downloaded automatically. Setting refresh_cache=True
forces deletion and re-download regardless of age.
Once cached, _resolve_assembly_to_biosample() searches both files for
matching assembly_accession values and extracts the corresponding
biosample column value.
Pass 2 — Entrez elink fallback:
Any assembly accession not found in the local index is forwarded to
_resolve_gcx_via_entrez(), which performs:
esearch(db="biosample",
term="GCF_XXXXXX.1[Accession] OR ...",
usehistory="n")
esummary(db="biosample", id=<UIDs from esearch>)
The Accession field from each DocumentSummary element is extracted
and mapped back to the original GCF/GCA input.
esearch + efetch Pipeline for BioSample Accessions
For native BioSample accessions the pipeline is:
esearch(db="biosample",
term="SAMN...[Accession] OR ...",
usehistory="y") <- per batch
efetch(db="biosample",
WebEnv=<env>,
query_key=<key>,
retstart=0,
retmax=fetch_batch_size,
rettype="full",
retmode="xml") <- one call per page
The usehistory="y" flag instructs NCBI to store the result set in a
named server-side History slot (WebEnv + query_key). Each
ingest() call creates a fresh History
slot per batch; this avoids the cross-batch accumulation bug where a single
query_key would incorrectly index only the final batch.
Batch Size Constants
Constant |
Default |
Controls |
|---|---|---|
|
200 |
Records per |
|
100 |
Accessions per |
Higher fetch_batch_size values reduce HTTP round trips but produce larger
XML payloads per response and may increase per-request latency.
Rate Limits and Inter-Batch Sleep
NCBI enforces the following limits:
Without API key: 3 requests/second (
inter_batch_sleep = 0.34s)With API key: 10 requests/second (
inter_batch_sleep = 0.12s)
The inter_batch_sleep variable is set inside _fetch_biosample_metadata
by checking the module-level ENTREZ_API_KEY. The sleep is applied after
every esearch batch and after every efetch call.
Retry Logic
Every Entrez call is wrapped in a retry loop using the following constants:
Constant |
Value |
Meaning |
|---|---|---|
|
3 |
Maximum number of attempts per call. |
|
2 |
Base of the exponential backoff (seconds). |
|
30 |
Maximum wait between retries (seconds). |
The wait duration for attempt n (1-indexed) is:
wait = min(_RETRY_BASE_S ** n, _RETRY_MAX_S)
= min(2^n, 30) seconds
So the sequence is: 2 s, 4 s, 8 s (capped at 30 s for large n).
Retries are triggered by instances of _TRANSIENT_EXCEPTIONS:
urllib.error.URLError, http.client.HTTPException, ConnectionError,
TimeoutError, and OSError.
Null Normalization
Every attribute value parsed from NCBI XML is passed through
_normalize_null(value). A value is normalized to None if it is:
Literally
Noneor a pandas NA.Empty string or whitespace-only.
Matched by the
_NULL_PATTERNSregex (case-insensitive):Dash/dot placeholders:
-,--,.,...Explicit nulls:
n/a,na,nd,nr,ns,nt,none,null,nilMissing variants:
missing,misssing,missng,misingUnknown variants:
unknown,unkown,unknwon,unknow“Not X” phrases:
not provided,not collected,not applicable,not available,not determined,not recorded,not reported,not known,not given,not stated,not specified,not done,not tested,not sequenced,not typedOther descriptors:
unavailable,unspecified,undetermined,unidentifiedRestricted/access terms:
restricted,restricted access,withheld,confidentialPlaceholder abbreviations:
tbd,tbaPrefixed null phrases: any string matching
missing:.*,not applicable:.*, ordata agreement established pre-2023
Suppressed and Invalid Accessions
When a BioSample accession is suppressed, withdrawn, or otherwise invalid,
NCBI returns no <BioSample> element for it in the efetch XML response. The
accession simply produces no row in the output. At the end of ingest(),
the number of records returned is logged; discrepancies between input count
and output row count indicate suppressed or invalid accessions:
WARNING biometaharmonizer.ingestion:
NOT resolved (suppressed/invalid after both passes): 2
Unresolved: ['GCF_000000001.1', 'SAMN00000000']
Ingest Summary Log
At the end of every successful ingest() call the following lines are
emitted at INFO level:
============================================================
INGEST SUMMARY
Input IDs provided : <N>
Assembly accessions : <n_gcx> (only if GCF/GCA input)
Resolved via local index : <n>
Resolved via Entrez elink (fallback) : <n>
fetch_batch_size : <N>
esearch_batch_size : <N>
Records in output : <N>
bioproject_accession filled : <n> / <N>
assembly_accession_refseq filled : <n> / <N>
assembly_accession_genbank filled : <n> / <N>
============================================================
BioProject and Assembly Back-Fill
After fetching BioSample XML, ingest() performs a second pass against the
local assembly index to back-fill bioproject_accession,
assembly_accession_refseq, and assembly_accession_genbank columns for
records whose BioSample accession appears in the index. This is logged at INFO
level with the count of records updated.
_extra_attributes
Any BioSample attribute whose harmonized_name (or attribute_name) does
not resolve to a known final output column via the synonym lookup is stored in
the _extra_attributes column as a JSON-serialized dict. When the same key
appears multiple times on a single record (a valid NCBI pattern for repeatable
attributes), the values are joined with a | pipe separator. The antibiogram
list (if present) is stored under the "antibiogram" key as a native Python
list, not a JSON string, before the single json.dumps(extras) call
serializes the whole dict.