Schema Reference
Every call to ingest() returns a DataFrame
with exactly 51 columns in the order defined by _load_final_schema().
All columns are initialized to None/NaN for records that do not carry the
corresponding attribute.
Output Columns
Column |
Type |
Source |
Description |
Example |
|---|---|---|---|---|
|
str/NaN |
BioSample XML |
Primary INSDC BioSample accession. |
|
|
str/NaN |
BioSample XML |
NCBI internal numeric BioSample ID. |
|
|
str/NaN |
|
Linked SRA experiment/run accession. |
|
|
str/NaN |
|
BioProject accession; back-filled from assembly index. |
|
|
str/NaN |
Assembly index (RefSeq) |
RefSeq assembly accession for this BioSample. |
|
|
str/NaN |
Assembly index (GenBank) |
GenBank assembly accession for this BioSample. |
|
|
str/NaN |
|
Submitter-assigned sample name/ID. |
|
|
str/NaN |
|
NCBI Taxonomy ID. |
|
|
str/NaN |
|
NCBI Taxonomy name (species-level label). |
|
|
str/NaN |
|
Organism name as submitted; falls back to |
|
|
str/NaN |
Attribute + DateEngine |
ISO 8601 point collection date (YYYY, YYYY-MM, YYYY-MM-DD). |
|
|
str/NaN |
Attribute + DateEngine |
Verbatim original date string for range/approximate inputs. |
|
|
str/NaN |
Attribute |
Original |
|
|
str/NaN |
Attribute |
Latitude/longitude as submitted (free-text string). |
|
|
str/NaN |
GeoEngine |
Normalised country display name. |
|
|
str/NaN |
GeoEngine |
Sub-national region as submitted. |
|
|
str/NaN |
GeoEngine |
Locality or sub-region as submitted. |
|
|
str/NaN |
GeoEngine + pycountry |
ISO 3166-1 alpha-2 code; |
|
|
str/NaN |
GeoEngine |
Ocean or sea name for marine samples. |
|
|
str/NaN |
GeoEngine |
Original string for coordinate-only entries; NaN otherwise. |
|
|
str/NaN |
Attribute |
Host organism as submitted. |
|
|
str/NaN |
Attribute |
Disease of the host. |
|
|
str/NaN |
Attribute |
Age of the host at time of sampling. |
|
|
str/NaN |
Attribute |
Biological sex of the host. |
|
|
str/NaN |
Attribute |
Tissue or body site sampled. |
|
|
str/NaN |
Attribute |
Physical, chemical, or biological material of sample. |
|
|
str/NaN |
Attribute |
Type of sample (e.g. clinical, environmental). |
|
|
str |
OneHealthClassifier |
One Health tier. Always a string; never NaN. Possible values: Human, Animal, Aquatic, Wildlife, Plant, Food, Environmental, Lab, Unclassified. |
|
|
str/NaN |
Attribute |
Isolate identifier. |
|
|
str/NaN |
Attribute |
Strain designation. |
|
|
str/NaN |
Attribute |
Sub-strain designation. |
|
|
str/NaN |
Attribute |
Serotype (antigen type). |
|
|
str/NaN |
Attribute |
Serovar designation. |
|
|
str/NaN |
Attribute |
Genotype classification. |
|
|
str/NaN |
Attribute |
Culture collection number/ID. |
|
|
str/NaN |
Attribute |
Outbreak identifier or name. |
|
|
str/NaN |
Attribute (MIxS) |
Broad-scale environmental context (MIxS field). |
|
|
str/NaN |
Attribute (MIxS) |
Local environmental context (MIxS field). |
|
|
str/NaN |
Attribute (MIxS) |
Environmental medium (MIxS field). |
|
|
str/NaN |
Attribute |
Sequencing platform or technology. |
|
|
str/NaN |
Attribute |
Assembly software/method. |
|
|
str/NaN |
Attribute |
Name of person/institution that collected the sample. |
|
|
str/NaN |
|
NCBI BioSample package name. |
|
|
str/NaN |
BioSample XML |
ISO 8601 date when the record was submitted. |
|
|
str/NaN |
BioSample XML |
ISO 8601 date of last record update. |
|
|
str/NaN |
BioSample XML |
ISO 8601 date when the record was made public. |
|
|
str/NaN |
BioSample XML |
Access level (e.g. |
|
|
str/NaN |
|
Record status (e.g. |
|
|
str/NaN |
|
ISO 8601 date of the most recent status change. |
|
|
str/NaN |
|
BioSample title as submitted. |
|
|
str/NaN |
|
Free-text comment paragraph from the BioSample record. |
|
|
str/NaN |
Overflow attributes |
JSON string containing all attributes that did not resolve to a known final output column. |
See below |
_extra_attributes
_extra_attributes is a JSON-serialized dict. It captures all attribute
key–value pairs from the BioSample XML that do not map to any of the 50 named
schema columns via the synonym lookup.
JSON structure:
{
"antibiogram": [
{
"antibiotic_name": "ampicillin",
"resistance_phenotype": "susceptible",
"measurement_sign": "<=",
"measurement": "8",
"measurement_units": "mg/L",
"laboratory_typing_method": "MIC",
"testing_standard": "CLSI"
}
],
"panel_id": "TREKAMRO",
"submission_contact": "John Smith",
"submission_owner": "University Hospital Lab"
}
Known sub-keys:
Sub-key |
Value type |
Description |
|---|---|---|
|
list of dicts |
Antibiogram rows; one dict per antibiotic. See Antibiogram for full details. |
|
str |
AMR panel identifier from NCBI Pathogen records. |
|
str |
Submitter contact name/email. |
|
str |
Submitting organization name. |
(other attribute keys) |
str |
Any other attribute that did not resolve to a named schema column. Multiple values for the same key are joined with |
Pipe-separated values: When NCBI XML contains multiple <Attribute>
elements with the same key on a single BioSample record, the values are
concatenated with a | pipe separator inside the JSON string. This is
an intentional design decision to preserve all submitted values without
data loss.