Schema Reference

Every call to ingest() returns a DataFrame with exactly 51 columns in the order defined by _load_final_schema(). All columns are initialized to None/NaN for records that do not carry the corresponding attribute.

Output Columns

Output column schema

Column

Type

Source

Description

Example

biosample_accession

str/NaN

BioSample XML @accession

Primary INSDC BioSample accession.

SAMN02436525

biosample_id

str/NaN

BioSample XML @id

NCBI internal numeric BioSample ID.

2436525

sra_accession

str/NaN

<Id db="SRA">

Linked SRA experiment/run accession.

SRR1234567

bioproject_accession

str/NaN

<Id db="BioProject"> / assembly index

BioProject accession; back-filled from assembly index.

PRJNA123456

assembly_accession_refseq

str/NaN

Assembly index (RefSeq)

RefSeq assembly accession for this BioSample.

GCF_000009045.1

assembly_accession_genbank

str/NaN

Assembly index (GenBank)

GenBank assembly accession for this BioSample.

GCA_000009045.1

sample_name_id

str/NaN

<Id db_label="Sample name">

Submitter-assigned sample name/ID.

KP-2021-001

taxonomy_id

str/NaN

<Organism @taxonomy_id>

NCBI Taxonomy ID.

573

taxonomy_name

str/NaN

<Organism @taxonomy_name>

NCBI Taxonomy name (species-level label).

Klebsiella pneumoniae

organism_name

str/NaN

<OrganismName> or fallback

Organism name as submitted; falls back to taxonomy_name.

K. pneumoniae subsp.

collection_date

str/NaN

Attribute + DateEngine

ISO 8601 point collection date (YYYY, YYYY-MM, YYYY-MM-DD).

2021-06-15

collection_date_range

str/NaN

Attribute + DateEngine

Verbatim original date string for range/approximate inputs.

2020-01/2020-06

geo_loc_name

str/NaN

Attribute

Original geo_loc_name as submitted to NCBI.

Russia: Novosibirsk

lat_lon

str/NaN

Attribute

Latitude/longitude as submitted (free-text string).

56.0153 N 92.8932 E

geo_country

str/NaN

GeoEngine

Normalised country display name.

Russia

geo_region

str/NaN

GeoEngine

Sub-national region as submitted.

Novosibirsk Oblast

geo_locality

str/NaN

GeoEngine

Locality or sub-region as submitted.

Akademgorodok

geo_iso3166

str/NaN

GeoEngine + pycountry

ISO 3166-1 alpha-2 code; "HISTORICAL" for defunct countries.

RU

geo_sea_ocean

str/NaN

GeoEngine

Ocean or sea name for marine samples.

Pacific Ocean

geo_loc_raw

str/NaN

GeoEngine

Original string for coordinate-only entries; NaN otherwise.

45.3 N, 30.1 E

host

str/NaN

Attribute

Host organism as submitted.

Homo sapiens

host_disease

str/NaN

Attribute

Disease of the host.

pneumonia

host_age

str/NaN

Attribute

Age of the host at time of sampling.

45

host_sex

str/NaN

Attribute

Biological sex of the host.

male

host_tissue_sampled

str/NaN

Attribute

Tissue or body site sampled.

lung

isolation_source

str/NaN

Attribute

Physical, chemical, or biological material of sample.

blood

sample_type

str/NaN

Attribute

Type of sample (e.g. clinical, environmental).

clinical

one_health_category

str

OneHealthClassifier

One Health tier. Always a string; never NaN. Possible values: Human, Animal, Aquatic, Wildlife, Plant, Food, Environmental, Lab, Unclassified.

Human

isolate

str/NaN

Attribute

Isolate identifier.

KP-2021-001

strain

str/NaN

Attribute

Strain designation.

ATCC 700603

sub_strain

str/NaN

Attribute

Sub-strain designation.

variant-A

serotype

str/NaN

Attribute

Serotype (antigen type).

O1:K1

serovar

str/NaN

Attribute

Serovar designation.

Typhimurium

genotype

str/NaN

Attribute

Genotype classification.

ST258

culture_collection

str/NaN

Attribute

Culture collection number/ID.

ATCC:700603

outbreak

str/NaN

Attribute

Outbreak identifier or name.

2011 Germany HUS

env_broad_scale

str/NaN

Attribute (MIxS)

Broad-scale environmental context (MIxS field).

grassland biome

env_local_scale

str/NaN

Attribute (MIxS)

Local environmental context (MIxS field).

pasture

env_medium

str/NaN

Attribute (MIxS)

Environmental medium (MIxS field).

soil

sequencing_method

str/NaN

Attribute

Sequencing platform or technology.

Illumina HiSeq 2500

assembly_method

str/NaN

Attribute

Assembly software/method.

SPAdes v3.15

collected_by

str/NaN

Attribute

Name of person/institution that collected the sample.

CDC

ncbi_package

str/NaN

<Package> element

NCBI BioSample package name.

Pathogen.cl.1.0

submission_date

str/NaN

BioSample XML @submission_date

ISO 8601 date when the record was submitted.

2021-09-01

last_update

str/NaN

BioSample XML @last_update

ISO 8601 date of last record update.

2022-03-15

publication_date

str/NaN

BioSample XML @publication_date

ISO 8601 date when the record was made public.

2021-09-05

access

str/NaN

BioSample XML @access

Access level (e.g. "public").

public

status

str/NaN

<Status @status>

Record status (e.g. "live", "suppressed").

live

status_date

str/NaN

<Status @when>

ISO 8601 date of the most recent status change.

2021-09-05

title

str/NaN

<Description/Title>

BioSample title as submitted.

K. pneumoniae isolate

description_comment

str/NaN

<Description/Comment/Paragraph>

Free-text comment paragraph from the BioSample record.

Hospital-acquired...

_extra_attributes

str/NaN

Overflow attributes

JSON string containing all attributes that did not resolve to a known final output column.

See below

_extra_attributes

_extra_attributes is a JSON-serialized dict. It captures all attribute key–value pairs from the BioSample XML that do not map to any of the 50 named schema columns via the synonym lookup.

JSON structure:

{
  "antibiogram": [
    {
      "antibiotic_name": "ampicillin",
      "resistance_phenotype": "susceptible",
      "measurement_sign": "<=",
      "measurement": "8",
      "measurement_units": "mg/L",
      "laboratory_typing_method": "MIC",
      "testing_standard": "CLSI"
    }
  ],
  "panel_id": "TREKAMRO",
  "submission_contact": "John Smith",
  "submission_owner": "University Hospital Lab"
}

Known sub-keys:

Output column schema

Sub-key

Value type

Description

antibiogram

list of dicts

Antibiogram rows; one dict per antibiotic. See Antibiogram for full details.

panel_id

str

AMR panel identifier from NCBI Pathogen records.

submission_contact

str

Submitter contact name/email.

submission_owner

str

Submitting organization name.

(other attribute keys)

str

Any other attribute that did not resolve to a named schema column. Multiple values for the same key are joined with |.

Pipe-separated values: When NCBI XML contains multiple <Attribute> elements with the same key on a single BioSample record, the values are concatenated with a | pipe separator inside the JSON string. This is an intentional design decision to preserve all submitted values without data loss.