Quickstart
All three examples below use the actual public API of BioMetaHarmonizer v0.6.0.
Every parameter name matches the signature of biometaharmonizer.ingestion.ingest()
exactly as it appears in the source code.
Example 1 — Minimal: Three BioSample Accessions
Fetch three NCBI BioSample records and save the result to a CSV file.
import biometaharmonizer as bmh
df = bmh.ingest(
source=["SAMN02436525", "SAMN02434874", "SAMN02429261"],
email="your@email.com",
)
bmh.write(df, "output.csv", fmt="csv")
print(df.shape) # (3, 51)
print(df.columns.tolist())
The returned DataFrame always has exactly 51 columns in the order listed
in Schema Reference. Every column is pre-initialised to None/NaN
for records that do not carry that attribute.
Example 2 — Large-Scale Run with API Key and Custom Cache
Read accessions from a plain-text file (one per line), enable API key authentication, override the cache directory, and tune batch sizes.
import biometaharmonizer as bmh
df = bmh.ingest(
source="accessions.txt", # Path or str to a file; one accession per line
email="your@email.com",
api_key="YOUR_NCBI_API_KEY",
cache_dir="/data/bmh_cache", # Custom cache for assembly summary files
fetch_batch_size=500, # Records per efetch request (default: 200)
esearch_batch_size=200, # Accessions per esearch term (default: 100)
refresh_cache=False, # Set True to force re-download of assembly index
)
bmh.write(df, "harmonized.parquet", fmt="parquet")
bmh.write_summary(df, "fill_rates.csv")
accessions.txt may contain BioSample accessions (SAMN/SAME/SAMD),
assembly accessions (GCF_/GCA_), or a mix of both — the tool classifies
them automatically via the internal _classify_ids() function.
Example 3 — Assembly Accession Input
Use GCF_/GCA_ accessions directly. The tool resolves them to BioSample accessions in a two-step process:
Local index lookup: both
assembly_summary_refseq.txtandassembly_summary_genbank.txtare searched for matchingassembly_accessionvalues. This step is instant for cached files.Entrez elink fallback: any accessions not found in the local index are resolved via a live
Entrez.esearch+Entrez.esummarycall against thebiosampledatabase.
import biometaharmonizer as bmh
# Mixed list: RefSeq and GenBank assembly accessions
accessions = [
"GCF_000009045.1", # Bacillus subtilis 168
"GCA_000005845.2", # E. coli K-12 MG1655
"GCF_000210835.1", # Klebsiella pneumoniae NTUH-K2044
]
df = bmh.ingest(
source=accessions,
email="your@email.com",
refresh_cache=False,
)
# The resolved assembly accessions are back-filled from the index:
print(df[["biosample_accession",
"assembly_accession_refseq",
"assembly_accession_genbank"]])
Output Columns
The fixed output schema (returned by every ingest() call) contains
exactly 51 columns in the following order:
[
"biosample_accession", "biosample_id", "sra_accession",
"bioproject_accession", "assembly_accession_refseq",
"assembly_accession_genbank", "sample_name_id",
"taxonomy_id", "taxonomy_name", "organism_name",
"collection_date", "collection_date_range",
"geo_loc_name", "lat_lon",
"geo_country", "geo_region", "geo_locality",
"geo_iso3166", "geo_sea_ocean", "geo_loc_raw",
"host", "host_disease", "host_age", "host_sex",
"host_tissue_sampled", "isolation_source", "sample_type",
"one_health_category", "isolate", "strain", "sub_strain",
"serotype", "serovar", "genotype", "culture_collection",
"outbreak", "env_broad_scale", "env_local_scale", "env_medium",
"sequencing_method", "assembly_method", "collected_by",
"ncbi_package", "submission_date", "last_update",
"publication_date", "access", "status", "status_date",
"title", "description_comment", "_extra_attributes",
]
See Schema Reference for the full description of every column.