API Reference
This page documents every public function and class exported by BioMetaHarmonizer v0.6.0. Parameter names, defaults, and types are derived directly from the source code.
biometaharmonizer.ingestion
- biometaharmonizer.ingestion.ingest(source, email=None, api_key=None, cache_dir=None, fetch_batch_size=None, esearch_batch_size=None, refresh_cache=False) pd.DataFrame
Fetch and return harmonized BioSample metadata as a DataFrame.
Accepts BioSample accessions (
SAMN/SAME/SAMD), assembly accessions (GCF_/GCA_), or a mix. Assembly accessions are resolved to BioSample accessions via a local assembly index (downloaded once tocache_dir) with an Entrez elink fallback. BioSample records are fetched using theesearch(usehistory="y")+efetchpipeline in batches offetch_batch_size.- Parameters:
source (str, Path, or list[str]) – Path to a file of accessions (one per line), a list of accession strings, or a single accession string. Accepted prefixes:
SAMN,SAME,SAMD(BioSample) orGCF_,GCA_(assembly).email (str or None) – Contact email for NCBI Entrez. Required if not previously set via
set_email(). Must match^[^@\s]+@[^@\s]+\.[^@\s]+$.api_key (str or None) – NCBI API key. Raises rate limit from 3 to 10 req/s. Optional; if not provided the module-level
ENTREZ_API_KEYis used.cache_dir (str, Path, or None) – Directory for assembly summary flat-file cache. Defaults to
~/.biometaharmonizer/cache/.fetch_batch_size (int or None) – Records per
efetchrequest. Defaults to 200. Higher values reduce round trips but increase per-request payload.esearch_batch_size (int or None) – Accessions per
esearchterm (BioSample and elink paths). Defaults to 100.refresh_cache (bool) – When
True, delete and re-download the assembly summary files unconditionally. Defaults toFalse.
- Returns:
DataFrame conforming to the 51-column fixed schema defined by
_load_final_schema(). Every column is initialized toNone/NaN for records that lack the attribute.- Return type:
- Raises:
ValueError – If no valid email is provided (neither via
emailparameter nor via priorset_email()call).
import biometaharmonizer as bmh df = bmh.ingest( source=["SAMN02436525", "SAMN02434874"], email="your@email.com", api_key="YOUR_KEY", fetch_batch_size=500, )
- biometaharmonizer.ingestion.set_email(email: str) None
Set the module-level Entrez contact email used by all subsequent
ingest()calls.- Parameters:
email (str) – Valid e-mail address. Validated against the pattern
^[^@\s]+@[^@\s]+\.[^@\s]+$.- Raises:
ValueError – If
emaildoes not match the pattern.
import biometaharmonizer as bmh bmh.set_email("researcher@institution.edu")
- biometaharmonizer.ingestion.set_api_key(key: str) None
Set the module-level NCBI API key used by all subsequent
ingest()calls. Also setsBio.Entrez.api_key.- Parameters:
key (str) – NCBI API key string from https://www.ncbi.nlm.nih.gov/account/
import biometaharmonizer as bmh bmh.set_api_key("abc123def456...")
- biometaharmonizer.ingestion.set_cache_dir(path) None
Override the directory used for assembly summary flat-file caching. Clears the internal
_read_assembly_summary_cachedLRU cache so that subsequent calls use the new path.- Parameters:
path (str or Path) – New cache directory path.
import biometaharmonizer as bmh bmh.set_cache_dir("/content/bmh_cache") # Google Colab example
biometaharmonizer.output
- biometaharmonizer.output.write(df: pd.DataFrame, path, fmt: str = 'csv') Path
Write a harmonized DataFrame to disk in the specified format.
- Parameters:
df (pandas.DataFrame) – Harmonized DataFrame to write.
path (str or Path) – Destination file path. Parent directories are created automatically.
fmt (str) – Output format. One of
"csv","tsv","excel","parquet". Case-insensitive. Defaults to"csv".
- Returns:
Resolved absolute path to the written file.
- Return type:
- Raises:
ValueError – If
fmtis not one of the four supported formats.
from biometaharmonizer.output import write write(df, "harmonized.parquet", fmt="parquet")
- biometaharmonizer.output.write_summary(df: pd.DataFrame, path) Path
Write a fill-rate summary CSV for each column in
df.The output CSV has three columns:
column_name,non_null_count,fill_pct.fill_pctis rounded to one decimal place.- Parameters:
df (pandas.DataFrame) – Source DataFrame to summarize.
path (str or Path) – Destination file path for the summary CSV.
- Returns:
Resolved absolute path to the written summary file.
- Return type:
from biometaharmonizer.output import write_summary write_summary(df, "fill_rates.csv")
biometaharmonizer.date_engine
- class biometaharmonizer.date_engine.DateEngine
Temporal parsing engine. Converts date strings to ISO 8601 truncated representation.
- parse(series) pd.Series
Parse a Series of date strings to ISO 8601 point dates. Deduplicates unique values before parsing for performance.
- Parameters:
series (pandas.Series or pandas.DataFrame) – Series of raw date strings.
- Returns:
Series of ISO 8601 strings (
YYYY,YYYY-MM, orYYYY-MM-DD) with NaN for unparseable or null values.- Return type:
- parse_with_range(series) pd.DataFrame
Parse dates and return a two-column DataFrame.
- Parameters:
series (pandas.Series) – Series of raw date strings.
- Returns:
DataFrame with columns
collection_date(ISO 8601 point date or NaN) andcollection_date_range(verbatim original string for any range/approximate input, else NaN).- Return type:
biometaharmonizer.geo_engine
- class biometaharmonizer.geo_engine.GeoEngine
Geospatial resolution engine. Parses
geo_loc_namestrings into structured geographic fields.- parse(series) pd.DataFrame
Parse a Series of
geo_loc_namestrings into a six-column DataFrame.- Parameters:
series (pandas.Series) – Series of
geo_loc_namestrings.- Returns:
DataFrame with columns:
geo_country,geo_region,geo_locality,geo_iso3166,geo_sea_ocean,geo_loc_raw.- Return type:
from biometaharmonizer.geo_engine import GeoEngine import pandas as pd geo = GeoEngine() series = pd.Series(["Russia: Novosibirsk", "USA: California, San Diego"]) result = geo.parse(series) print(result[["geo_country", "geo_region", "geo_iso3166"]])
biometaharmonizer.one_health
- class biometaharmonizer.one_health.OneHealthClassifier
One Health categorization classifier. Loads biological knowledge exclusively from
one_health_dictionaries.json.- classify(series) pd.Series
Single-field classification from a Series of
isolation_sourcevalues.- Parameters:
series (pandas.Series) – Series of isolation source strings.
- Returns:
Series of One Health category strings.
- Return type:
- classify_joint(isolation_source_series, host_series) pd.Series
Two-field classification using both
isolation_sourceandhost. Both Series must share the same index.- Parameters:
isolation_source_series (pandas.Series) – Series of isolation source strings.
host_series (pandas.Series) – Series of host strings.
- Returns:
Series of One Health category strings.
- Return type:
- Raises:
ValueError – If the two Series do not share the same index.
- classify_with_confidence(series) pd.DataFrame
Single-field classification with confidence scores.
- Parameters:
series (pandas.Series) – Series of isolation source strings.
- Returns:
DataFrame with columns:
one_health_category,one_health_term,one_health_confidence.- Return type:
- classify_multi_field(**fields) pd.DataFrame
Multi-field evidence integration. Accepts named
pd.Seriesfor any of:isolation_source,host,env_medium,env_local_scale,env_broad_scale,sample_type.- Returns:
DataFrame with columns:
one_health_category,one_health_term,one_health_confidence,one_health_evidence_level,one_health_processing,one_health_setting,one_health_source_field.- Return type:
from biometaharmonizer.one_health import OneHealthClassifier import pandas as pd clf = OneHealthClassifier() sources = pd.Series(["blood", "chicken feces", "river sediment"]) print(clf.classify(sources)) # 0 Human # 1 Animal # 2 Environmental
biometaharmonizer.key_mapper
- class biometaharmonizer.key_mapper.KeyMapper
Harmonizes column names for custom or non-ingestion workflows by applying the two-layer synonym lookup from
synonyms.py.- map_columns(df) pd.DataFrame
Rename raw columns to canonical standard keys, coalesce duplicate columns (using
combine_firstpriority to the leftmost column), and reindex to the fixed output schema.- Parameters:
df (pandas.DataFrame) – Input DataFrame with potentially non-standard column names.
- Returns:
DataFrame reindexed to
BIOSAMPLE_SCHEMA.- Return type:
biometaharmonizer.synonyms
- biometaharmonizer.synonyms.build_synonym_lookup() dict
Build and return a
{lowercased_synonym: standard_key}dict from two sources:unified.json(Layer 1) andncbi_attributes.xml(Layer 2, optional). The result is cached viafunctools.lru_cacheafter the first call.- Returns:
Dictionary mapping lowercased synonym strings to canonical standard key strings. Returns an empty dict if neither schema file is present.
- Return type: