.. _api_reference: ============= API Reference ============= This page documents every public function and class exported by BioMetaHarmonizer v0.6.0. Parameter names, defaults, and types are derived directly from the source code. biometaharmonizer.ingestion ----------------------------- .. function:: biometaharmonizer.ingestion.ingest(source, email=None, api_key=None, cache_dir=None, fetch_batch_size=None, esearch_batch_size=None, refresh_cache=False) -> pd.DataFrame Fetch and return harmonized BioSample metadata as a DataFrame. Accepts BioSample accessions (``SAMN``/``SAME``/``SAMD``), assembly accessions (``GCF_``/``GCA_``), or a mix. Assembly accessions are resolved to BioSample accessions via a local assembly index (downloaded once to ``cache_dir``) with an Entrez elink fallback. BioSample records are fetched using the ``esearch(usehistory="y")`` + ``efetch`` pipeline in batches of ``fetch_batch_size``. :param source: Path to a file of accessions (one per line), a list of accession strings, or a single accession string. Accepted prefixes: ``SAMN``, ``SAME``, ``SAMD`` (BioSample) or ``GCF_``, ``GCA_`` (assembly). :type source: str, Path, or list[str] :param email: Contact email for NCBI Entrez. Required if not previously set via :func:`set_email`. Must match ``^[^@\s]+@[^@\s]+\.[^@\s]+$``. :type email: str or None :param api_key: NCBI API key. Raises rate limit from 3 to 10 req/s. Optional; if not provided the module-level ``ENTREZ_API_KEY`` is used. :type api_key: str or None :param cache_dir: Directory for assembly summary flat-file cache. Defaults to ``~/.biometaharmonizer/cache/``. :type cache_dir: str, Path, or None :param fetch_batch_size: Records per ``efetch`` request. Defaults to 200. Higher values reduce round trips but increase per-request payload. :type fetch_batch_size: int or None :param esearch_batch_size: Accessions per ``esearch`` term (BioSample and elink paths). Defaults to 100. :type esearch_batch_size: int or None :param refresh_cache: When ``True``, delete and re-download the assembly summary files unconditionally. Defaults to ``False``. :type refresh_cache: bool :returns: DataFrame conforming to the 51-column fixed schema defined by ``_load_final_schema()``. Every column is initialized to ``None``/NaN for records that lack the attribute. :rtype: pandas.DataFrame :raises ValueError: If no valid email is provided (neither via ``email`` parameter nor via prior :func:`set_email` call). .. code-block:: python import biometaharmonizer as bmh df = bmh.ingest( source=["SAMN02436525", "SAMN02434874"], email="your@email.com", api_key="YOUR_KEY", fetch_batch_size=500, ) .. function:: biometaharmonizer.ingestion.set_email(email: str) -> None Set the module-level Entrez contact email used by all subsequent ``ingest()`` calls. :param email: Valid e-mail address. Validated against the pattern ``^[^@\s]+@[^@\s]+\.[^@\s]+$``. :type email: str :raises ValueError: If ``email`` does not match the pattern. .. code-block:: python import biometaharmonizer as bmh bmh.set_email("researcher@institution.edu") .. function:: biometaharmonizer.ingestion.set_api_key(key: str) -> None Set the module-level NCBI API key used by all subsequent ``ingest()`` calls. Also sets ``Bio.Entrez.api_key``. :param key: NCBI API key string from https://www.ncbi.nlm.nih.gov/account/ :type key: str .. code-block:: python import biometaharmonizer as bmh bmh.set_api_key("abc123def456...") .. function:: biometaharmonizer.ingestion.set_cache_dir(path) -> None Override the directory used for assembly summary flat-file caching. Clears the internal ``_read_assembly_summary_cached`` LRU cache so that subsequent calls use the new path. :param path: New cache directory path. :type path: str or Path .. code-block:: python import biometaharmonizer as bmh bmh.set_cache_dir("/content/bmh_cache") # Google Colab example biometaharmonizer.output -------------------------- .. function:: biometaharmonizer.output.write(df: pd.DataFrame, path, fmt: str = "csv") -> Path Write a harmonized DataFrame to disk in the specified format. :param df: Harmonized DataFrame to write. :type df: pandas.DataFrame :param path: Destination file path. Parent directories are created automatically. :type path: str or Path :param fmt: Output format. One of ``"csv"``, ``"tsv"``, ``"excel"``, ``"parquet"``. Case-insensitive. Defaults to ``"csv"``. :type fmt: str :returns: Resolved absolute path to the written file. :rtype: pathlib.Path :raises ValueError: If ``fmt`` is not one of the four supported formats. .. code-block:: python from biometaharmonizer.output import write write(df, "harmonized.parquet", fmt="parquet") .. function:: biometaharmonizer.output.write_summary(df: pd.DataFrame, path) -> Path Write a fill-rate summary CSV for each column in ``df``. The output CSV has three columns: ``column_name``, ``non_null_count``, ``fill_pct``. ``fill_pct`` is rounded to one decimal place. :param df: Source DataFrame to summarize. :type df: pandas.DataFrame :param path: Destination file path for the summary CSV. :type path: str or Path :returns: Resolved absolute path to the written summary file. :rtype: pathlib.Path .. code-block:: python from biometaharmonizer.output import write_summary write_summary(df, "fill_rates.csv") biometaharmonizer.date_engine ------------------------------- .. class:: biometaharmonizer.date_engine.DateEngine Temporal parsing engine. Converts date strings to ISO 8601 truncated representation. .. method:: parse(series) -> pd.Series Parse a Series of date strings to ISO 8601 point dates. Deduplicates unique values before parsing for performance. :param series: Series of raw date strings. :type series: pandas.Series or pandas.DataFrame :returns: Series of ISO 8601 strings (``YYYY``, ``YYYY-MM``, or ``YYYY-MM-DD``) with NaN for unparseable or null values. :rtype: pandas.Series .. method:: parse_with_range(series) -> pd.DataFrame Parse dates and return a two-column DataFrame. :param series: Series of raw date strings. :type series: pandas.Series :returns: DataFrame with columns ``collection_date`` (ISO 8601 point date or NaN) and ``collection_date_range`` (verbatim original string for any range/approximate input, else NaN). :rtype: pandas.DataFrame .. staticmethod:: _detect_range(value) -> bool Return ``True`` if *value* represents a date range or approximate date of any supported format. Must be called before ``dateutil`` to prevent silent misparsing. biometaharmonizer.geo_engine ------------------------------ .. class:: biometaharmonizer.geo_engine.GeoEngine Geospatial resolution engine. Parses ``geo_loc_name`` strings into structured geographic fields. .. method:: parse(series) -> pd.DataFrame Parse a Series of ``geo_loc_name`` strings into a six-column DataFrame. :param series: Series of ``geo_loc_name`` strings. :type series: pandas.Series :returns: DataFrame with columns: ``geo_country``, ``geo_region``, ``geo_locality``, ``geo_iso3166``, ``geo_sea_ocean``, ``geo_loc_raw``. :rtype: pandas.DataFrame .. code-block:: python from biometaharmonizer.geo_engine import GeoEngine import pandas as pd geo = GeoEngine() series = pd.Series(["Russia: Novosibirsk", "USA: California, San Diego"]) result = geo.parse(series) print(result[["geo_country", "geo_region", "geo_iso3166"]]) biometaharmonizer.one_health ------------------------------ .. class:: biometaharmonizer.one_health.OneHealthClassifier One Health categorization classifier. Loads biological knowledge exclusively from ``one_health_dictionaries.json``. .. method:: classify(series) -> pd.Series Single-field classification from a Series of ``isolation_source`` values. :param series: Series of isolation source strings. :type series: pandas.Series :returns: Series of One Health category strings. :rtype: pandas.Series .. method:: classify_joint(isolation_source_series, host_series) -> pd.Series Two-field classification using both ``isolation_source`` and ``host``. Both Series must share the same index. :param isolation_source_series: Series of isolation source strings. :type isolation_source_series: pandas.Series :param host_series: Series of host strings. :type host_series: pandas.Series :returns: Series of One Health category strings. :rtype: pandas.Series :raises ValueError: If the two Series do not share the same index. .. method:: classify_with_confidence(series) -> pd.DataFrame Single-field classification with confidence scores. :param series: Series of isolation source strings. :type series: pandas.Series :returns: DataFrame with columns: ``one_health_category``, ``one_health_term``, ``one_health_confidence``. :rtype: pandas.DataFrame .. method:: classify_multi_field(**fields) -> pd.DataFrame Multi-field evidence integration. Accepts named ``pd.Series`` for any of: ``isolation_source``, ``host``, ``env_medium``, ``env_local_scale``, ``env_broad_scale``, ``sample_type``. :returns: DataFrame with columns: ``one_health_category``, ``one_health_term``, ``one_health_confidence``, ``one_health_evidence_level``, ``one_health_processing``, ``one_health_setting``, ``one_health_source_field``. :rtype: pandas.DataFrame .. code-block:: python from biometaharmonizer.one_health import OneHealthClassifier import pandas as pd clf = OneHealthClassifier() sources = pd.Series(["blood", "chicken feces", "river sediment"]) print(clf.classify(sources)) # 0 Human # 1 Animal # 2 Environmental biometaharmonizer.key_mapper ------------------------------ .. class:: biometaharmonizer.key_mapper.KeyMapper Harmonizes column names for custom or non-ingestion workflows by applying the two-layer synonym lookup from ``synonyms.py``. .. method:: map_columns(df) -> pd.DataFrame Rename raw columns to canonical standard keys, coalesce duplicate columns (using ``combine_first`` priority to the leftmost column), and reindex to the fixed output schema. :param df: Input DataFrame with potentially non-standard column names. :type df: pandas.DataFrame :returns: DataFrame reindexed to ``BIOSAMPLE_SCHEMA``. :rtype: pandas.DataFrame biometaharmonizer.synonyms ---------------------------- .. function:: biometaharmonizer.synonyms.build_synonym_lookup() -> dict Build and return a ``{lowercased_synonym: standard_key}`` dict from two sources: ``unified.json`` (Layer 1) and ``ncbi_attributes.xml`` (Layer 2, optional). The result is cached via ``functools.lru_cache`` after the first call. :returns: Dictionary mapping lowercased synonym strings to canonical standard key strings. Returns an empty dict if neither schema file is present. :rtype: dict