Installation

Requirements

BioMetaHarmonizer requires Python 3.9 or later.

PyPI Installation

Install the latest stable release with pip:

pip install biometaharmonizer

Development Install from Source

Clone the repository and install in editable mode:

git clone https://github.com/rustam-bioinfo/BioMetaHarmonizer.git
cd BioMetaHarmonizer
pip install -e .

The following dependencies are declared in pyproject.toml and are installed automatically:

Package	Minimum version
pandas	>=1.5
numpy	>=1.24
biopython	>=1.80
requests	>=2.28
pycountry	>=22.3
python-dateutil	>=2.8
openpyxl	>=3.0
pyarrow	>=12.0
rapidfuzz	>=3.0.0

rapidfuzz is a required runtime dependency that enables the fuzzy-matching fallback layer in biometaharmonizer.one_health.OneHealthClassifier. If it is absent at import time the classifier logs a warning and disables the fuzzy layer; all other functionality remains available.

openpyxl is required only when writing Excel output (fmt="excel"). pyarrow is required only when writing Parquet output (fmt="parquet"). kaleido is an optional dependency of scripts/generate_summary_report.py for PDF export; install it separately with pip install kaleido if needed.

NCBI API Key

All Entrez requests require a contact e-mail address. Without an API key, NCBI enforces a rate limit of 3 requests per second. With a free API key the limit rises to 10 requests per second, which roughly triples throughput on large jobs.

To register a free API key:

Create or log in to your NCBI account at https://www.ncbi.nlm.nih.gov/account/
Navigate to Settings → API Key Management.
Click Create an API Key and copy the generated string.

Pass the key to biometaharmonizer.ingestion.set_api_key() before calling biometaharmonizer.ingestion.ingest(), or supply it directly as the api_key argument:

import biometaharmonizer as bmh
bmh.set_api_key("YOUR_API_KEY_HERE")

Cache Directory

Assembly summary flat files (approximately 100 MB each; two files are downloaded — one for RefSeq, one for GenBank) are cached locally.

Default location: ~/.biometaharmonizer/cache/

This is the value of the module-level constant biometaharmonizer.ingestion.CACHE_DIR.

Files stored in the cache:

assembly_summary_refseq.txt — NCBI RefSeq assembly summary
assembly_summary_genbank.txt — NCBI GenBank assembly summary

Time-to-live (TTL): Cache files older than 7 days (the value of _CACHE_TTL_DAYS) are automatically re-downloaded on the next ingest() call. You can also force an immediate refresh with refresh_cache=True.

To override the default cache location, call biometaharmonizer.ingestion.set_cache_dir() before ingestion:

import biometaharmonizer as bmh
# For Google Colab — use the Colab working directory
bmh.set_cache_dir("/content/bmh_cache")

Note

In Google Colab the home directory ~/ is the root of the VM filesystem. Setting the cache to /content or a subdirectory keeps the files inside your mounted Google Drive or the session’s writable working directory.