Installation
Requirements
BioMetaHarmonizer requires Python 3.9 or later.
PyPI Installation
Install the latest stable release with pip:
pip install biometaharmonizer
Development Install from Source
Clone the repository and install in editable mode:
git clone https://github.com/rustam-bioinfo/BioMetaHarmonizer.git
cd BioMetaHarmonizer
pip install -e .
The following dependencies are declared in pyproject.toml and are
installed automatically:
Package |
Minimum version |
|---|---|
pandas |
>=1.5 |
numpy |
>=1.24 |
biopython |
>=1.80 |
requests |
>=2.28 |
pycountry |
>=22.3 |
python-dateutil |
>=2.8 |
openpyxl |
>=3.0 |
pyarrow |
>=12.0 |
rapidfuzz |
>=3.0.0 |
rapidfuzz is a required runtime dependency that enables the fuzzy-matching
fallback layer in biometaharmonizer.one_health.OneHealthClassifier.
If it is absent at import time the classifier logs a warning and disables the
fuzzy layer; all other functionality remains available.
openpyxl is required only when writing Excel output (fmt="excel").
pyarrow is required only when writing Parquet output (fmt="parquet").
kaleido is an optional dependency of scripts/generate_summary_report.py
for PDF export; install it separately with pip install kaleido if needed.
NCBI API Key
All Entrez requests require a contact e-mail address. Without an API key, NCBI enforces a rate limit of 3 requests per second. With a free API key the limit rises to 10 requests per second, which roughly triples throughput on large jobs.
To register a free API key:
Create or log in to your NCBI account at https://www.ncbi.nlm.nih.gov/account/
Navigate to Settings → API Key Management.
Click Create an API Key and copy the generated string.
Pass the key to biometaharmonizer.ingestion.set_api_key() before calling
biometaharmonizer.ingestion.ingest(), or supply it directly as the
api_key argument:
import biometaharmonizer as bmh
bmh.set_api_key("YOUR_API_KEY_HERE")
Cache Directory
Assembly summary flat files (approximately 100 MB each; two files are downloaded — one for RefSeq, one for GenBank) are cached locally.
Default location: ~/.biometaharmonizer/cache/
This is the value of the module-level constant
biometaharmonizer.ingestion.CACHE_DIR.
Files stored in the cache:
assembly_summary_refseq.txt— NCBI RefSeq assembly summaryassembly_summary_genbank.txt— NCBI GenBank assembly summary
Time-to-live (TTL): Cache files older than 7 days (the value of
_CACHE_TTL_DAYS) are automatically re-downloaded on the next
ingest() call. You can also force
an immediate refresh with refresh_cache=True.
To override the default cache location, call
biometaharmonizer.ingestion.set_cache_dir() before ingestion:
import biometaharmonizer as bmh
# For Google Colab — use the Colab working directory
bmh.set_cache_dir("/content/bmh_cache")
Note
In Google Colab the home directory ~/ is the root of the VM
filesystem. Setting the cache to /content or a subdirectory keeps
the files inside your mounted Google Drive or the session’s writable
working directory.