.. _installation: ============ Installation ============ Requirements ------------ BioMetaHarmonizer requires **Python 3.9 or later**. PyPI Installation ----------------- Install the latest stable release with pip: .. code-block:: bash pip install biometaharmonizer Development Install from Source -------------------------------- Clone the repository and install in editable mode: .. code-block:: bash git clone https://github.com/rustam-bioinfo/BioMetaHarmonizer.git cd BioMetaHarmonizer pip install -e . The following dependencies are declared in ``pyproject.toml`` and are installed automatically: +----------------------+-----------------+ | Package | Minimum version | +======================+=================+ | pandas | >=1.5 | +----------------------+-----------------+ | numpy | >=1.24 | +----------------------+-----------------+ | biopython | >=1.80 | +----------------------+-----------------+ | requests | >=2.28 | +----------------------+-----------------+ | pycountry | >=22.3 | +----------------------+-----------------+ | python-dateutil | >=2.8 | +----------------------+-----------------+ | openpyxl | >=3.0 | +----------------------+-----------------+ | pyarrow | >=12.0 | +----------------------+-----------------+ | rapidfuzz | >=3.0.0 | +----------------------+-----------------+ ``rapidfuzz`` is a required runtime dependency that enables the fuzzy-matching fallback layer in :class:`biometaharmonizer.one_health.OneHealthClassifier`. If it is absent at import time the classifier logs a warning and disables the fuzzy layer; all other functionality remains available. ``openpyxl`` is required only when writing Excel output (``fmt="excel"``). ``pyarrow`` is required only when writing Parquet output (``fmt="parquet"``). ``kaleido`` is an optional dependency of ``scripts/generate_summary_report.py`` for PDF export; install it separately with ``pip install kaleido`` if needed. NCBI API Key ------------ All Entrez requests require a contact e-mail address. Without an API key, NCBI enforces a rate limit of **3 requests per second**. With a free API key the limit rises to **10 requests per second**, which roughly triples throughput on large jobs. To register a free API key: 1. Create or log in to your NCBI account at https://www.ncbi.nlm.nih.gov/account/ 2. Navigate to **Settings → API Key Management**. 3. Click **Create an API Key** and copy the generated string. Pass the key to :func:`biometaharmonizer.ingestion.set_api_key` before calling :func:`biometaharmonizer.ingestion.ingest`, or supply it directly as the ``api_key`` argument: .. code-block:: python import biometaharmonizer as bmh bmh.set_api_key("YOUR_API_KEY_HERE") Cache Directory --------------- Assembly summary flat files (approximately 100 MB each; two files are downloaded — one for RefSeq, one for GenBank) are cached locally. **Default location:** ``~/.biometaharmonizer/cache/`` This is the value of the module-level constant :attr:`biometaharmonizer.ingestion.CACHE_DIR`. **Files stored in the cache:** - ``assembly_summary_refseq.txt`` — NCBI RefSeq assembly summary - ``assembly_summary_genbank.txt`` — NCBI GenBank assembly summary **Time-to-live (TTL):** Cache files older than **7 days** (the value of ``_CACHE_TTL_DAYS``) are automatically re-downloaded on the next :func:`~biometaharmonizer.ingestion.ingest` call. You can also force an immediate refresh with ``refresh_cache=True``. To override the default cache location, call :func:`biometaharmonizer.ingestion.set_cache_dir` before ingestion: .. code-block:: python import biometaharmonizer as bmh # For Google Colab — use the Colab working directory bmh.set_cache_dir("/content/bmh_cache") .. note:: In Google Colab the home directory ``~/`` is the *root* of the VM filesystem. Setting the cache to ``/content`` or a subdirectory keeps the files inside your mounted Google Drive or the session's writable working directory.