CLI Reference

BioMetaHarmonizer installs a biometaharmonizer entry point that is registered in pyproject.toml as:

biometaharmonizer = "biometaharmonizer.cli:main"

The CLI is built with argparse and exposes a single subcommand: run.

biometaharmonizer --version
biometaharmonizer run --help

run subcommand

Runs the full harmonization pipeline: ingest → key-map → date/geo/One Health → output.

Usage:

biometaharmonizer run \
    --input   <FILE_OR_ACCESSIONS> \
    --email   <EMAIL> \
    --output  <FILE> \
    [--api-key <KEY>] \
    [--cache-dir <DIR>] \
    [--format <FORMAT>] \
    [--summary <FILE>] \
    [--fetch-batch-size <N>] \
    [--esearch-batch-size <N>] \
    [--refresh-cache] \
    [--verbose]

Input flexibility:

The --input argument accepts:

  • A path to a plain-text file containing one accession per line.

  • A comma-separated list of accessions passed directly as a string.

Accepted accession prefixes: SAMN, SAME, SAMD (BioSample) or GCF_, GCA_ (assembly). Mixed files are handled automatically.

Output format inference:

If --format is not specified, the output format is inferred from the file extension of --output:

CLI flags

Extension

Inferred format

.csv

csv

.tsv

tsv

.txt

tsv

.xlsx

excel

.xls

excel

.parquet

parquet

(other)

csv

Flags

CLI flags

Long flag

Short

Type

Default

Description

--input

-i

str

Required. Input file or comma-separated accession list.

--email

-e

str

Required. NCBI contact email.

--output

-o

str

Required. Output file path.

--api-key

str

None

NCBI API key.

--cache-dir

str

None

Assembly summary cache directory.

--format

-f

choice

None

Output format: csv, tsv, excel, parquet.

--summary

str

None

Write fill-rate summary CSV to this path.

--fetch-batch-size

int

200

Records per efetch request.

--esearch-batch-size

int

200

Accessions per esearch term.

--refresh-cache

flag

False

Force re-download of assembly index.

--verbose

-v

flag

False

Enable DEBUG-level logging.

--version

flag

Print version string and exit.

CLI Flag ↔ Python API Mapping

CLI flags

CLI flag

ingest() parameter

Default

--input

source

--email

email

--api-key

api_key

None

--cache-dir

cache_dir

None

--fetch-batch-size

fetch_batch_size

200

--esearch-batch-size

esearch_batch_size

100

--refresh-cache

refresh_cache

False

Note

The CLI default for --esearch-batch-size is 200 (as declared in add_argument), while the Python API module-level constant _ESEARCH_BATCH defaults to 100. The effective value is whichever is passed to ingest().

Complete Invocation Examples

Example 1 — BioSample file, CSV output with summary:

biometaharmonizer run \
    --input    biosample_ids.txt \
    --email    your@email.com \
    --api-key  abc123def456 \
    --output   harmonized.csv \
    --summary  fill_rates.csv \
    --verbose

Example 2 — Assembly accessions, Parquet output, custom cache:

biometaharmonizer run \
    --input           assemblies.txt \
    --email           your@email.com \
    --output          harmonized.parquet \
    --cache-dir       /data/bmh_cache \
    --fetch-batch-size 500 \
    --refresh-cache

Example 3 — Inline accessions (no file required):

biometaharmonizer run \
    -i "SAMN02436525,SAMN02434874,SAMN02429261" \
    -e your@email.com \
    -o out.csv

Log Output Format

Log messages are written to stderr using the format:

HH:MM:SS  LEVEL    logger_name: message

For example:

14:32:01  INFO     biometaharmonizer.ingestion: Fetching NCBI assembly index (refseq) ...
14:32:45  INFO     biometaharmonizer.ingestion: Fetching metadata for 1500 BioSample accessions...
14:35:12  INFO     biometaharmonizer.ingestion: ============================================================
14:35:12  INFO     biometaharmonizer.ingestion: INGEST SUMMARY
14:35:12  INFO     biometaharmonizer.ingestion:   Input IDs provided  : 1500
14:35:12  INFO     biometaharmonizer.ingestion:   fetch_batch_size    : 200
14:35:12  INFO     biometaharmonizer.ingestion:   esearch_batch_size  : 100
14:35:12  INFO     biometaharmonizer.ingestion:   Records in output   : 1498
14:35:12  INFO     biometaharmonizer.ingestion:   bioproject_accession filled : 1350 / 1498
14:35:12  INFO     biometaharmonizer.ingestion:   assembly_accession_refseq   filled : 1200 / 1498
14:35:12  INFO     biometaharmonizer.ingestion:   assembly_accession_genbank  filled : 1100 / 1498
14:35:12  INFO     biometaharmonizer.ingestion: ============================================================
14:35:14  INFO     biometaharmonizer.cli: Writing output to harmonized.csv (format=csv)
Done. 1498 records x 51 columns -> harmonized.csv

The final Done. line is printed to stdout.