Output

Module: biometaharmonizer.output

The output module provides two public functions for persisting the harmonized DataFrame: write() for the main data file and write_summary() for a fill-rate report.

Supported Formats

The following format identifiers are accepted by write() (case-insensitive). The full list is defined in the module-level constant _VALID_FORMATS = ("csv", "tsv", "excel", "parquet").

Format	Description	Engine used
`csv`	Comma-separated values, UTF-8 encoded.	pandas `to_csv`
`tsv`	Tab-separated values, UTF-8 encoded. Useful for downstream shell processing.	pandas `to_csv`
`excel`	Excel `.xlsx` workbook.	openpyxl engine
`parquet`	Apache Parquet columnar format.	pyarrow engine

The write() function creates any missing parent directories automatically via Path.mkdir(parents=True, exist_ok=True). It logs the output path, record count, and column count at INFO level.

write() Signature

from biometaharmonizer.output import write

def write(df: pd.DataFrame, path, fmt: str = "csv") -> Path:
    ...

df — the harmonized DataFrame returned by ingest().
path — destination file path as a str or Path object.
fmt — output format; default "csv". Case-insensitive.
Returns the resolved absolute Path of the written file.
Raises ValueError if fmt is not one of the four valid formats.

write_summary() Signature

from biometaharmonizer.output import write_summary

def write_summary(df: pd.DataFrame, path) -> Path:
    ...

Writes a fill-rate summary CSV to path. The output has three columns:

Column	Type	Description
`column_name`	str	Name of the source DataFrame column.
`non_null_count`	int	Count of non-null values in the column.
`fill_pct`	float	Percentage of non-null rows (0–100.0).

Expanding `_extra_attributes`

The _extra_attributes column contains a JSON string representing a dict of overflow attributes. To expand it into separate columns:

import json
import pandas as pd

# Parse the JSON strings
ea = df["_extra_attributes"].dropna().apply(json.loads)

# Normalize to a wide DataFrame
ea_wide = pd.json_normalize(ea)
ea_wide.index = df["_extra_attributes"].dropna().index

# Join back to the main DataFrame (drop the original column)
df_expanded = df.drop(columns=["_extra_attributes"]).join(ea_wide)

Note

Columns from _extra_attributes are not part of the fixed schema and their names depend on the NCBI packages present in the input. Common keys include panel_id, submission_contact, submission_owner, and antibiogram.

Unnesting the Antibiogram

To convert the nested antibiogram list into a long-format table with one row per antibiotic entry:

import json
import pandas as pd

abg_rows = []
for _, row in df.iterrows():
    if pd.isna(row["_extra_attributes"]):
        continue
    ea = json.loads(row["_extra_attributes"])
    for entry in ea.get("antibiogram", []):
        entry["biosample_accession"] = row["biosample_accession"]
        abg_rows.append(entry)

abg_df = pd.DataFrame(abg_rows)

Parquet Output and Downstream Pipelines

Parquet is the recommended format for downstream Snakemake or Nextflow pipelines because:

Column types are preserved (no implicit string coercion).
The _extra_attributes column stores the JSON string compactly.
Files can be read by pandas, polars, dask, and Apache Spark.

To write Parquet:

bmh.write(df, "harmonized.parquet", fmt="parquet")

To read in a downstream rule:

import pandas as pd
df = pd.read_parquet("harmonized.parquet")