Output
Module: biometaharmonizer.output
The output module provides two public functions for persisting the harmonized
DataFrame: write() for the main data file
and write_summary() for a fill-rate report.
Supported Formats
The following format identifiers are accepted by write()
(case-insensitive). The full list is defined in the module-level constant
_VALID_FORMATS = ("csv", "tsv", "excel", "parquet").
Format |
Description |
Engine used |
|---|---|---|
|
Comma-separated values, UTF-8 encoded. |
pandas |
|
Tab-separated values, UTF-8 encoded. Useful for downstream shell processing. |
pandas |
|
Excel |
openpyxl engine |
|
Apache Parquet columnar format. |
pyarrow engine |
The write() function creates any missing parent directories automatically
via Path.mkdir(parents=True, exist_ok=True). It logs the output path,
record count, and column count at INFO level.
write() Signature
from biometaharmonizer.output import write
def write(df: pd.DataFrame, path, fmt: str = "csv") -> Path:
...
df — the harmonized DataFrame returned by
ingest().path — destination file path as a
strorPathobject.fmt — output format; default
"csv". Case-insensitive.Returns the resolved absolute
Pathof the written file.Raises
ValueErroriffmtis not one of the four valid formats.
write_summary() Signature
from biometaharmonizer.output import write_summary
def write_summary(df: pd.DataFrame, path) -> Path:
...
Writes a fill-rate summary CSV to path. The output has three columns:
Column |
Type |
Description |
|---|---|---|
|
str |
Name of the source DataFrame column. |
|
int
|
Count of non-null values in the column. |
|
float |
Percentage of non-null rows (0–100.0). |
Expanding _extra_attributes
The _extra_attributes column contains a JSON string representing a dict
of overflow attributes. To expand it into separate columns:
import json
import pandas as pd
# Parse the JSON strings
ea = df["_extra_attributes"].dropna().apply(json.loads)
# Normalize to a wide DataFrame
ea_wide = pd.json_normalize(ea)
ea_wide.index = df["_extra_attributes"].dropna().index
# Join back to the main DataFrame (drop the original column)
df_expanded = df.drop(columns=["_extra_attributes"]).join(ea_wide)
Note
Columns from _extra_attributes are not part of the fixed schema and
their names depend on the NCBI packages present in the input. Common
keys include panel_id, submission_contact, submission_owner,
and antibiogram.
Unnesting the Antibiogram
To convert the nested antibiogram list into a long-format table with one
row per antibiotic entry:
import json
import pandas as pd
abg_rows = []
for _, row in df.iterrows():
if pd.isna(row["_extra_attributes"]):
continue
ea = json.loads(row["_extra_attributes"])
for entry in ea.get("antibiogram", []):
entry["biosample_accession"] = row["biosample_accession"]
abg_rows.append(entry)
abg_df = pd.DataFrame(abg_rows)
Parquet Output and Downstream Pipelines
Parquet is the recommended format for downstream Snakemake or Nextflow pipelines because:
Column types are preserved (no implicit string coercion).
The
_extra_attributescolumn stores the JSON string compactly.Files can be read by
pandas,polars,dask, andApache Spark.
To write Parquet:
bmh.write(df, "harmonized.parquet", fmt="parquet")
To read in a downstream rule:
import pandas as pd
df = pd.read_parquet("harmonized.parquet")