.. _output:

======
Output
======

Module: :mod:`biometaharmonizer.output`

The output module provides two public functions for persisting the harmonized
DataFrame: :func:`~biometaharmonizer.output.write` for the main data file
and :func:`~biometaharmonizer.output.write_summary` for a fill-rate report.

Supported Formats
-----------------

The following format identifiers are accepted by :func:`~biometaharmonizer.output.write`
(case-insensitive). The full list is defined in the module-level constant
``_VALID_FORMATS = ("csv", "tsv", "excel", "parquet")``.

.. list-table:: 
   :header-rows: 1

   * - Format
     - Description
     - Engine used
   * - ``csv``
     - Comma-separated values, UTF-8 encoded.
     - pandas ``to_csv``
   * - ``tsv``
     - Tab-separated values, UTF-8 encoded. Useful for downstream shell processing.
     - pandas ``to_csv``
   * - ``excel``
     - Excel ``.xlsx`` workbook.
     - openpyxl engine
   * - ``parquet``
     - Apache Parquet columnar format.
     - pyarrow engine

The ``write()`` function creates any missing parent directories automatically
via ``Path.mkdir(parents=True, exist_ok=True)``. It logs the output path,
record count, and column count at INFO level.

write() Signature
-----------------

.. code-block:: python

   from biometaharmonizer.output import write

   def write(df: pd.DataFrame, path, fmt: str = "csv") -> Path:
       ...

- **df** — the harmonized DataFrame returned by ``ingest()``.
- **path** — destination file path as a ``str`` or ``Path`` object.
- **fmt** — output format; default ``"csv"``. Case-insensitive.
- **Returns** the resolved absolute ``Path`` of the written file.
- **Raises** ``ValueError`` if ``fmt`` is not one of the four valid formats.

write_summary() Signature
--------------------------

.. code-block:: python

   from biometaharmonizer.output import write_summary

   def write_summary(df: pd.DataFrame, path) -> Path:
       ...

Writes a fill-rate summary CSV to ``path``. The output has three columns:

.. list-table:: 
   :header-rows: 1

   * - Column
     - Type
     - Description
   * - ``column_name``
     - str
     - Name of the source DataFrame column.
   * - ``non_null_count``
     - | int
     - Count of non-null values in the column.
   * - ``fill_pct``
     - float
     - Percentage of non-null rows (0–100.0).

Expanding ``_extra_attributes``
---------------------------------

The ``_extra_attributes`` column contains a JSON string representing a dict
of overflow attributes. To expand it into separate columns:

.. code-block:: python

   import json
   import pandas as pd

   # Parse the JSON strings
   ea = df["_extra_attributes"].dropna().apply(json.loads)

   # Normalize to a wide DataFrame
   ea_wide = pd.json_normalize(ea)
   ea_wide.index = df["_extra_attributes"].dropna().index

   # Join back to the main DataFrame (drop the original column)
   df_expanded = df.drop(columns=["_extra_attributes"]).join(ea_wide)

.. note::

   Columns from ``_extra_attributes`` are not part of the fixed schema and
   their names depend on the NCBI packages present in the input. Common
   keys include ``panel_id``, ``submission_contact``, ``submission_owner``,
   and ``antibiogram``.

Unnesting the Antibiogram
--------------------------

To convert the nested ``antibiogram`` list into a long-format table with one
row per antibiotic entry:

.. code-block:: python

   import json
   import pandas as pd

   abg_rows = []
   for _, row in df.iterrows():
       if pd.isna(row["_extra_attributes"]):
           continue
       ea = json.loads(row["_extra_attributes"])
       for entry in ea.get("antibiogram", []):
           entry["biosample_accession"] = row["biosample_accession"]
           abg_rows.append(entry)

   abg_df = pd.DataFrame(abg_rows)

Parquet Output and Downstream Pipelines
-----------------------------------------

Parquet is the recommended format for downstream Snakemake or Nextflow
pipelines because:

- Column types are preserved (no implicit string coercion).
- The ``_extra_attributes`` column stores the JSON string compactly.
- Files can be read by ``pandas``, ``polars``, ``dask``, and ``Apache Spark``.

To write Parquet:

.. code-block:: python

   bmh.write(df, "harmonized.parquet", fmt="parquet")

To read in a downstream rule:

.. code-block:: python

   import pandas as pd
   df = pd.read_parquet("harmonized.parquet")