Write a new handler

How to add a new data provider to the marisco pipeline.

A handler is a Jupyter notebook in nbs/handlers/ that encodes one data provider’s raw files into a MARIS standard NetCDF4 file. The notebook is the authoritative description of every curation decision made for that dataset. The Python module in marisco/handlers/ is auto generated from it via nbdev.

This guide walks through the steps needed to create a handler, from prerequisites to verification.

For the nomenclature reconciliation process used in several steps, see the nomenclature reconciliation how-to.

Literate programming in practice

Each handler notebook follows a literate programming approach: code and prose live together in the same document, interspersed with explanations of why each transformation exists. This makes the pipeline self-documenting: anyone reading the notebook can follow the logic without jumping between source code and separate documentation.

The approach is inspired by Donald Knuth’s original vision, and resonates with the fastai/fastcore/solveit principles of making software more human-centred and nicely coined by this quote:

“I think it would be better to humanise software than softwarise humans.” – Daniele Procida

From the MARIS application perspective, literate programming brings four concrete benefits:

Full traceability: every ingestion, reconciliation, and normalisation step is documented alongside the code that implements it, making it possible to trace exactly how any value in the output was derived.
Data-provider feedback: when spotting errors, inconsistencies, or opportunities for improvement in the source data, the notebook captures them as callout boxes that can be shared directly with data providers for future releases.
Easier maintenance: the MARIS team can revisit a handler years later and understand not just what the code does, but why it does it.
Reusability: handlers developed and documented in this style compound over time. Patterns from existing handlers can be leveraged for new datasets, where the narrative structure helps both human developers and LLMs work with the codebase.

nbdev directives

The #| exports nbdev directive marks functions, classes, and symbols for export to the corresponding module under marisco/handlers/, while also rendering their implementation body in the generated documentation. Use this for smaller constructs where the reader benefits from seeing the code inline: configuration dicts, lookup tables, and short callbacks whose logic tells part of the story.

The #| export nbdev directive (without the s) also exports to the module but only shows the signature in the rendered doc, not the body. This is appropriate for larger functions where readers primarily need to know the interface (the implementation details are a distraction at the documentation layer).

Never edit the .py files under marisco/handlers/ directly, they are regenerated by nbdev-export after every notebook change. Other useful directives include #| eval: false (run in notebook but omit from module) and #| hide (hide from both module and rendered output). See the nbdev directives documentation for the full list.

Before you start

You will need:

The raw source data from the provider and its format documentation.
A Zotero record key for the dataset (8 character alphanumeric string). Create a record in the MARIS Zotero library if one does not already exist.
The ZOTERO_API_KEY environment variable set.

Start with the HELCOM handler as your reference. It covers all four sample type groups and every curation rule. GEOTRACES, OSPAR and TEPCO are additional examples that show different structural challenges.

Getting oriented with a new dataset

Before writing any code, spend time understanding the source data and planning the column mapping. Work through these three steps.

1. Identify sample types

Determine which of the four MARIS sample type groups the data covers: SEAWATER, BIOTA, SEDIMENT, SUSPENDED_MATTER. A single provider dataset may span several; each becomes a separate DataFrame key in the dict returned by load_data.

2. Map provider columns to MARIS columns

Open the field definitions reference and compare the provider’s column names against the MARIS standard columns defined as uppercase keys in NC_CSV in marisco/configs.py (e.g. LAT, LON, VALUE, NUCLIDE, UNIT, TIME). For each provider column:

Direct match or trivial rename: note the MARIS uppercase key and add it to the renaming rules.
Needs normalisation: flag it; a callback will handle it (nuclide name reconciliation, unit conversion, detection limit encoding, coordinate sanitisation, time encoding, etc.).
No MARIS equivalent: leave it as-is; columns not in NC_CSV are silently ignored by the encoder.

3. Identify the callbacks you need

Based on the mapping gaps, decide which transformations are required. Check nbs/api/callbacks.ipynb for generic callbacks (SanitizeLonLatCB, EncodeTimeCB, RemapCB, LowerStripNameCB, RenameColumnsCB, etc.) and scan existing handlers for ad-hoc callbacks you can reuse or adapt. If you are working with an LLM assistant, share the callbacks notebook and the existing handler notebooks and ask it to suggest which callbacks apply to each unmapped column.

Step 1: Write the module header and constants

Every handler notebook starts with the #| default_exp directive that tells nbdev which Python module to generate. Import the standard marisco classes you will need.

#| default_exp handlers.your_dataset
from fastcore.all import *
import pandas as pd
import numpy as np
from marisco.callbacks import (
    Callback, PerGroupCB, Transformer,
    EncodeTimeCB, LowerStripNameCB, SanitizeLonLatCB,
    CompareDfsAndTfmCB, RemapCB)
from marisco.metadata import GlobAttrsFeeder, BboxCB, DepthRangeCB, TimeRangeCB, ZoteroCB
from marisco.encoders import NetCDFEncoder
from marisco.match import make_lut, make_lut_from, fuzzy_merge, fix_lut, lut_from
from marisco.configs import NC_DTYPES, get_lut, lut_path, cache_path

Then define the three constants that locate the data, define the output file, and identify the dataset in Zotero.

#| exports
src_dir = '...'        # path or URL to the raw data
fname_out = '...'      # default output filename
zotero_key = 'XXXXXXXX'  # 8 character Zotero record key

Step 2: Write the data loader

The loader function reads raw provider files and returns a dictionary of DataFrames keyed by sample type group. A handler supports any subset of the four groups: SEAWATER, BIOTA, SEDIMENT, SUSPENDED_MATTER.

#| exports
def load_data(
        fname_in # Path to raw data provider's data
        ):
    "Load provider data; returns dict of DataFrames keyed by sample type."
    res = {}
    # Read each sample type group from the provider's files and store it
    # under the correct MARIS group key
    return res

Column naming conventions

Column names in the pipeline are case-significant: uppercase names are MARIS standard columns written to the output; lowercase or mixed-case names are working columns used during transformation but ignored by the encoder.

Uppercase columns (e.g. NUCLIDE, VALUE, LAT, LON, TIME, UNIT) are defined as keys in the NC_CSV dict in marisco/configs.py. The NetCDFEncoder writes only these columns to the output file. Each key maps to a (netcdf_variable_name, csv_column_name) tuple: for example 'VALUE': ('value', 'activity') means the VALUE column becomes the value variable in the NetCDF file and the activity column in CSV export.
Lowercase or mixed-case columns carry provider-specific data through the pipeline. They are used during transformation but are silently ignored by the encoder.

The consequence: every provider column that should appear in the output must eventually be renamed or remapped to its MARIS uppercase key. Columns that never reach an uppercase key are dropped at encoding time. This is intentional, not an error.

For the full list of MARIS columns, their NetCDF variable names, and their CSV column names, see the field definitions reference.

Step 3: Reconcile nomenclatures

Every enumerated field (nuclide name, species, body part, sediment type, unit, etc.) needs to be mapped from the provider’s names to MARIS standard identifiers. The nomenclature reconciliation how-to explains the full workflow: derive unique values from the data (or load a provider-supplied lookup table), let fuzzy matching handle the bulk, inspect borderline scores, apply expert overrides, and verify the result.

The how-to covers three lut formats:

make_lut: derive from measurement data (most common)
make_lut_from: build from a provider-supplied lookup table
Plain dict: for trivial mappings (e.g. filtered/unfiltered flags)

Then package the result into a RemapCB callback for the Transformer pipeline. See the HELCOM handler for a complete example.

#| exports
# Provider nuclide names that need manual correction after fuzzy matching
fixes_nuclide_names = {
    # 'provider_name': 'correct_maris_name',
}

# Lookup table for nuclide names — derived from the data, fuzzy-matched against MARIS
nuclide_lut = make_lut('NUCLIDE', fixes=fixes_nuclide_names)

For small, stable enumerations with only two or three entries (for example a filtered or unfiltered flag), skip the fuzzy matching and write a plain dict directly.

Step 4: Write callbacks

Each data quality issue in the raw input becomes one callback class. The callback docstring serves double duty: it appears as the description in the rendered documentation and is appended to the publisher_postprocess_logs NetCDF4 global attribute at runtime.

Write the docstring as a single sentence that describes what was done and, if not obvious, why. It must stand alone as an audit trail entry.

# Good — informative in both contexts
"Shift longitudes from [0, 360] convention to MARIS [-180, 180] by subtracting 180."

# Bad — too vague to serve as an audit trail entry
"Unshift longitudes."

Use #| export for callback classes and #| exports for lookup tables and constants passed to callbacks.

#| exports
class SanitizeNuclideNamesCB(PerGroupCB):
    "Lowercase and strip whitespace from provider nuclide names."
    def __init__(self, 
                 col_src, # Source column name e.g. 'Nuclide'
                 col_dst # Destination column name
                 ):
        store_attr()
    def each_grp(self, grp, df, tfm):
        df[self.col_dst] = df[self.col_src].str.lower().str.strip()

Step 5: Build the Transformer pipeline

Instantiate a Transformer with the ordered list of callbacks. The order matters: sanitisation callbacks run first, then nomenclature remapping, then unit conversion and detection limit encoding, then coordinate and time validation. Each callback sees the output of the previous one.

dfs = load_data(src_dir)
tfm = Transformer(dfs, cbs=[
    # Sanitisation
    SanitizeNuclideNamesCB(col_src='nuclide', col_dst='NUCLIDE'),
    # Nomenclature remapping
    RemapCB(lut_nuclides, col_name='NUCLIDE'),
    # Validation
    SanitizeLonLatCB(),
    EncodeTimeCB(col_src='date')
])
tfm()

Step 6: Build global attributes

The NetCDF4 file needs global attributes that describe the dataset: geographic bounding box, time range, depth range, and bibliographic metadata from Zotero.

#| exports
def get_attrs(
        tfm, # Transformer object
        zotero_key, # Zotero dataset record key
        kw:list=['oceanography', 'Earth Science > Oceans > ...']
        ):
    "Build NetCDF global attributes for the dataset."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key),
    ])()

Step 7: Write the entry point

encode is the main public interface of every handler. The maris_to_nc CLI command (defined in marisco/cli/to_nc.py) dynamically imports and calls this function by name, so the signature must be kept stable: fname_out as the first positional argument, **kwargs for optional flags such as verbose.

Once installed, the handler is invoked from the command line as:

maris_to_nc your_dataset /path/to/output.nc

The body mirrors the development pipeline built in the preceding steps: load, transform, build global attributes, encode.

#| exports
def encode(
        fname_out: str, # Output file name
        **kwargs
        ) -> None:
    "Encode provider data into MARIS standard NetCDF4 file."
    dfs = load_data(src_dir)
    tfm = Transformer(dfs, cbs=[
        SanitizeNuclideNamesCB(col_src='nuclide', col_dst='NUCLIDE'),
        RemapCB(lut=nuclide_lut, col_remap='NUCLIDE', col_src='NUCLIDE'),
        EncodeTimeCB(col_src='date'),
        SanitizeLonLatCB(),
        # ... add remaining callbacks in pipeline order
    ])
    tfm()
    encoder = NetCDFEncoder(
        tfm.dfs,
        dest_fname=fname_out,
        global_attrs=get_attrs(tfm, zotero_key, **kwargs),
        verbose=kwargs.get('verbose', False),
    )
    encoder.encode()

#| eval: false
encode(fname_out, verbose=False)

Step 8: Write tests

Each handler includes two levels of testing:

Mock-data unit tests: For every callback, define a small dfs_mock dictionary with representative inputs, apply the callback via Transformer, and assert the expected outputs. This verifies the transformation logic in isolation, independent of the full dataset.
End-to-end smoke test: After assembling the full pipeline, include a code cell (marked #| eval: false) that calls encode() on the real data. Run this during development to confirm every step works together. The output is a NetCDF file that can be inspected or decoded back to CSV for manual review.

Curation rules summary

Rule	Detail
Units — seawater	Must be Bq m⁻³. Multiply VALUE, UNC, and any detection limit value by 1000 if the provider reports Bq L⁻¹
Units — biota	Prefer wet weight (unit ID 5). Also compute PERCENTWT when both wet and dry weights are available
Uncertainty	1σ absolute, same unit as VALUE. Convert from relative: UNC = VALUE × UNC_pct / 100
Detection limit	Map to MARIS IDs: 1 for measured, 2 for below MDA, 3 for ND no limit, 4 for derived or aggregated
Coordinates	Decimal degrees. Longitude in range -180 to 180. Drop rows where latitude equals longitude equals 0 (unknown position sentinel)
Time	Parse to pd.Timestamp first. Encode to integer seconds since 1970-01-01 UTC via EncodeTimeCB. Drop rows with unparseable dates
Depth	SMP_DEPTH in metres. Use -1 as sentinel for not available
Unknown IDs	Set to 0 not NaN for unresolved enumerated fields. Never silently drop a row because a name could not be remapped