Write a new handler

How to add a new data provider to the marisco pipeline.

A handler is a Jupyter notebook in nbs/handlers/ that encodes one data provider’s raw files into a MARIS standard NetCDF4 file. The notebook is the authoritative description of every curation decision made for that dataset. The Python module in marisco/handlers/ is auto generated from it via nbdev.

This guide walks through the steps needed to create a handler, from prerequisites to verification.

For the nomenclature reconciliation process used in several steps, see the nomenclature reconciliation how-to.

Literate programming in practice

Each handler notebook follows a literate programming approach: code and prose live together in the same document, interspersed with explanations of why each transformation exists. This makes the pipeline self-documenting: anyone reading the notebook can follow the logic without jumping between source code and separate documentation.

The approach is inspired by Donald Knuth’s original vision, and resonates with the fastai/fastcore/solveit principles of making software more human-centred and nicely coined by this quote:

“I think it would be better to humanise software than softwarise humans.” – Daniele Procida

From the MARIS application perspective, literate programming brings four concrete benefits:

  1. Full traceability: every ingestion, reconciliation, and normalisation step is documented alongside the code that implements it, making it possible to trace exactly how any value in the output was derived.
  2. Data-provider feedback: when spotting errors, inconsistencies, or opportunities for improvement in the source data, the notebook captures them as callout boxes that can be shared directly with data providers for future releases.
  3. Easier maintenance: the MARIS team can revisit a handler years later and understand not just what the code does, but why it does it.
  4. Reusability: handlers developed and documented in this style compound over time. Patterns from existing handlers can be leveraged for new datasets, where the narrative structure helps both human developers and LLMs work with the codebase.

nbdev directives

The #| exports nbdev directive marks functions, classes, and symbols for export to the corresponding module under marisco/handlers/, while also rendering their implementation body in the generated documentation. Use this for smaller constructs where the reader benefits from seeing the code inline: configuration dicts, lookup tables, and short callbacks whose logic tells part of the story.

The #| export nbdev directive (without the s) also exports to the module but only shows the signature in the rendered doc, not the body. This is appropriate for larger functions where readers primarily need to know the interface (the implementation details are a distraction at the documentation layer).

Never edit the .py files under marisco/handlers/ directly, they are regenerated by nbdev-export after every notebook change. Other useful directives include #| eval: false (run in notebook but omit from module) and #| hide (hide from both module and rendered output). See the nbdev directives documentation for the full list.

Before you start

You will need:

  • The raw source data from the provider and its format documentation.
  • A Zotero record key for the dataset (8 character alphanumeric string). Create a record in the MARIS Zotero library if one does not already exist.
  • The ZOTERO_API_KEY environment variable set.

Start with the HELCOM handler as your reference. It covers all four sample type groups and every curation rule. GEOTRACES, OSPAR and TEPCO are additional examples that show different structural challenges.

Step 1: Write the module header and constants

Every handler notebook starts with the #| default_exp directive that tells nbdev which Python module to generate. Import the standard marisco classes you will need.

Then define the three constants that locate the data, define the output file, and identify the dataset in Zotero.

Exported source
src_dir = '...'        # path or URL to the raw data
fname_out = '...'      # default output filename
zotero_key = 'XXXXXXXX'  # 8 character Zotero record key

Step 2: Write the data loader

The loader function reads raw provider files and returns a dictionary of DataFrames keyed by sample type group. A handler supports any subset of the four groups: SEAWATER, BIOTA, SEDIMENT, SUSPENDED_MATTER.


source

load_data


def load_data(
    fname_in, # Path to raw data provider's data
):

Load provider data; returns dict of DataFrames keyed by sample type.

Exported source
def load_data(
        fname_in # Path to raw data provider's data
        ):
    "Load provider data; returns dict of DataFrames keyed by sample type."
    res = {}
    # Read each sample type group from the provider's files and store it
    # under the correct MARIS group key
    return res

Step 3: Reconcile nomenclatures

Every enumerated field (nuclide name, species, body part, sediment type, unit, etc.) needs to be mapped from the provider’s names to MARIS standard identifiers. The nomenclature reconciliation how-to explains the full workflow: derive unique values from the data (or load a provider-supplied lookup table), let fuzzy matching handle the bulk, inspect borderline scores, apply expert overrides, and verify the result.

The how-to covers three lut formats:

  • make_lut: derive from measurement data (most common)
  • make_lut_from: build from a provider-supplied lookup table
  • Plain dict: for trivial mappings (e.g. filtered/unfiltered flags)

Then package the result into a RemapCB callback for the Transformer pipeline. See the HELCOM handler for a complete example.

Exported source
# Provider nuclide names that need manual correction after fuzzy matching
fixes_nuclide_names = {
    # 'provider_name': 'correct_maris_name',
}

# Lookup table for nuclide names — derived from the data, fuzzy-matched against MARIS
nuclide_lut = make_lut('NUCLIDE', fixes=fixes_nuclide_names)

For small, stable enumerations with only two or three entries (for example a filtered or unfiltered flag), skip the fuzzy matching and write a plain dict directly.

Step 4: Write callbacks

Each data quality issue in the raw input becomes one callback class. The callback docstring serves double duty: it appears as the description in the rendered documentation and is appended to the publisher_postprocess_logs NetCDF4 global attribute at runtime.

Write the docstring as a single sentence that describes what was done and, if not obvious, why. It must stand alone as an audit trail entry.

# Good — informative in both contexts
"Shift longitudes from [0, 360] convention to MARIS [-180, 180] by subtracting 180."

# Bad — too vague to serve as an audit trail entry
"Unshift longitudes."

Use #| export for callback classes and #| exports for lookup tables and constants passed to callbacks.


source

SanitizeNuclideNamesCB


def SanitizeNuclideNamesCB(
    col_src, # Source column name e.g. 'Nuclide'
    col_dst, # Destination column name
):

Lowercase and strip whitespace from provider nuclide names.

Exported source
class SanitizeNuclideNamesCB(PerGroupCB):
    "Lowercase and strip whitespace from provider nuclide names."
    def __init__(self, 
                 col_src, # Source column name e.g. 'Nuclide'
                 col_dst # Destination column name
                 ):
        store_attr()
    def each_grp(self, grp, df, tfm):
        df[self.col_dst] = df[self.col_src].str.lower().str.strip()

Step 5: Build the Transformer pipeline

Instantiate a Transformer with the ordered list of callbacks. The order matters: sanitisation callbacks run first, then nomenclature remapping, then unit conversion and detection limit encoding, then coordinate and time validation. Each callback sees the output of the previous one.

dfs = load_data(src_dir)
tfm = Transformer(dfs, cbs=[
    # Sanitisation
    SanitizeNuclideNamesCB(col_src='nuclide', col_dst='NUCLIDE'),
    # Nomenclature remapping
    RemapCB(lut_nuclides, col_name='NUCLIDE'),
    # Validation
    SanitizeLonLatCB(),
    EncodeTimeCB(col_src='date')
])
tfm()

Step 6: Build global attributes

The NetCDF4 file needs global attributes that describe the dataset: geographic bounding box, time range, depth range, and bibliographic metadata from Zotero.


source

get_attrs


def get_attrs(
    tfm, # Transformer object
    zotero_key, # Zotero dataset record key
    kw:list=['oceanography', 'Earth Science > Oceans > ...']
):

Build NetCDF global attributes for the dataset.

Exported source
def get_attrs(
        tfm, # Transformer object
        zotero_key, # Zotero dataset record key
        kw:list=['oceanography', 'Earth Science > Oceans > ...']
        ):
    "Build NetCDF global attributes for the dataset."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key),
    ])()

Step 7: Write the entry point

encode is the main public interface of every handler. The maris_to_nc CLI command (defined in marisco/cli/to_nc.py) dynamically imports and calls this function by name, so the signature must be kept stable: fname_out as the first positional argument, **kwargs for optional flags such as verbose.

Once installed, the handler is invoked from the command line as:

maris_to_nc your_dataset /path/to/output.nc

The body mirrors the development pipeline built in the preceding steps: load, transform, build global attributes, encode.


source

encode


def encode(
    fname_out:str, # Output file name
    kwargs:VAR_KEYWORD
)->None:

Encode provider data into MARIS standard NetCDF4 file.

Exported source
def encode(
        fname_out: str, # Output file name
        **kwargs
        ) -> None:
    "Encode provider data into MARIS standard NetCDF4 file."
    dfs = load_data(src_dir)
    tfm = Transformer(dfs, cbs=[
        SanitizeNuclideNamesCB(col_src='nuclide', col_dst='NUCLIDE'),
        RemapCB(lut=nuclide_lut, col_remap='NUCLIDE', col_src='NUCLIDE'),
        EncodeTimeCB(col_src='date'),
        SanitizeLonLatCB(),
        # ... add remaining callbacks in pipeline order
    ])
    tfm()
    encoder = NetCDFEncoder(
        tfm.dfs,
        dest_fname=fname_out,
        global_attrs=get_attrs(tfm, zotero_key, **kwargs),
        verbose=kwargs.get('verbose', False),
    )
    encoder.encode()
encode(fname_out, verbose=False)

Step 8: Write tests

Each handler includes two levels of testing:

  • Mock-data unit tests: For every callback, define a small dfs_mock dictionary with representative inputs, apply the callback via Transformer, and assert the expected outputs. This verifies the transformation logic in isolation, independent of the full dataset.

  • End-to-end smoke test: After assembling the full pipeline, include a code cell (marked #| eval: false) that calls encode() on the real data. Run this during development to confirm every step works together. The output is a NetCDF file that can be inspected or decoded back to CSV for manual review.

Curation rules summary

Rule Detail
Units — seawater Must be Bq m⁻³. Multiply VALUE, UNC, and any detection limit value by 1000 if the provider reports Bq L⁻¹
Units — biota Prefer wet weight (unit ID 5). Also compute PERCENTWT when both wet and dry weights are available
Uncertainty 1σ absolute, same unit as VALUE. Convert from relative: UNC = VALUE × UNC_pct / 100
Detection limit Map to MARIS IDs: 1 for measured, 2 for below MDA, 3 for ND no limit, 4 for derived or aggregated
Coordinates Decimal degrees. Longitude in range -180 to 180. Drop rows where latitude equals longitude equals 0 (unknown position sentinel)
Time Parse to pd.Timestamp first. Encode to integer seconds since 1970-01-01 UTC via EncodeTimeCB. Drop rows with unparseable dates
Depth SMP_DEPTH in metres. Use -1 as sentinel for not available
Unknown IDs Set to 0 not NaN for unresolved enumerated fields. Never silently drop a row because a name could not be remapped

Further reading