Exported source
src_dir = '...' # path or URL to the raw data
fname_out = '...' # default output filename
zotero_key = 'XXXXXXXX' # 8 character Zotero record keyA handler is a Jupyter notebook in nbs/handlers/ that encodes one data provider’s raw files into a MARIS standard NetCDF4 file. The notebook is the authoritative description of every curation decision made for that dataset. The Python module in marisco/handlers/ is auto generated from it via nbdev.
This guide walks through the steps needed to create a handler, from prerequisites to verification.
For the nomenclature reconciliation process used in several steps, see the nomenclature reconciliation how-to.
Each handler notebook follows a literate programming approach: code and prose live together in the same document, interspersed with explanations of why each transformation exists. This makes the pipeline self-documenting: anyone reading the notebook can follow the logic without jumping between source code and separate documentation.
The approach is inspired by Donald Knuth’s original vision, and resonates with the fastai/fastcore/solveit principles of making software more human-centred and nicely coined by this quote:
“I think it would be better to humanise software than softwarise humans.” – Daniele Procida
From the MARIS application perspective, literate programming brings four concrete benefits:
The #| exports nbdev directive marks functions, classes, and symbols for export to the corresponding module under marisco/handlers/, while also rendering their implementation body in the generated documentation. Use this for smaller constructs where the reader benefits from seeing the code inline: configuration dicts, lookup tables, and short callbacks whose logic tells part of the story.
The #| export nbdev directive (without the s) also exports to the module but only shows the signature in the rendered doc, not the body. This is appropriate for larger functions where readers primarily need to know the interface (the implementation details are a distraction at the documentation layer).
Never edit the .py files under marisco/handlers/ directly, they are regenerated by nbdev-export after every notebook change. Other useful directives include #| eval: false (run in notebook but omit from module) and #| hide (hide from both module and rendered output). See the nbdev directives documentation for the full list.
You will need:
ZOTERO_API_KEY environment variable set.Start with the HELCOM handler as your reference. It covers all four sample type groups and every curation rule. GEOTRACES, OSPAR and TEPCO are additional examples that show different structural challenges.
Every handler notebook starts with the #| default_exp directive that tells nbdev which Python module to generate. Import the standard marisco classes you will need.
Then define the three constants that locate the data, define the output file, and identify the dataset in Zotero.
The loader function reads raw provider files and returns a dictionary of DataFrames keyed by sample type group. A handler supports any subset of the four groups: SEAWATER, BIOTA, SEDIMENT, SUSPENDED_MATTER.
Load provider data; returns dict of DataFrames keyed by sample type.
Every enumerated field (nuclide name, species, body part, sediment type, unit, etc.) needs to be mapped from the provider’s names to MARIS standard identifiers. The nomenclature reconciliation how-to explains the full workflow: derive unique values from the data (or load a provider-supplied lookup table), let fuzzy matching handle the bulk, inspect borderline scores, apply expert overrides, and verify the result.
The how-to covers three lut formats:
make_lut: derive from measurement data (most common)make_lut_from: build from a provider-supplied lookup tableThen package the result into a RemapCB callback for the Transformer pipeline. See the HELCOM handler for a complete example.
For small, stable enumerations with only two or three entries (for example a filtered or unfiltered flag), skip the fuzzy matching and write a plain dict directly.
Each data quality issue in the raw input becomes one callback class. The callback docstring serves double duty: it appears as the description in the rendered documentation and is appended to the publisher_postprocess_logs NetCDF4 global attribute at runtime.
Write the docstring as a single sentence that describes what was done and, if not obvious, why. It must stand alone as an audit trail entry.
Use #| export for callback classes and #| exports for lookup tables and constants passed to callbacks.
Lowercase and strip whitespace from provider nuclide names.
class SanitizeNuclideNamesCB(PerGroupCB):
"Lowercase and strip whitespace from provider nuclide names."
def __init__(self,
col_src, # Source column name e.g. 'Nuclide'
col_dst # Destination column name
):
store_attr()
def each_grp(self, grp, df, tfm):
df[self.col_dst] = df[self.col_src].str.lower().str.strip()Instantiate a Transformer with the ordered list of callbacks. The order matters: sanitisation callbacks run first, then nomenclature remapping, then unit conversion and detection limit encoding, then coordinate and time validation. Each callback sees the output of the previous one.
The NetCDF4 file needs global attributes that describe the dataset: geographic bounding box, time range, depth range, and bibliographic metadata from Zotero.
Build NetCDF global attributes for the dataset.
encode is the main public interface of every handler. The maris_to_nc CLI command (defined in marisco/cli/to_nc.py) dynamically imports and calls this function by name, so the signature must be kept stable: fname_out as the first positional argument, **kwargs for optional flags such as verbose.
Once installed, the handler is invoked from the command line as:
The body mirrors the development pipeline built in the preceding steps: load, transform, build global attributes, encode.
Encode provider data into MARIS standard NetCDF4 file.
def encode(
fname_out: str, # Output file name
**kwargs
) -> None:
"Encode provider data into MARIS standard NetCDF4 file."
dfs = load_data(src_dir)
tfm = Transformer(dfs, cbs=[
SanitizeNuclideNamesCB(col_src='nuclide', col_dst='NUCLIDE'),
RemapCB(lut=nuclide_lut, col_remap='NUCLIDE', col_src='NUCLIDE'),
EncodeTimeCB(col_src='date'),
SanitizeLonLatCB(),
# ... add remaining callbacks in pipeline order
])
tfm()
encoder = NetCDFEncoder(
tfm.dfs,
dest_fname=fname_out,
global_attrs=get_attrs(tfm, zotero_key, **kwargs),
verbose=kwargs.get('verbose', False),
)
encoder.encode()Each handler includes two levels of testing:
Mock-data unit tests: For every callback, define a small dfs_mock dictionary with representative inputs, apply the callback via Transformer, and assert the expected outputs. This verifies the transformation logic in isolation, independent of the full dataset.
End-to-end smoke test: After assembling the full pipeline, include a code cell (marked #| eval: false) that calls encode() on the real data. Run this during development to confirm every step works together. The output is a NetCDF file that can be inspected or decoded back to CSV for manual review.
| Rule | Detail |
|---|---|
| Units — seawater | Must be Bq m⁻³. Multiply VALUE, UNC, and any detection limit value by 1000 if the provider reports Bq L⁻¹ |
| Units — biota | Prefer wet weight (unit ID 5). Also compute PERCENTWT when both wet and dry weights are available |
| Uncertainty | 1σ absolute, same unit as VALUE. Convert from relative: UNC = VALUE × UNC_pct / 100 |
| Detection limit | Map to MARIS IDs: 1 for measured, 2 for below MDA, 3 for ND no limit, 4 for derived or aggregated |
| Coordinates | Decimal degrees. Longitude in range -180 to 180. Drop rows where latitude equals longitude equals 0 (unknown position sentinel) |
| Time | Parse to pd.Timestamp first. Encode to integer seconds since 1970-01-01 UTC via EncodeTimeCB. Drop rows with unparseable dates |
| Depth | SMP_DEPTH in metres. Use -1 as sentinel for not available |
| Unknown IDs | Set to 0 not NaN for unresolved enumerated fields. Never silently drop a row because a name could not be remapped |