The BODC GEOTRACES Intermediate Data Product 2021 is one of the most comprehensive compilations of ocean radionuclide measurements to date, assembling water-column and suspended-particulate data from international oceanographic cruises worldwide.

This notebook documents the full curation workflow applied to bring that dataset into alignment with MARIS data standards: selecting the radionuclide variables within MARIS scope, reshaping the wide-format source, extracting metadata encoded in column names (unit, filtering status, sampling method), standardising nuclide nomenclature, coordinates, and units, and splitting measurements into SEAWATER and SUSPENDED_MATTER groups before encoding as a self-contained NetCDF4 file. The same workflow can be run end-to-end without inspecting the notebook via the maris_to_nc CLI tool.

Our approach is inspired by Literate Programming: code and explanation live side by side so data providers can follow the reasoning behind every curation decision and data users can understand exactly what was done to the data and why. Where the raw data contains inconsistencies or opportunities for improvement, they are flagged directly in the relevant section as feedback for future releases.

Configuration & file paths

  • fname_in: path to the folder containing the HELCOM data in CSV format. The path can be defined as a relative path.

  • fname_out: path and filename for the NetCDF output.The path can be defined as a relative path.

  • Zotero key: used to retrieve attributes related to the dataset from Zotero. The MARIS datasets include a library available on Zotero.

Exported source
fname_in = '../../_data/geotraces/GEOTRACES_IDP2021_v2/seawater/ascii/GEOTRACES_IDP2021_Seawater_Discrete_Sample_Data_v2.csv'
fname_out = '../../_data/output/190-geotraces-2021.nc'
zotero_key = '97UIMEXN'

Load data

Exported source
load_data = lambda fname: pd.read_csv(fname_in)
df = load_data(fname_in)
print(f'df shape: {df.shape}')
df.columns
df shape: (105417, 1188)
Index(['Cruise', 'Station:METAVAR:INDEXED_TEXT', 'Type',
       'yyyy-mm-ddThh:mm:ss.sss', 'Longitude [degrees_east]',
       'Latitude [degrees_north]', 'Bot. Depth [m]',
       'Operator's Cruise Name:METAVAR:INDEXED_TEXT',
       'Ship Name:METAVAR:INDEXED_TEXT', 'Period:METAVAR:INDEXED_TEXT',
       ...
       'QV:SEADATANET.581', 'Co_CELL_CONC_BOTTLE [amol/cell]',
       'QV:SEADATANET.582', 'Ni_CELL_CONC_BOTTLE [amol/cell]',
       'QV:SEADATANET.583', 'Cu_CELL_CONC_BOTTLE [amol/cell]',
       'QV:SEADATANET.584', 'Zn_CELL_CONC_BOTTLE [amol/cell]',
       'QV:SEADATANET.585', 'QV:ODV:SAMPLE'],
      dtype='str', length=1188)

Select columns of interest

The raw Geotraces CSV arrives in wide format with 1,188 columns; mostly non-radionuclide parameters (nutrients, trace metals, quality flags) outside MARIS scope. The first step is to select only the radionuclide columns: common_coi lists the 6 metadata columns always kept as identifiers, and nuclides_pattern matches 80 measurement columns, reducing the table to 86. The regex patterns match on measurement column names, so companion quality-flag (QV:) columns are naturally excluded. The wide structure is then reshaped to long form in a later step.


source

SelectColsOfInterestCB


def SelectColsOfInterestCB(
    common_coi:list, # Non-nuclide columns always kept as id_vars
    nuclides_pattern:list, # Regex patterns matching nuclide column names
):

Select columns of interest from the wide Geotraces dataframe.

Exported source
# Metadata columns always kept as identifiers when reshaping wide → long
common_coi = ['yyyy-mm-ddThh:mm:ss.sss', 'Longitude [degrees_east]',
              'Latitude [degrees_north]', 'Bot. Depth [m]', 'DEPTH [m]', 'BODC Bottle Number:INTEGER']

# Regex patterns identifying radionuclide measurement columns
nuclides_pattern = ['^TRITI', '^Th_228', '^Th_23[024]', '^Pa_231', 
                    '^U_236_[DT]', '^Be_', '^Cs_137', '^Pb_210', '^Po_210',
                    '^Ra_22[3468]', '^Np_237', '^Pu_239_[D]', '^Pu_240', '^Pu_239_Pu_240',
                    '^I_129', '^Ac_227']

For instance:

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern)
])

df_test = tfm()
print(f'First ten cols: {df_test.columns[:10]}')

# All metadata columns preserved
for col in common_coi: test_eq(col in df_test.columns, True)

# Quality flag columns stripped
test_eq(any('QV:' in c for c in df_test.columns), False)
First ten cols: Index(['yyyy-mm-ddThh:mm:ss.sss', 'Longitude [degrees_east]',
       'Latitude [degrees_north]', 'Bot. Depth [m]', 'DEPTH [m]',
       'BODC Bottle Number:INTEGER', 'TRITIUM_D_CONC_BOTTLE [TU]',
       'Cs_137_D_CONC_BOTTLE [uBq/kg]', 'I_129_D_CONC_BOTTLE [atoms/kg]',
       'Np_237_D_CONC_BOTTLE [uBq/kg]'],
      dtype='str')

Columns matched by the nuclides patterns
From the 1,188 raw columns, the patterns above select these 80 radionuclide measurement columns:

  ^TRITI               →  1 cols  e.g. ['TRITIUM_D_CONC_BOTTLE [TU]']
  ^Th_228              →  4 cols  e.g. ['Th_228_D_CONC_PUMP [uBq/kg]', 'Th_228_D_CONC_UWAY [uBq/kg]', 'Th_228_SPT_CONC_PUMP [uBq/kg]']
  ^Th_23[024]          → 25 cols  e.g. ['Th_230_T_CONC_BOTTLE [uBq/kg]', 'Th_230_D_CONC_BOTTLE [uBq/kg]', 'Th_232_T_CONC_BOTTLE [pmol/kg]']
  ^Pa_231              →  8 cols  e.g. ['Pa_231_D_CONC_BOTTLE [uBq/kg]', 'Pa_231_D_CONC_FISH [uBq/kg]', 'Pa_231_D_CONC_UWAY [uBq/kg]']
  ^U_236_[DT]          →  3 cols  e.g. ['U_236_D_CONC_BOTTLE [atoms/kg]', 'U_236_T_CONC_BOTTLE [atoms/kg]', 'U_236_D_CONC_FISH [atoms/kg]']
  ^Be_                 →  2 cols  e.g. ['Be_7_T_CONC_PUMP [uBq/kg]', 'Be_7_D_CONC_PUMP [uBq/kg]']
  ^Cs_137              →  2 cols  e.g. ['Cs_137_D_CONC_BOTTLE [uBq/kg]', 'Cs_137_D_CONC_UWAY [uBq/kg]']
  ^Pb_210              →  7 cols  e.g. ['Pb_210_D_CONC_BOTTLE [mBq/kg]', 'Pb_210_D_CONC_FISH [mBq/kg]', 'Pb_210_D_CONC_UWAY [mBq/kg]']
  ^Po_210              →  7 cols  e.g. ['Po_210_D_CONC_BOTTLE [mBq/kg]', 'Po_210_D_CONC_FISH [mBq/kg]', 'Po_210_D_CONC_UWAY [mBq/kg]']
  ^Ra_22[3468]         → 14 cols  e.g. ['Ra_224_D_CONC_BOTTLE [mBq/kg]', 'Ra_226_D_CONC_BOTTLE [mBq/kg]', 'Ra_228_T_CONC_BOTTLE [mBq/kg]']
  ^Np_237              →  1 cols  e.g. ['Np_237_D_CONC_BOTTLE [uBq/kg]']
  ^Pu_239_[D]          →  1 cols  e.g. ['Pu_239_D_CONC_BOTTLE [uBq/kg]']
  ^Pu_240              →  1 cols  e.g. ['Pu_240_D_CONC_BOTTLE [uBq/kg]']
  ^Pu_239_Pu_240       →  2 cols  e.g. ['Pu_239_Pu_240_D_CONC_BOTTLE [uBq/kg]', 'Pu_239_Pu_240_D_CONC_UWAY [uBq/kg]']
  ^I_129               →  1 cols  e.g. ['I_129_D_CONC_BOTTLE [atoms/kg]']
  ^Ac_227              →  1 cols  e.g. ['Ac_227_D_CONC_PUMP [uBq/kg]']

Total nuclide columns selected: 80 of 1188

Reshape: wide to long

The raw Geotraces CSV is in wide format: each row holds up to 80 radionuclide measurements crammed into separate columns, and metadata like unit, sampling methodology, and filter status is embedded in the column names themselves (e.g. Th_230_D_CONC_BOTTLE [uBq/kg]). This is unworkable for curation. Melting to long format folds all measurements into a single VALUE column and a NUCLIDE column that carries the full column-name string; which we can then parse to extract unit, method, and filter status in the next step.


source

WideToLongCB


def WideToLongCB(
    common_coi:list, # Non-nuclide columns kept as id_vars in melt
    nuclides_pattern:list, # Regex patterns identifying nuclide columns
    var_name:str='NUCLIDE', # Output column name for nuclide identifiers
    value_name:str='VALUE', # Output column name for measurement values
):

Reshape wide nuclide columns to long format so unit, method, and filter status can be extracted from column names.

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern)
])
df_test = tfm()
print(f'Long format: {df_test.shape[0]} rows × {df_test.shape[1]} cols')
# id columns all preserved
for col in common_coi: test_eq(col in df_test.columns, True)
# nuclide name and value columns created
test_eq('NUCLIDE' in df_test.columns, True)
test_eq('VALUE' in df_test.columns, True)
# no original wide nuclide columns remain
test_eq(any(re.match(p, c) for p in nuclides_pattern for c in df_test.columns), False)
Long format: 26745 rows × 8 cols

Extract

Geotraces encodes unit, filtering status, and sampling method inside the column names themselves, for example Th_230_D_CONC_BOTTLE [uBq/kg] holds all three. These need to be parsed out into dedicated columns before they can drive unit conversion, MARIS nomenclature mapping, and quality checks.

Unit

Units appear in square brackets at the end of every nuclide column name. The five distinct units found in this dataset are uBq/kg, mBq/kg, TU, atoms/kg, and pmol/kg; each needs to be mapped to the corresponding MARIS unit code in a later step.


source

ExtractUnitCB


def ExtractUnitCB(
    var_name:str='NUCLIDE', # Column containing nuclide names with embedded units in brackets
):

Extract measurement unit from nuclide column names (e.g. ‘Cs_137_D_CONC_BOTTLE [uBq/kg]’ → ‘uBq/kg’).

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB()
])
df_test = tfm()
print(f'Units found: {sorted(df_test.UNIT.unique())}')
test_eq('UNIT' in df_test.columns, True)
test_eq(set(df_test.UNIT.unique()), {'TU', 'uBq/kg', 'atoms/kg', 'mBq/kg', 'pmol/kg'})
Units found: ['TU', 'atoms/kg', 'mBq/kg', 'pmol/kg', 'uBq/kg']

Filtering status

Phase codes embedded in nuclide column names encode both filtering status and sample type group. The second underscore component after the nuclide name indicates the phase: D (dissolved, FILT=1, SEAWATER), T (total, FILT=2, SEAWATER), and TP / LPT / SPT (suspended particulate matter fractions, all FILT=1, SUSPENDED_MATTER). These are parsed into dedicated FILT and GROUP columns that drive sample type classification and downstream quality checks.


source

ExtractFilteringStatusCB


def ExtractFilteringStatusCB(
    phase:dict, # Phase code → {FILT, group} mapping (e.g. {'D': {'FILT': 1, 'group': 'SEAWATER'}})
    var_name:str='NUCLIDE', # Column containing nuclide names with embedded phase codes
):

Extract filtering status and sample-type group from nuclide column names using phase code (e.g. D, T, TP).

Exported source
# Phase code embedded in column names → FILT status and sample type group
phase = {
    'D': {'FILT': 1, 'group': 'SEAWATER'},
    'T': {'FILT': 2, 'group': 'SEAWATER'},
    'TP': {'FILT': 1, 'group': 'SUSPENDED_MATTER'}, 
    'LPT': {'FILT': 1, 'group': 'SUSPENDED_MATTER'},
    'SPT': {'FILT': 1, 'group': 'SUSPENDED_MATTER'}}
tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase)
])
df_test = tfm()
print(f'Groups found: {sorted(df_test.GROUP.dropna().unique())}')
print(f'Filtering values: {sorted(df_test.FILT.dropna().unique())}')
test_eq('FILT' in df_test.columns, True)
test_eq('GROUP' in df_test.columns, True)
test_eq(set(df_test.GROUP.dropna().unique()), {'SEAWATER', 'SUSPENDED_MATTER'})
test_eq(set(df_test.FILT.dropna().unique()).issubset({1, 2}), True)
Groups found: ['SEAWATER', 'SUSPENDED_MATTER']
Filtering values: [np.int64(1), np.int64(2)]

Sampling method

Sampling method codes appear as the last underscore component before the unit brackets in nuclide column names, for example BOTTLE in Th_230_D_CONC_BOTTLE [uBq/kg]. The four distinct methods found in this dataset are BOTTLE (rosette/CTD bottle, code 1), FISH (continuous towfish, code 18), PUMP (in situ pump, code 14), and UWAY (underway uncontaminated seawater supply, code 24). These are mapped to MARIS sampling method codes and recorded in the SAMP_MET column, enabling sample classification and cross dataset comparisons between different collection techniques.


source

ExtractSamplingMethodCB


def ExtractSamplingMethodCB(
    smp_method:dict={'BOTTLE': 1, 'FISH': 18, 'PUMP': 14, 'UWAY': 24}, # Sampling method lookup table
    var_name:str='NUCLIDE', # Column name containing nuclide names
    smp_method_col_name:str='SAMP_MET', # Column name for sampling method in output df
):

Extract sampling method from nuclide names.

Exported source
# Sampling method code → MARIS method ID mapping (to be validated)
smp_method = {
    'BOTTLE': 1,
    'FISH': 18,
    'PUMP': 14,
    'UWAY': 24}
tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method)
])
df_test = tfm()
print(f'Sampling methods found: {sorted(df_test.SAMP_MET.dropna().unique())}')
test_eq('SAMP_MET' in df_test.columns, True)
test_eq(set(df_test.SAMP_MET.dropna().unique()).issubset(set(smp_method.values())), True)
Sampling methods found: [np.int64(1), np.int64(14), np.int64(18), np.int64(24)]

Remap to MARIS nuclide names

Geotraces nuclide column names begin with provider-specific strings (e.g. TRITIUM, Pu_239_Pu_240, Th_230, U_236) that must be remapped to MARIS standard nomenclature before any lookup tables can be applied. Most names follow a regular pattern: strip the phase-code suffix (_D, _T, _TP, etc.), then lowercase and remove underscores — Th_230 becomes th230, U_236 becomes u236. Two exceptions need explicit overrides: TRITIUM maps to h3 (the standard nuclide symbol for tritium), and Pu_239_Pu_240 is a combined total activity that MARIS records as pu239_240_tot. The RenameNuclideCB applies the override dictionary first, then falls back to the general lowercasing rule for everything else.


source

RenameNuclideCB


def RenameNuclideCB(
    nuclides_name:dict, # Provider-specific name overrides e.g. {'TRITIUM': 'h3'}
    var_name:str='NUCLIDE', # Column containing nuclide names to standardize
):

Remap nuclides name to MARIS standard.

Exported source
# Provider-specific nuclide name overrides for MARIS standardisation
nuclides_name = {'TRITIUM': 'h3', 'Pu_239_Pu_240': 'pu239_240_tot'}
tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name)
])
df_test = tfm()
nuclides = set(df_test.NUCLIDE.unique())
print(f'Nuclides after rename: {sorted(nuclides)}')
test_eq('h3' in nuclides, True)              # TRITIUM → h3 override
test_eq('pu239_240_tot' in nuclides, True)   # Pu_239_Pu_240 special case
test_eq(all(n == n.lower() for n in nuclides), True)  # all names lowercased
Nuclides after rename: ['ac227', 'be7', 'cs137', 'h3', 'i129', 'np237', 'pa231', 'pb210', 'po210', 'pu239', 'pu239_240_tot', 'pu240', 'ra223', 'ra224', 'ra226', 'ra228', 'th228', 'th230', 'th232', 'th234', 'u236']
df_test.NUCLIDE.unique()
<ArrowStringArray>
[           'h3',         'cs137',          'i129',         'np237',
         'pu239', 'pu239_240_tot',         'pu240',          'u236',
         'pa231',         'pb210',         'po210',         'ra224',
         'ra226',         'ra228',         'th230',         'th232',
         'th234',         'ac227',           'be7',         'ra223',
         'th228']
Length: 21, dtype: str
ImportantFEEDBACK TO DATA PROVIDER

Several measurements are negative (see grouped counts below). Please review these values and provide detection-limit flags or handling guidance in future data releases.

df_test[df_test.VALUE < 0].groupby('NUCLIDE').size()
NUCLIDE
h3       71
pa231     3
th228    22
th230     1
th232     6
dtype: int64

Standardize unit

Geotraces encodes units inside nuclide column names, and five distinct units appear across the dataset: TU, uBq/kg, mBq/kg, atoms/kg, and pmol/kg. Some of these share a common MARIS unit ID despite different magnitudes — uBq/kg and mBq/kg both map to unit ID 3 but differ by a factor of 1000, which must be accounted for in the conversion factor. Similarly, pmol/kg must be converted via Avogadro’s number before it matches the atoms/kg unit ID. The mapping below handles both the unit remapping and the value rescaling:


source

StandardizeUnitCB


def StandardizeUnitCB(
    units_lut:dict, # Unit string → {id, factor} conversion mapping
    unit_col_name:str='UNIT', # Column containing unit strings to remap
    var_name:str='VALUE', # Column containing measurement values to rescale
):

Remap Geotraces unit strings to MARIS unit IDs, rescaling measurement values by the appropriate conversion factor where units share a common MARIS unit ID (e.g. uBq/kg and mBq/kg both map to ID 3 but differ 1000x).

Exported source
# Geotraces unit → MARIS unit ID and conversion factor mapping
units_lut = {
    'TU': {'id': 7, 'factor': 1},
    'uBq/kg': {'id': 3, 'factor': 1e-6},
    'atoms/kg': {'id': 9, 'factor': 1},
    'mBq/kg': {'id': 3, 'factor': 1e-3},
    'pmol/kg': {'id': 9, 'factor': 1e-12 * AVOGADRO}
    }
tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut)
])
df_test = tfm()
print(f'Unit IDs after standardization: {sorted(df_test.UNIT.unique())}')
test_eq(set(df_test.UNIT.unique()), {3, 7, 9})  # TU→7, uBq/kg+mBq/kg→3, atoms/kg+pmol/kg→9
Unit IDs after standardization: [np.int64(3), np.int64(7), np.int64(9)]

Rename common columns

Geotraces uses provider-specific column names for coordinates, depth, and sample identifiers — with units and metadata embedded as suffixes in brackets — that don’t match MARIS standard nomenclature (TIME, LON, LAT, TOT_DEPTH, SMP_DEPTH, SMP_ID_PROVIDER). These are remapped via RenameColumnCB before NetCDF encoding.


source

RenameColumnCB


def RenameColumnCB(
    lut:dict={'yyyy-mm-ddThh:mm:ss.sss': 'TIME', 'Longitude [degrees_east]': 'LON', 'Latitude [degrees_north]': 'LAT', 'DEPTH [m]': 'SMP_DEPTH', 'Bot. Depth [m]': 'TOT_DEPTH', 'BODC Bottle Number:INTEGER': 'SMP_ID_PROVIDER'}, # Provider column name → MARIS standard name mapping
):

Remap Geotraces-specific coordinate, depth, and sample-ID column names to MARIS standard nomenclature.

Exported source
# Geotraces column name → MARIS standard name mapping
renaming_rules = {
    'yyyy-mm-ddThh:mm:ss.sss': 'TIME',
    'Longitude [degrees_east]': 'LON',
    'Latitude [degrees_north]': 'LAT',
    'DEPTH [m]': 'SMP_DEPTH',
    'Bot. Depth [m]': 'TOT_DEPTH',
    'BODC Bottle Number:INTEGER': 'SMP_ID_PROVIDER'
}
tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut),
    RenameColumnCB(renaming_rules)
])
df_test = tfm()
print(f'Columns after rename: {list(df_test.columns)}')
# MARIS standard names present
for new_name in renaming_rules.values(): test_eq(new_name in df_test.columns, True)
# Provider names removed
for old_name in renaming_rules.keys(): test_eq(old_name in df_test.columns, False)
Columns after rename: ['TIME', 'LON', 'LAT', 'TOT_DEPTH', 'SMP_DEPTH', 'SMP_ID_PROVIDER', 'NUCLIDE', 'VALUE', 'UNIT', 'FILT', 'GROUP', 'SAMP_MET']

Unshift longitudes

Geotraces encodes longitudes in the [0, 360] range (e.g. 230°E instead of −130°), which is incompatible with the MARIS [-180, 180] convention. The callback subtracts 180 to realign all longitudes to the standard range.


source

UnshiftLongitudeCB


def UnshiftLongitudeCB(
    lon_col_name:str='LON', # Column containing longitudes in [0, 360] to shift
):

Shift longitudes from Geotraces [0, 360] convention to MARIS [-180, 180] by subtracting 180.

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut),
    RenameColumnCB(renaming_rules),
    UnshiftLongitudeCB()
])
df_test = tfm()
print(f'LON range: [{df_test.LON.min():.4f}, {df_test.LON.max():.4f}]')
test_eq(df_test.LON.between(-180, 180).all(), True)
LON range: [-180.0000, 179.9986]

Dispatch to groups

The pipeline so far produces a single flat dataframe containing both seawater and suspended-particulate-matter measurements side by side. These two sample types belong to separate NetCDF4 groups (and use different units, different detection-limit conventions, etc.), so they need to be split into per-group dataframes before encoding. The DispatchToGroupCB partitions the flat result by the GROUP column and drops the column — the group label becomes the dict key rather than persisting as a data column.


source

DispatchToGroupCB


def DispatchToGroupCB(
    group_name:str='GROUP', # Column whose distinct values become the output dict keys
):

Split flat dataframe into per-group dict keyed by sample type (SEAWATER, SUSPENDED_MATTER, …).

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut),
    RenameColumnCB(renaming_rules),
    UnshiftLongitudeCB(),
    DispatchToGroupCB()
])
dfs_test = tfm()
print(f'Groups: {list(dfs_test.keys())}')
test_eq(set(dfs_test.keys()), {'SEAWATER', 'SUSPENDED_MATTER'})
# GROUP column consumed as dict key, not passed through
test_eq('GROUP' in dfs_test['SEAWATER'].columns, False)
test_eq('GROUP' in dfs_test['SUSPENDED_MATTER'].columns, False)
Groups: ['SEAWATER', 'SUSPENDED_MATTER']

Add sample ID

After wide-to-long melting, each BODC Bottle Number (renamed to SMP_ID_PROVIDER) appears once per measured nuclide — 8,779 distinct provider IDs across 19,139 seawater rows, and 1,849 across 7,606 suspended-matter rows. The provider ID is not a row-level identifier, so a sequential SMP_ID is generated per group to serve as the NetCDF dimension index. For traceability the provider’s stable bottle number is preserved as SMP_ID_PROVIDER and cast to str for NetCDF VLEN compatibility.

for grp, gdf in dfs_test.items():
    print(f'{grp}: {len(gdf)} rows, {gdf.SMP_ID_PROVIDER.nunique()} unique provider IDs')
SEAWATER: 19139 rows, 8779 unique provider IDs
SUSPENDED_MATTER: 7606 rows, 1849 unique provider IDs

source

AddSampleIDCB


def AddSampleIDCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Assign a sequential SMP_ID per sample-type group; cast SMP_ID_PROVIDER (BODC Bottle Number) to string for NetCDF VLEN compatibility.

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut),
    RenameColumnCB(renaming_rules),
    UnshiftLongitudeCB(),
    DispatchToGroupCB(),
    AddSampleIDCB()
])
dfs_test = tfm()
for grp, gdf in dfs_test.items():
    print(f'{grp}: SMP_ID range 1–{gdf.SMP_ID.max()}, SMP_ID_PROVIDER dtype={gdf.SMP_ID_PROVIDER.dtype}')
# SMP_ID is sequential from 1
test_eq(dfs_test['SEAWATER']['SMP_ID'].iloc[0], 1)
test_eq(dfs_test['SUSPENDED_MATTER']['SMP_ID'].iloc[0], 1)
# SMP_ID_PROVIDER cast to string for NetCDF VLEN compatibility
test_eq(dfs_test['SEAWATER']['SMP_ID_PROVIDER'].dtype, 'str')
SEAWATER: SMP_ID range 1–19139, SMP_ID_PROVIDER dtype=str
SUSPENDED_MATTER: SMP_ID range 1–7606, SMP_ID_PROVIDER dtype=str

Parse time

Geotraces timestamps arrive as ISO 8601 strings in a single column (yyyy-mm-ddThh:mm:ss.sss). ParseTimeCB converts these to pandas datetime objects, enabling temporal filtering and NetCDF-compatible time encoding downstream.

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut),
    RenameColumnCB(renaming_rules),
    UnshiftLongitudeCB(),
    DispatchToGroupCB(),
    ParseTimeCB()
])
dfs_test = tfm()
print('TIME dtype:', dfs_test['SEAWATER']['TIME'].dtype)
test_eq(dfs_test['SEAWATER']['TIME'].dtype, 'datetime64[us]')
TIME dtype: datetime64[us]

Encode time

Geotraces timestamps arrive as ISO 8601 strings. After ParseTimeCB converts them to datetime64[us] they are still not in a NetCDF-compatible format. The MARIS NetCDF CDL template stores time as seconds since 1970-01-01, so EncodeTimeCB converts each datetime to its Unix timestamp integer. Downstream the NetCDF file declares units: seconds since 1970-01-01T00:00:00Z on the TIME variable, which client software can decode back to calendar dates on read.

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut),
    RenameColumnCB(renaming_rules),
    UnshiftLongitudeCB(),
    DispatchToGroupCB(),    
    ParseTimeCB(),
    EncodeTimeCB()
])

dfs_test = tfm()
print('TIME sample (epoch seconds):', dfs_test['SEAWATER']['TIME'].iloc[:5].values)
TIME sample (epoch seconds): [1287274409 1287274409 1287274409 1287274409 1287274409]

Sanitize coordinates

SanitizeLonLatCB normalises comma decimal separators to dots for longitude and latitude values, and drops rows whose coordinates are exactly (0, 0) or fall outside the valid ranges (lon ∉ [-180, 180], lat ∉ [-90, 90]).

df = pd.read_csv(fname_in)

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut),
    RenameColumnCB(renaming_rules),
    UnshiftLongitudeCB(),
    DispatchToGroupCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
])
dfs_test = tfm()
dfs_test['SEAWATER'].head()
TIME LON LAT TOT_DEPTH SMP_DEPTH SMP_ID_PROVIDER NUCLIDE VALUE UNIT FILT SAMP_MET
9223 1287274409 170.33792 38.3271 2827.0 17.8 842525 h3 0.733 7 1 1
9231 1287274409 170.33792 38.3271 2827.0 34.7 842528 h3 0.696 7 1 1
9237 1287274409 170.33792 38.3271 2827.0 67.5 842531 h3 0.718 7 1 1
9244 1287274409 170.33792 38.3271 2827.0 91.9 842534 h3 0.709 7 1 1
9256 1287274409 170.33792 38.3271 2827.0 136.6 842540 h3 0.692 7 1 1
for grp, gdf in dfs_test.items():
    print(f'{grp}: {len(gdf)} rows after sanitize')
test_eq(all(dfs_test['SEAWATER']['LON'].between(-180, 180)), True)
test_eq(all(dfs_test['SEAWATER']['LAT'].between(-90, 90)), True)
SEAWATER: 19139 rows after sanitize
SUSPENDED_MATTER: 7606 rows after sanitize

Remap nuclides name to id

At this point the pipeline holds nuclides as human-readable strings (h3, cs137, …) but the NetCDF file stores them as integer enumeration types for space efficiency. The mapping from standardised name to MARIS nuclide ID is defined by the lookup table below, which RemapCB applies before encoding. For example h31 and cs13733

Exported source
# Lookup table: MARIS nc_name → nuclide_id
lut_nuclides = lambda: get_lut('NUCLIDE', reverse=False)
df = pd.read_csv(fname_in)

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut),
    RenameColumnCB(renaming_rules),
    UnshiftLongitudeCB(),
    DispatchToGroupCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    RemapCB(fn_lut=lut_nuclides, col_remap='NUCLIDE', col_src='NUCLIDE')
])

dfs_test = tfm()
dfs_test['SEAWATER'].NUCLIDE.unique()
Group BIOTA not found in the dataframes.
Group SEDIMENT not found in the dataframes.
array([  1,  33,  28,  65,  68,  77,  69, 108, 107,  41,  47,  51,  53,
        54, 106,  59,  60, 144,   2,  50,  57])

NetCDF encoder

Example change logs

Each callback’s docstring is recorded in tfm.logs during pipeline execution — an ordered audit trail of every transformation applied. These logs are serialised into the NetCDF output’s global attribute publisher_postprocess_logs, providing traceability for downstream users.

The two “not found” messages for BIOTA and SEDIMENT are expected: this dataset (GEOTRACES IDP2021 seawater) only contains SEAWATER and SUSPENDED_MATTER sample types.

df = pd.read_csv(fname_in)

tfm = Transformer(df, cbs=[
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(common_coi, nuclides_pattern),
    ExtractUnitCB(),
    ExtractFilteringStatusCB(phase),
    ExtractSamplingMethodCB(smp_method),
    RenameNuclideCB(nuclides_name),
    StandardizeUnitCB(units_lut),
    RenameColumnCB(renaming_rules),
    UnshiftLongitudeCB(),
    DispatchToGroupCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    RemapCB(fn_lut=lut_nuclides, col_remap='NUCLIDE', col_src='NUCLIDE')
])

tfm();
Group BIOTA not found in the dataframes.
Group SEDIMENT not found in the dataframes.
tfm.logs
['Select columns of interest from the wide Geotraces dataframe.',
 'Reshape wide nuclide columns to long format so unit, method, and filter status can be extracted from column names.',
 "Extract measurement unit from nuclide column names (e.g. 'Cs_137_D_CONC_BOTTLE [uBq/kg]' → 'uBq/kg').",
 'Extract filtering status and sample-type group from nuclide column names using phase code (e.g. _D_, _T_, _TP_).',
 'Extract sampling method from nuclide names.',
 'Remap nuclides name to MARIS standard.',
 'Remap Geotraces unit strings to MARIS unit IDs, rescaling measurement values by the appropriate conversion factor where units share a common MARIS unit ID (e.g. uBq/kg and mBq/kg both map to ID 3 but differ 1000x).',
 'Remap Geotraces-specific coordinate, depth, and sample-ID column names to MARIS standard nomenclature.',
 'Shift longitudes from Geotraces [0, 360] convention to MARIS [-180, 180] by subtracting 180.',
 'Split flat dataframe into per-group dict keyed by sample type (SEAWATER, SUSPENDED_MATTER, …).',
 'Parse time column from ISO8601 string to datetime.',
 'Encode time as seconds since epoch.',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.',
 "Remap values from 'NUCLIDE' to 'NUCLIDE' for groups: dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT', 'SUSPENDED_MATTER'])."]

Feed global attributes

The global attributes that end up in the NetCDF output come from three sources:

  1. Computed from the data itself — the BboxCB, DepthRangeCB, and TimeRangeCB derive spatial extent (geospatial_lat_min/max, geospatial_lon_min/max, geospatial_bounds), depth range (geospatial_vertical_min/max), and temporal coverage (time_coverage_start/end) from the columns in each sample-type group’s dataframe.
  2. Pulled from an external repositoryZoteroCB fetches bibliographic metadata (id, title, summary, creator_name) from the MARIS Zotero library using the dataset’s zotero_key, so citation details stay synchronised with the library rather than being hardcoded.
  3. Supplied as literalsKeyValuePairCB injects ad-hoc attributes like keywords (a controlled-vocabulary string for data discovery) and publisher_postprocess_logs (the transformation audit trail from tfm.logs).

source

get_attrs


def get_attrs(
    tfm, zotero_key,
    kw:list=['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides', 'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments', 'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality > Ocean Contaminants', 'Earth Science > Biological Classification > Animals/Vertebrates > Fish', 'Earth Science > Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks', 'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans', 'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']
):

Retrieve global attributes from Geotraces dataset.

Exported source
def get_attrs(
        tfm, 
        zotero_key, 
        kw=kw
        ):
    "Retrieve global attributes from Geotraces dataset."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()
zotero_metadata = get_attrs(tfm, zotero_key=zotero_key, kw=kw)
print('Keys: ', zotero_metadata.keys())
print('Title: ', zotero_metadata['title'])
Keys:  dict_keys(['geospatial_lat_min', 'geospatial_lat_max', 'geospatial_lon_min', 'geospatial_lon_max', 'geospatial_bounds', 'geospatial_vertical_max', 'geospatial_vertical_min', 'time_coverage_start', 'time_coverage_end', 'id', 'title', 'summary', 'creator_name', 'keywords', 'publisher_postprocess_logs'])
Title:  The GEOTRACES Intermediate Data Product 2017

Encoding

The encode() function below is the entry point called by the CLI tool maris_to_nc. When a user runs:

maris_to_nc geotraces --dest path/to/output.nc --src path/to/input.csv

The CLI (nbs/cli/to_nc.ipynb) resolves geotraces to this handler module (marisco.handlers.geotraces), imports its encode function, and calls it with the provided paths:

encode = import_handler('marisco.handlers.geotraces')
encode(fname_in=src, fname_out=dest)

Two conventions make this dispatch work:

  1. Each handler exposes encode() with the same signature (fname_in, fname_out, **kwargs) — the CLI doesn’t need to know what transformations happen inside.
  2. The src parameter is optional: handlers with built-in data paths (HELCOM, OSPAR, TEPCO) can be called with only fname_out, while Geotraces requires the explicit path because the raw CSV is too large to bundle.

The full orchestration is laid out below — each callback in the pipeline is documented in its own section above, and tfm.logs captures every step as an audit trail serialised into the output NetCDF.


source

encode


def encode(
    fname_in:str, # Path to the raw Geotraces input CSV (the IDP2021 discrete sample data)
    fname_out:str, # Destination path for the NetCDF4 output file
    kwargs:VAR_KEYWORD
):

Orchestrate the full Geotraces curation pipeline: load, transform, and encode to MARIS NetCDF4 format.

encode(fname_in, fname_out, verbose=False)
Group BIOTA not found in the dataframes.
Group SEDIMENT not found in the dataframes.

NetCDF → CSV (MARIS DB import)

The NetCDF file is the archival format, but the MARIS master database requires input in a specific CSV layout compatible with the legacy OpenRefine import pipeline. The decode function reads the just-encoded NetCDF, reverses enum encoding back to human-readable strings (nuclide names, unit labels, etc.), appends SAMPLE_TYPE and REF_ID columns from the Zotero metadata, and saves per-group CSV files alongside the NetCDF.

The CSV step is optional — the NetCDF is the canonical output — but without it the data cannot be ingested into the central MARIS database.

# | eval: false
decode(fname_in=fname_out, verbose=True)

The resulting files (*_SEAWATER.csv, *_SUSPENDED_MATTER.csv) are then ready for verification and SQL import.