Exported source
fname_in = '../../_data/geotraces/GEOTRACES_IDP2021_v2/seawater/ascii/GEOTRACES_IDP2021_Seawater_Discrete_Sample_Data_v2.csv'
fname_out = '../../_data/output/190-geotraces-2021.nc'
zotero_key = '97UIMEXN'The BODC GEOTRACES Intermediate Data Product 2021 is one of the most comprehensive compilations of ocean radionuclide measurements to date, assembling water-column and suspended-particulate data from international oceanographic cruises worldwide.
This notebook documents the full curation workflow applied to bring that dataset into alignment with MARIS data standards: selecting the radionuclide variables within MARIS scope, reshaping the wide-format source, extracting metadata encoded in column names (unit, filtering status, sampling method), standardising nuclide nomenclature, coordinates, and units, and splitting measurements into SEAWATER and SUSPENDED_MATTER groups before encoding as a self-contained NetCDF4 file. The same workflow can be run end-to-end without inspecting the notebook via the maris_to_nc CLI tool.
Our approach is inspired by Literate Programming: code and explanation live side by side so data providers can follow the reasoning behind every curation decision and data users can understand exactly what was done to the data and why. Where the raw data contains inconsistencies or opportunities for improvement, they are flagged directly in the relevant section as feedback for future releases.
fname_in: path to the folder containing the HELCOM data in CSV format. The path can be defined as a relative path.
fname_out: path and filename for the NetCDF output.The path can be defined as a relative path.
Zotero key: used to retrieve attributes related to the dataset from Zotero. The MARIS datasets include a library available on Zotero.
df shape: (105417, 1188)
Index(['Cruise', 'Station:METAVAR:INDEXED_TEXT', 'Type',
'yyyy-mm-ddThh:mm:ss.sss', 'Longitude [degrees_east]',
'Latitude [degrees_north]', 'Bot. Depth [m]',
'Operator's Cruise Name:METAVAR:INDEXED_TEXT',
'Ship Name:METAVAR:INDEXED_TEXT', 'Period:METAVAR:INDEXED_TEXT',
...
'QV:SEADATANET.581', 'Co_CELL_CONC_BOTTLE [amol/cell]',
'QV:SEADATANET.582', 'Ni_CELL_CONC_BOTTLE [amol/cell]',
'QV:SEADATANET.583', 'Cu_CELL_CONC_BOTTLE [amol/cell]',
'QV:SEADATANET.584', 'Zn_CELL_CONC_BOTTLE [amol/cell]',
'QV:SEADATANET.585', 'QV:ODV:SAMPLE'],
dtype='str', length=1188)
The raw Geotraces CSV arrives in wide format with 1,188 columns; mostly non-radionuclide parameters (nutrients, trace metals, quality flags) outside MARIS scope. The first step is to select only the radionuclide columns: common_coi lists the 6 metadata columns always kept as identifiers, and nuclides_pattern matches 80 measurement columns, reducing the table to 86. The regex patterns match on measurement column names, so companion quality-flag (QV:) columns are naturally excluded. The wide structure is then reshaped to long form in a later step.
Select columns of interest from the wide Geotraces dataframe.
# Metadata columns always kept as identifiers when reshaping wide → long
common_coi = ['yyyy-mm-ddThh:mm:ss.sss', 'Longitude [degrees_east]',
'Latitude [degrees_north]', 'Bot. Depth [m]', 'DEPTH [m]', 'BODC Bottle Number:INTEGER']
# Regex patterns identifying radionuclide measurement columns
nuclides_pattern = ['^TRITI', '^Th_228', '^Th_23[024]', '^Pa_231',
'^U_236_[DT]', '^Be_', '^Cs_137', '^Pb_210', '^Po_210',
'^Ra_22[3468]', '^Np_237', '^Pu_239_[D]', '^Pu_240', '^Pu_239_Pu_240',
'^I_129', '^Ac_227']For instance:
First ten cols: Index(['yyyy-mm-ddThh:mm:ss.sss', 'Longitude [degrees_east]',
'Latitude [degrees_north]', 'Bot. Depth [m]', 'DEPTH [m]',
'BODC Bottle Number:INTEGER', 'TRITIUM_D_CONC_BOTTLE [TU]',
'Cs_137_D_CONC_BOTTLE [uBq/kg]', 'I_129_D_CONC_BOTTLE [atoms/kg]',
'Np_237_D_CONC_BOTTLE [uBq/kg]'],
dtype='str')
Columns matched by the nuclides patterns
From the 1,188 raw columns, the patterns above select these 80 radionuclide measurement columns:
^TRITI → 1 cols e.g. ['TRITIUM_D_CONC_BOTTLE [TU]']
^Th_228 → 4 cols e.g. ['Th_228_D_CONC_PUMP [uBq/kg]', 'Th_228_D_CONC_UWAY [uBq/kg]', 'Th_228_SPT_CONC_PUMP [uBq/kg]']
^Th_23[024] → 25 cols e.g. ['Th_230_T_CONC_BOTTLE [uBq/kg]', 'Th_230_D_CONC_BOTTLE [uBq/kg]', 'Th_232_T_CONC_BOTTLE [pmol/kg]']
^Pa_231 → 8 cols e.g. ['Pa_231_D_CONC_BOTTLE [uBq/kg]', 'Pa_231_D_CONC_FISH [uBq/kg]', 'Pa_231_D_CONC_UWAY [uBq/kg]']
^U_236_[DT] → 3 cols e.g. ['U_236_D_CONC_BOTTLE [atoms/kg]', 'U_236_T_CONC_BOTTLE [atoms/kg]', 'U_236_D_CONC_FISH [atoms/kg]']
^Be_ → 2 cols e.g. ['Be_7_T_CONC_PUMP [uBq/kg]', 'Be_7_D_CONC_PUMP [uBq/kg]']
^Cs_137 → 2 cols e.g. ['Cs_137_D_CONC_BOTTLE [uBq/kg]', 'Cs_137_D_CONC_UWAY [uBq/kg]']
^Pb_210 → 7 cols e.g. ['Pb_210_D_CONC_BOTTLE [mBq/kg]', 'Pb_210_D_CONC_FISH [mBq/kg]', 'Pb_210_D_CONC_UWAY [mBq/kg]']
^Po_210 → 7 cols e.g. ['Po_210_D_CONC_BOTTLE [mBq/kg]', 'Po_210_D_CONC_FISH [mBq/kg]', 'Po_210_D_CONC_UWAY [mBq/kg]']
^Ra_22[3468] → 14 cols e.g. ['Ra_224_D_CONC_BOTTLE [mBq/kg]', 'Ra_226_D_CONC_BOTTLE [mBq/kg]', 'Ra_228_T_CONC_BOTTLE [mBq/kg]']
^Np_237 → 1 cols e.g. ['Np_237_D_CONC_BOTTLE [uBq/kg]']
^Pu_239_[D] → 1 cols e.g. ['Pu_239_D_CONC_BOTTLE [uBq/kg]']
^Pu_240 → 1 cols e.g. ['Pu_240_D_CONC_BOTTLE [uBq/kg]']
^Pu_239_Pu_240 → 2 cols e.g. ['Pu_239_Pu_240_D_CONC_BOTTLE [uBq/kg]', 'Pu_239_Pu_240_D_CONC_UWAY [uBq/kg]']
^I_129 → 1 cols e.g. ['I_129_D_CONC_BOTTLE [atoms/kg]']
^Ac_227 → 1 cols e.g. ['Ac_227_D_CONC_PUMP [uBq/kg]']
Total nuclide columns selected: 80 of 1188
The raw Geotraces CSV is in wide format: each row holds up to 80 radionuclide measurements crammed into separate columns, and metadata like unit, sampling methodology, and filter status is embedded in the column names themselves (e.g. Th_230_D_CONC_BOTTLE [uBq/kg]). This is unworkable for curation. Melting to long format folds all measurements into a single VALUE column and a NUCLIDE column that carries the full column-name string; which we can then parse to extract unit, method, and filter status in the next step.
Reshape wide nuclide columns to long format so unit, method, and filter status can be extracted from column names.
print(f'Long format: {df_test.shape[0]} rows × {df_test.shape[1]} cols')
# id columns all preserved
for col in common_coi: test_eq(col in df_test.columns, True)
# nuclide name and value columns created
test_eq('NUCLIDE' in df_test.columns, True)
test_eq('VALUE' in df_test.columns, True)
# no original wide nuclide columns remain
test_eq(any(re.match(p, c) for p in nuclides_pattern for c in df_test.columns), False)Long format: 26745 rows × 8 cols
Geotraces encodes unit, filtering status, and sampling method inside the column names themselves, for example Th_230_D_CONC_BOTTLE [uBq/kg] holds all three. These need to be parsed out into dedicated columns before they can drive unit conversion, MARIS nomenclature mapping, and quality checks.
Units appear in square brackets at the end of every nuclide column name. The five distinct units found in this dataset are uBq/kg, mBq/kg, TU, atoms/kg, and pmol/kg; each needs to be mapped to the corresponding MARIS unit code in a later step.
Extract measurement unit from nuclide column names (e.g. ‘Cs_137_D_CONC_BOTTLE [uBq/kg]’ → ‘uBq/kg’).
Phase codes embedded in nuclide column names encode both filtering status and sample type group. The second underscore component after the nuclide name indicates the phase: D (dissolved, FILT=1, SEAWATER), T (total, FILT=2, SEAWATER), and TP / LPT / SPT (suspended particulate matter fractions, all FILT=1, SUSPENDED_MATTER). These are parsed into dedicated FILT and GROUP columns that drive sample type classification and downstream quality checks.
Extract filtering status and sample-type group from nuclide column names using phase code (e.g. D, T, TP).
# Phase code embedded in column names → FILT status and sample type group
phase = {
'D': {'FILT': 1, 'group': 'SEAWATER'},
'T': {'FILT': 2, 'group': 'SEAWATER'},
'TP': {'FILT': 1, 'group': 'SUSPENDED_MATTER'},
'LPT': {'FILT': 1, 'group': 'SUSPENDED_MATTER'},
'SPT': {'FILT': 1, 'group': 'SUSPENDED_MATTER'}}print(f'Groups found: {sorted(df_test.GROUP.dropna().unique())}')
print(f'Filtering values: {sorted(df_test.FILT.dropna().unique())}')
test_eq('FILT' in df_test.columns, True)
test_eq('GROUP' in df_test.columns, True)
test_eq(set(df_test.GROUP.dropna().unique()), {'SEAWATER', 'SUSPENDED_MATTER'})
test_eq(set(df_test.FILT.dropna().unique()).issubset({1, 2}), True)Groups found: ['SEAWATER', 'SUSPENDED_MATTER']
Filtering values: [np.int64(1), np.int64(2)]
Sampling method codes appear as the last underscore component before the unit brackets in nuclide column names, for example BOTTLE in Th_230_D_CONC_BOTTLE [uBq/kg]. The four distinct methods found in this dataset are BOTTLE (rosette/CTD bottle, code 1), FISH (continuous towfish, code 18), PUMP (in situ pump, code 14), and UWAY (underway uncontaminated seawater supply, code 24). These are mapped to MARIS sampling method codes and recorded in the SAMP_MET column, enabling sample classification and cross dataset comparisons between different collection techniques.
Extract sampling method from nuclide names.
Geotraces nuclide column names begin with provider-specific strings (e.g. TRITIUM, Pu_239_Pu_240, Th_230, U_236) that must be remapped to MARIS standard nomenclature before any lookup tables can be applied. Most names follow a regular pattern: strip the phase-code suffix (_D, _T, _TP, etc.), then lowercase and remove underscores — Th_230 becomes th230, U_236 becomes u236. Two exceptions need explicit overrides: TRITIUM maps to h3 (the standard nuclide symbol for tritium), and Pu_239_Pu_240 is a combined total activity that MARIS records as pu239_240_tot. The RenameNuclideCB applies the override dictionary first, then falls back to the general lowercasing rule for everything else.
Remap nuclides name to MARIS standard.
Nuclides after rename: ['ac227', 'be7', 'cs137', 'h3', 'i129', 'np237', 'pa231', 'pb210', 'po210', 'pu239', 'pu239_240_tot', 'pu240', 'ra223', 'ra224', 'ra226', 'ra228', 'th228', 'th230', 'th232', 'th234', 'u236']
<ArrowStringArray>
[ 'h3', 'cs137', 'i129', 'np237',
'pu239', 'pu239_240_tot', 'pu240', 'u236',
'pa231', 'pb210', 'po210', 'ra224',
'ra226', 'ra228', 'th230', 'th232',
'th234', 'ac227', 'be7', 'ra223',
'th228']
Length: 21, dtype: str
Several measurements are negative (see grouped counts below). Please review these values and provide detection-limit flags or handling guidance in future data releases.
Geotraces encodes units inside nuclide column names, and five distinct units appear across the dataset: TU, uBq/kg, mBq/kg, atoms/kg, and pmol/kg. Some of these share a common MARIS unit ID despite different magnitudes — uBq/kg and mBq/kg both map to unit ID 3 but differ by a factor of 1000, which must be accounted for in the conversion factor. Similarly, pmol/kg must be converted via Avogadro’s number before it matches the atoms/kg unit ID. The mapping below handles both the unit remapping and the value rescaling:
Remap Geotraces unit strings to MARIS unit IDs, rescaling measurement values by the appropriate conversion factor where units share a common MARIS unit ID (e.g. uBq/kg and mBq/kg both map to ID 3 but differ 1000x).
Geotraces uses provider-specific column names for coordinates, depth, and sample identifiers — with units and metadata embedded as suffixes in brackets — that don’t match MARIS standard nomenclature (TIME, LON, LAT, TOT_DEPTH, SMP_DEPTH, SMP_ID_PROVIDER). These are remapped via RenameColumnCB before NetCDF encoding.
Remap Geotraces-specific coordinate, depth, and sample-ID column names to MARIS standard nomenclature.
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules)
])
df_test = tfm()Columns after rename: ['TIME', 'LON', 'LAT', 'TOT_DEPTH', 'SMP_DEPTH', 'SMP_ID_PROVIDER', 'NUCLIDE', 'VALUE', 'UNIT', 'FILT', 'GROUP', 'SAMP_MET']
Geotraces encodes longitudes in the [0, 360] range (e.g. 230°E instead of −130°), which is incompatible with the MARIS [-180, 180] convention. The callback subtracts 180 to realign all longitudes to the standard range.
Shift longitudes from Geotraces [0, 360] convention to MARIS [-180, 180] by subtracting 180.
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB()
])
df_test = tfm()The pipeline so far produces a single flat dataframe containing both seawater and suspended-particulate-matter measurements side by side. These two sample types belong to separate NetCDF4 groups (and use different units, different detection-limit conventions, etc.), so they need to be split into per-group dataframes before encoding. The DispatchToGroupCB partitions the flat result by the GROUP column and drops the column — the group label becomes the dict key rather than persisting as a data column.
Split flat dataframe into per-group dict keyed by sample type (SEAWATER, SUSPENDED_MATTER, …).
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB()
])
dfs_test = tfm()Groups: ['SEAWATER', 'SUSPENDED_MATTER']
After wide-to-long melting, each BODC Bottle Number (renamed to SMP_ID_PROVIDER) appears once per measured nuclide — 8,779 distinct provider IDs across 19,139 seawater rows, and 1,849 across 7,606 suspended-matter rows. The provider ID is not a row-level identifier, so a sequential SMP_ID is generated per group to serve as the NetCDF dimension index. For traceability the provider’s stable bottle number is preserved as SMP_ID_PROVIDER and cast to str for NetCDF VLEN compatibility.
SEAWATER: 19139 rows, 8779 unique provider IDs
SUSPENDED_MATTER: 7606 rows, 1849 unique provider IDs
Assign a sequential SMP_ID per sample-type group; cast SMP_ID_PROVIDER (BODC Bottle Number) to string for NetCDF VLEN compatibility.
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
AddSampleIDCB()
])
dfs_test = tfm()for grp, gdf in dfs_test.items():
print(f'{grp}: SMP_ID range 1–{gdf.SMP_ID.max()}, SMP_ID_PROVIDER dtype={gdf.SMP_ID_PROVIDER.dtype}')
# SMP_ID is sequential from 1
test_eq(dfs_test['SEAWATER']['SMP_ID'].iloc[0], 1)
test_eq(dfs_test['SUSPENDED_MATTER']['SMP_ID'].iloc[0], 1)
# SMP_ID_PROVIDER cast to string for NetCDF VLEN compatibility
test_eq(dfs_test['SEAWATER']['SMP_ID_PROVIDER'].dtype, 'str')SEAWATER: SMP_ID range 1–19139, SMP_ID_PROVIDER dtype=str
SUSPENDED_MATTER: SMP_ID range 1–7606, SMP_ID_PROVIDER dtype=str
Geotraces timestamps arrive as ISO 8601 strings in a single column (yyyy-mm-ddThh:mm:ss.sss). ParseTimeCB converts these to pandas datetime objects, enabling temporal filtering and NetCDF-compatible time encoding downstream.
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB()
])
dfs_test = tfm()Geotraces timestamps arrive as ISO 8601 strings. After ParseTimeCB converts them to datetime64[us] they are still not in a NetCDF-compatible format. The MARIS NetCDF CDL template stores time as seconds since 1970-01-01, so EncodeTimeCB converts each datetime to its Unix timestamp integer. Downstream the NetCDF file declares units: seconds since 1970-01-01T00:00:00Z on the TIME variable, which client software can decode back to calendar dates on read.
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB(),
EncodeTimeCB()
])
dfs_test = tfm()
print('TIME sample (epoch seconds):', dfs_test['SEAWATER']['TIME'].iloc[:5].values)TIME sample (epoch seconds): [1287274409 1287274409 1287274409 1287274409 1287274409]
SanitizeLonLatCB normalises comma decimal separators to dots for longitude and latitude values, and drops rows whose coordinates are exactly (0, 0) or fall outside the valid ranges (lon ∉ [-180, 180], lat ∉ [-90, 90]).
df = pd.read_csv(fname_in)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB()
])
dfs_test = tfm()
dfs_test['SEAWATER'].head()| TIME | LON | LAT | TOT_DEPTH | SMP_DEPTH | SMP_ID_PROVIDER | NUCLIDE | VALUE | UNIT | FILT | SAMP_MET | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9223 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | h3 | 0.733 | 7 | 1 | 1 |
| 9231 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | h3 | 0.696 | 7 | 1 | 1 |
| 9237 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | h3 | 0.718 | 7 | 1 | 1 |
| 9244 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | h3 | 0.709 | 7 | 1 | 1 |
| 9256 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | h3 | 0.692 | 7 | 1 | 1 |
At this point the pipeline holds nuclides as human-readable strings (h3, cs137, …) but the NetCDF file stores them as integer enumeration types for space efficiency. The mapping from standardised name to MARIS nuclide ID is defined by the lookup table below, which RemapCB applies before encoding. For example h3 → 1 and cs137 → 33
df = pd.read_csv(fname_in)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
RemapCB(fn_lut=lut_nuclides, col_remap='NUCLIDE', col_src='NUCLIDE')
])
dfs_test = tfm()
dfs_test['SEAWATER'].NUCLIDE.unique()Group BIOTA not found in the dataframes.
Group SEDIMENT not found in the dataframes.
array([ 1, 33, 28, 65, 68, 77, 69, 108, 107, 41, 47, 51, 53,
54, 106, 59, 60, 144, 2, 50, 57])
Each callback’s docstring is recorded in tfm.logs during pipeline execution — an ordered audit trail of every transformation applied. These logs are serialised into the NetCDF output’s global attribute publisher_postprocess_logs, providing traceability for downstream users.
The two “not found” messages for BIOTA and SEDIMENT are expected: this dataset (GEOTRACES IDP2021 seawater) only contains SEAWATER and SUSPENDED_MATTER sample types.
df = pd.read_csv(fname_in)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
RemapCB(fn_lut=lut_nuclides, col_remap='NUCLIDE', col_src='NUCLIDE')
])
tfm();Group BIOTA not found in the dataframes.
Group SEDIMENT not found in the dataframes.
['Select columns of interest from the wide Geotraces dataframe.',
'Reshape wide nuclide columns to long format so unit, method, and filter status can be extracted from column names.',
"Extract measurement unit from nuclide column names (e.g. 'Cs_137_D_CONC_BOTTLE [uBq/kg]' → 'uBq/kg').",
'Extract filtering status and sample-type group from nuclide column names using phase code (e.g. _D_, _T_, _TP_).',
'Extract sampling method from nuclide names.',
'Remap nuclides name to MARIS standard.',
'Remap Geotraces unit strings to MARIS unit IDs, rescaling measurement values by the appropriate conversion factor where units share a common MARIS unit ID (e.g. uBq/kg and mBq/kg both map to ID 3 but differ 1000x).',
'Remap Geotraces-specific coordinate, depth, and sample-ID column names to MARIS standard nomenclature.',
'Shift longitudes from Geotraces [0, 360] convention to MARIS [-180, 180] by subtracting 180.',
'Split flat dataframe into per-group dict keyed by sample type (SEAWATER, SUSPENDED_MATTER, …).',
'Parse time column from ISO8601 string to datetime.',
'Encode time as seconds since epoch.',
'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.',
"Remap values from 'NUCLIDE' to 'NUCLIDE' for groups: dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT', 'SUSPENDED_MATTER'])."]
The global attributes that end up in the NetCDF output come from three sources:
BboxCB, DepthRangeCB, and TimeRangeCB derive spatial extent (geospatial_lat_min/max, geospatial_lon_min/max, geospatial_bounds), depth range (geospatial_vertical_min/max), and temporal coverage (time_coverage_start/end) from the columns in each sample-type group’s dataframe.ZoteroCB fetches bibliographic metadata (id, title, summary, creator_name) from the MARIS Zotero library using the dataset’s zotero_key, so citation details stay synchronised with the library rather than being hardcoded.KeyValuePairCB injects ad-hoc attributes like keywords (a controlled-vocabulary string for data discovery) and publisher_postprocess_logs (the transformation audit trail from tfm.logs).
def get_attrs(
tfm, zotero_key,
kw:list=['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides', 'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments', 'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality > Ocean Contaminants', 'Earth Science > Biological Classification > Animals/Vertebrates > Fish', 'Earth Science > Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks', 'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans', 'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']
):
Retrieve global attributes from Geotraces dataset.
def get_attrs(
tfm,
zotero_key,
kw=kw
):
"Retrieve global attributes from Geotraces dataset."
return GlobAttrsFeeder(tfm.dfs, cbs=[
BboxCB(),
DepthRangeCB(),
TimeRangeCB(),
ZoteroCB(zotero_key),
KeyValuePairCB('keywords', ', '.join(kw)),
KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
])()Keys: dict_keys(['geospatial_lat_min', 'geospatial_lat_max', 'geospatial_lon_min', 'geospatial_lon_max', 'geospatial_bounds', 'geospatial_vertical_max', 'geospatial_vertical_min', 'time_coverage_start', 'time_coverage_end', 'id', 'title', 'summary', 'creator_name', 'keywords', 'publisher_postprocess_logs'])
Title: The GEOTRACES Intermediate Data Product 2017
The encode() function below is the entry point called by the CLI tool maris_to_nc. When a user runs:
The CLI (nbs/cli/to_nc.ipynb) resolves geotraces to this handler module (marisco.handlers.geotraces), imports its encode function, and calls it with the provided paths:
Two conventions make this dispatch work:
encode() with the same signature (fname_in, fname_out, **kwargs) — the CLI doesn’t need to know what transformations happen inside.src parameter is optional: handlers with built-in data paths (HELCOM, OSPAR, TEPCO) can be called with only fname_out, while Geotraces requires the explicit path because the raw CSV is too large to bundle.The full orchestration is laid out below — each callback in the pipeline is documented in its own section above, and tfm.logs captures every step as an audit trail serialised into the output NetCDF.
Orchestrate the full Geotraces curation pipeline: load, transform, and encode to MARIS NetCDF4 format.
The NetCDF file is the archival format, but the MARIS master database requires input in a specific CSV layout compatible with the legacy OpenRefine import pipeline. The decode function reads the just-encoded NetCDF, reverses enum encoding back to human-readable strings (nuclide names, unit labels, etc.), appends SAMPLE_TYPE and REF_ID columns from the Zotero metadata, and saves per-group CSV files alongside the NetCDF.
The CSV step is optional — the NetCDF is the canonical output — but without it the data cannot be ingested into the central MARIS database.
The resulting files (*_SEAWATER.csv, *_SUSPENDED_MATTER.csv) are then ready for verification and SQL import.