This handler ingests raw HELCOM (Helsinki Commission — Baltic Marine Environment Protection Commission) Monitoring of Radioactive Substances (MORS) data and transforms it into the MARIS NetCDF format through a pipeline that standardises nomenclatures, parses time and coordinates, melts dual-value sediment rows into long format, and computes uncertainties, detection-limit flags, and weight variables.

For detailed guidance on the reconciliation workflow used throughout this handler, see the writing-a-handler and reconcile-nomenclature how-to guides.

For the MARIS data model and field conventions, see the reference guide and field definitions.

The pipeline processes the data through these main stages:

Configuration & file paths

  • src_dir: path to the maris-crawlers folder containing the HELCOM data in CSV format.

  • fname_out: path and filename for the NetCDF output.The path can be defined as a relative path.

  • Zotero key: used to retrieve attributes related to the dataset from Zotero. The MARIS datasets include a library available on Zotero.

Exported source
src_dir = 'https://raw.githubusercontent.com/franckalbinet/maris-crawlers/refs/heads/main/data/processed/HELCOM%20MORS'
fname_out = '../../_data/output/100-HELCOM-MORS-2024.nc'
zotero_key ='26VMZZ2Q' # HELCOM MORS zotero key

Load data

Helcom MORS (Monitoring of Radioactive Substances in the Baltic Sea) data is provided as a zipped Microsoft Access database. We automatically fetch and convert this dataset with database tables exported as .csv files using a Github action here: maris-crawlers.

The dataset is then accessible in an amenable format for the marisco data pipeline.


source

load_data


def load_data(
    fname_in, # Path to raw HELCOM csv dataset
):

Load HELCOM data; returns dict of DataFrames keyed by sample type.

Exported source
default_smp_types = {  
    'BIO': 'BIOTA', 
    'SEA': 'SEAWATER', 
    'SED': 'SEDIMENT'
}

dfs is a dictionary of dataframes created from the Helcom dataset located at the path src_dir. The data to be included in each dataframe is sorted by sample type. Each dictionary is defined with a key equal to the sample type.

dfs = load_data(src_dir)
test_eq(list(dfs.keys()), ['BIOTA', 'SEAWATER', 'SEDIMENT'])
for k,v in dfs.items():
    print(f"{k}: {v.shape[0]} rows, {v.shape[1]} cols")
    print(v.columns.tolist(), '\n')
BIOTA: 16124 rows, 33 cols
['key', 'country', 'laboratory', 'sequence', 'date', 'year', 'month', 'day', 'station', 'latitude ddmmmm', 'latitude dddddd', 'longitude ddmmmm', 'longitude dddddd', 'sdepth', 'rubin', 'biotatype', 'tissue', 'no', 'length', 'weight', 'dw%', 'loi%', 'mors_subbasin', 'helcom_subbasin', 'date_of_entry_x', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'basis', 'error%', 'number', 'date_of_entry_y'] 

SEAWATER: 21626 rows, 27 cols
['key', 'country', 'laboratory', 'sequence', 'date', 'year', 'month', 'day', 'station', 'latitude (ddmmmm)', 'latitude (dddddd)', 'longitude (ddmmmm)', 'longitude (dddddd)', 'tdepth', 'sdepth', 'salin', 'ttemp', 'filt', 'mors_subbasin', 'helcom_subbasin', 'date_of_entry_x', 'nuclide', 'method', '< value_bq/m³', 'value_bq/m³', 'error%_m³', 'date_of_entry_y'] 

SEDIMENT: 40743 rows, 35 cols
['key', 'country', 'laboratory', 'sequence', 'date', 'year', 'month', 'day', 'station', 'latitude (ddmmmm)', 'latitude (dddddd)', 'longitude (ddmmmm)', 'longitude (dddddd)', 'device', 'tdepth', 'uppsli', 'lowsli', 'area', 'sedi', 'oxic', 'dw%', 'loi%', 'mors_subbasin', 'helcom_subbasin', 'sum_link', 'date_of_entry_x', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'error%_kg', '< value_bq/m²', 'value_bq/m²', 'error%_m²', 'date_of_entry_y'] 

Normalize nuclide names

Fix trailing spaces

ImportantFEEDBACK TO DATA PROVIDER

Trailing whitespace in nuclide names: ~325 rows across the dataset contain nuclide values with one or more trailing spaces (e.g. ‘PU238’, ‘CS137’, ‘SR90’). These should be trimmed at source.

For instance, rows where the raw nuclide name has trailing whitespace:

bad = pd.concat(dfs.values(), ignore_index=True).query('nuclide != nuclide.str.strip()')
print(f"{len(bad)} rows with trailing spaces. Examples:\n")
print(bad.drop_duplicates('nuclide')['nuclide'].to_list()[:8])
325 rows with trailing spaces. Examples:

['PU238   ', 'AM241   ', 'CS137    ', 'CS137   ', 'CS134   ', 'CO60    ', 'K40     ', 'SR90    ']

LowerStripNameCB lowercases and strips them into a standardised NUCLIDE column.

tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE')])
tfm()

for df in tfm.dfs.values():
    test_eq(df['NUCLIDE'], df['NUCLIDE'].str.lower().str.strip())
print(f"All nuclide names normalised across {len(tfm.dfs)} sample groups.")
All nuclide names normalised across 3 sample groups.

Align nuclide names with MARIS

HELCOM nuclide names are lowercased and stripped by LowerStripNameCB above, but some names need expert overrides: combined totals like cs134137 (caesium-134+137 sum, maps to cs134_137_tot), compound codes like cm243244 (curium-243+244), and clearly-as-typos like cs143 (likely cs137). We reconcile these by following the same semi-automated workflow used across marisco handlers:

  1. get familiar with the provider’s codes,
  2. try an automatic mapping,
  3. fix what it got wrong,
  4. and check the result.

We derive the unique nuclide values from the data after lowercase/normalisation, then fuzzy‑match them against the MARIS nuclide reference table.

Try an automatic mapping

Derive unique provider values and fuzzy-match against MARIS reference.

provider_lut = lut_from(tfm(), 'NUCLIDE')
maris_ref = get_lut('NUCLIDE', as_df=True)

print("provider_lut:", provider_lut.columns.tolist())
print("maris_ref:   ", maris_ref.columns.tolist())

merged = fuzzy_merge(provider_lut, maris_ref, left_on='value', right_on='nc_name')
provider_lut: ['value']
maris_ref:    ['nuclide_id', 'nc_name']

Inspect the borderline matches

Review non-exact matches to identify cases the fuzzy matcher could not resolve.

# Entries with score > 0 need human review
non_exact = merged[merged.score > 0].sort_values('score', ascending=False)
print(non_exact)
       value  nuclide_id nc_name  score
12  cm243244          73   cm242      3
18  cs134137          31   cs134      3
49  pu239240          68   pu239      3
47  pu238240          67   pu238      3
25     cs143           3     c14      2
24     cs142           3     c14      2
27     cs145           3     c14      2
22     cs140         129   ce140      1
21     cs139          31   cs134      1
20     cs138          31   cs134      1
23     cs141          36   ce141      1
26     cs144          31   cs134      1
36      k-40           4     k40      1
28     cs146         102   cs136      1

The table above shows the borderline cases. Some are legitimate combined-total nuclides (cs134137, cm243244, pu239240, pu238240) that should map to their MARIS _tot counterparts. Others are typos or historical artefacts — e.g. cs143, cs145, cs142, cs141, cs144, cs140, cs146, cs139, cs138 are all clearly variants of cs137. k-40 is simply k40 with a hyphen. These overrides are captured below.

Fix what it got wrong

Apply expert overrides for cases the fuzzy match could not resolve correctly.

ImportantFEEDBACK TO DATA PROVIDER

Inconsistent nuclide naming conventions — Most nuclide names follow the standard alphanumeric format (e.g. cs137, k40), but a few entries are inconsistent: - k-40 uses a hyphen, unlike other entries (should be k40) - A cluster of entries (cs140cs146, cs138, cs139) appear to be typos for cs137

A standardised nuclide pick-list at the point of entry would prevent these issues.

Exported source
fixes_nuclide_names = {
    'cs134137': 'cs134_137_tot',
    'cm243244': 'cm243_244_tot',
    'pu239240': 'pu239_240_tot',
    'pu238240': 'pu238_240_tot',
    'cs143': 'cs137',
    'cs145': 'cs137',
    'cs142': 'cs137',
    'cs141': 'cs137',
    'cs144': 'cs137',
    'k-40': 'k40',
    'cs140': 'cs137',
    'cs146': 'cs137',
    'cs139': 'cs137',
    'cs138': 'cs137'
    }

The dictionary below records our expert decisions for every case the fuzzy matcher got wrong. Each entry maps a provider nuclide value to its correct MARIS nc_name. The fix_lut function applies these overrides and resets the score to 0.

fixed = fix_lut(merged, fixes_nuclide_names, maris_ref,
                left_on='value', right_on='nc_name', id_col='nuclide_id')

# Verify: no unresolved matches remain
unresolved = fixed[fixed['score'] > 0]
print(unresolved if len(unresolved) else "All nuclide entries resolved. ✓")
All nuclide entries resolved. ✓

Assemble the final mapping

The four steps above (unique values, fuzzy match, expert overrides, verification) told us what the correct MARIS translations are. The make_lut function packages that knowledge, the expert fixes and the MARIS reference table, into a single function that the Transformer can call later, when it is processing the data through the pipeline.

Exported source
# Resolved nuclide lookup table (provider → MARIS nuclide_id); lazy, resolves at Transformer time
nuclide_lut = make_lut('NUCLIDE', fixes=fixes_nuclide_names)

The nuclide_lut lookup table is passed to the generic RemapCB callback, which looks up the MARIS nuclide reference table behind the scenes when the Transformer runs. The mapping translates NUCLIDE (the provider string after lowercasing and stripping) into NUCLIDE (the MARIS integer nuclide_id) across all sample-type groups.

Let’s verify the full pipeline works:

tfm = Transformer(dfs, cbs=[
    LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
    RemapCB(lut=nuclide_lut, col_remap='NUCLIDE', col_src='NUCLIDE')
    ])
    
dfs_out = tfm()
print(f"NUCLIDE is integer MARIS IDs in all {len(dfs_out)} groups. ✓")
for key in dfs_out.keys():
    test_eq(dfs_out[key]['NUCLIDE'].dtype, 'int64')
NUCLIDE is integer MARIS IDs in all 3 groups. ✓

A quick sanity check confirms that the expert-mapped nuclides like cs134_137_tot and cs137 are properly assigned to actual rows in the output:

for name, ncid in [('cs134_137_tot', 76), ('cs137', 33)]:
    n = (dfs_out['BIOTA'].NUCLIDE == ncid).sum()
    print(f"{name} (id={ncid}): {n} rows")
    test_ne(n, 0)
cs134_137_tot (id=76): 91 rows
cs137 (id=33): 4610 rows

Standardize time

HELCOM provides dates in a DATE column (format MM/DD/YY HH:MM:SS), but ~1,500 rows across the dataset have missing DATE values. The raw data also includes separate YEAR, MONTH, DAY columns as fallback, though some rows have MONTH=0 or DAY=0 (unknown), which we set to 1.

ParseTimeCB handles this in three steps: it parses the DATE column using pandas, replaces MONTH=0 / DAY=0 with 1, and fills any remaining missing TIME values by constructing dates from the YEAR/MONTH/DAY columns.

ImportantFEEDBACK TO DATA PROVIDER

Time/date is provided in DATE, YEAR, MONTH, and DAY columns. The DATE column contains ~1,500 missing values across the dataset. These should ideally be populated at source. Additionally, MONTH=0 or DAY=0 occurs when the day or month is unknown; we set these to 1 as a convention, but a standardised sentinel value for unknown components would be clearer.

# Show rows with missing or zero date components in SEAWATER
df = dfs['SEAWATER']
bad_dates = df[df['date'].isna()]
bad_parts  = df[(df['day'] == 0) | (df['month'] == 0)]

print(f"Missing DATE values: {len(bad_dates)} rows")
print(f"Zero day or month:  {len(bad_parts)} rows")
print(bad_dates[['date','year','month','day']].head(3))
Missing DATE values: 546 rows
Zero day or month:  107 rows
      date  year  month  day
11332  NaN  2013      6  3.0
11333  NaN  2013      6  3.0
11334  NaN  2013      6  3.0

source

ParseTimeCB


def ParseTimeCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Parse HELCOM DATE (MM/DD/YY HH:MM:SS) with fallback to YEAR/MONTH/DAY.

Applying ParseTimeCB across all sample-type groups:

tfm = Transformer(dfs, cbs=[ParseTimeCB()])
dfs_out = tfm()

print(f"TIME column added across {len(dfs_out)} groups. ✓")
print(dfs_out['SEAWATER'][['TIME']].head(3))
test_eq('TIME' in dfs_out['SEAWATER'].columns, True)
test_eq(dfs_out['SEAWATER']['TIME'].isna().sum(), 0)
TIME column added across 3 groups. ✓
        TIME
0 1984-06-13
1 1984-06-13
2 1984-06-13

NetCDF stores time as milliseconds since an origin (here 1970-01-01, as defined in the template’s CDL). EncodeTimeCB converts the parsed TIME column to this integer format; rows with unresolvable dates are dropped (8 in SEAWATER, 1 in SEDIMENT).

Applying ParseTimeCB and EncodeTimeCB together:

tfm = Transformer(dfs, cbs=[ParseTimeCB(), EncodeTimeCB()])
dfs_out = tfm()

print(f"TIME encoded as int64 in all {len(dfs_out)} groups. ✓")
test_eq(dfs_out['SEAWATER']['TIME'].dtype, 'int64')
test_eq(dfs_out['SEAWATER']['TIME'].isna().sum(), 0)
TIME encoded as int64 in all 3 groups. ✓

Melt sediment values

HELCOM sediment records are in wide format: each row carries two parallel measurement columns (VALUE_Bq/kg and value_bq/m², plus their associated uncertainty and detection-limit columns). MARIS expects tidy/long format; one measurement per row with a UNIT code identifying the original column. So we unpivot (melt) the sediment data, creating separate rows for each measurement type.

To make the transformation explicit: the melt copies values from each measurement-type group into columns prefixed with _ (_VALUE, _UNC, _DL, _UNIT). The underscore marks these as intermediate; they will be renamed to their final MARIS-standard column names in a later step.

NoteFeedback to data provider

Tidy/long format would simplify ingestion. HELCOM supplies sediment measurements in wide format (Bq/kg and Bq/m² columns on the same row). MARIS expects one measurement per row with a unit identifier. This means every sediment row with data in both columns must be split into two rows during ingestion, an extra transformation step that a long-format delivery would avoid.

# Let's see what the sediment data looks like and why we need to split
sed = dfs['SEDIMENT']

cols = ['key','nuclide','value_bq/kg','< value_bq/kg','error%_kg','value_bq/m²','< value_bq/m²','error%_m²']
print("Random sample of 3 sediment rows:")
print(sed[cols].sample(3).to_string(index=False), '\n')

# How many rows have data in BOTH columns?
both = sed[sed['value_bq/kg'].notna() & sed['value_bq/m²'].notna()]
print(f"Rows with values in BOTH Bq/kg and Bq/m²: {len(both):,} out of {len(sed):,} ({100*len(both)/len(sed):.0f}%)")
Random sample of 3 sediment rows:
         key nuclide  value_bq/kg < value_bq/kg  error%_kg  value_bq/m² < value_bq/m²  error%_m²
SSTUK1995004     K40        750.0           NaN        5.0          NaN           NaN        NaN
SCLOR2021031     K40        744.7           NaN        4.4       2284.0           NaN        5.1
SDHIG2012024   CS137          5.6           NaN        2.0         84.0           NaN        NaN 

Rows with values in BOTH Bq/kg and Bq/m²: 29,926 out of 40,743 (73%)

The mapping below defines which raw columns correspond to VALUE, uncertainty (UNC), and detection limit (DL) for each measurement type, together with the MARIS unit ID to assign.

Exported source
# Column mappings per sediment measurement type: MARIS-standard column name → raw HELCOM column name
coi_sediment = {
    'kg_type': {
        'VALUE': 'value_bq/kg',  # Activity concentration per unit mass
        'UNC': 'error%_kg',      # Relative uncertainty (percent)
        'DL': '< value_bq/kg',   # Detection limit flag/level
        'UNIT': 3,               # Unit ID for Bq/kg
    },
    'm2_type': {
        'VALUE': 'value_bq/m²',  # Activity per unit area
        'UNC': 'error%_m²',      # Relative uncertainty (percent)
        'DL': '< value_bq/m²',   # Detection limit flag/level
        'UNIT': 2,               # Unit ID for Bq/m²
    }
}

SplitSedimentValuesCB reads each measurement-type group from the mapping above, checks which rows have data in that group’s VALUE/UNC/DL columns, and copies those values into a standard set of temporary columns prefixed with _. It then concatenates all measurement-type subsets into a single sediment dataframe. The underscore prefix marks these columns as intermediate (they will be finally renamed in a later step).


source

MeltSedimentValuesCB


def MeltSedimentValuesCB(
    coi:dict, # Column-of-interest mapping, keyed by unit variant (kg, m²)
):

Melt HELCOM dual-value sediment rows into separate rows per measurement type (Bq/kg, Bq/m²).

tfm = Transformer(dfs, cbs=[MeltSedimentValuesCB(coi_sediment)])
dfs_out = tfm()

print(f"SEDIMENT rows: {dfs['SEDIMENT'].shape[0]}{dfs_out['SEDIMENT'].shape[0]} after melt")
test_eq('_VALUE' in dfs_out['SEDIMENT'].columns, True)
test_eq('_UNIT' in dfs_out['SEDIMENT'].columns, True)
test_eq(dfs_out['SEDIMENT']['_UNIT'].isin([2, 3]).all(), True)
SEDIMENT rows: 40743 → 70695 after melt

Sanitize value

ImportantFEEDBACK TO DATA PROVIDER

Some of the HELCOM datasets contain missing values in the VALUE column, see output after applying the SanitizeValueCB callback.

HELCOM measurement values live in differently named columns depending on the sample type — value_bq/m³ for seawater, value_bq/kg for biota, and _VALUE for sediment (created by the previous melt step). SanitizeValueCB collects these into a single VALUE column and drops rows that lack a measurement value, since MARIS requires a non-null measurement for every record.


source

SanitizeValueCB


def SanitizeValueCB(
    coi:Dict, # Columns of interest. Format: {group_name: {'VALUE': 'column_name'}}
):

Sanitize measurement values by removing blanks and standardizing to use the VALUE column.

Exported source
coi_val = {'SEAWATER' : {'VALUE': 'value_bq/m³'},
           'BIOTA':  {'VALUE': 'value_bq/kg'},
           'SEDIMENT': {'VALUE': '_VALUE'}}
tfm = Transformer(dfs, cbs=[MeltSedimentValuesCB(coi_sediment),
                            SanitizeValueCB(coi_val),
                            ])
dfs_out = tfm()

print(f"VALUE column created across all {len(dfs_out)} groups.")
for key in dfs_out.keys():
    test_eq('VALUE' in dfs_out[key].columns, True)
    test_eq(dfs_out[key]['VALUE'].isna().sum(), 0)
VALUE column created across all 3 groups.

Normalize uncertainty

HELCOM provides measurement uncertainty as a relative percentage, but MARIS requires absolute (standard) uncertainty. The percentage column also has a different name in each sample group (error%_m³ for seawater, error% for biota, _UNC for sediment). NormalizeUncCB converts each group’s percentage column to an absolute UNC column by multiplying the percentage by the measured value.

coi_units_unc = {
    'SEAWATER': ('VALUE', 'error%_m³'),
    'BIOTA':    ('VALUE', 'error%'),
    'SEDIMENT': ('VALUE', '_UNC'),
}
class NormalizeUncCB(PerGroupCB):
    "Convert relative uncertainty (percent) to absolute (standard) uncertainty per group."
    def __init__(self,
                 coi: dict=coi_units_unc,  # {group: (meas_col, unc_col)}
                ):
        store_attr()

    def each_grp(self, grp, df, tfm):
        if grp not in self.coi: return
        meas_col, unc_col = self.coi[grp]
        df['UNC'] = df[unc_col] * df[meas_col] / 100

Run NormalizeUncCB on mock data from all three groups. The output confirms the conversion produces the expected absolute uncertainties. The assertions below verify the arithmetic inline.

tfm = Transformer(dfs, cbs=[
    MeltSedimentValuesCB(coi_sediment), 
    SanitizeValueCB(coi_val),
    NormalizeUncCB()])

tfm()['SEDIMENT'][['VALUE', 'UNC', '_UNIT']].head()
VALUE UNC _UNIT
0 32.0 0.960 3
1 32.0 3.200 3
2 29.0 3.190 3
3 5.5 1.760 3
4 6.4 1.792 3
# Verify NormalizeUncCB computes correctly on mock data
dfs_mock = {
    'SEAWATER': pd.DataFrame({'VALUE': [5.3, 19.9], 'error%_m³': [32.0, 20.0]}),
    'BIOTA':    pd.DataFrame({'VALUE': [135.3],     'error%':    [3.57]}),
    'SEDIMENT': pd.DataFrame({'VALUE': [1200.0, 250.0], '_UNC': [20.0, 20.0]}),
}
tfm = Transformer(dfs_mock, cbs=[NormalizeUncCB()])
tfm()
{'SEAWATER':    VALUE  error%_m³    UNC
 0    5.3       32.0  1.696
 1   19.9       20.0  3.980,
 'BIOTA':    VALUE  error%      UNC
 0  135.3    3.57  4.83021,
 'SEDIMENT':     VALUE  _UNC    UNC
 0  1200.0  20.0  240.0
 1   250.0  20.0   50.0}
test_eq(tfm.dfs['SEAWATER']['UNC'].to_list(), [5.3*32/100, 19.9*20/100])
test_eq(tfm.dfs['BIOTA']['UNC'].to_list(),    [135.3*3.57/100])
test_eq(tfm.dfs['SEDIMENT']['UNC'].to_list(), [1200.0*20/100, 250.0*20/100])
print("NormalizeUncCB on mock data: all assertions passed. ✓")
NormalizeUncCB on mock data: all assertions passed. ✓

Remap units

HELCOM encodes units differently per sample type. SEAWATER uses Bq/m³ (implied by the column name value_bq/m³). BIOTA uses Bq/kg with a basis column distinguishing wet weight (W), dry weight (D), or fresh weight (F). SEDIMENT gets its unit from the melt step’s _UNIT column (Bq/kg or Bq/m²). RemapUnitCB collects these into a single MARIS-standard UNIT column.

For the BIOTA sample type, the base unit is Bq/kg, as indicated in the value_bq/kg column. The distinction between wet (W) and dry weight (D) is specified in the basis column.

dfs['BIOTA'][['value_bq/kg', 'basis']].head(1)
value_bq/kg basis
0 0.00816 W

For the SEAWATER sample type, the unit is Bq/m³ as indicated in the value_bq/m³ column.

dfs['SEAWATER'][['value_bq/m³']].head(1)
value_bq/m³
0 6.7

We can now review the units that are available in MARIS:

print(get_lut('UNIT', as_df=True))
    unit_id  unit_sanitized
0        -1  Not applicable
1         0   NOT AVAILABLE
2         1       Bq per m3
3         2       Bq per m2
4         3       Bq per kg
5         4      Bq per kgd
6         5      Bq per kgw
7         6       kg per kg
8         7              TU
9         8  DELTA per mill
10        9     atom per kg
11       10    atom per kgd
12       11    atom per kgw
13       12      atom per l
14       13      Bq per kgC

We define unit renaming rules for HELCOM in an ad hoc way:

Exported source
lut_units = {
    'SEAWATER': 1,  # 'Bq/m3'
    'SEDIMENT': '_UNIT',  # Accounted for in MeltSedimentValuesCB
    'BIOTA': {
        'D': 4,  # 'Bq/kgd'
        'W': 5,  # 'Bq/kgw'
        'F': 5   # 'Bq/kgw' (fresh assumed = wet)
    }
}

We define the RemapUnitCB callback to set the UNIT column in the DataFrames based on the lookup table lut_units.


source

RemapUnitCB


def RemapUnitCB(
    lut_units:dict={'SEAWATER': 1, 'SEDIMENT': '_UNIT', 'BIOTA': {'D': 4, 'W': 5, 'F': 5}}, # Per-group unit mapping: group -> literal ID or {basis_code -> ID}
):

Set the MARIS-standard UNIT column from per-sample-type conventions (column name, basis column, or melt result).

A quick sanity check on mock data confirms the callback assigns the correct UNIT IDs for every sample type. SEAWATER always gets unit 1 (Bq/m³). BIOTA rows get 4 for dry weight, 5 for wet/fresh weight, and 0 for unknown basis codes. SEDIMENT picks up whatever _UNIT the melt step assigned, either 2 (Bq/m²) or 3 (Bq/kg):

# Verify RemapUnitCB assigns correct UNIT IDs on mock data
dfs_mock = {
    'SEAWATER': pd.DataFrame({'dummy': [1, 2]}),
    'BIOTA':    pd.DataFrame({'basis': ['D', 'W', 'F', 'X']}),
    'SEDIMENT': pd.DataFrame({'_UNIT': [2, 3, 2]}),
}
tfm = Transformer(dfs_mock, cbs=[RemapUnitCB()])
tfm()

test_eq(tfm.dfs['SEAWATER']['UNIT'].to_list(), [1, 1])
test_eq(tfm.dfs['BIOTA']['UNIT'].to_list(),   [4, 5, 5, 0])
test_eq(tfm.dfs['SEDIMENT']['UNIT'].to_list(), [2, 3, 2])
print("RemapUnitCB on mock data: all assertions passed. ✓")
RemapUnitCB on mock data: all assertions passed. ✓

Running the full pipeline up to this point on the real HELCOM data confirms the units are assigned correctly across all sample-type groups:

tfm = Transformer(dfs, cbs=[
    MeltSedimentValuesCB(coi_sediment),
    SanitizeValueCB(coi_val),
    NormalizeUncCB(),
    RemapUnitCB(),
])
dfs_out = tfm()

for grp in ['SEAWATER', 'BIOTA', 'SEDIMENT']:
    print(f"{grp}: UNIT values = {dfs_out[grp]['UNIT'].unique()}")

test_eq(set(dfs_out['SEAWATER']['UNIT'].unique()), {1})
test_eq(set(dfs_out['SEDIMENT']['UNIT'].unique()), {2, 3})
test_eq(set(dfs_out['BIOTA']['UNIT'].unique()), {0, 4, 5})
SEAWATER: UNIT values = [1]
BIOTA: UNIT values = [5 4 0]
SEDIMENT: UNIT values = [3 2]

Remap detection limit

HELCOM encodes detection limits in provider-specific columns: < value_bq/m³ for seawater, < value_bq/kg for biota, and _DL for sediment after the melt step (see Melt sediment values). When the raw column contains <, the measurement is a detection limit; otherwise it is a detected value. MARIS uses the following integer codes for this distinction:

print(get_lut('DL', as_df=True))
   id   name_sanitized
0  -1   Not applicable
1   0    Not available
2   1   Detected value
3   2  Detection limit
4   3     Not detected
5   4          Derived

The coi_dl mapping below specifies which raw column holds the detection-limit information for each sample group. RemapDetectionLimitCB converts these to the MARIS-standard DL column, assigning code 2 for detection limits (where the raw value is <) and 1 for detected values.


source

RemapDetectionLimitCB


def RemapDetectionLimitCB(
    coi:dict, # Dict of column hosting the detection limit info for each sample type
):

Map HELCOM < / detected-value conventions to MARIS detection-limit integer codes (2 for DL, 1 for detected).

Exported source
coi_dl = {'SEAWATER' : {'DL' : '< value_bq/m³'},
          'BIOTA':  {'DL' : '< value_bq/kg'},
          'SEDIMENT': {'DL' : '_DL'}}
# Verify RemapDetectionLimitCB assigns correct DL codes on mock data
dfs_mock = {
    'SEAWATER': pd.DataFrame({'< value_bq/m³': ['<', '=', '<', None]}),
    'BIOTA':    pd.DataFrame({'< value_bq/kg': ['<', None, '=', '<']}),
    'SEDIMENT': pd.DataFrame({'_DL': ['=', None, '<', '<']}),
}
tfm = Transformer(dfs_mock, cbs=[RemapDetectionLimitCB(coi_dl)])
tfm()

test_eq(tfm.dfs['SEAWATER']['DL'].to_list(), [2, 1, 2, 1])
test_eq(tfm.dfs['BIOTA']['DL'].to_list(),    [2, 1, 1, 2])
test_eq(tfm.dfs['SEDIMENT']['DL'].to_list(), [1, 1, 2, 2])
print("RemapDetectionLimitCB on mock data: all assertions passed. ✓")
RemapDetectionLimitCB on mock data: all assertions passed. ✓

Running the full pipeline up to this point on the real HELCOM data confirms that every sample-type group gets the correct detection-limit codes:

tfm = Transformer(dfs, cbs=[
    MeltSedimentValuesCB(coi_sediment),
    SanitizeValueCB(coi_val),
    NormalizeUncCB(),
    RemapUnitCB(),
    RemapDetectionLimitCB(coi_dl),
])
dfs_out = tfm()

for grp in ['SEAWATER', 'BIOTA', 'SEDIMENT']:
    print(f"{grp}: DL values = {dfs_out[grp]['DL'].unique()}")

test_eq(set(dfs_out['SEAWATER']['DL'].unique()), {1, 2})
test_eq(set(dfs_out['BIOTA']['DL'].unique()),   {1, 2})
test_eq(set(dfs_out['SEDIMENT']['DL'].unique()), {1, 2})
SEAWATER: DL values = [1 2]
BIOTA: DL values = [2 1]
SEDIMENT: DL values = [1 2]

Remap Biota species

The HELCOM Biota dataset records species using HELCOM’s RUBIN code system, which is documented in the accompanying RUBIN_NAME.csv lookup table. We align these scientific names with the MARIS species nomenclature following the same inspect-match-fix workflow used for nuclide names above. The mapping involves two steps: each RUBIN code is first looked up against the provider’s nomenclature to get a scientific name, then that scientific name is mapped to a MARIS species_id via fuzzy matching and expert overrides.

Try an automatic mapping

Read the provider’s RUBIN_NAME.csv and derive unique scientific names, then fuzzy-match them against the MARIS species reference.

Inspect the borderline matches

Review non-exact matches to identify cases the fuzzy matcher could not resolve.

Fix what it got wrong

Apply expert overrides for cases the fuzzy match could not resolve correctly.

Assemble the final mapping

Package the results into a lookup function the Transformer can call later.

Exported source
provider_lut_species = pd.read_csv(f'{src_dir}/RUBIN_NAME.csv')
print(provider_lut_species.head())
   RUBIN_ID     RUBIN    SCIENTIFIC NAME     ENGLISH NAME
0        11  ABRA BRA      ABRAMIS BRAMA            BREAM
1        12  ANGU ANG  ANGUILLA ANGUILLA              EEL
2        13  ARCT ISL  ARCTICA ISLANDICA   ISLAND CYPRINE
3        14  ASTE RUB    ASTERIAS RUBENS  COMMON STARFISH
4        15  CARD EDU      CARDIUM EDULE           COCKLE
ImportantFEEDBACK TO DATA PROVIDER

Some rubin codes in the HELCOM Biota dataset do not appear in the RUBIN_NAME.csv lookup table. This includes entries with trailing spaces (FUCU VES, GADU MOR) and apparently missing codes (FUCU SPP, FURC LUMB, STUC PECT). Trailing spaces should be trimmed at source, and any valid RUBIN codes missing from the lookup table should be added.

set(dfs['BIOTA']['rubin']) - set(provider_lut_species['RUBIN'])
{'CHAR BALT', 'FUCU SPP', 'FUCU VES ', 'FURC LUMB', 'GADU MOR  ', 'STUC PECT'}
maris_ref = get_lut('SPECIES', as_df=True)
print(maris_ref.head())
   species_id                             species
0           0                       NOT AVAILABLE
1           1                 Aristeus antennatus
2           2                        Apostichopus
3           3  Saccharina japonica var. religiosa
4           4                  Siganus fuscescens
# Fuzzy-merge provider scientific names against MARIS species names
merged = fuzzy_merge(provider_lut_species, maris_ref,
                     left_on='SCIENTIFIC NAME', right_on='species')
# Inspect non-exact matches
non_exact = merged[merged.score > 0].sort_values('score', ascending=False)
print(non_exact[['SCIENTIFIC NAME', 'species', 'score']].to_string())
            SCIENTIFIC NAME              species  score
40  STIZOSTEDION LUCIOPERCA    Sander lucioperca     10
20     LAMINARIA SACCHARINA   Laminaria japonica      7
6             CHARA BALTICA      Macoma balthica      6
4             CARDIUM EDULE            Cardiidae      6
11       ENCHINODERMATA CIM        Echinodermata      5
33            PSETTA MAXIMA      Pinctada maxima      5
22           MACOMA BALTICA      Macoma balthica      1
41      STUCKENIA PECTINATE  Stuckenia pectinata      1
Exported source
fixes_species = {
    'LAMINARIA SACCHARINA': 'Saccharina latissima',
    'CARDIUM EDULE': 'Cerastoderma edule',
    'CHARA BALTICA': 'NOT AVAILABLE',
    'PSETTA MAXIMA': 'Scophthalmus maximus'
    }
fixed = fix_lut(merged, fixes_species, maris_ref,
                left_on='SCIENTIFIC NAME', right_on='species', id_col='species_id')

unresolved = fixed[fixed['score'] > 0]
print(unresolved[['SCIENTIFIC NAME', 'species']] if len(unresolved) else "All species entries resolved. ✓")
            SCIENTIFIC NAME              species
11       ENCHINODERMATA CIM        Echinodermata
22           MACOMA BALTICA      Macoma balthica
40  STIZOSTEDION LUCIOPERCA    Sander lucioperca
41      STUCKENIA PECTINATE  Stuckenia pectinata

Four entries (ENCHINODERMATA CIM, MACOMA BALTICA, STIZOSTEDION LUCIOPERCA, STUCKENIA PECTINATE) return non-zero fuzzy-match scores, but the matches are semantically correct: Echinodermata, Macoma balthica, Sander lucioperca, and Stuckenia pectinata are the right MARIS equivalents. No further overrides needed.

Exported source
species_lut = make_lut_from(provider_lut_species, 'RUBIN', 'SCIENTIFIC NAME', 'SPECIES', fixes=fixes_species)

Verify species lookup on mock data:

dfs_mock = {'BIOTA': pd.DataFrame({'rubin': ['ABRA BRA', 'CARD EDU', 'CHAR BALT']})}
tfm = Transformer(dfs_mock, cbs=[RemapCB(lut=species_lut, col_remap='SPECIES', col_src='rubin')])
tfm()
test_eq(tfm.dfs['BIOTA']['SPECIES'].to_list(), [271, 274, 0])

Map species on real HELCOM Biota data:

tfm = Transformer({'BIOTA': dfs['BIOTA'].copy()}, cbs=[
    LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
    RemapCB(lut=species_lut, col_remap='SPECIES', col_src='rubin')
])
dfs_out = tfm()
test_eq(dfs_out['BIOTA']['SPECIES'].isna().sum(), 0)
test_eq(dfs_out['BIOTA']['SPECIES'].dtype, 'int64')
print(f"SPECIES mapped to integer MARIS IDs across {len(dfs_out['BIOTA'])} rows. ✓")
SPECIES mapped to integer MARIS IDs across 16124 rows. ✓
# Verify species mapping on real data
tfm = Transformer({'BIOTA': dfs['BIOTA'].copy()}, cbs=[
    LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
    RemapCB(lut=species_lut, col_remap='SPECIES', col_src='rubin')
])
dfs_out = tfm()
test_eq(dfs_out['BIOTA']['SPECIES'].dtype, 'int64')
print(f"SPECIES mapped to integer MARIS IDs across {len(dfs_out['BIOTA'])} rows. \u2713")
SPECIES mapped to integer MARIS IDs across 16124 rows. ✓

Remap Body Part

Biota tissue is recorded as HELCOM TISSUE codes documented in the accompanying TISSUE.csv lookup table. We reconcile these with the MARIS body-part nomenclature following the same inspect-match-fix workflow used for species above.

Exported source
provider_lut_tissues = pd.read_csv(f'{src_dir}/TISSUE.csv')
print(provider_lut_tissues.head())
   TISSUE                    TISSUE_DESCRIPTION
0       1                            WHOLE FISH
1       2           WHOLE FISH WITHOUT ENTRAILS
2       3  WHOLE FISH WITHOUT HEAD AND ENTRAILS
3       4                      FLESH WITH BONES
4       5          FLESH WITHOUT BONES (FILETS)
maris_ref = get_lut('BODY_PART', as_df=True)
print(maris_ref.head())

# Fuzzy-merge provider tissue descriptions against MARIS body-part names
merged = fuzzy_merge(provider_lut_tissues, maris_ref,
                     left_on='TISSUE_DESCRIPTION', right_on='bodypar')
   bodypar_id                                bodypar
0          -1                         Not applicable
1           0                        (Not available)
2           1                           Whole animal
3           2               Whole animal eviscerated
4           3  Whole animal eviscerated without head
# Inspect non-exact matches
#| eval: false
non_exact = merged[merged.score > 0].sort_values('score', ascending=False)
print(non_exact[['TISSUE_DESCRIPTION', 'bodypar', 'score']].to_string())
                      TISSUE_DESCRIPTION                bodypar  score
2   WHOLE FISH WITHOUT HEAD AND ENTRAILS    Flesh without bones     20
1            WHOLE FISH WITHOUT ENTRAILS    Flesh without bones     13
7                         SKIN/EPIDERMIS               Skeleton     10
4           FLESH WITHOUT BONES (FILETS)    Flesh without bones      9
0                             WHOLE FISH           Whole animal      5
11                              ENTRAILS                  Brain      5
14                   STOMACH + INTESTINE  Stomach and intestine      3
21                         WHOLE ANIMALS           Whole animal      1

We address several entries that were not correctly matched, as detailed below:

Exported source
fixes_biota_tissues = {
    'WHOLE FISH WITHOUT HEAD AND ENTRAILS': 'Whole animal eviscerated without head',
    'WHOLE FISH WITHOUT ENTRAILS': 'Whole animal eviscerated',
    'SKIN/EPIDERMIS': 'Skin',
    'ENTRAILS': 'Viscera'
    }
maris_ref = get_lut('BODY_PART', as_df=True)
fixed = fix_lut(merged, fixes_biota_tissues, maris_ref,
                left_on='TISSUE_DESCRIPTION', right_on='bodypar', id_col='bodypar_id')

unresolved = fixed[fixed['score'] > 0]
print(unresolved[['TISSUE_DESCRIPTION', 'bodypar']] if len(unresolved) else "All body-part entries resolved. \u2713")
              TISSUE_DESCRIPTION                bodypar
0                     WHOLE FISH           Whole animal
4   FLESH WITHOUT BONES (FILETS)    Flesh without bones
14           STOMACH + INTESTINE  Stomach and intestine
21                 WHOLE ANIMALS           Whole animal

Assemble the final mapping

The steps above (unique values, fuzzy match, expert overrides, verification) told us what the correct MARIS translations are. The make_lut_from function packages that knowledge into a callable that the Transformer can use later.

Exported source
lut_tissues = make_lut_from(provider_lut_tissues,
                             'TISSUE', 'TISSUE_DESCRIPTION', 'BODY_PART',
                             fixes=fixes_biota_tissues)

Verify body part lookup on mock data:

# Verify body part lookup on mock data
dfs_mock = {'BIOTA': pd.DataFrame({'tissue': [1, 5, 12]})}
tfm = Transformer(dfs_mock, cbs=[RemapCB(lut=lut_tissues, col_remap='BODY_PART', col_src='tissue')])
tfm()
test_eq(tfm.dfs['BIOTA']['BODY_PART'].dtype, 'int64')
print("BODY_PART mapped as integer on mock data. \u2713")
BODY_PART mapped as integer on mock data. ✓

Map body part on real HELCOM Biota data:

tfm = Transformer({'BIOTA': dfs['BIOTA'].copy()}, cbs=[
    LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
    RemapCB(lut=lut_tissues, col_remap='BODY_PART', col_src='tissue')])
dfs_out = tfm()
test_eq(dfs_out['BIOTA']['BODY_PART'].isna().sum(), 0)
test_eq(dfs_out['BIOTA']['BODY_PART'].dtype, 'int64')
print(f'BODY_PART mapped to integer MARIS IDs across {len(dfs_out['BIOTA'])} rows. \u2713')
BODY_PART mapped to integer MARIS IDs across 16124 rows. ✓

Remap Biological Group

Unlike nuclide names, species, and body parts which required fuzzy matching against MARIS nomenclature followed by expert overrides, the biological group assignment is straightforward. The MARIS SPECIES lookup table already includes a biogroup_id column. Since each HELCOM Biota row now has SPECIES as an integer MARIS ID (mapped in the previous step), we just need to look up the corresponding biological group.

Exported source
lut_biogroup = get_lut('SPECIES', key='species_id', value='biogroup_id')

Let’s verify this works on mock data. We assign SPECIES IDs (as if the species-remap step already ran), then look up BIO_GROUP:

# Verify biogroup lookup on mock species IDs
dfs_mock = {'BIOTA': pd.DataFrame({'SPECIES': [271, 274, 0]})}
tfm = Transformer(dfs_mock, cbs=[
    RemapCB(lut=lut_biogroup, col_remap='BIO_GROUP', col_src='SPECIES', grps=['BIOTA'])
])
tfm()
test_eq(tfm.dfs['BIOTA']['BIO_GROUP'].to_list(), [4, 14, 0])
print('BIO_GROUP mapped correctly on mock data. ✓')
BIO_GROUP mapped correctly on mock data. ✓

Now apply to real HELCOM Biota data, chained after the species remap:

tfm = Transformer({'BIOTA': dfs['BIOTA'].copy()}, cbs=[
    LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
    RemapCB(lut=species_lut, col_remap='SPECIES', col_src='rubin'),
    RemapCB(lut=lut_biogroup, col_remap='BIO_GROUP', col_src='SPECIES'),
])
dfs_out = tfm()
test_eq(dfs_out['BIOTA']['BIO_GROUP'].isna().sum(), 0)
test_eq(dfs_out['BIOTA']['BIO_GROUP'].dtype, 'int64')
print(f'BIO_GROUP mapped to integer MARIS IDs across {len(dfs_out["BIOTA"])} rows. ✓')
BIO_GROUP mapped to integer MARIS IDs across 16124 rows. ✓

Remap Sediment Types

HELCOM sediment types are recorded as integer SEDI codes documented in the accompanying SEDIMENT_TYPE.csv lookup table. We reconcile these with the MARIS sediment-type nomenclature following the same inspect-match-fix workflow used for nuclide names and species above.

Exported source
provider_lut_sed = pd.read_csv(f'{src_dir}/SEDIMENT_TYPE.csv')
print(provider_lut_sed.head())
   SEDI SEDIMENT TYPE RECOMMENDED TO BE USED
0   -99       NO DATA                    NaN
1     0        GRAVEL                    YES
2     1          SAND                    YES
3     2     FINE SAND                     NO
4     3          SILT                    YES
ImportantFEEDBACK TO DATA PROVIDER

The SEDI values 56 and 73 are not found in the SEDIMENT_TYPE.csv lookup table provided. Note there are many nan values. We reassign them to -99 for now but should be clarified/fixed. This is demonstrated below.

set(dfs['SEDIMENT']['sedi'].unique()) - set(provider_lut_sed['SEDI'])
{np.float64(56.0), np.float64(73.0), np.float64(nan)}

Try an automatic mapping

Derive provider sediment types from the lookup table and fuzzy-match against MARIS reference.

maris_ref = get_lut('SED_TYPE', as_df=True)

print("provider_lut_sed:", provider_lut_sed.columns.tolist())
print("maris_ref:   ", maris_ref.columns.tolist())

merged = fuzzy_merge(provider_lut_sed, maris_ref, left_on='SEDIMENT TYPE', right_on='sedtype')
provider_lut_sed: ['SEDI', 'SEDIMENT TYPE', 'RECOMMENDED TO BE USED']
maris_ref:    ['sedtype_id', 'sedtype']

Fix what it got wrong

Apply expert overrides for cases the fuzzy match could not resolve correctly. Two are simple typos in the provider lookup table (MUD AND GARVELMud and gravel, CLACIAL CLAYGlacial clay). NO DATA maps to (Not available).

ImportantFEEDBACK TO MARIS DATA TEAM

The MARIS SED_TYPE lookup table uses parenthesised (Not available) for its sentinel entry, while every other MARIS reference table uses the bare Not available (e.g. SPECIES, BODY_PART, UNIT). This inconsistency should be aligned so that all LUTs use the same sentinel form.

Exported source
# Expert overrides for sediment type names
# 'NO DATA' maps to '(Not available)' rather than 'Not available' due to 
# an inconsistency in the MARIS SED_TYPE reference table — the sentinel entry 
# uses parentheses while other LUTs use the bare form. This should be aligned.
fixes_sediments = {
    'NO DATA': '(Not available)',
    'MUD AND GARVEL': 'Mud and gravel',
    'CLACIAL CLAY': 'Glacial clay',
}
fixed = fix_lut(merged, fixes_sediments, maris_ref,
                left_on='SEDIMENT TYPE', right_on='sedtype', id_col='sedtype_id')
unresolved = fixed[fixed['score'] > 0]
print(unresolved[['SEDIMENT TYPE', 'sedtype']] if len(unresolved) else "All sediment type entries resolved. \u2713")
All sediment type entries resolved. ✓

The steps above (unique values, fuzzy match, expert overrides, verification) told us what the correct MARIS translations are. The make_lut_from function packages that knowledge, the expert fixes and the MARIS reference table, into a single function that the Transformer can call later, when it is processing data through the pipeline.

A dedicated CleanSedimentCodesCB replaces the invalid SEDI codes (56, 73, NaN) with -99 before the nomenclature lookup, making it explicit which step handles data-cleaning vs. nomenclature mapping. When the data provider fixes these codes, simply drop this callback from the pipeline.


source

CleanSedimentCodesCB


def CleanSedimentCodesCB(
    replace_lut
):

Replace invalid HELCOM SEDI codes with -99 sentinel before nomenclature lookup.

Exported source
sed_replace_lut = {56: -99, 73: -99}
Exported source
sediment_lut = make_lut_from(provider_lut_sed, 'SEDI', 'SEDIMENT TYPE', 'SED_TYPE', fixes=fixes_sediments)

Verify sediment type lookup on mock data:

dfs_mock = {'SEDIMENT': pd.DataFrame({'sedi': [0, 1, -99, 56, 73]})}
tfm = Transformer(dfs_mock, cbs=[
    CleanSedimentCodesCB(replace_lut=sed_replace_lut),
    RemapCB(lut=sediment_lut, col_remap='SED_TYPE', col_src='sedi'),
])
tfm()
test_eq(tfm.dfs['SEDIMENT']['SED_TYPE'].dtype, 'int64')
print('SED_TYPE mapped as integer on mock data.')
SED_TYPE mapped as integer on mock data.

Apply sediment type lookup to real HELCOM SEDIMENT data:

tfm = Transformer(dfs, cbs=[
    CleanSedimentCodesCB(replace_lut=sed_replace_lut),
    RemapCB(lut=sediment_lut, col_remap='SED_TYPE', col_src='sedi', grps=['SEDIMENT']),
])
dfs_out = tfm()
test_eq(dfs_out['SEDIMENT']['SED_TYPE'].isna().sum(), 0)
test_eq(dfs_out['SEDIMENT']['SED_TYPE'].dtype, 'int64')
print(f'SED_TYPE mapped to integer MARIS IDs across {len(dfs_out["SEDIMENT"])} rows.')
SED_TYPE mapped to integer MARIS IDs across 40743 rows.

Remap Filtering Status

Unlike nuclide names, species, and body parts which had a dedicated provider lookup table, HELCOM filtering status has no provider-side LUT. The filt column appears only in the seawater data. We inspect the unique values directly from the data and then map them to the MARIS FILT nomenclature via a plain dictionary.

Inspect unique filt values across the data:

uniq_across_dfs(dfs, 'filt')
['F', 'n', 'N', nan]

How does MARIS filtering nomenclature looks like:

maris_ref = get_lut('FILT', as_df=True)
print(maris_ref.head())
   id            name
0  -1  Not applicable
1   0   Not available
2   1             Yes
3   2              No

With only four categories to remap, the generic RemapCB callback does the job directly, it accepts a plain dict as its lut parameter, so no custom callback is needed.

Exported source
lut_filtered = {
    'N': 2, # No
    'n': 2, # No
    'F': 1 # Yes
}

RemapCB(lut=lut_filtered, col_remap='FILT', col_src='filt') converts the HELCOM filt codes to MARIS-standard FILT identifiers.

Let us verify on mock data:

# Verify on mock data
dfs_mock = {'SEAWATER': pd.DataFrame({'filt': ['N', 'F', 'n', 'Y', None]})}
tfm = Transformer(dfs_mock, cbs=[RemapCB(lut=lut_filtered, col_remap='FILT', col_src='filt')])
tfm()
test_eq(tfm.dfs['SEAWATER']['FILT'].to_list(), [2, 1, 2, 0, 0])
print("FILT mapped correctly on mock data. \u2713")
FILT mapped correctly on mock data. ✓
#¬ eval: false
tfm = Transformer(dfs, cbs=[RemapCB(lut=lut_filtered, col_remap='FILT', col_src='filt', grps=['SEAWATER'])])
tfm()

print(tfm.dfs['SEAWATER'][['filt', 'FILT']].head())
  filt  FILT
0    N     2
1    N     2
2    N     2
3    N     2
4    N     2

Add sample ID

HELCOM identifies each record with a KEY column. MARIS requires two identifier columns: SMP_ID (an internal sequential id) and SMP_ID_PROVIDER (the provider’s original key). We generate the sequential id and copy the KEY column as the provider identifier.

  • SMP_ID is an internal unique identifier for each sample
  • SMP_ID_PROVIDER is provided by the data provider

source

AddSampleIDCB


def AddSampleIDCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Assign internal sequential SMP_ID and preserve provider KEY as SMP_ID_PROVIDER.

# Verify sample IDs on mock data
dfs_mock = {'BIOTA': pd.DataFrame({'key': ['A1', 'A2']})}
tfm = Transformer(dfs_mock, cbs=[AddSampleIDCB()])
tfm()
test_eq(tfm.dfs['BIOTA']['SMP_ID'].to_list(), [1, 2])
test_eq(tfm.dfs['BIOTA']['SMP_ID_PROVIDER'].to_list(), ['A1', 'A2'])
tfm = Transformer(dfs, cbs=[AddSampleIDCB()])
print(tfm()['SEAWATER'][['SMP_ID', 'SMP_ID_PROVIDER']].head())
   SMP_ID SMP_ID_PROVIDER
0       1    WCLOR1984001
1       2    WCLOR1984001
2       3    WCLOR1984001
3       4    WCLOR1984002
4       5    WCLOR1984002

Add depths

HELCOM stores sampling depth as sdepth (seawater, biota) and total depth as tdepth (seawater, sediment). The raw CSV may contain these as strings. AddDepthCB renames them to the MARIS-standard SMP_DEPTH and TOT_DEPTH columns and casts them as float.


source

AddDepthCB


def AddDepthCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Rename HELCOM sdepth/tdepth columns to MARIS-standard SMP_DEPTH/TOT_DEPTH and cast as float.

Verify AddDepthCB renames and casts correctly on mock data:

dfs_mock = {
    "SEAWATER": pd.DataFrame({"key": ["S1"], "sdepth": ["5.0"], "tdepth": ["42.0"]}),
    "BIOTA":    pd.DataFrame({"key": ["B1"], "sdepth": ["3.5"]}),
    "SEDIMENT": pd.DataFrame({"key": ["D1"], "tdepth": ["120.0"]}),
}
tfm = Transformer(dfs_mock, cbs=[AddDepthCB()])
tfm()

test_eq(tfm.dfs["SEAWATER"]["SMP_DEPTH"].to_list(), [5.0])
test_eq(tfm.dfs["SEAWATER"]["TOT_DEPTH"].to_list(), [42.0])
test_eq(tfm.dfs["BIOTA"]["SMP_DEPTH"].to_list(), [3.5])
test_eq("TOT_DEPTH" not in tfm.dfs["BIOTA"].columns, True)
test_eq(tfm.dfs["SEDIMENT"]["TOT_DEPTH"].to_list(), [120.0])
test_eq("SMP_DEPTH" not in tfm.dfs["SEDIMENT"].columns, True)
print("AddDepthCB on mock data: all assertions passed. \u2713")
AddDepthCB on mock data: all assertions passed. ✓

Using real data:

tfm = Transformer(dfs, cbs=[AddDepthCB()])
dfs_out = tfm()
print(dfs_out['BIOTA'][['SMP_DEPTH']].head())
print(dfs_out['SEAWATER'][['TOT_DEPTH']].head())
   SMP_DEPTH
0       20.5
1       20.5
2       20.5
3       20.5
4       20.5
   TOT_DEPTH
0       16.0
1       16.0
2       16.0
3       16.0
4       16.0

Add Salinity

HELCOM stores water salinity in a salin column (PSU units, present only in seawater data). AddSalinityCB renames it to the MARIS-standard SAL column and casts to float.

ImportantFEEDBACK TO DATA PROVIDER

The HELCOM dataset includes a column for the salinity of the water (salin). According to the HELCOM documentation, the salin column represents “Salinity of water in PSU units”.

In the SEAWATER dataset, three entries have salinity values greater than 50 PSU. While salinity values greater than 50 PSU are possible, these entries may require further verification. Notably, these three entries have a salinity value of 99.99 PSU, which suggests potential data entry errors.


source

AddSalinityCB


def AddSalinityCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Add salinity (SAL) from HELCOM salin column where present.

# Verify AddSalinityCB on mock data
dfs_mock = {
    "SEAWATER": pd.DataFrame({"key": ["S1"], "salin": ["7.5"]}),
    "BIOTA":    pd.DataFrame({"key": ["B1"]}),
}
tfm = Transformer(dfs_mock, cbs=[AddSalinityCB()])
tfm()

test_eq(tfm.dfs["SEAWATER"]["SAL"].to_list(), [7.5])
test_eq("SAL" not in tfm.dfs["BIOTA"].columns, True)
print("AddSalinityCB on mock data: all assertions passed. ✓")
AddSalinityCB on mock data: all assertions passed. ✓
tfm = Transformer(dfs, cbs=[AddSalinityCB()])
dfs_out = tfm()
print(dfs_out['SEAWATER'][['SAL']].drop_duplicates().head())
    SAL
0   4.6
3   7.8
5   6.9
8   8.4
10  8.0

Add Station

HELCOM identifies each sampling location with a station column present in all sample-type groups. AddStationCB copies the provider’s station column to the MARIS-standard STATION column, filling missing values with an empty string.

Verify AddStationCB on mock data:


source

AddStationCB


def AddStationCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Add station to all DataFrames.

# Verify AddStationCB on mock data
dfs_mock = {
    "SEAWATER": pd.DataFrame({"station": ["SD24", None]}),
    "BIOTA":    pd.DataFrame({"station": ["SD24"]}),
    "SEDIMENT": pd.DataFrame({"station": ["BY1", "BY2"]}),
}
tfm = Transformer(dfs_mock, cbs=[AddStationCB()])
tfm()

test_eq(tfm.dfs["SEAWATER"]["STATION"].to_list(), ["SD24", ""])
test_eq(tfm.dfs["BIOTA"]["STATION"].to_list(),    ["SD24"])
test_eq(tfm.dfs["SEDIMENT"]["STATION"].to_list(), ["BY1", "BY2"])
print("AddStationCB on mock data: all assertions passed.")
AddStationCB on mock data: all assertions passed.
tfm = Transformer(dfs, cbs=[AddStationCB()])
print(tfm()['SEAWATER'][['STATION']].head())
  STATION
0     ZN2
1     ZN2
2     ZN2
3     ZN2
4     ZN2

Add Temperature

ImportantFEEDBACK TO DATA PROVIDER

The HELCOM dataset includes a column for the temperature of the water (ttemp). According to the HELCOM documentation, the ttemp column represents: > ‘Water temperature in Celsius (ºC) degrees of sampled water’

In the SEAWATER dataset, 92 entries have temperature values greater than 50°C (all reading 99.9°C, concentrated in DHIG samples). These appear to be data entry errors and should be verified at source.

HELCOM stores water temperature in a ttemp column (degrees Celsius, present only in seawater data). AddTemperatureCB renames it to the MARIS-standard TEMP column and casts to float.


source

AddTemperatureCB


def AddTemperatureCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Add temperature (TEMP) from HELCOM ttemp column.

tfm = Transformer(dfs, cbs=[AddTemperatureCB()])
dfs_out = tfm()
print(dfs_out['SEAWATER']['TEMP'].dropna().head())
890    7.8
891    7.8
892    7.8
893    6.5
894    6.5
Name: TEMP, dtype: float64

Add slice position (TOP and BOTTOM)

HELCOM sediment cores record slice positions in uppsli (top of slice, cm) and lowsli (bottom of slice, cm) columns. RemapSedSliceTopBottomCB renames these to the MARIS-standard TOP and BOTTOM columns.


source

RemapSedSliceTopBottomCB


def RemapSedSliceTopBottomCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Remap Sediment slice top and bottom to MARIS format.

# Verify RemapSedSliceTopBottomCB assigns TOP and BOTTOM correctly on mock data
dfs_mock = {'SEDIMENT': pd.DataFrame({'uppsli': [0.0, 5.0, 10.0], 'lowsli': [5.0, 10.0, 15.0]})}
tfm = Transformer(dfs_mock, cbs=[RemapSedSliceTopBottomCB()])
tfm()
test_eq(tfm.dfs['SEDIMENT']['TOP'].to_list(), [0.0, 5.0, 10.0])
test_eq(tfm.dfs['SEDIMENT']['BOTTOM'].to_list(), [5.0, 10.0, 15.0])
print("RemapSedSliceTopBottomCB on mock data: all assertions passed.")
RemapSedSliceTopBottomCB on mock data: all assertions passed.
tfm = Transformer(dfs, cbs=[RemapSedSliceTopBottomCB()])
tfm()
print(tfm.dfs['SEDIMENT'][['TOP','BOTTOM']].head())
    TOP  BOTTOM
0   0.0     5.0
1   5.0    10.0
2  10.0    15.0
3  15.0    20.0
4  20.0    25.0

Compute weights

Clean basis codes

HELCOM BIOTA samples record a basis column with values D (dry weight), W (wet weight), and F. The HELCOM documentation only defines D and W.

ImportantFEEDBACK TO DATA PROVIDER

The BIOTA dataset reports F in the basis column for 25 rows. The HELCOM guidelines only define D (dry weight) and W (wet weight). The F values appear to be fresh weight, which we treat as wet weight, but this should be confirmed or corrected at source.

We use RemapCB to convert F to W, leaving D and NaN unchanged.

Exported source
basis_fix = {'F': 'W'}

Compute weight variables

MARIS stores three weight-related variables: - PERCENTWT — dry weight as a decimal fraction of fresh weight (HELCOM dw% divided by 100) - DRYWT — dry weight in grams - WETWT — fresh weight in grams

HELCOM provides dw% for both BIOTA and SEDIMENT. BIOTA also has a weight column whose interpretation depends on the basis column: if basis is D, weight is dry weight; if W, weight is wet weight. We derive the complementary weight using PERCENTWT.

ImportantFEEDBACK TO DATA PROVIDER

dw% > 100% — 20 BIOTA rows and 625 SEDIMENT rows have a dry-weight percentage greater than 100%, which would imply the dry weight exceeds the fresh weight. These should be verified.

ImportantFEEDBACK TO DATA PROVIDER

dw% = 0% — 6 BIOTA rows and 302 SEDIMENT rows have zero dry-weight percentage, which is physically impossible. We treat these as missing.

We define a dedicated callback rather than using the generic RemapCB because the basis column also feeds into the weight calculations below. Making the correction explicit keeps each step auditable.


source

CleanBasisCB


def CleanBasisCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Map basis F to W (BIOTA).

For SEDIMENT, the dw% column is the only weight information available. We divide by 100 to get a decimal fraction and drop zero values (physically impossible) as missing.


source

PercentWeightCB


def PercentWeightCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Compute PERCENTWT = dw% / 100 (SEDIMENT).

For BIOTA, we have both the percentage and the actual weight. The basis column tells us which weight (dry or wet) was recorded, so we can derive the other from PERCENTWT.


source

WeightCB


def WeightCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Compute DRYWT / WETWT from weight + basis (BIOTA).

The mock test below checks all three callbacks work together: basis F→W correction, PERCENTWT computation (including zero → NaN), and the dry/wet derivation.

dfs_mock = {
    'BIOTA': pd.DataFrame({
        'basis': ['D', 'W', 'F'],
        'weight': [100.0, 200.0, 150.0],
        'dw%': [25.0, 30.0, 40.0],
    }),
    'SEDIMENT': pd.DataFrame({
        'dw%': [80.0, 0.0, 110.0],
    }),
}

tfm = Transformer(dfs_mock, cbs=[
    CleanBasisCB(),
    PercentWeightCB(),
    WeightCB(),
])
tfm()

b = tfm.dfs['BIOTA']
assert b['basis'].to_list() == ['D', 'W', 'W']
assert b['PERCENTWT'].to_list() == [0.25, 0.30, 0.40]
assert b['DRYWT'].to_list() == [100.0, 60.0, 60.0]
assert b['WETWT'].to_list() == [400.0, 200.0, 150.0]

s = tfm.dfs['SEDIMENT']
assert s['PERCENTWT'].to_list()[0] == 0.80
assert np.isnan(s['PERCENTWT'].to_list()[1])
assert s['PERCENTWT'].to_list()[2] == 1.10

print("All assertions passed. ✓")
All assertions passed. ✓

Usage on real data:

tfm = Transformer(dfs, cbs=[
    CleanBasisCB(),
    PercentWeightCB(),
    WeightCB(),
])
dfs_out = tfm()

cols = ['basis', 'dw%', 'PERCENTWT', 'weight', 'DRYWT', 'WETWT']
print(dfs_out['BIOTA'][cols].sample(5).to_string(index=False))
basis    dw%  PERCENTWT  weight     DRYWT  WETWT
    W    NaN        NaN     NaN       NaN    NaN
    W 21.240    0.21240   265.5  56.39220  265.5
    W 20.702    0.20702   771.0 159.61242  771.0
    D 14.000    0.14000     NaN       NaN    NaN
    W 18.000    0.18000   517.0  93.06000  517.0

Running the pipeline on real HELCOM data shows the weight columns populated correctly, with NaN for rows where dw% was missing or zero.

cols = ['dw%', 'PERCENTWT']
print(dfs_out['SEDIMENT'].dropna(subset=['dw%'])[cols].head(3).to_string(index=False))
 dw%  PERCENTWT
10.0        0.1
10.0        0.1
10.0        0.1
cols = ['dw%', 'PERCENTWT']
print(tfm.dfs['SEDIMENT'].dropna(subset=['dw%'])[cols].head(3).to_string(index=False))
 dw%  PERCENTWT
10.0        0.1
10.0        0.1
10.0        0.1

Standardize Coordinates

HELCOM provides geographical coordinates in two formats per lat/lon: decimal degrees (dddddd) and degrees+minutes (ddmmmm). Column names vary by sample type: BIOTA uses 'latitude dddddd' while SEAWATER and SEDIMENT use 'latitude (dddddd)' (with parentheses). ParseCoordinatesCB finds the columns by substring matching, prefers decimal degrees, and falls back to the ddmmmm format when the decimal value is missing or zero.

ImportantFEEDBACK TO DATA PROVIDER

Coordinate column names are inconsistent: BIOTA omits parentheses (latitude dddddd), SEAWATER and SEDIMENT include them (latitude (dddddd)). This should be standardised at source.

Eight SEAWATER rows have zero or NaN values for both latitude and longitude; these are dropped.

ParseCoordinatesCB works in two steps. First, it finds the four coordinate columns by scanning for names containing lat/lon and dddddd/ddmmmm. Second, for each row it reads the decimal-degree column; if the value is missing or zero, it falls back to the degree-minute column (converting via ddmm_to_dd). Rows where both formats are missing or zero are dropped.


source

ParseCoordinatesCB


def ParseCoordinatesCB(
    fn_convert_cor
):

Parse lat/lon from decimal-degree or degree-minute columns, preferring decimal.

Verify ParseCoordinatesCB on mock data. Row 1 has valid decimal degrees. Row 2 has zero decimal, falling back to ddmmmm (5420 → 54.3333). Row 3 has both missing and is dropped.

dfs_mock = {'SEAWATER': pd.DataFrame({
    'latitude (dddddd)': [54.28, 0, np.nan, 61.50, 0],
    'latitude (ddmmmm)': [np.nan, 54.20, 54.20, np.nan, np.nan],
    'longitude (dddddd)': [12.32, 0, np.nan, 21.40, 0],
    'longitude (ddmmmm)': [np.nan, 12.15, 12.15, np.nan, np.nan],
})}
tfm = Transformer(dfs_mock, cbs=[ParseCoordinatesCB(ddmm_to_dd)])
tfm()

# Row 0: valid decimal → used as-is
# Row 1: zero decimal → fallback to ddmmmm (54.20 → 54.3333, 12.15 → 12.25)
# Row 2: NaN decimal → fallback to ddmmmm
# Row 3: valid decimal with ddmmmm available → still uses decimal
# Row 4: zero decimal + NaN minute → stays 0 / 0 → dropped
test_eq(tfm.dfs['SEAWATER']['LAT'].to_list(), [54.28, 54.333333, 54.333333, 61.5])
test_eq(tfm.dfs['SEAWATER']['LON'].to_list(), [12.32, 12.25, 12.25, 21.4])
test_eq(len(tfm.dfs['SEAWATER']), 4)
print("ParseCoordinatesCB on mock data: all assertions passed. ✓")
ParseCoordinatesCB on mock data: all assertions passed. ✓
tfm = Transformer(dfs, cbs=[ParseCoordinatesCB(ddmm_to_dd)])
dfs_out = tfm()
print(dfs_out['BIOTA'][['LAT', 'LON']].head())
     LAT    LON
0  54.53  10.13
1  54.53  10.13
2  54.53  10.13
3  54.53  10.13
4  54.53  10.13

NetCDF encoder

tfm = Transformer(dfs, cbs=[
    # Nuclide normalisation and mapping
    LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
    RemapCB(lut=nuclide_lut, col_remap='NUCLIDE', col_src='NUCLIDE'),

    # Time
    ParseTimeCB(),
    EncodeTimeCB(),

    # Value columns (sediment melt, value, uncertainty)
    MeltSedimentValuesCB(coi_sediment),
    SanitizeValueCB(coi_val),
    NormalizeUncCB(),

    # Unit and detection limit
    RemapUnitCB(),
    RemapDetectionLimitCB(coi_dl),

    # BIOTA lookups: species, body part, biological group
    RemapCB(lut=species_lut, col_remap='SPECIES', col_src='rubin', grps=['BIOTA']),
    RemapCB(lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', grps=['BIOTA']),
    RemapCB(lut=lut_biogroup, col_remap='BIO_GROUP', col_src='SPECIES', grps=['BIOTA']),

    # Sediment type
    CleanSedimentCodesCB(replace_lut=sed_replace_lut),
    RemapCB(lut=sediment_lut, col_remap='SED_TYPE', col_src='sedi', grps=['SEDIMENT']),

    # Filtering status (seawater)
    RemapCB(lut=lut_filtered, col_remap='FILT', col_src='filt', grps=['SEAWATER']),

    # Sample identifiers
    AddSampleIDCB(),

    # Depth, salinity, temperature
    AddDepthCB(),
    AddSalinityCB(),
    AddTemperatureCB(),

    # Sediment slice positions
    RemapSedSliceTopBottomCB(),

    # Weights (BIOTA and SEDIMENT)
    CleanBasisCB(),
    PercentWeightCB(),
    WeightCB(),

    # Coordinates
    ParseCoordinatesCB(ddmm_to_dd),
    SanitizeLonLatCB(),

    # Station
    AddStationCB()
])

dfs_out = tfm()
print(dfs_out['BIOTA'].head())
            key  country laboratory  sequence               date  year  month  \
0  BBFFG1986001      6.0       BFFG   1986001  12/06/86 00:00:00  1986   12.0   
1  BBFFG1986001      6.0       BFFG   1986001  12/06/86 00:00:00  1986   12.0   
2  BBFFG1986001      6.0       BFFG   1986001  12/06/86 00:00:00  1986   12.0   
3  BBFFG1986001      6.0       BFFG   1986001  12/06/86 00:00:00  1986   12.0   
4  BBFFG1986001      6.0       BFFG   1986001  12/06/86 00:00:00  1986   12.0   

   day station  latitude ddmmmm  ...  BIO_GROUP  SMP_ID  SMP_ID_PROVIDER  \
0  6.0  BKIBU1            54.32  ...          4       1     BBFFG1986001   
1  6.0  BKIBU1            54.32  ...          4       2     BBFFG1986001   
2  6.0  BKIBU1            54.32  ...          4       3     BBFFG1986001   
3  6.0  BKIBU1            54.32  ...          4       4     BBFFG1986001   
4  6.0  BKIBU1            54.32  ...          4       5     BBFFG1986001   

   SMP_DEPTH PERCENTWT DRYWT  WETWT    LON    LAT  STATION  
0       20.5    0.2083   NaN    NaN  10.13  54.53   BKIBU1  
1       20.5    0.2083   NaN    NaN  10.13  54.53   BKIBU1  
2       20.5    0.2083   NaN    NaN  10.13  54.53   BKIBU1  
3       20.5    0.2083   NaN    NaN  10.13  54.53   BKIBU1  
4       20.5    0.2083   NaN    NaN  10.13  54.53   BKIBU1  

[5 rows x 51 columns]

Example change logs

tfm.logs
["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column.",
 "Remap values from 'NUCLIDE' to 'NUCLIDE' for groups: all.",
 'Parse HELCOM DATE (MM/DD/YY HH:MM:SS) with fallback to YEAR/MONTH/DAY.',
 'Encode time as seconds since epoch.',
 'Melt HELCOM dual-value sediment rows into separate rows per measurement type (Bq/kg, Bq/m²).',
 'Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column.',
 'Convert relative uncertainty (percent) to absolute (standard) uncertainty per group.',
 'Set the MARIS-standard UNIT column from per-sample-type conventions (column name, basis column, or melt result).',
 'Map HELCOM `<` / detected-value conventions to MARIS detection-limit integer codes (2 for DL, 1 for detected).',
 "Remap values from 'rubin' to 'SPECIES' for groups: BIOTA.",
 "Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA.",
 "Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA.",
 'Replace invalid HELCOM SEDI codes with -99 sentinel before nomenclature lookup.',
 "Remap values from 'sedi' to 'SED_TYPE' for groups: SEDIMENT.",
 "Remap values from 'filt' to 'FILT' for groups: SEAWATER.",
 'Assign internal sequential SMP_ID and preserve provider KEY as SMP_ID_PROVIDER.',
 'Rename HELCOM sdepth/tdepth columns to MARIS-standard SMP_DEPTH/TOT_DEPTH and cast as float.',
 'Add salinity (SAL) from HELCOM salin column where present.',
 'Add temperature (TEMP) from HELCOM ttemp column.',
 'Remap Sediment slice top and bottom to MARIS format.',
 'Map basis F to W (BIOTA).',
 'Compute PERCENTWT = dw% / 100 (SEDIMENT).',
 'Compute DRYWT / WETWT from weight + basis (BIOTA).',
 'Parse lat/lon from decimal-degree or degree-minute columns, preferring decimal.',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.',
 'Add station to all DataFrames.']

Feed global attributes


source

get_attrs


def get_attrs(
    tfm:Transformer, # Transformer object
    zotero_key:str, # Zotero dataset record key
    kw:list=['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides', 'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments', 'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality > Ocean Contaminants', 'Earth Science > Biological Classification > Animals/Vertebrates > Fish', 'Earth Science > Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks', 'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans', 'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)'], # List of keywords
)->dict: # Global attributes

Retrieve all global attributes.

Exported source
def get_attrs(
    tfm: Transformer, # Transformer object
    zotero_key: str, # Zotero dataset record key
    kw: list = kw # List of keywords
    ) -> dict: # Global attributes
    "Retrieve all global attributes."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()
get_attrs(tfm, zotero_key=zotero_key, kw=kw)
{'geospatial_lat_min': '31.17',
 'geospatial_lat_max': '65.75',
 'geospatial_lon_min': '9.6333',
 'geospatial_lon_max': '53.5',
 'geospatial_bounds': 'POLYGON ((9.6333 53.5, 31.17 53.5, 31.17 65.75, 9.6333 65.75, 9.6333 53.5))',
 'geospatial_vertical_max': '437.0',
 'geospatial_vertical_min': '0.0',
 'time_coverage_start': '1984-01-10T00:00:00',
 'time_coverage_end': '2023-11-30T00:00:00',
 'id': '26VMZZ2Q',
 'title': 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances',
 'summary': 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting data under Recommendation 26/3.\n\nThe database is updated and quality assured annually by HELCOM MORS EG.',
 'creator_name': '[{"creatorType": "author", "name": "HELCOM MORS"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
 'publisher_postprocess_logs': "Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column., Remap values from 'NUCLIDE' to 'NUCLIDE' for groups: all., Parse HELCOM DATE (MM/DD/YY HH:MM:SS) with fallback to YEAR/MONTH/DAY., Encode time as seconds since epoch., Melt HELCOM dual-value sediment rows into separate rows per measurement type (Bq/kg, Bq/m²)., Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column., Convert relative uncertainty (percent) to absolute (standard) uncertainty per group., Set the MARIS-standard UNIT column from per-sample-type conventions (column name, basis column, or melt result)., Map HELCOM `<` / detected-value conventions to MARIS detection-limit integer codes (2 for DL, 1 for detected)., Remap values from 'rubin' to 'SPECIES' for groups: BIOTA., Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA., Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA., Replace invalid HELCOM SEDI codes with -99 sentinel before nomenclature lookup., Remap values from 'sedi' to 'SED_TYPE' for groups: SEDIMENT., Remap values from 'filt' to 'FILT' for groups: SEAWATER., Assign internal sequential SMP_ID and preserve provider KEY as SMP_ID_PROVIDER., Rename HELCOM sdepth/tdepth columns to MARIS-standard SMP_DEPTH/TOT_DEPTH and cast as float., Add salinity (SAL) from HELCOM salin column where present., Add temperature (TEMP) from HELCOM ttemp column., Remap Sediment slice top and bottom to MARIS format., Map basis F to W (BIOTA)., Compute PERCENTWT = dw% / 100 (SEDIMENT)., Compute DRYWT / WETWT from weight + basis (BIOTA)., Parse lat/lon from decimal-degree or degree-minute columns, preferring decimal., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator., Add station to all DataFrames."}

Encoding


source

encode


def encode(
    fname_out:str, # Output file name
    kwargs:VAR_KEYWORD
)->None: # Additional arguments

Encode data to NetCDF.

Exported source
def encode(
    fname_out: str, # Output file name
    **kwargs # Additional arguments
    ) -> None:
    "Encode data to NetCDF."
    dfs = load_data(src_dir)
    tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
                            RemapCB(lut=nuclide_lut, col_remap='NUCLIDE', col_src='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            MeltSedimentValuesCB(coi_sediment),
                            SanitizeValueCB(coi_val),
                            NormalizeUncCB(),
                            RemapUnitCB(),
                            RemapDetectionLimitCB(coi_dl),
                            RemapCB(lut=species_lut, col_remap='SPECIES', col_src='rubin', grps=['BIOTA']),
                            RemapCB(lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', grps=['BIOTA']),
                            RemapCB(lut=lut_biogroup, col_remap='BIO_GROUP', col_src='SPECIES', grps=['BIOTA']),
                            CleanSedimentCodesCB(replace_lut=sed_replace_lut),
                            RemapCB(lut=sediment_lut, col_remap='SED_TYPE', col_src='sedi', grps=['SEDIMENT']),
                            RemapCB(lut=lut_filtered, col_remap='FILT', col_src='filt', grps=['SEAWATER']),
                            AddSampleIDCB(),
                            AddDepthCB(),
                            AddSalinityCB(),
                            AddTemperatureCB(),
                            RemapSedSliceTopBottomCB(),
                            CleanBasisCB(),
                            PercentWeightCB(),
                            WeightCB(),
                            ParseCoordinatesCB(ddmm_to_dd),
                            SanitizeLonLatCB(),
                            AddStationCB()
                            ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out, 
                            global_attrs=get_attrs(tfm, zotero_key=zotero_key, kw=kw),
                            # custom_maps=tfm.custom_maps,
                            verbose=kwargs.get('verbose', False),
                           )
    encoder.encode()
encode(fname_out, verbose=False)

NetCDF → CSV (MARIS DB import)

The MARIS data processing workflow involves two key steps:

  1. NetCDF to Standardized CSV Compatible with OpenRefine Pipeline
    • Convert standardized NetCDF files to CSV formats compatible with OpenRefine using the NetCDFDecoder.
    • Preserve data integrity and variable relationships.
    • Maintain standardized nomenclature and units.
  2. Database Integration
    • Process the converted CSV files using OpenRefine.
    • Apply data cleaning and standardization rules.
    • Export validated data to the MARIS master database.

This section focuses on the first step: converting NetCDF files to a format suitable for OpenRefine processing using the NetCDFDecoder class.

#decode(fname_in=fname_out, verbose=True)
Saved BIOTA to ../../_data/output/100-HELCOM-MORS-2024_BIOTA.csv
Saved SEAWATER to ../../_data/output/100-HELCOM-MORS-2024_SEAWATER.csv
Saved SEDIMENT to ../../_data/output/100-HELCOM-MORS-2024_SEDIMENT.csv