Sample ID Coverage

Status of SMP_ID and SMP_ID_PROVIDER capture across all marisco handlers

Background

Every measurement row in a MARIS NetCDF4 file requires two identity fields:

MARISCO column NetCDF variable CSV variable Purpose
SMP_ID id (dimension) Marisco-generated sequential integer; unique within a sample-type group in a given file
SMP_ID_PROVIDER id_provider samplabcode Original sample identifier supplied by the data provider; stored verbatim; empty when the provider does not supply one

SMP_ID is the NetCDF4 dimension — every group must have it and it must be unique within that group.
SMP_ID_PROVIDER is optional from the data-model perspective, but should always be populated when the provider ships an identifier, as it enables traceability back to the source dataset.

Status summary

Handler SMP_ID SMP_ID_PROVIDER Provider has IDs? Notes
helcom ✅ auto-incremented per group ✅ from key column Yes Reference implementation
ospar ✅ auto-incremented per group ✅ from sample id column Yes
geotraces ✅ auto-incremented per group ✅ from BODC Bottle Number:INTEGER Yes Fixed: was incorrectly placed in SMP_ID
tepco ✅ auto-incremented (SEAWATER only) ⚠️ empty string "" No
maris_legacy ✅ from legacy MARIS DB sample_id ❌ missing Rarely samplabcode not captured; nearly always null in legacy data

Per-handler details

helcom — ✅ Complete (reference pattern)

HELCOM exposes a key column that uniquely identifies each measurement in the MORS database.
The handler creates both columns inside a dedicated callback:

df['SMP_ID'] = range(1, len(df) + 1)          # marisco internal id
df['SMP_ID_PROVIDER'] = df['key'].astype(str)  # provider id verbatim

This is the canonical pattern all handlers should follow when the provider supplies an identifier.


ospar — ✅ Complete

OSPAR supplies a sample id column. The handler mirrors the HELCOM pattern:

df['SMP_ID'] = range(1, len(df) + 1)
df['SMP_ID_PROVIDER'] = df['sample id'].astype(str)

geotraces — ✅ Fixed

The BODC Bottle Number:INTEGER field is renamed to SMP_ID_PROVIDER via renaming_rules in RenameColumnCB, then AddSampleIDCB generates a sequential SMP_ID and casts SMP_ID_PROVIDER to string:

renaming_rules = {
    ...
    'BODC Bottle Number:INTEGER': 'SMP_ID_PROVIDER'
}

class AddSampleIDCB(Callback):
    def __call__(self, tfm):
        for _, df in tfm.dfs.items():
            df['SMP_ID'] = range(1, len(df) + 1)
            df['SMP_ID_PROVIDER'] = df['SMP_ID_PROVIDER'].astype(str)

AddSampleIDCB runs after DispatchToGroupCB, so it iterates over both SEAWATER and SUSPENDED_MATTER groups.

Previously: the BODC Bottle Number was mapped directly to SMP_ID via renaming_rules, leaving SMP_ID_PROVIDER absent and the NetCDF dimension populated with raw provider integers.


tepco — ⚠️ SMP_ID_PROVIDER present but empty

TEPCO monitoring data does not carry sample identifiers. The AddSampleIdCB callback handles both columns for the SEAWATER group (the only group present in TEPCO data):

tfm.dfs['SEAWATER']['SMP_ID'] = range(1, len(tfm.dfs['SEAWATER']) + 1)
tfm.dfs['SEAWATER']['SMP_ID_PROVIDER'] = ""

Minor concern: SMP_ID_PROVIDER is set to an empty string "" rather than a null/None. Depending on encoder behaviour this may surface as an empty-string column rather than a missing-value column. Using None or pd.NA would be more consistent.


maris_legacy — ⚠️ SMP_ID_PROVIDER not captured

The legacy MARIS CSV format contains two ID-related fields:

CSV column Meaning
sample_id MARIS central-database internal integer ID
samplabcode Provider lab code (maps to SMP_ID_PROVIDER in CSV_VARS)

The handler renames sample_idSMP_ID (correct: this is already the MARIS internal ID) but does not capture samplabcode as SMP_ID_PROVIDER.

In practice samplabcode is nearly always null in the legacy export (~1 non-null value out of ~420 000 records), so the impact is low. The column should still be mapped for correctness and forward-compatibility.

UniqueIndexCB() is also used in the legacy pipeline, but it creates a separate ID column (a plain DataFrame reset-index, unrelated to SMP_ID).