Matching provider codes to MARIS identifiers sounds like a string comparison problem, but in practice it is a data curation problem.
There are many reasons the automatic matching can fail:
Spelling variations and typos in the source data
Inconsistent use of hyphens, spaces, or underscores (Cs-137, cs137, CS 137)
Composite or derived values that do not exist as a single MARIS reference (cs134_137_tot, pu239+240)
Provider-specific codes with no direct MARIS equivalent
Scientific names that have been updated (synonyms, reclassifications)
Units written in different notations (Bq/m3 vs Bq.m-3)
No fuzzy-matching algorithm can resolve these on its own. The distance score can flag the borderline cases, but deciding the correct mapping requires domain expertise.
An LLM could take a first pass, but it would need the same careful human review afterwards. And some nomenclatures are large: the MARIS species reference has over 1500 entries and gets updated over time. Running an LLM on that scale for every handler would be disproportionate, especially when fuzzy matching handles the bulk cases well and the remaining ambiguous cases still need a domain expert to evaluate them anyway.
Our experience has shown that the most reliable approach is: let the computer do brute-force fuzzy matching on the bulk cases, then bring a domain expert into the loop for the ambiguous ones.
How it works
The reconciliation workflow breaks down into four steps, each with a clear outcome:
Try an automatic mapping: Let the computer do brute-force fuzzy matching between provider values and MARIS references. The distance score tells you how far apart two strings are. A score of 0 means an exact match. The higher the score, the more the strings differ. This score is your signal for which cases need human attention.
Fix what it got wrong: Apply expert overrides for every case the fuzzy match could not resolve correctly. This is where domain knowledge enters.
Check the result: Verify the final mapping before using it in the pipeline.
Use the mapping in a Transformer: Package the resolved lookup table and pass it into a callback.
Let’s see it in action.
First try to “fuzzy” match data provider and maris nomenclature:
Apply expert overrides for every case the fuzzy match got wrong. You write a dictionary that maps each problematic provider value to the correct MARIS name. The fix_lut function applies these overrides and resets the score to 0 for the fixed entries, so you can verify that no issues remain.
There are two kinds of problems you will encounter:
False matches: The algorithm picked a MARIS name, but it is the wrong one. For example, cs134_137_tot was matched to k40 because of some accidental string similarity. These need an override telling the system the correct target.
Missing matches: The provider uses a value that does not exist in the MARIS reference at all. For example, cs134 and cs144 are not standard MARIS nuclide names. You need to research what they actually represent and map them to the closest MARIS equivalent.
from marisco.match import fix_lut# Expert overrides: provider value -> correct MARIS nameoverrides = {'cs134_137_tot': 'cs134_137_tot', # correct match exists'cs144': 'cs137', # typo for cs137'cs134': 'cs134_137_tot', # combined measurement}fixed = fix_lut(merged, overrides, maris_ref, left_on='value', right_on='name', id_col='maris_id')print(fixed[fixed['score'] >0])
When the match is correct but the score is not zero
A non-zero score does not always mean something is wrong. The algorithm scores every pair by string distance, and sometimes a valid mapping produces a small positive score. For example, the HELCOM RUBIN code ENCHINODERMATA CIM fuzzy-matches to the MARIS species Echinodermata with a non-zero score, because the provider uses a more specific category name than the MARIS reference entry. The match is semantically correct, so you leave it as is and move on.
The boundary of concern is not “score is zero” but “score is low enough to be clearly the same thing.” If the best match is clearly the right one even with a non-zero score, no override is needed. You only need to override when the fuzzy match picked the wrong MARIS entry, or when you decide a value must map to a specific target that the algorithm did not rank first.
When the provider supplies a lookup table
Some providers include a separate file that maps their codes to scientific names. For example, HELCOM ships a RUBIN_NAME.csv that maps RUBIN species codes to their full scientific names. In this case you load that file directly and run the workflow on it.
Many providers only share measurement data without a mapping table. In this case you derive the unique values directly from the data columns using lut_from, which collects all unique entries across every sample group (SEAWATER, BIOTA, SEDIMENT) into a single DataFrame.
Once you have this derived lookup table, the workflow proceeds exactly as above: try an automatic match against the MARIS reference, inspect the borderline scores, fix with overrides, and check the result.
Once the assertion passes, wrap it into a Transformer-ready callable with:
This example creates a mock provider LUT, fuzzy-matches it against the MARIS NUCLIDE reference (loaded with get_lut), applies expert overrides, and packages the result into a RemapCB callback. This is the most explicit workflow, mapping directly to what you would do for a new provider.
When the provider only delivers measurement data without a separate lookup table, make_lut handles the entire workflow in one call: it derives unique values from the data, fuzzy-matches against the MARIS reference, applies your overrides, and returns a ready-to-use callable.
from marisco.configs import get_lutfrom marisco.match import make_lutfrom marisco.callbacks import RemapCB# What the provider actually delivers: measurement data in each groupdfs = {'SEAWATER': pd.DataFrame({'NUCLIDE': ['cs137', 'cs134', 'k-40', 'sr90']}),'BIOTA': pd.DataFrame({'NUCLIDE': ['cs137', 'cs144', 'cs134_137_tot']}),}# Derive unique values across all groupsmaris_ref = get_lut('NUCLIDE', as_df=True)fixes = {'cs134': 'cs134_137_tot', 'cs144': 'cs137', 'k-40': 'k40'}# make_lut does the derive + fuzzy-match + fix in one callnuclide_lut = make_lut('NUCLIDE', fixes=fixes)# Use in Transformercb = RemapCB(lut=nuclide_lut, col_remap='NUCLIDE', col_src='NUCLIDE')
Running the Transformer end-to-end
Once the lut is built, you feed it into a Transformer along with your data. The RemapCB callback remaps every value in the specified column to its MARIS identifier during encoding. The output shows the remapped numeric ids.
For simple mappings where the provider values already match MARIS codes exactly, you can pass a plain Python dict directly as the lut. No fuzzy matching or overrides needed.
Each handler should document the overrides dictionary and any data-quality issues discovered during inspection. The HELCOM handler shows this pattern with FEEDBACK TO DATA PROVIDER callout boxes.
For instance, the following markdown cell:
:::{.callout-important}
## FEEDBACK TO DATA PROVIDER
Some `rubin` codes in the HELCOM Biota dataset do not appear in the `RUBIN_NAME.csv` lookup table. This includes entries with trailing spaces (`FUCU VES `, `GADU MOR `) and apparently missing codes (`FUCU SPP`, `FURC LUMB`, `STUC PECT`). Trailing spaces should be trimmed at source, and any valid RUBIN codes missing from the lookup table should be added.
:::
would be rendered as:
ImportantFEEDBACK TO DATA PROVIDER
Some rubin codes in the HELCOM Biota dataset do not appear in the RUBIN_NAME.csv lookup table. This includes entries with trailing spaces (FUCU VES, GADU MOR) and apparently missing codes (FUCU SPP, FURC LUMB, STUC PECT). Trailing spaces should be trimmed at source, and any valid RUBIN codes missing from the lookup table should be added.