This data pipeline, known as a “handler” in Marisco terminology, is designed to clean, standardize, and encode HELCOM data into NetCDF format. The handler processes raw HELCOM data, applying various transformations and lookups to align it with MARIS data standards.

Key functions of this handler:

This handler is a crucial component in the Marisco data processing workflow, ensuring HELCOM data is properly integrated into the MARIS database.

Tip

For new MARIS users, please refer to Understanding MARIS Data Formats (NetCDF and Open Refine) for detailed information.

The present notebook pretends to be an instance of Literate Programming in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case marisco/handlers/helcom.py) the code snippet is added to the module using #| exports as provided by the wonderful nbdev library.

Configuration & file paths

  • src_dir: path to the maris-crawlers folder containing the HELCOM data in CSV format.

  • fname_out_nc: path and filename for the NetCDF output.The path can be defined as a relative path.

  • Zotero key: used to retrieve attributes related to the dataset from Zotero. The MARIS datasets include a library available on Zotero.

Tip

FEEDBACK FOR NEXT VERSION: Review the NetCDF file naming convention as discussed here. I think we should include ‘MARISCO’ in the filename and attributes to acknowledge the contributions and branding of the project.

Exported source
src_dir = 'https://raw.githubusercontent.com/franckalbinet/maris-crawlers/refs/heads/main/data/processed/HELCOM%20MORS'
fname_out_nc = '../../_data/output/100-HELCOM-MORS-2024.nc'
zotero_key ='26VMZZ2Q' # HELCOM MORS zotero key

Load data

Helcom MORS (Monitoring of Radioactive Substances in the Baltic Sea) data is provided as a zipped Microsoft Access database. We automatically fetch and convert this dataset with database tables exported as .csv files using a Github action here: maris-crawlers.

The dataset is then accessible in an amenable format for the marisco data pipeline.


source

read_csv

 read_csv (file_name,
           dir='https://raw.githubusercontent.com/franckalbinet/maris-
           crawlers/refs/heads/main/data/processed/HELCOM%20MORS')
Exported source
default_smp_types = {  
    'BIO': 'BIOTA', 
    'SEA': 'SEAWATER', 
    'SED': 'SEDIMENT'
}
Exported source
def read_csv(file_name, dir=src_dir):
    file_path = f'{dir}/{file_name}'
    return pd.read_csv(file_path)

source

load_data

 load_data (src_url:str, smp_types:dict={'BIO': 'BIOTA', 'SEA':
            'SEAWATER', 'SED': 'SEDIMENT'}, use_cache:bool=False,
            save_to_cache:bool=False, verbose:bool=False)

Load HELCOM data and return the data in a dictionary of dataframes with the dictionary key as the sample type.

Exported source
def load_data(src_url: str, 
              smp_types: dict = default_smp_types, 
              use_cache: bool = False,
              save_to_cache: bool = False,
              verbose: bool = False) -> Dict[str, pd.DataFrame]:
    "Load HELCOM data and return the data in a dictionary of dataframes with the dictionary key as the sample type."

    
    def load_and_merge(file_prefix: str) -> pd.DataFrame:
        
        if use_cache:
            dir=cache_path()
        else:
            dir = src_url
            
        file_smp_path = f'{dir}/{file_prefix}01.csv'
        file_meas_path = f'{dir}/{file_prefix}02.csv'

        if use_cache:
            if not Path(file_smp_path).exists():
                print(f'{file_smp_path} not found.')            
            if not Path(file_meas_path).exists():
                print(f'{file_meas_path} not found.')
        
        if verbose:
            start_time = time.time()
        df_meas = read_csv(f'{file_prefix}02.csv', dir)
        df_smp = read_csv(f'{file_prefix}01.csv', dir)
        
        df_meas.columns = df_meas.columns.str.lower()
        df_smp.columns = df_smp.columns.str.lower()
        
        merged_df = pd.merge(df_meas, df_smp, on='key', how='left')
        
        if verbose:
            print(f"Downloaded data for {file_prefix}01.csv and {file_prefix}02.csv in {time.time() - start_time:.2f} seconds.")
            
        if save_to_cache:
            dir = cache_path()
            df_smp.to_csv(f'{dir}/{file_prefix}01.csv', index=False)
            df_meas.to_csv(f'{dir}/{file_prefix}02.csv', index=False)
            if verbose:
                print(f"Saved downloaded data to cache at {dir}/{file_prefix}01.csv and {dir}/{file_prefix}02.csv")

        return merged_df
    return {smp_type: load_and_merge(file_prefix) for file_prefix, smp_type in smp_types.items()}

dfs is a dictionary of dataframes created from the Helcom dataset located at the path src_dir. The data to be included in each dataframe is sorted by sample type. Each dictionary is defined with a key equal to the sample type.

dfs = load_data(src_dir, save_to_cache=True, verbose=True)
print('keys/sample types: ', dfs.keys())

for key in dfs.keys():
    print(f'{key} columns: ', dfs[key].columns)
    print(f'{key} shape: ', dfs[key].shape)
Downloaded data for BIO01.csv and BIO02.csv in 1.54 seconds.
Saved downloaded data to cache at /home/niallmurphy93/.marisco/cache/BIO01.csv and /home/niallmurphy93/.marisco/cache/BIO02.csv
Downloaded data for SEA01.csv and SEA02.csv in 1.30 seconds.
Saved downloaded data to cache at /home/niallmurphy93/.marisco/cache/SEA01.csv and /home/niallmurphy93/.marisco/cache/SEA02.csv
Downloaded data for SED01.csv and SED02.csv in 1.92 seconds.
Saved downloaded data to cache at /home/niallmurphy93/.marisco/cache/SED01.csv and /home/niallmurphy93/.marisco/cache/SED02.csv
keys/sample types:  dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT'])
BIOTA columns:  Index(['key', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'basis',
       'error%', 'number', 'date_of_entry_x', 'country', 'laboratory',
       'sequence', 'date', 'year', 'month', 'day', 'station',
       'latitude ddmmmm', 'latitude dddddd', 'longitude ddmmmm',
       'longitude dddddd', 'sdepth', 'rubin', 'biotatype', 'tissue', 'no',
       'length', 'weight', 'dw%', 'loi%', 'mors_subbasin', 'helcom_subbasin',
       'date_of_entry_y'],
      dtype='object')
BIOTA shape:  (16124, 33)
SEAWATER columns:  Index(['key', 'nuclide', 'method', '< value_bq/m³', 'value_bq/m³', 'error%_m³',
       'date_of_entry_x', 'country', 'laboratory', 'sequence', 'date', 'year',
       'month', 'day', 'station', 'latitude (ddmmmm)', 'latitude (dddddd)',
       'longitude (ddmmmm)', 'longitude (dddddd)', 'tdepth', 'sdepth', 'salin',
       'ttemp', 'filt', 'mors_subbasin', 'helcom_subbasin', 'date_of_entry_y'],
      dtype='object')
SEAWATER shape:  (21634, 27)
SEDIMENT columns:  Index(['key', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'error%_kg',
       '< value_bq/m²', 'value_bq/m²', 'error%_m²', 'date_of_entry_x',
       'country', 'laboratory', 'sequence', 'date', 'year', 'month', 'day',
       'station', 'latitude (ddmmmm)', 'latitude (dddddd)',
       'longitude (ddmmmm)', 'longitude (dddddd)', 'device', 'tdepth',
       'uppsli', 'lowsli', 'area', 'sedi', 'oxic', 'dw%', 'loi%',
       'mors_subbasin', 'helcom_subbasin', 'sum_link', 'date_of_entry_y'],
      dtype='object')
SEDIMENT shape:  (40744, 35)

Normalize nuclide names

Lower & strip nuclide names

Tip

FEEDBACK TO DATA PROVIDER: Some nuclide names contain one or multiple trailing spaces.

This is demonstrated below for the NUCLIDE column:

df = get_unique_across_dfs(load_data(src_dir, use_cache=True, verbose=True), 'nuclide', as_df=True, include_nchars=True)
df['stripped_chars'] = df['value'].str.strip().str.replace(' ', '').str.len()
print(df[df['n_chars'] != df['stripped_chars']])
Downloaded data for BIO01.csv and BIO02.csv in 0.03 seconds.
Downloaded data for SEA01.csv and SEA02.csv in 0.05 seconds.
Downloaded data for SED01.csv and SED02.csv in 0.09 seconds.
    index      value  n_chars  stripped_chars
0       0      SR90         5               4
19     19   CS137           8               5
27     27  CS137            9               5
34     34   PU238           8               5
40     40     CS137         6               5
42     42   CO60            8               4
45     45    TC99           7               4
68     68    SR90           7               4
80     80   K40             8               3
82     82   CS134           8               5
88     88   SR90            8               4
91     91     SR90          6               4
95     95   AM241           8               5

To fix this issue, we use the LowerStripNameCB callback. For each dataframe in the dictionary of dataframes, it corrects the nuclide name by converting it lowercase, striping any leading or trailing whitespace(s).

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE')])
tfm()
for key, df in tfm.dfs.items():
    print(f'{key} Nuclides: ')
    print(df['NUCLIDE'].unique())
    
print(tfm.logs)
BIOTA Nuclides: 
['cs134' 'k40' 'co60' 'cs137' 'sr90' 'ag108m' 'mn54' 'co58' 'ag110m'
 'zn65' 'sb125' 'pu239240' 'ru106' 'be7' 'ce144' 'pb210' 'po210' 'sb124'
 'sr89' 'zr95' 'te129m' 'ru103' 'nb95' 'ce141' 'la140' 'i131' 'ba140'
 'pu238' 'u235' 'bi214' 'pb214' 'pb212' 'tl208' 'ac228' 'ra223' 'eu155'
 'ra226' 'gd153' 'sn113' 'fe59' 'tc99' 'co57' 'sn117m' 'eu152' 'sc46'
 'rb86' 'ra224' 'th232' 'cs134137' 'am241' 'ra228' 'th228' 'k-40' 'cs138'
 'cs139' 'cs140' 'cs141' 'cs142' 'cs143' 'cs144' 'cs145' 'cs146']
SEAWATER Nuclides: 
['cs137' 'sr90' 'h3' 'cs134' 'pu238' 'pu239240' 'am241' 'cm242' 'cm244'
 'tc99' 'k40' 'ru103' 'sr89' 'sb125' 'nb95' 'ru106' 'zr95' 'ag110m'
 'cm243244' 'ba140' 'ce144' 'u234' 'u238' 'co60' 'pu239' 'pb210' 'po210'
 'np237' 'pu240' 'mn54']
SEDIMENT Nuclides: 
['cs137' 'ra226' 'ra228' 'k40' 'sr90' 'cs134137' 'cs134' 'pu239240'
 'pu238' 'co60' 'ru103' 'ru106' 'sb125' 'ag110m' 'ce144' 'am241' 'be7'
 'th228' 'pb210' 'co58' 'mn54' 'zr95' 'ba140' 'po210' 'ra224' 'nb95'
 'pu238240' 'pu241' 'pu239' 'eu155' 'ir192' 'th232' 'cd109' 'sb124' 'zn65'
 'th234' 'tl208' 'pb212' 'pb214' 'bi214' 'ac228' 'ra223' 'u235' 'bi212']
["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column."]

Remap nuclide names to MARIS data formats

Below, we map nuclide names used by HELCOM to the MARIS standard nuclide names.

Remapping data provider nomenclatures to MARIS standards is a recurrent operation and is done in a semi-automated manner according to the following pattern:

  1. Inspect data provider nomenclature:
  2. Match automatically against MARIS nomenclature (using a fuzzy matching algorithm);
  3. Fix potential mismatches;
  4. Apply the lookup table to the dataframe.

We will refer to this process as IMFA (Inspect, Match, Fix, Apply).

The get_unique_across_dfs function is a utility in MARISCO that retrieves unique values from a specified column across all DataFrames. Note that there is one DataFrame for each sample type, such as biota, sediment, etc.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE')])

dfs_output = tfm()

# Transpose to display the dataframe horizontally
get_unique_across_dfs(dfs_output, col_name='NUCLIDE', as_df=True).T
0 1 2 3 4 5 6 7 8 9 ... 67 68 69 70 71 72 73 74 75 76
index 0 1 2 3 4 5 6 7 8 9 ... 67 68 69 70 71 72 73 74 75 76
value nb95 cs140 bi212 ru106 pb214 cs134137 po210 pb210 bi214 th228 ... cs144 h3 pu238240 th232 pu238 cs138 u238 cs141 pb212 co57

2 rows × 77 columns

Let’s now create an instance of a fuzzy matching algorithm Remapper. This instance will match the nuclide names of the HELCOM dataset to the MARIS standard nuclide names.

remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_output, col_name='NUCLIDE', as_df=True),
                    maris_lut_fn=nuc_lut_path,
                    maris_col_id='nuclide_id',
                    maris_col_name='nc_name',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='nuclides_helcom.pkl')

Lets try to match HELCOM nuclide names to MARIS standard nuclide names as automatically as possible. The match_score column allows to assess the results:

remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing:   0%|          | 0/77 [00:00<?, ?it/s]Processing: 100%|██████████| 77/77 [00:01<00:00, 41.86it/s]
63 entries matched the criteria, while 14 entries had a match score of 1 or higher.
matched_maris_name source_name match_score
source_key
cs134137 cs137 cs134137 3
cm243244 cm242 cm243244 3
pu239240 pu239 pu239240 3
pu238240 pu240 pu238240 3
cs143 ce140 cs143 2
cs145 ce140 cs145 2
cs142 ce140 cs142 2
cs140 ce140 cs140 1
cs139 ce139 cs139 1
k-40 k40 k-40 1
cs146 cs136 cs146 1
cs144 cs134 cs144 1
cs138 cs134 cs138 1
cs141 ce141 cs141 1

We can now manually inspect the unmatched nuclide names and create a table to correct them to the MARIS standard:

Exported source
fixes_nuclide_names = {
    'cs134137': 'cs134_137_tot',
    'cm243244': 'cm243_244_tot',
    'pu239240': 'pu239_240_tot',
    'pu238240': 'pu238_240_tot',
    'cs143': 'cs137',
    'cs145': 'cs137',
    'cs142': 'cs137',
    'cs141': 'cs137',
    'cs144': 'cs137',
    'k-40': 'k40',
    'cs140': 'cs137',
    'cs146': 'cs137',
    'cs139': 'cs137',
    'cs138': 'cs137'
    }

We now include the table fixes_nuclide_names, which applies manual corrections to the nuclide names before the remapping process. The generate_lookup_table function has an overwrite parameter (default is True), which, when set to True, creates a pickle file cache of the lookup table. We can now test the remapping process:

remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
fc.test_eq(len(remapper.select_match(match_score_threshold=1, verbose=True)), 0)
Processing:   0%|          | 0/77 [00:00<?, ?it/s]Processing: 100%|██████████| 77/77 [00:01<00:00, 48.43it/s]
77 entries matched the criteria, while 0 entries had a match score of 1 or higher.

Test passes! We can now create a callback RemapNuclideNameCB to remap the nuclide names. Note that we pass overwrite=False to the Remapper constructor to now use the cached version.

Exported source
# Create a lookup table for nuclide names
lut_nuclides = lambda df: Remapper(provider_lut_df=df,
                                   maris_lut_fn=nuc_lut_path,
                                   maris_col_id='nuclide_id',
                                   maris_col_name='nc_name',
                                   provider_col_to_match='value',
                                   provider_col_key='value',
                                   fname_cache='nuclides_helcom.pkl').generate_lookup_table(fixes=fixes_nuclide_names, 
                                                                                            as_df=False, overwrite=False)

We now create the callback RemapNuclideNameCB, which will remap the nuclide names using the lut_nuclides lookup table.


source

RemapNuclideNameCB

 RemapNuclideNameCB (fn_lut:Callable, col_name:str)

Remap data provider nuclide names to standardized MARIS nuclide names.

Type Details
fn_lut Callable Function that returns the lookup table dictionary
col_name str Column name to remap
Exported source
class RemapNuclideNameCB(Callback):
    "Remap data provider nuclide names to standardized MARIS nuclide names."
    def __init__(self, 
                 fn_lut: Callable, # Function that returns the lookup table dictionary
                 col_name: str # Column name to remap
                ):
        fc.store_attr()

    def __call__(self, tfm: Transformer):
        df_uniques = get_unique_across_dfs(tfm.dfs, col_name=self.col_name, as_df=True)
        #lut = {k: v.matched_maris_name for k, v in self.fn_lut(df_uniques).items()}    
        lut = {k: v.matched_id for k, v in self.fn_lut(df_uniques).items()}    
        for k in tfm.dfs.keys():
            tfm.dfs[k]['NUCLIDE'] = tfm.dfs[k][self.col_name].replace(lut)

Let’s see it in action, along with the LowerStripNameCB callback:

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
                            RemapNuclideNameCB(lut_nuclides, col_name='NUCLIDE'),
                            CompareDfsAndTfmCB(dfs)
                            ])
dfs_out = tfm()

# For instance
for key in dfs_out.keys():
    print(f'{key} NUCLIDE unique: ', dfs_out[key]['NUCLIDE'].unique())
    
print(tfm.logs)
BIOTA NUCLIDE unique:  [31  4  9 33 12 21  6  8 22 10 24 77 17  2 37 41 47 23 11 13 25 16 14 36
 35 29 34 67 63 46 43 42 94 55 50 40 53 87 92 86 15  7 93 85 91 90 51 59
 76 72 54 57]
SEAWATER NUCLIDE unique:  [33 12  1 31 67 77 72 73 75 15  4 16 11 24 14 17 13 22 80 34 37 62 64  9
 68 41 47 65 69  6]
SEDIMENT NUCLIDE unique:  [ 33  53  54   4  12  76  31  77  67   9  16  17  24  22  37  72   2  57
  41   8   6  13  34  47  51  14  89  70  68  40  88  59  84  23  10  60
  94  42  43  46  55  50  63 130]
["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column.", 'Remap data provider nuclide names to standardized MARIS nuclide names.', 'Create a dataframe of removed data. Data included in the `tfm` not in the `dfs`.']

Standardize Time

Tip

FEEDBACK TO DATA PROVIDER: Time/date is provide in the DATE, YEAR , MONTH, DAY columns. Note that the DATE contains missing values as indicated below. When missing, we fallback on the YEAR, MONTH, DAY columns. Note also that sometimes DAY and MONTH contain 0. In this case we systematically set them to 1.

dfs = load_data(src_dir, use_cache=True)
for key in dfs.keys():
    print(f'{key} DATE null values: ', dfs[key]['date'].isna().sum())
BIOTA DATE null values:  88
SEAWATER DATE null values:  554
SEDIMENT DATE null values:  830

source

ParseTimeCB

 ParseTimeCB ()

Standardize time format across all dataframes.

Exported source
class ParseTimeCB(Callback):
    "Standardize time format across all dataframes."
    def __call__(self, tfm: Transformer):
        for df in tfm.dfs.values():
            self._process_dates(df)

    def _process_dates(self, df: pd.DataFrame) -> None:
        "Process and correct date and time information in the DataFrame."
        df['TIME'] = self._parse_date(df)
        self._handle_missing_dates(df)
        self._fill_missing_time(df)

    def _parse_date(self, df: pd.DataFrame) -> pd.Series:
        "Parse the DATE column if present."
        return pd.to_datetime(df['date'], format='%m/%d/%y %H:%M:%S', errors='coerce')

    def _handle_missing_dates(self, df: pd.DataFrame):
        "Handle cases where DAY or MONTH is 0 or missing."
        df.loc[df["day"] == 0, "day"] = 1
        df.loc[df["month"] == 0, "month"] = 1
        
        missing_day_month = (df["day"].isna()) & (df["month"].isna()) & (df["year"].notna())
        df.loc[missing_day_month, ["day", "month"]] = 1

    def _fill_missing_time(self, df: pd.DataFrame) -> None:
        "Fill missing time values using year, month, and day columns."
        missing_time = df['TIME'].isna()
        df.loc[missing_time, 'TIME'] = pd.to_datetime(
            df.loc[missing_time, ['year', 'month', 'day']], 
            format='%Y%m%d', 
            errors='coerce'
        )

Apply the transformer for callbacks ParseTimeCB. Then, print the TIME data for seawater. Passing the CompareDfsAndTfmCB callback allows us to compare the original dataframes with the transformed dataframes using the compare_stats attribute.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['SEAWATER'][['TIME']])
                                                     BIOTA  SEAWATER  SEDIMENT
Number of rows in original dataframes (dfs):         16124     21634     40744
Number of rows in transformed dataframes (tfm.dfs):  16124     21634     40744
Number of rows removed (tfm.dfs_removed):                0         0         0 

            TIME
0     2012-05-23
1     2012-05-23
2     2012-06-17
3     2012-05-24
4     2012-05-24
...          ...
21629 2023-06-11
21630 2023-06-11
21631 2023-06-13
21632 2023-06-13
21633 2023-06-13

[21634 rows x 1 columns]

The NetCDF time format requires that time be encoded as the number of milliseconds since a specified origin. In our case, the origin is 1970-01-01, as indicated in the cdl.toml file under the [vars.defaults.time.attrs] section.

EncodeTimeCB converts the HELCOM time format to the MARIS NetCDF time format.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            EncodeTimeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.logs)
Warning: 8 missing time value(s) in SEAWATER
Warning: 1 missing time value(s) in SEDIMENT
                                                     BIOTA  SEAWATER  SEDIMENT
Number of rows in original dataframes (dfs):         16124     21634     40744
Number of rows in transformed dataframes (tfm.dfs):  16124     21626     40743
Number of rows removed (tfm.dfs_removed):                0         8         1 

['Standardize time format across all dataframes.', 'Encode time as seconds since epoch.', 'Create a dataframe of removed data. Data included in the `tfm` not in the `dfs`.']
with pd.option_context('display.max_columns', None):
    display(tfm.dfs['SEAWATER'].head())
key nuclide method < value_bq/m³ value_bq/m³ error%_m³ date_of_entry_x country laboratory sequence date year month day station latitude (ddmmmm) latitude (dddddd) longitude (ddmmmm) longitude (dddddd) tdepth sdepth salin ttemp filt mors_subbasin helcom_subbasin date_of_entry_y
0 WKRIL2012003 CS137 NaN NaN 5.3 32.0 08/20/14 00:00:00 90.0 KRIL 2012003.0 05/23/12 00:00:00 2012.0 5.0 23.0 RU10 60.05 60.0833 29.20 29.3333 NaN 0.0 NaN NaN NaN 11.0 11.0 08/20/14 00:00:00
1 WKRIL2012004 CS137 NaN NaN 19.9 20.0 08/20/14 00:00:00 90.0 KRIL 2012004.0 05/23/12 00:00:00 2012.0 5.0 23.0 RU10 60.05 60.0833 29.20 29.3333 NaN 29.0 NaN NaN NaN 11.0 11.0 08/20/14 00:00:00
2 WKRIL2012005 CS137 NaN NaN 25.5 20.0 08/20/14 00:00:00 90.0 KRIL 2012005.0 06/17/12 00:00:00 2012.0 6.0 17.0 RU11 59.26 59.4333 23.09 23.1500 NaN 0.0 NaN NaN NaN 11.0 3.0 08/20/14 00:00:00
3 WKRIL2012006 CS137 NaN NaN 17.0 29.0 08/20/14 00:00:00 90.0 KRIL 2012006.0 05/24/12 00:00:00 2012.0 5.0 24.0 RU19 60.15 60.2500 27.59 27.9833 NaN 0.0 NaN NaN NaN 11.0 11.0 08/20/14 00:00:00
4 WKRIL2012007 CS137 NaN NaN 22.2 18.0 08/20/14 00:00:00 90.0 KRIL 2012007.0 05/24/12 00:00:00 2012.0 5.0 24.0 RU19 60.15 60.2500 27.59 27.9833 NaN 39.0 NaN NaN NaN 11.0 11.0 08/20/14 00:00:00

Split Sediment Values

Helcom reports two values for the SEDIMENT sample type: VALUE_Bq/kg and VALUE_Bq/m³. We need to split this and use a single column VALUE for the MARIS standard. We will use the UNIT column to identify the reported values.

Lets take a look at the MARIS unit lookup table:

pd.read_excel(unit_lut_path())
unit_id unit unit_sanitized ordlist Unnamed: 4 Unnamed: 5 Unnamed: 6
0 -1 Not applicable Not applicable NaN NaN NaN NaN
1 0 NOT AVAILABLE NOT AVAILABLE 0.0 NaN NaN NaN
2 1 Bq/m3 Bq per m3 1.0 Bq/m3 NaN Bq/m<sup>3</sup>
3 2 Bq/m2 Bq per m2 2.0 NaN NaN NaN
4 3 Bq/kg Bq per kg 3.0 NaN NaN NaN
5 4 Bq/kgd Bq per kgd 4.0 NaN NaN NaN
6 5 Bq/kgw Bq per kgw 5.0 NaN NaN NaN
7 6 kg/kg kg per kg 6.0 NaN NaN NaN
8 7 TU TU 7.0 NaN NaN NaN
9 8 DELTA/mill DELTA per mill 8.0 NaN NaN NaN
10 9 atom/kg atom per kg 9.0 NaN NaN NaN
11 10 atom/kgd atom per kgd 10.0 NaN NaN NaN
12 11 atom/kgw atom per kgw 11.0 NaN NaN NaN
13 12 atom/l atom per l 12.0 NaN NaN NaN
14 13 Bq/kgC Bq per kgC 13.0 NaN NaN NaN

We will define the columns of interest for the SEDIMENT measurement types:

Exported source
coi_sediment = {
    'kg_type': {
        'VALUE': 'value_bq/kg',
        'UNC': 'error%_kg',
        'DL': '< value_bq/kg',
        'UNIT': 3,  # Unit ID for Bq/kg
    },
    'm2_type': {
        'VALUE': 'value_bq/m²',
        'UNC': 'error%_m²',
        'DL': '< value_bq/m²',
        'UNIT': 2,  # Unit ID for Bq/m²
    }
}

We define the SplitSedimentValuesCB callback to split the sediment entries into separate rows for Bq/kg and Bq/m² values. We use underscore to denote the columns are temporary columns created during the splitting process.


source

SplitSedimentValuesCB

 SplitSedimentValuesCB (coi:Dict[str,Dict[str,Any]])

Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements.

Type Details
coi Dict Columns of interest with value, uncertainty, DL columns and units
Exported source
class SplitSedimentValuesCB(Callback):
    "Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements."
    def __init__(self, 
                 coi: Dict[str, Dict[str, Any]] # Columns of interest with value, uncertainty, DL columns and units
                ):
        fc.store_attr()
        
    def __call__(self, tfm: Transformer):
        if 'SEDIMENT' not in tfm.dfs:
            return
            
        df = tfm.dfs['SEDIMENT']
        dfs_to_concat = []
        
        # For each measurement type (kg and m2)
        for measure_type, cols in self.coi.items():
            # If any of value/uncertainty/DL exists, keep the row
            has_data = (
                df[cols['VALUE']].notna() | 
                df[cols['UNC']].notna() | 
                df[cols['DL']].notna()
            )
            
            if has_data.any():
                df_measure = df[has_data].copy()
                
                # Copy columns to standardized names
                df_measure['_VALUE'] = df_measure[cols['VALUE']]
                df_measure['_UNC'] = df_measure[cols['UNC']]
                df_measure['_DL'] = df_measure[cols['DL']]
                df_measure['_UNIT'] = cols['UNIT']
                
                dfs_to_concat.append(df_measure)
        
        # Combine all measurement type dataframes
        if dfs_to_concat:
            tfm.dfs['SEDIMENT'] = pd.concat(dfs_to_concat, ignore_index=True)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SplitSedimentValuesCB(coi_sediment),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

tfm.dfs['SEDIMENT'].head()
                                                     BIOTA  SEAWATER  SEDIMENT
Number of rows in original dataframes (dfs):         16124     21634     40744
Number of rows in transformed dataframes (tfm.dfs):  16124     21634     70697
Number of rows removed (tfm.dfs_removed):                0         0         0 
key nuclide method < value_bq/kg value_bq/kg error%_kg < value_bq/m² value_bq/m² error%_m² date_of_entry_x ... dw% loi% mors_subbasin helcom_subbasin sum_link date_of_entry_y _VALUE _UNC _DL _UNIT
0 SKRIL2012116 CS137 NaN NaN 1200.0 20.0 NaN NaN NaN 08/20/14 00:00:00 ... NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00 1200.0 20.0 NaN 3
1 SKRIL2012117 CS137 NaN NaN 250.0 20.0 NaN NaN NaN 08/20/14 00:00:00 ... NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00 250.0 20.0 NaN 3
2 SKRIL2012118 CS137 NaN NaN 140.0 21.0 NaN NaN NaN 08/20/14 00:00:00 ... NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00 140.0 21.0 NaN 3
3 SKRIL2012119 CS137 NaN NaN 79.0 20.0 NaN NaN NaN 08/20/14 00:00:00 ... NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00 79.0 20.0 NaN 3
4 SKRIL2012120 CS137 NaN NaN 29.0 24.0 NaN NaN NaN 08/20/14 00:00:00 ... NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00 29.0 24.0 NaN 3

5 rows × 39 columns

Sanitize value

Tip

FEEDBACK TO DATA PROVIDER: Some of the HELCOM dataset contains missing values in the VALUE column, see output after applying the SanitizeValueCB callback.

We allocate each column containing measurement values (named differently across sample types) into a single column VALUE and remove NA where needed.


source

SanitizeValueCB

 SanitizeValueCB (coi:Dict[str,Dict[str,str]])

Sanitize measurement values by removing blanks and standardizing to use the VALUE column.

Type Details
coi Dict Columns of interest. Format: {group_name: {‘val’: ‘column_name’}}
Exported source
coi_val = {'SEAWATER' : {'VALUE': 'value_bq/m³'},
           'BIOTA':  {'VALUE': 'value_bq/kg'},
           'SEDIMENT': {'VALUE': '_VALUE'}}
Exported source
class SanitizeValueCB(Callback):
    "Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column."
    def __init__(self, 
                 coi: Dict[str, Dict[str, str]] # Columns of interest. Format: {group_name: {'val': 'column_name'}}
                 ): 
        fc.store_attr()

    def __call__(self, tfm: Transformer):
        for grp, df in tfm.dfs.items():
            value_col = self.coi[grp]['VALUE']
            # Count NaN values before dropping
            initial_nan_count = df[value_col].isna().sum()
            
            df.dropna(subset=[value_col], inplace=True)
            
            # Count NaN values after dropping
            final_nan_count = df[value_col].isna().sum()
            dropped_nan_count = initial_nan_count - final_nan_count
            
            # Print the number of dropped NaN values
            if dropped_nan_count > 0:
                print(f"Warning: {dropped_nan_count} missing value(s) in {value_col} for group {grp}.")
            
            df['VALUE'] = df[value_col]
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SplitSedimentValuesCB(coi_sediment),
                            SanitizeValueCB(coi_val),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
                                                     BIOTA  SEAWATER  SEDIMENT
Number of rows in original dataframes (dfs):         16124     21634     40744
Number of rows in transformed dataframes (tfm.dfs):  16094     21481     70451
Number of rows removed (tfm.dfs_removed):               30       153       144 

Normalize uncertainty

Function unc_rel2stan converts uncertainty from relative uncertainty to standard uncertainty.


source

unc_rel2stan

 unc_rel2stan (df:pandas.core.frame.DataFrame, meas_col:str, unc_col:str)

Convert relative uncertainty to absolute uncertainty.

Type Details
df DataFrame DataFrame containing measurement and uncertainty columns
meas_col str Name of the column with measurement values
unc_col str Name of the column with relative uncertainty values (percentages)
Returns Series Series with calculated absolute uncertainties
Exported source
def unc_rel2stan(
    df: pd.DataFrame, # DataFrame containing measurement and uncertainty columns
    meas_col: str, # Name of the column with measurement values
    unc_col: str # Name of the column with relative uncertainty values (percentages)
) -> pd.Series: # Series with calculated absolute uncertainties
    "Convert relative uncertainty to absolute uncertainty."
    return df.apply(lambda row: row[unc_col] * row[meas_col] / 100, axis=1)

For each sample type in the Helcom dataset, the UNC is provided as a relative uncertainty. The column names for both the VALUE and the UNC vary by sample type. The coi_units_unc dictionary defines the column names for the VALUE and UNC for each sample type.

Exported source
# Columns of interest
coi_units_unc = [('SEAWATER', 'value_bq/m³', 'error%_m³'),
                 ('BIOTA', 'value_bq/kg', 'error%'),
                 ('SEDIMENT', '_VALUE', '_UNC')]

NormalizeUncCB callback normalizes the UNC by converting from relative uncertainty to standard uncertainty.


source

NormalizeUncCB

 NormalizeUncCB (fn_convert_unc:Callable=<function unc_rel2stan>,
                 coi:List[Tuple[str,str,str]]=[('SEAWATER', 'value_bq/m³',
                 'error%_m³'), ('BIOTA', 'value_bq/kg', 'error%'),
                 ('SEDIMENT', '_VALUE', '_UNC')])

Convert from relative error to standard uncertainty.

Type Default Details
fn_convert_unc Callable unc_rel2stan Function converting relative uncertainty to absolute uncertainty
coi List [(‘SEAWATER’, ‘value_bq/m³’, ’error%_m³’), (‘BIOTA’, ‘value_bq/kg’, ‘error%’), (‘SEDIMENT’, ’_VALUE’, ’_UNC’)] List of columns of interest
Exported source
class NormalizeUncCB(Callback):
    "Convert from relative error to standard uncertainty."
    def __init__(self, 
                 fn_convert_unc: Callable=unc_rel2stan, # Function converting relative uncertainty to absolute uncertainty
                 coi: List[Tuple[str, str, str]]=coi_units_unc # List of columns of interest
                ):
        fc.store_attr()
    
    def __call__(self, tfm: Transformer):
        for grp, val, unc in self.coi:
            if grp in tfm.dfs:
                df = tfm.dfs[grp]
                df['UNC'] = self.fn_convert_unc(df, val, unc)

Apply the transformer for callback [NormalizeUncCB](https://franckalbinet.github.io/marisco/handlers/ospar.html#normalizeunccb). Then, print the value (i.e. activity per unit ) and standard uncertainty for each sample type.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SplitSedimentValuesCB(coi_sediment),
                            SanitizeValueCB(coi_val),
                            NormalizeUncCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()

print(tfm.dfs['SEAWATER'][['VALUE', 'UNC']][:2])
print(tfm.dfs['BIOTA'][['VALUE', 'UNC']][:2])
print(tfm.dfs['SEDIMENT'][['VALUE', 'UNC']][:2])
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
   VALUE    UNC
0    5.3  1.696
1   19.9  3.980
       VALUE      UNC
0    0.01014      NaN
1  135.30000  4.83021
    VALUE    UNC
0  1200.0  240.0
1   250.0   50.0
                                                     BIOTA  SEAWATER  SEDIMENT
Number of rows in original dataframes (dfs):         16124     21634     40744
Number of rows in transformed dataframes (tfm.dfs):  16094     21481     70451
Number of rows removed (tfm.dfs_removed):               30       153       144 

Remap units

Tip

FEEDBACK TO DATA PROVIDER: The handling of unit types varies between biota and sediment sample types. For consistency and ease of use, it would be beneficial to have dedicated unit columns for all sample types.

Given the inconsistent handling of units across sample types, we need to define custom mapping rules for standardizing the units. The units available in MARIS are:

pd.read_excel(unit_lut_path())[['unit_id', 'unit', 'unit_sanitized']]
unit_id unit unit_sanitized
0 -1 Not applicable Not applicable
1 0 NOT AVAILABLE NOT AVAILABLE
2 1 Bq/m3 Bq per m3
3 2 Bq/m2 Bq per m2
4 3 Bq/kg Bq per kg
5 4 Bq/kgd Bq per kgd
6 5 Bq/kgw Bq per kgw
7 6 kg/kg kg per kg
8 7 TU TU
9 8 DELTA/mill DELTA per mill
10 9 atom/kg atom per kg
11 10 atom/kgd atom per kgd
12 11 atom/kgw atom per kgw
13 12 atom/l atom per l
14 13 Bq/kgC Bq per kgC

We define unit renaming rules for HELCOM in an ad hoc way:


source

RemapUnitCB

 RemapUnitCB (lut_units:dict={'SEAWATER': 1, 'SEDIMENT': 4, 'BIOTA': {'D':
              4, 'W': 5, 'F': 5}})

Set the unit id column in the DataFrames based on a lookup table.

Type Default Details
lut_units dict {‘SEAWATER’: 1, ‘SEDIMENT’: 4, ‘BIOTA’: {‘D’: 4, ‘W’: 5, ‘F’: 5}} Dictionary containing renaming rules for different unit categories
Exported source
lut_units = {
    'SEAWATER': 1,  # 'Bq/m3'
    'SEDIMENT': 4,  # 'Bq/kgd' for sediment
    'BIOTA': {
        'D': 4,  # 'Bq/kgd'
        'W': 5,  # 'Bq/kgw'
        'F': 5   # 'Bq/kgw' (assumed to be 'Fresh', so set to wet)
    }
}
Exported source
class RemapUnitCB(Callback):
    "Set the `unit` id column in the DataFrames based on a lookup table."
    def __init__(self, 
                 lut_units: dict=lut_units # Dictionary containing renaming rules for different unit categories
                ):
        fc.store_attr()

    def __call__(self, tfm: Transformer):
        for grp in tfm.dfs.keys():
            if grp in ['SEAWATER', 'SEDIMENT']:
                tfm.dfs[grp]['UNIT'] = self.lut_units[grp]
            else:
                tfm.dfs[grp]['UNIT'] = tfm.dfs[grp]['basis'].apply(lambda x: lut_units[grp].get(x, 0))

Apply the transformer for callback RemapUnitCB(). Then, print the unique UNIT for the SEAWATER dataframe.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[RemapUnitCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

for grp in ['BIOTA', 'SEDIMENT', 'SEAWATER']:
    print(f"{grp}: {tfm()[grp]['UNIT'].unique()}")
                                                     BIOTA  SEAWATER  SEDIMENT
Number of rows in original dataframes (dfs):         16124     21634     40744
Number of rows in transformed dataframes (tfm.dfs):  16124     21634     40744
Number of rows removed (tfm.dfs_removed):                0         0         0 

BIOTA: [5 0 4]
SEDIMENT: [4]
SEAWATER: [1]

Remap detection limit

Detection limits are encoded as follows in MARIS:

pd.read_excel(detection_limit_lut_path())
id name name_sanitized
0 -1 Not applicable Not applicable
1 0 Not Available Not available
2 1 = Detected value
3 2 < Detection limit
4 3 ND Not detected
5 4 DE Derived
Exported source
lut_dl = lambda: pd.read_excel(detection_limit_lut_path(), usecols=['name','id']).set_index('name').to_dict()['id']

Based on columns of interest for each sample type:

Exported source
coi_dl = {'SEAWATER' : {'VALUE' : 'value_bq/m³',
                       'UNC' : 'error%_m³',
                       'DL' : '< value_bq/m³'},
          'BIOTA':  {'VALUE' : 'value_bq/kg',
                     'UNC' : 'error%',
                     'DL' : '< value_bq/kg'},
          'SEDIMENT': {
              'VALUE' : '_VALUE',
              'UNC' : '_UNC',
              'DL' : '_DL'}}

We follow the following business logic to encode the detection limit:

RemapDetectionLimitCB creates a detection_limit column with values determined as follows: 1. Perform a lookup with the appropriate columns value type (or DL) columns (< VALUE_Bq/m³ or < VALUE_Bq/kg) against the table returned from the function get_detectionlimit_lut. 2. If < VALUE_Bq/m³ or < VALUE_Bq/kg is NaN but both activity values (VALUE_Bq/m³ or VALUE_Bq/kg) and standard uncertainty (ERROR%_m³, ERROR%, or ERROR%_kg) are provided, then assign the ID of 1 (i.e. “Detected value”). 3. For other NaN values in the detection_limit column, set them to 0 (i.e. Not Available).


source

RemapDetectionLimitCB

 RemapDetectionLimitCB (coi:dict, fn_lut:Callable)

Remap value type to MARIS format.

Type Details
coi dict Configuration options for column names
fn_lut Callable Function that returns a lookup table
Exported source
class RemapDetectionLimitCB(Callback):
    "Remap value type to MARIS format."
    
    def __init__(self, 
                 coi: dict,  # Configuration options for column names
                 fn_lut: Callable  # Function that returns a lookup table
                ):
        fc.store_attr()

    def __call__(self, tfm: Transformer):
        "Remap detection limits in the DataFrames using the lookup table."
        lut = self.fn_lut()
        
        for grp in tfm.dfs:
            df = tfm.dfs[grp]
            self._update_detection_limit(df, grp, lut)

    def _update_detection_limit(self, 
                                df: pd.DataFrame,  # The DataFrame to modify
                                grp: str,  # The group name to get the column configuration
                                lut: dict  # The lookup table dictionary
                               ) -> None:
        "Update detection limit column in the DataFrame based on lookup table and rules."
        
        # Check if the group exists in coi_dl
        if grp not in coi_dl:
            raise ValueError(f"Group '{grp}' not found in coi_dl configuration.")
        
        # Access column names from coi_dl
        value_col = coi_dl[grp]['VALUE']
        uncertainty_col = coi_dl[grp]['UNC']
        detection_col = coi_dl[grp]['DL']

        # Initialize detection limit column
        df['DL'] = df[detection_col]
        
        # Set detection limits based on conditions
        self._set_detection_limits(df, value_col, uncertainty_col, lut)

    def _set_detection_limits(self, df: pd.DataFrame, value_col: str, uncertainty_col: str, lut: dict) -> None:
        "Set detection limits based on value and uncertainty columns."
        
        # Condition for setting '='
        # 'DL' defaults to equal (i.e. '=') if there is a value and uncertainty and 'DL' value is not 
        # in the lookup table.
        
        condition_eq =(df[value_col].notna() & 
                       df[uncertainty_col].notna() & 
                       ~df['DL'].isin(lut.keys())
        )
        
        df.loc[condition_eq, 'DL'] = '='

        # Set 'Not Available' for unmatched detection limits
        df.loc[~df['DL'].isin(lut.keys()), 'DL'] = 'Not Available'
        
        # Perform lookup to map detection limits
        df['DL'] = df['DL'].map(lut)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            SplitSedimentValuesCB(coi_sediment),
                            SanitizeValueCB(coi_val),
                            NormalizeUncCB(),                  
                            RemapUnitCB(),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

for grp in ['BIOTA', 'SEDIMENT', 'SEAWATER']:
    print(f'Unique DL values for {grp}: {tfm.dfs[grp]["DL"].unique()}')
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
                                                     BIOTA  SEAWATER  SEDIMENT
Number of rows in original dataframes (dfs):         16124     21634     40744
Number of rows in transformed dataframes (tfm.dfs):  16094     21481     70451
Number of rows removed (tfm.dfs_removed):               30       153       144 

Unique DL values for BIOTA: [2 1 0]
Unique DL values for SEDIMENT: [1 2 0]
Unique DL values for SEAWATER: [1 2 0]

Remap Biota species

Tip

FEEDBACK TO DATA PROVIDER: Some of the HELCOM Biota dataset rubin codes differ from the RUBIN_NAME lookup table as shown below. Some are mistyped, others contains trailing spaces:

set(dfs['BIOTA']['rubin']) - set(read_csv('RUBIN_NAME.csv')['RUBIN'])
{'CHAR BALT', 'FUCU SPP', 'FUCU VES ', 'FURC LUMB', 'GADU MOR  ', 'STUC PECT'}

We will remap the HELCOM RUBIN column to the MARIS SPECIES column using the IMFA (Inspect, Match, Fix, Apply) pattern. First lets inspect the RUBIN_NAME.csv file provided by HELCOM, which describes the nomenclature of biota species.

read_csv('RUBIN_NAME.csv').head()
RUBIN_ID RUBIN SCIENTIFIC NAME ENGLISH NAME
0 11 ABRA BRA ABRAMIS BRAMA BREAM
1 12 ANGU ANG ANGUILLA ANGUILLA EEL
2 13 ARCT ISL ARCTICA ISLANDICA ISLAND CYPRINE
3 14 ASTE RUB ASTERIAS RUBENS COMMON STARFISH
4 15 CARD EDU CARDIUM EDULE COCKLE

Now we try to MATCH the SCIENTIFIC NAME column of HELCOM BIOTA dataset to the species column of the MARIS species lookup table, again using a Remapper object:

remapper = Remapper(provider_lut_df=read_csv('RUBIN_NAME.csv'),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='SCIENTIFIC NAME',
                    provider_col_key='RUBIN',
                    fname_cache='species_helcom.pkl'
                    )

remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 100%|██████████| 46/46 [00:06<00:00,  7.43it/s]
38 entries matched the criteria, while 8 entries had a match score of 1 or higher.
matched_maris_name source_name match_score
source_key
STIZ LUC Sander lucioperca STIZOSTEDION LUCIOPERCA 10
LAMI SAC Laminaria japonica LAMINARIA SACCHARINA 7
CARD EDU Cardiidae CARDIUM EDULE 6
CH HI;BA Macoma balthica CHARA BALTICA 6
ENCH CIM Echinodermata ENCHINODERMATA CIM 5
PSET MAX Pinctada maxima PSETTA MAXIMA 5
MACO BAL Macoma balthica MACOMA BALTICA 1
STUC PEC Stuckenia pectinata STUCKENIA PECTINATE 1

Below, we will correct the entries that were not properly matched by the Remapper object:

Exported source
fixes_biota_species = {
    'STIZOSTEDION LUCIOPERCA': 'Sander luciopercas',
    'LAMINARIA SACCHARINA': 'Saccharina latissima',
    'CARDIUM EDULE': 'Cerastoderma edule',
    'CHARA BALTICA': NA,
    'PSETTA MAXIMA': 'Scophthalmus maximus'
    }

And give the remapper another try:

remapper.generate_lookup_table(fixes=fixes_biota_species)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 100%|██████████| 46/46 [00:06<00:00,  6.78it/s]
42 entries matched the criteria, while 4 entries had a match score of 1 or higher.
matched_maris_name source_name match_score
source_key
ENCH CIM Echinodermata ENCHINODERMATA CIM 5
MACO BAL Macoma balthica MACOMA BALTICA 1
STIZ LUC Sander lucioperca STIZOSTEDION LUCIOPERCA 1
STUC PEC Stuckenia pectinata STUCKENIA PECTINATE 1

Visual inspection of the remaining unperfectly matched entries seem acceptable to proceed.

We can now use the generic RemapCB callback to perform the remapping of the RUBIN column to the species column after having defined the lookup table lut_biota.

Exported source
lut_biota = lambda: Remapper(provider_lut_df=read_csv('RUBIN_NAME.csv'),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='SCIENTIFIC NAME',
                             provider_col_key='RUBIN',
                             fname_cache='species_helcom.pkl'
                             ).generate_lookup_table(fixes=fixes_biota_species, as_df=False, overwrite=False)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA')
    ])
tfm()
tfm.dfs['BIOTA'].columns
tfm.dfs['BIOTA']['SPECIES'].unique()
array([  99,  243,   50,  139,  270,  192,  191,  284,   84,  269,  122,
         96,  287,  279,  278,  288,  286,  244,  129,  275,  271,  285,
        283,  247,  120,   59,  280,  274,  273,  290,  289,  272,  277,
        276,   21,  282,  110,  281,  245,  704, 1524,  703,    0,  621,
         60])

Remap Biota tissues

Let’s inspect the TISSUE.csv file provided by HELCOM describing the tissue nomenclature. Biota tissue is known as body part in the maris data set.

remapper = Remapper(provider_lut_df=read_csv('TISSUE.csv'),
                    maris_lut_fn=bodyparts_lut_path,
                    maris_col_id='bodypar_id',
                    maris_col_name='bodypar',
                    provider_col_to_match='TISSUE_DESCRIPTION',
                    provider_col_key='TISSUE',
                    fname_cache='tissues_helcom.pkl'
                    )

remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 100%|██████████| 29/29 [00:00<00:00, 74.81it/s]
21 entries matched the criteria, while 8 entries had a match score of 1 or higher.
matched_maris_name source_name match_score
source_key
3 Flesh without bones WHOLE FISH WITHOUT HEAD AND ENTRAILS 20
2 Flesh without bones WHOLE FISH WITHOUT ENTRAILS 13
8 Soft parts SKIN/EPIDERMIS 10
5 Flesh without bones FLESH WITHOUT BONES (FILETS) 9
1 Whole animal WHOLE FISH 5
12 Brain ENTRAILS 5
15 Stomach and intestine STOMACH + INTESTINE 3
41 Whole animal WHOLE ANIMALS 1

We address several entries that were not correctly matched by the Remapper object, as detailed below:”

Exported source
fixes_biota_tissues = {
    'WHOLE FISH WITHOUT HEAD AND ENTRAILS': 'Whole animal eviscerated without head',
    'WHOLE FISH WITHOUT ENTRAILS': 'Whole animal eviscerated',
    'SKIN/EPIDERMIS': 'Skin',
    'ENTRAILS': 'Viscera'
    }
remapper.generate_lookup_table(as_df=True, fixes=fixes_biota_tissues)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing:   0%|          | 0/29 [00:00<?, ?it/s]Processing: 100%|██████████| 29/29 [00:00<00:00, 99.87it/s]
25 entries matched the criteria, while 4 entries had a match score of 1 or higher.
matched_maris_name source_name match_score
source_key
5 Flesh without bones FLESH WITHOUT BONES (FILETS) 9
1 Whole animal WHOLE FISH 5
15 Stomach and intestine STOMACH + INTESTINE 3
41 Whole animal WHOLE ANIMALS 1

Visual inspection of the remaining unperfectly matched entries seem acceptable to proceed.

We can now use the generic RemapCB callback to perform the remapping of the TISSUE column to the body_part column after having defined the lookup table lut_tissues.

Exported source
lut_tissues = lambda: Remapper(provider_lut_df=read_csv('TISSUE.csv'),
                               maris_lut_fn=bodyparts_lut_path,
                               maris_col_id='bodypar_id',
                               maris_col_name='bodypar',
                               provider_col_to_match='TISSUE_DESCRIPTION',
                               provider_col_key='TISSUE',
                               fname_cache='tissues_helcom.pkl'
                               ).generate_lookup_table(fixes=fixes_biota_tissues, as_df=False, overwrite=False)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
    ])

print(tfm()['BIOTA'][['tissue', 'BODY_PART']][:5])
   tissue  BODY_PART
0       5         52
1       5         52
2       5         52
3       5         52
4       5         52

Remap Biogroup

lut_biogroup_from_biota reads the file at species_lut_path() and from the contents of this file creates a dictionary linking species_id to biogroup_id.

Exported source
lut_biogroup_from_biota = lambda: get_lut(src_dir=species_lut_path().parent, fname=species_lut_path().name, 
                               key='species_id', value='biogroup_id')
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA')
    ])

print(tfm()['BIOTA']['BIO_GROUP'].unique())
[ 4  2 14 11  8  3  0]

Remap Sediment Types

Tip

FEEDBACK TO DATA PROVIDER: The SEDI values 56 and 73 are not found in the SEDIMENT_TYPE.csv lookup table provided. Note also there are many nan values in the SEDIMENT_TYPE.csv file.

We reassign them to -99 for now but should be clarified/fixed. This is demonstrated below.

df_sed_lut = read_csv('SEDIMENT_TYPE.csv')
dfs = load_data(src_dir, use_cache=True)
sediment_sedi = set(dfs['SEDIMENT']['sedi'].unique())
lookup_sedi = set(df_sed_lut['SEDI'])
missing = sediment_sedi - lookup_sedi
print(f"Missing sediment type values in HELCOM lookup table: {missing if missing else 'None'}")
Missing sediment type values in HELCOM lookup table: {56.0, 73.0, nan}

Once again, we employ the IMFA (Inspect, Match, Fix, Apply) pattern to remap the HELCOM sediment types. Let’s inspect the SEDIMENT_TYPE.csv file provided by HELCOM describing the sediment type nomenclature:

read_csv('SEDIMENT_TYPE.csv').head()
SEDI SEDIMENT TYPE RECOMMENDED TO BE USED
0 -99 NO DATA NaN
1 0 GRAVEL YES
2 1 SAND YES
3 2 FINE SAND NO
4 3 SILT YES

Let’s try to match as many as possible:

remapper = Remapper(provider_lut_df=read_csv('SEDIMENT_TYPE.csv'),
                    maris_lut_fn=sediments_lut_path,
                    maris_col_id='sedtype_id',
                    maris_col_name='sedtype',
                    provider_col_to_match='SEDIMENT TYPE',
                    provider_col_key='SEDI',
                    fname_cache='sediments_helcom.pkl'
                    )

remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing:  55%|█████▌    | 26/47 [00:00<00:00, 119.77it/s]
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[62], line 11
      1 #| eval: false
      2 remapper = Remapper(provider_lut_df=read_csv('SEDIMENT_TYPE.csv'),
      3                     maris_lut_fn=sediments_lut_path,
      4                     maris_col_id='sedtype_id',
   (...)
      8                     fname_cache='sediments_helcom.pkl'
      9                     )
---> 11 remapper.generate_lookup_table(as_df=True)
     12 remapper.select_match(match_score_threshold=1, verbose=True)

File ~/marisco/projects/marisco/marisco/utils.py:79, in Remapper.generate_lookup_table(self, fixes, as_df, overwrite)
     77 self.as_df = as_df
     78 if overwrite or not self.cache_file.exists():
---> 79     self._create_lookup_table()
     80     fc.save_pickle(self.cache_file, self.lut)
     81 else:

File ~/marisco/projects/marisco/marisco/utils.py:89, in Remapper._create_lookup_table(self)
     87 df = self.provider_lut_df
     88 for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing"): 
---> 89     self._process_row(row)

File ~/marisco/projects/marisco/marisco/utils.py:96, in Remapper._process_row(self, row)
     93 if isinstance(value_to_match, str):  # Only process if value is a string
     94     # If value is in fixes, use the fixed value
     95     name_to_match = self.fixes.get(value_to_match, value_to_match)
---> 96     result = match_maris_lut(self.maris_lut, name_to_match, self.maris_col_id, self.maris_col_name).iloc[0]
     97     match = Match(result[self.maris_col_id], result[self.maris_col_name], 
     98                   value_to_match, result['score'])
     99     self.lut[row[self.provider_col_key]] = match

File ~/marisco/projects/marisco/marisco/utils.py:249, in match_maris_lut(lut, data_provider_name, maris_id, maris_name, dist_fn, nresults)
    247 "Fuzzy matching data provider and MARIS lookup tables (e.g biota species, sediments, ...)."
    248 if isinstance(lut, str) or isinstance(lut, Path):
--> 249     df = pd.read_excel(lut)  # Load the LUT if a path is provided
    250 elif isinstance(lut, pd.DataFrame):
    251     df = lut  # Use the DataFrame directly if provided

File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/excel/_base.py:495, in read_excel(io, sheet_name, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, date_format, thousands, decimal, comment, skipfooter, storage_options, dtype_backend, engine_kwargs)
    493 if not isinstance(io, ExcelFile):
    494     should_close = True
--> 495     io = ExcelFile(
    496         io,
    497         storage_options=storage_options,
    498         engine=engine,
    499         engine_kwargs=engine_kwargs,
    500     )
    501 elif engine and engine != io.engine:
    502     raise ValueError(
    503         "Engine should not be specified when passing "
    504         "an ExcelFile - ExcelFile already has the engine set"
    505     )

File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/excel/_base.py:1550, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options, engine_kwargs)
   1548     ext = "xls"
   1549 else:
-> 1550     ext = inspect_excel_format(
   1551         content_or_path=path_or_buffer, storage_options=storage_options
   1552     )
   1553     if ext is None:
   1554         raise ValueError(
   1555             "Excel file format cannot be determined, you must specify "
   1556             "an engine manually."
   1557         )

File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/excel/_base.py:1402, in inspect_excel_format(content_or_path, storage_options)
   1399 if isinstance(content_or_path, bytes):
   1400     content_or_path = BytesIO(content_or_path)
-> 1402 with get_handle(
   1403     content_or_path, "rb", storage_options=storage_options, is_text=False
   1404 ) as handle:
   1405     stream = handle.handle
   1406     stream.seek(0)

File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/common.py:728, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    725     codecs.lookup_error(errors)
    727 # open URLs
--> 728 ioargs = _get_filepath_or_buffer(
    729     path_or_buf,
    730     encoding=encoding,
    731     compression=compression,
    732     mode=mode,
    733     storage_options=storage_options,
    734 )
    736 handle = ioargs.filepath_or_buffer
    737 handles: list[BaseBuffer]

File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/common.py:458, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    453     raise ValueError(
    454         "storage_options passed with file object or non-fsspec file path"
    455     )
    457 if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)):
--> 458     return IOArgs(
    459         filepath_or_buffer=_expand_user(filepath_or_buffer),
    460         encoding=encoding,
    461         compression=compression,
    462         should_close=False,
    463         mode=mode,
    464     )
    466 # is_file_like requires (read | write) & __iter__ but __iter__ is only
    467 # needed for read_csv(engine=python)
    468 if not (
    469     hasattr(filepath_or_buffer, "read") or hasattr(filepath_or_buffer, "write")
    470 ):

KeyboardInterrupt: 

We address the remaining unmatched values by adding fixes_sediments:

Exported source
fixes_sediments = {
    'NO DATA': NA
}
remapper.generate_lookup_table(as_df=True, fixes=fixes_sediments)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing:   0%|          | 0/47 [00:00<?, ?it/s]Processing: 100%|██████████| 47/47 [00:00<00:00, 102.08it/s]
44 entries matched the criteria, while 3 entries had a match score of 1 or higher.
matched_maris_name source_name match_score
source_key
-99 (Not available) NO DATA 2
50 Mud and gravel MUD AND GARVEL 2
46 Glacial clay CLACIAL CLAY 1

A visual inspection of the remaining values shows that they are acceptable to proceed.


source

RemapSedimentCB

 RemapSedimentCB (fn_lut:Callable, sed_grp_name:str='SEDIMENT',
                  sed_col_name:str='sedi', replace_lut:dict=None)

Lookup sediment id using lookup table.

Type Default Details
fn_lut Callable Function that returns the lookup table dictionary
sed_grp_name str SEDIMENT The name of the sediment group
sed_col_name str sedi The name of the sediment column
replace_lut dict None Dictionary for replacing SEDI values
Exported source
class RemapSedimentCB(Callback):
    "Lookup sediment id using lookup table."
    def __init__(self, 
                 fn_lut: Callable,  # Function that returns the lookup table dictionary
                 sed_grp_name: str = 'SEDIMENT',  # The name of the sediment group
                 sed_col_name: str = 'sedi',  # The name of the sediment column
                 replace_lut: dict = None  # Dictionary for replacing SEDI values
                 ):
        fc.store_attr()

    def __call__(self, tfm: Transformer):
        "Remap sediment types using lookup table."
        df = tfm.dfs[self.sed_grp_name]
        
        # Fix inconsistent values and get lookup table
        if self.replace_lut: df[self.sed_col_name] = df[self.sed_col_name].replace(self.replace_lut)
        lut = self.fn_lut()
        
        # Map sediment types, defaulting to 0 (Not available) for unmatched values
        df['SED_TYPE'] = df['sedi'].map(
            lambda x: lut.get(x, Match(0, None, None, None)).matched_id
        )
Exported source
lut_sediments = lambda: Remapper(provider_lut_df=read_csv('SEDIMENT_TYPE.csv'),
                                 maris_lut_fn=sediments_lut_path,
                                 maris_col_id='sedtype_id',
                                 maris_col_name='sedtype',
                                 provider_col_to_match='SEDIMENT TYPE',
                                 provider_col_key='SEDI',
                                 fname_cache='sediments_helcom.pkl'
                                 ).generate_lookup_table(fixes=fixes_sediments, as_df=False, overwrite=False)

Reassign the SEDI values of 56, 73, and nan to -99 further remapped to 0 (Not available) in compliance with the MARIS nomenclature:

Exported source
sed_replace_lut = {
    56: -99,
    73: -99,
    np.nan: -99
}

Utilize the RemapSedimentCB callback to remap the SEDI values in the HELCOM dataset to the corresponding MARIS standard sediment type, referred to as SED_TYPE.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapSedimentCB(fn_lut=lut_sediments, replace_lut=sed_replace_lut)
    ])

tfm()
tfm.dfs['SEDIMENT']['SED_TYPE'].unique()
array([ 0,  2, 58, 30, 59, 55, 56, 36, 29, 47,  4, 54, 33,  6, 44, 42, 48,
       61, 57, 28, 49, 32, 45, 39, 46, 38, 31, 60, 62, 26, 53, 52,  1, 51,
       37, 34, 50,  7, 10, 41, 43, 35])

Remap Filtering Status

HELCOM filtered status is encoded as follows in the FILT column:

dfs = load_data(src_dir, use_cache=True)
get_unique_across_dfs(dfs, col_name='filt', as_df=True).head(5)
index value
0 0 n
1 1 F
2 2 N
3 3 NaN

MARIS uses a different encoding for filtered status:

pd.read_excel(filtered_lut_path())
id name
0 -1 Not applicable
1 0 Not available
2 1 Yes
3 2 No

For only four categories to remap, the Remapper is an overkill. We can use a simple dictionary to map the values:

Exported source
lut_filtered = {
    'N': 2, # No
    'n': 2, # No
    'F': 1 # Yes
}

RemapFiltCB converts the HELCOM filt data to the MARIS FILT format.


source

RemapFiltCB

 RemapFiltCB (lut_filtered:dict={'N': 2, 'n': 2, 'F': 1})

Lookup filt value in dataframe using the lookup table.

Type Default Details
lut_filtered dict {‘N’: 2, ‘n’: 2, ‘F’: 1} Dictionary mapping filt codes to their corresponding names
Exported source
class RemapFiltCB(Callback):
    "Lookup filt value in dataframe using the lookup table."
    def __init__(self,
                 lut_filtered: dict=lut_filtered, # Dictionary mapping filt codes to their corresponding names
                ):
        fc.store_attr()

    def __call__(self, tfm):
        for df in tfm.dfs.values():
            if 'filt' in df.columns:
                df['FILT'] = df['filt'].map(lambda x: self.lut_filtered.get(x, 0))

For instance:

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[RemapFiltCB(lut_filtered)])

print(tfm()['SEAWATER']['FILT'].unique())
[0 2 1]

Add Laboratory ID (REVIEW)

Tip

FEEDBACK FOR NEXT VERSION: Review the inclusion of LAB in the NetCDF output, note with minor updates to dbo_lab.xlsx it would offer a way to include SMP_ID.

This section could be simplified by including all Helcom ‘LABORATORY’ names in the MARIS standard laboratory names lookup table (dbo_lab.xlsx). For example STUK, KRIL, RISO, etc. are absent from the MARIS standard laboratory names lookup table lab_abb column.

Lets use the utility get_unique_across_dfs function to review the unique laboratory IDs in the HELCOM dataset:

# Transpose to display the dataframe horizontally
get_unique_across_dfs(tfm.dfs, col_name='laboratory', as_df=True).T
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
index 0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
value BFFG LREB SAAS KRIL EBRS DHIG LRPC LEPA CLOR SSSI ... VTIG STUK NCRS LVDC IMGW JORC NaN LVEA RISO ERPC

2 rows × 21 columns

The HELCOM dataset includes a lookup table LABORATORY_NAME.csv which captures the laboratory names and codes.

read_csv('LABORATORY_NAME.csv').head()
LABORATORY LABORATORY_NAME START_DATE END_DATE COUNTRY
0 BFFG BUNDESFORSCHUNGANSTALT FÜR FISCHEREI, GERMANY 01/01/86 00:00:00 12/31/07 00:00:00 6
1 CLOR CENTRAL LABORATORY FOR RADIOLOGICAL PROTECTION, POLAND 01/01/84 00:00:00 NaN 67
2 DHIG FEDERAL MARITIME AND HYDROGRAPHIC AGENCY, GERMANY 01/01/84 00:00:00 NaN 6
3 EBRS RADIATION SAFETY DEPARTMENT ENVIRONMENTAL BOARD, ESTONIA 01/01/10 00:00:00 NaN 91
4 EMHI ESTONIAN METEOROLOGICAL AND HYDROLOGICAL INSTITUTE, ESTONIA NaN NaN 91

Lets take a look at the MARIS standard laboratory names:

maris_lab_lut=pd.read_excel(lab_lut_path())
maris_lab_lut.head(4)
lab_id lab_abb lab addr_1 addr_2 twn_zip country tel e_mail fax note Unnamed: 11 Unnamed: 12 Unnamed: 13
0 -1 Not applicable Not applicable NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0 Not available Not available NaN NaN NaN Not available NaN NaN NaN NaN NaN NaN NaN
2 1 IAEA-EL International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ NaN P.O. Box No. 800 MC-98012 Monaco Cedex Principality of Monaco NaN NaN NaN NaN NaN NaN update lab set lab = International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ , country = where lab_id = 1
3 2 INPAS Institute of Nuclear Physics - Academy of Sciences NaN NaN Tirana Albania NaN NaN NaN NaN NaN NaN update lab set lab = Institute of Nuclear Physics - Academy of Sciences , country = where lab_id = 2
Tip

FEEDBACK FOR DATA PROVIDER: One entry for the laboratory column includes a ‘NaN’, see below.

def find_nan_entries(dfs, columns=None):
    """
    Returns a dictionary of DataFrames, each containing the complete rows where any of the specified columns have NaN values from the original DataFrames.
    
    Parameters:
        dfs (dict): A dictionary where keys are dataset names and values are pandas DataFrames.
        columns (list, optional): A list of column names to check for NaN values. If None, all columns are checked.
    
    Returns:
        dict: A dictionary where each key is the dataset name and the value is a DataFrame of complete rows that have NaN entries in the specified columns.
    """
    nan_entries = {}
    for key, df in dfs.items():
        # If columns are specified, check these columns for NaN values
        if columns is not None:
            # Find rows with NaN values in the specified columns
            nan_rows = df[columns].isnull().any(axis=1)
        else:
            # Find rows with any NaN values across all columns
            nan_rows = df.isnull().any(axis=1)
        
        # Use the boolean index to select the complete rows from the original DataFrame
        complete_nan_rows = df[nan_rows]
        
        if not complete_nan_rows.empty:
            nan_entries[key] = complete_nan_rows
    return nan_entries

nan_lab_df = find_nan_entries(dfs, columns=['laboratory'])

print ('Entries with NaN in the `LABORATORY` column:')
for key, df in nan_lab_df.items():
    print(f"{key}: \n{df}")
Entries with NaN in the `LABORATORY` column:
SEAWATER: 
                key nuclide   method < value_bq/m³  value_bq/m³  error%_m³  \
20556  WSSSM2015009      H3  STYR201             <       2450.0        NaN   
20557  WSSSM2015010      H3  STYR201           NaN       2510.0      29.17   
20558  WSSSM2015011      H3  STYR201             <       2450.0        NaN   
20559  WSSSM2015012      H3  STYR201           NaN       1740.0      41.26   
20560  WSSSM2015013      H3  STYR201           NaN       1650.0      43.53   
20561  WSSSM2015014      H3  STYR201             <       2277.0        NaN   
20562  WSSSM2015015      H3  STYR201             <       2277.0        NaN   
20563  WSSSM2015016      H3  STYR201             <       2277.0        NaN   

      date_of_entry_x  country laboratory  sequence  ... longitude (ddmmmm)  \
20556             NaN      NaN        NaN       NaN  ...                NaN   
20557             NaN      NaN        NaN       NaN  ...                NaN   
20558             NaN      NaN        NaN       NaN  ...                NaN   
20559             NaN      NaN        NaN       NaN  ...                NaN   
20560             NaN      NaN        NaN       NaN  ...                NaN   
20561             NaN      NaN        NaN       NaN  ...                NaN   
20562             NaN      NaN        NaN       NaN  ...                NaN   
20563             NaN      NaN        NaN       NaN  ...                NaN   

       longitude (dddddd)  tdepth  sdepth salin  ttemp  filt  mors_subbasin  \
20556                 NaN     NaN     NaN   NaN    NaN   NaN            NaN   
20557                 NaN     NaN     NaN   NaN    NaN   NaN            NaN   
20558                 NaN     NaN     NaN   NaN    NaN   NaN            NaN   
20559                 NaN     NaN     NaN   NaN    NaN   NaN            NaN   
20560                 NaN     NaN     NaN   NaN    NaN   NaN            NaN   
20561                 NaN     NaN     NaN   NaN    NaN   NaN            NaN   
20562                 NaN     NaN     NaN   NaN    NaN   NaN            NaN   
20563                 NaN     NaN     NaN   NaN    NaN   NaN            NaN   

       helcom_subbasin  date_of_entry_y  
20556              NaN              NaN  
20557              NaN              NaN  
20558              NaN              NaN  
20559              NaN              NaN  
20560              NaN              NaN  
20561              NaN              NaN  
20562              NaN              NaN  
20563              NaN              NaN  

[8 rows x 27 columns]
SEDIMENT: 
                key nuclide  method < value_bq/kg  value_bq/kg  error%_kg  \
35821  SDHIG2016236   CS137  DHIG03           NaN       8.2952      2.351   

      < value_bq/m²  value_bq/m²  error%_m²    date_of_entry_x  ...  lowsli  \
35821           NaN   237.500899        NaN  05/13/19 00:00:00  ...     NaN   

      area  sedi oxic  dw%  loi%  mors_subbasin helcom_subbasin  sum_link  \
35821  NaN   NaN  NaN  NaN   NaN            NaN             NaN       NaN   

      date_of_entry_y  
35821             NaN  

[1 rows x 35 columns]
Tip

FEEDBACK FOR NEXT VERSION: Consider integrating combine_lut_columns function into utils.ipynb. I’ve updated the remapper and match_maris_lut functions to accept either a lut_path or a DataFrame. This code could be further simplified by handling the file opening (e.g., pd.read_excel) directly within the remapper function, thereby always passing a DataFrame to match_maris_lut. Refer to the implementation in utils.ipynb for details.

The HELCOM description of laboratory includes both the laboratory name and country. Lets update the maris_lab_lut to include the laboratory name and country in the same column.


source

combine_lut_columns

 combine_lut_columns (lut_path:Callable, combine_cols:List[str]=[])
Exported source
def combine_lut_columns(lut_path: Callable, combine_cols: List[str] = []):
    if lut_path:
        df_lut = pd.read_excel(lut_path()) 
        if combine_cols:
            # Combine the specified columns into a single column with space as separator
            df_lut['combined'] = df_lut[combine_cols].astype(str).agg(' '.join, axis=1)
            # Create a column name by joining column names with '_'
            combined_col_name = '_'.join(combine_cols)
            df_lut.rename(columns={'combined': combined_col_name}, inplace=True)
        return df_lut
df_lut = combine_lut_columns(lab_lut_path, ['lab','country'])
df_lut.head(3)
lab_id lab_abb lab addr_1 addr_2 twn_zip country tel e_mail fax note Unnamed: 11 Unnamed: 12 Unnamed: 13 lab_country
0 -1 Not applicable Not applicable NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Not applicable nan
1 0 Not available Not available NaN NaN NaN Not available NaN NaN NaN NaN NaN NaN NaN Not available Not available
2 1 IAEA-EL International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ NaN P.O. Box No. 800 MC-98012 Monaco Cedex Principality of Monaco NaN NaN NaN NaN NaN NaN update lab set lab = International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ , country = where lab_id = 1 International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ Principality of Monaco

Let’s now create an instance of a fuzzy matching algorithm Remapper. This instance will match the LABORATORY column of the HELCOM dataset to the MARIS standard laboratory names using both lab and country fields.

remapper = Remapper(provider_lut_df=read_csv('LABORATORY_NAME.csv'),
                    maris_lut_fn= combine_lut_columns(lut_path=lab_lut_path, combine_cols=['lab','country']),
                    maris_col_id='lab_id',
                    maris_col_name='lab_country',
                    provider_col_to_match='LABORATORY_NAME',
                    provider_col_key='LABORATORY',
                    fname_cache='lab_helcom.pkl')

Lets try to match LABORATORY names to MARIS standard laboratory names as automatically as possible. The match_score column allows to assess the results:

remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing:   0%|          | 0/21 [00:00<?, ?it/s]Processing: 100%|██████████| 21/21 [00:00<00:00, 67.11it/s]
0 entries matched the criteria, while 21 entries had a match score of 1 or higher.
matched_maris_name source_name match_score
source_key
SSSI Central Mining Institute Poland STATENS STRÅLSKYDDSINSTITUT, SWEDEN 23
KRIL Polytechnic Institute Romania V. G. KHLOPIN RADIUM INSTITUTE, RUSSIA 22
STUK Radiation and Nuclear Safety Authority Finland SÄTEILYTURVAKESKUS, RADIATION AND NUCLEAR SAFETY AUTHORITY, FINLAND 21
LRPC Radiation Protection Authority Norway RADIATION PROTECTION CENTRE, LITHUANIA 15
SAAS National Board of Nuclear Safety and Radiation Protection Germany NATIONAL BOARD FOR ATOMIC SAFETY AND RADIATION PROTECTION, GERMANY 10
RISO Risø National Laboratory - The Radiation Research Department Denmark RISÖ NATIONAL LABORATORY, RADIATION RESEARCH DEPARTMENT, DENMARK 8
LEPA Environmental Protection Agency Ireland ENVIRONMENTAL PROTECTION AGENCY, LITHUANIA 7
NCRS The Swedish University of Agricultural Sciences Sweden SWEDISH UNIVERSITY OF AGRICULTURAL SCIENCES, SWEDEN 5
CLOR Central Laboratory for Radiological Protection Poland CENTRAL LABORATORY FOR RADIOLOGICAL PROTECTION, POLAND 4
SSSM SVERIGE S STRÅL SÄKERHETS MYNDIGHETEN Sweden SVERIGE'S STRÅLSÄKERHETS MYNDIGHETEN, SWEDEN 3
VTIG JOHAN HEINRICH VON THÜNEN-INSTITUTE Germany JOHANN HEINRICH VON THÜNEN-INSTITUTE, GERMANY 2
EBRS Radiation Safety Department, Environmental Board Estonia RADIATION SAFETY DEPARTMENT ENVIRONMENTAL BOARD, ESTONIA 2
LVEA Latvian Environment Agency Latvia LATVIAN ENVIRONMENT AGENCY, LATVIA 1
BFFG BUNDESFORSCHUNGANSTALT FÜR FISCHEREI Germany BUNDESFORSCHUNGANSTALT FÜR FISCHEREI, GERMANY 1
LVDC Environmental Data Center of Latvia Latvia ENVIRONMENTAL DATA CENTER OF LATVIA, LATVIA 1
JORC Joint Research Center Lithuania JOINT RESEARCH CENTER, LITHUANIA 1
IMGW Institute of Meteorology and Water Management Poland INSTITUTE OF METEOROLOGY AND WATER MANAGEMENT, POLAND 1
ERPC Estonian Radiation Protection Centre Estonia ESTONIAN RADIATION PROTECTION CENTRE, ESTONIA 1
EMHI Estonian Meteorological and Hydrological Institute Estonia ESTONIAN METEOROLOGICAL AND HYDROLOGICAL INSTITUTE, ESTONIA 1
DHIG Federal Maritime and Hydrographic Agency Germany FEDERAL MARITIME AND HYDROGRAPHIC AGENCY, GERMANY 1
LREB Lielriga Regional Environmental Board Latvia LIELRIGA REGIONAL ENVIRONMENTAL BOARD, LATVIA 1

Although the match score is 1 or greater for all entries, many are still matched appropriately. Let’s manually correct any unmatched values. Here, we are manually aligning the data providers’ laboratory names with those used by the MARIS LUT.

Exported source
fixes_lab_names = {
    'STATENS STRÅLSKYDDSINSTITUT, SWEDEN': 'Swedish Radiation Safety Authority Sweden',
    'V. G. KHLOPIN RADIUM INSTITUTE, RUSSIA': 'V.G. Khlopin Radium Institute - Lab. of Environmental Radioactive Contamination Monitoring Russian Federation',
    'ENVIRONMENTAL PROTECTION AGENCY, LITHUANIA': 'Lithuanian Environmental Protection Agency Lithuania',
    }

Now, lets apply the manual corrections, fixes_lab_names and try again.

remapper.generate_lookup_table(as_df=True, fixes=fixes_lab_names)
remapper.select_match(match_score_threshold=1, verbose=True).head(5)
Processing:   0%|          | 0/21 [00:00<?, ?it/s]Processing: 100%|██████████| 21/21 [00:00<00:00, 60.11it/s]
3 entries matched the criteria, while 18 entries had a match score of 1 or higher.
matched_maris_name source_name match_score
source_key
STUK Radiation and Nuclear Safety Authority Finland SÄTEILYTURVAKESKUS, RADIATION AND NUCLEAR SAFETY AUTHORITY, FINLAND 21
LRPC Radiation Protection Authority Norway RADIATION PROTECTION CENTRE, LITHUANIA 15
SAAS National Board of Nuclear Safety and Radiation Protection Germany NATIONAL BOARD FOR ATOMIC SAFETY AND RADIATION PROTECTION, GERMANY 10
RISO Risø National Laboratory - The Radiation Research Department Denmark RISÖ NATIONAL LABORATORY, RADIATION RESEARCH DEPARTMENT, DENMARK 8
NCRS The Swedish University of Agricultural Sciences Sweden SWEDISH UNIVERSITY OF AGRICULTURAL SCIENCES, SWEDEN 5

We have successfully matched the laboratory names to the MARIS standard laboratory names. We can now create a lookup table for the laboratory names.

Exported source
# Create a lookup table for laboratory names
lut_lab = lambda: Remapper(provider_lut_df=read_csv('LABORATORY_NAME.csv'),
                    maris_lut_fn= combine_lut_columns(lut_path=lab_lut_path, combine_cols=['lab','country']),
                    maris_col_id='lab_id',
                    maris_col_name='lab_country',
                    provider_col_to_match='LABORATORY_NAME',
                    provider_col_key='LABORATORY',
                    fname_cache='lab_helcom.pkl').generate_lookup_table(fixes=fixes_lab_names,as_df=False, overwrite=False)

We now create the callback RemapLabCB, which will remap the nuclide names using the lut_lab lookup table.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER'])
    ])
tfm()
# For instance:
unique_labs = set()

# Get unique labs from all groups
for grp in ['BIOTA', 'SEAWATER', 'SEDIMENT']:
    if grp in tfm.dfs and 'laboratory' in tfm.dfs[grp].columns:
        # Get values and add to set
        labs = tfm.dfs[grp]['laboratory'].unique()
        unique_labs.update(labs)
    
print('Example of unique laboratory names: \n', unique_labs)
Example of unique laboratory names: 
 {'BFFG', 'LREB', 'SAAS', 'KRIL', 'EBRS', 'DHIG', 'LRPC', 'LEPA', 'CLOR', 'SSSI', 'SSSM', 'VTIG', 'STUK', 'NCRS', 'LVDC', 'IMGW', 'JORC', nan, 'LVEA', 'RISO', 'ERPC'}

Add Sample ID (REVIEW)

Tip

FEEDBACK FOR NEXT VERSION: Enhancing traceability of NetCDF entries to original samples in the datasource using a standardized SMP_ID.

Context:

Previously, the NetCDF output did not include a sample laboratory code (or SMP_ID), limiting our ability to trace data back to its source.

Issue Identified: The KEY column in the HELCOM dataset, which combines a sample type, a laboratory code, and an integer sequence offers a way trace data back to the HELCOM source. The KEY is of type string which is not included in our NetCDF output. To include a way to trace data back to the HELCOM source, we propose to include a SMP_ID in the NetCDF output which is of type integer.

Proposed Solution: For the HELCOM dataset, where the KEY column includes unique codes like WDHIG1996246 (comprising sample type, lab code, and sequence), we propose encoding this into a structured SMP_ID. This SMP_ID will use standardized MARIS Lookup Tables (LUTs) to convert both the sample type and laboratory code into integers.

Implementation Details: - The SMP_ID will be formatted such that: - The first digit indicates the sample type (e.g., 1 for Seawater). - The next three digits represent the laboratory code (e.g., 313 for DHIG as standardized in dbo_lab.xlsx). - The remaining digits reflect the integer sequence from the HELCOM KEY. - Example: WDHIG1996246 becomes SMP_ID 13131996246.

Action Required: To adopt this approach, a review and update of the laboratory codes in the LUT (dbo_lab.xlsx) are necessary to ensure consistency and accuracy.

First we wil use check_unique_key_int to show the non unique integer part of the KEY column.

def check_unique_key_int(tfm):
    """
    Extracts unique 'KEY' values from specified DataFrames, separates them into string and integer components,
    and groups keys by their integer components.

    Parameters:
    tfm (Transformer): The transformer object containing DataFrames.

    Returns:
    dict: A dictionary with the unique keys, their string and integer components, and grouped keys by integer component.
    """
    # Define the groups to extract keys from
    groups = ['SEAWATER', 'BIOTA', 'SEDIMENT']
    
    # Initialize a set to store unique keys
    unique_keys = set()
    
    # Collect unique keys from each DataFrame
    for grp in groups:
        unique_keys.update(tfm.dfs[grp]['key'].unique())
    
    # Initialize a dictionary to group keys by their integer components
    int_key_map = {}
    
    for key in unique_keys:
        # Assuming the integer part starts after the first 5 characters
        int_part = int(key[5:]) if key[5:].isdigit() else None  # Remaining part as integer
        
        if int_part is not None:
            if int_part not in int_key_map:
                int_key_map[int_part] = []  # Initialize list for this integer part
            int_key_map[int_part].append(key)  # Append the complete key to the list
    
    return {
        'int_key_map': int_key_map  # Return the mapping of integer parts to complete keys
    }

Below, we will generate a DataFrame where the index (labeled ‘INT COMPONENT OF KEY’) represents the integer portion extracted from the Helcom KEY. The ‘KEYS’ column lists all the KEY values that include this integer component. Originally, the plan was to use the integer part of the KEY column to create the SMP_ID. However, as demonstrated below, the integer part is not unique, which complicates this approach.

# Create DataFrame from dictionary and set index name and column name
unique_key_df = pd.DataFrame.from_dict(check_unique_key_int(tfm)).rename_axis('INT COMPONENT OF `KEY`')
unique_key_df=unique_key_df.rename(columns={unique_key_df.columns[0]: 'KEYS'})
unique_key_df.head(5)
KEYS
INT COMPONENT OF `KEY`
2010003 [SCLOR2010003, WSSSI2010003, BRISO2010003, WKRIL2010003, WSTUK2010003, BEBRS2010003, BSTUK2010003, WLEPA2010003, SSSSI2010003, SRISO2010003, BSSSM2010003, WIMGW2010003, BCLOR2010003, SKRIL2010003, SSTUK2010003, WLVEA2010003, SLEPA2010003, BVTIG2010003, SLVEA2010003, WEBRS2010003, WRISO2010003]
1988170 [SDHIG1988170]
2014018 [WIMGW2014018, SCLOR2014018, BCLOR2014018, SSTUK2014018, SEBRS2014018, WRISO2014018, WSTUK2014018, SSSSM2014018]
2012009 [SSTUK2012009, BVTIG2012009, WSTUK2012009, SKRIL2012009, BCLOR2012009, SEBRS2012009, WKRIL2012009, BSTUK2012009, WIMGW2012009, SCLOR2012009, BRISO2012009, WRISO2012009, BSSSM2012009, SLEPA2012009, SLVEA2012009, SSSSM2012009, WLEPA2012009]
1987561 [SDHIG1987561]

Below we will create a callback AddSampleIDCB to remap the KEY column to the SMP_ID column in each DataFrame.

Remeber that in HELCOM, the KEY column is encoded to include the sample type (S=Sediment, W=Seawater, B=Biota), the laboratory code (e.g., DHIG), followed by an integer sequence.

If we update the MARIS LUT (dbo_lab.xlsx), to include laboratory codes (i.e. update the lab_abb column), then the remapping of the LAB and the AddSampleIDCB can be much simpler.


source

AddSampleIDCB

 AddSampleIDCB (lut_type:Dict[str,int])

Generate sample id, SMP_ID, from encoded group, encoded LAB and sequence values.

Exported source
class AddSampleIDCB(Callback):
    "Generate sample id, `SMP_ID`, from encoded group, encoded `LAB` and sequence values."
    def __init__(self, lut_type: Dict[str, int]):
        self.lut_type = lut_type
        
    def __call__(self, tfm: Transformer):
        for grp in tfm.dfs:
            self._remap_sample_id(tfm.dfs[grp], grp)
    
    def _remap_sample_id(self, df: pd.DataFrame, grp: str):
        """
        Remaps the 'KEY' column to 'SMP_ID' using the provided lookup table.
        Sets 'SMP_ID' to -1 if 'LAB' or 'SEQUENCE' is NaN.
        
        Parameters:
            df (pd.DataFrame): The DataFrame to process.
            grp (str): The group key from the DataFrame dictionary, used to access specific LUT values.
        """
        # Check for NaNs in 'LAB' or 'SEQUENCE' and compute 'SMP_ID' conditionally
        df['SMP_ID'] = np.where(
            df['LAB'].isna() | df['sequence'].isna(),
            -1,
            str(self.lut_type[grp]) + df['LAB'].astype(str).str.zfill(3) + df['sequence'].astype(str).str.zfill(7)
        )

        # Convert 'SMP_ID' to integer, handling floating point representations
        df['SMP_ID'] = df['SMP_ID'].apply(lambda x: int(float(x)) if isinstance(x, str) and '.' in x else int(x))
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                        RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER']),
                        AddSampleIDCB(lut_type=SMP_TYPE_LUT),
                        CompareDfsAndTfmCB(dfs)
                        ])

print(tfm()['SEAWATER']['SMP_ID'].unique())
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
[12112012003 12112012004 12112012005 ... 11102023163 11102023164
 11102023165]
                           BIOTA  SEAWATER  SEDIMENT
Number of rows in dfs      16124     21634     40744
Number of rows in tfm.dfs  16124     21634     40744
Number of rows removed         0         0         0 

Add depths

The HELCOM dataset includes a column for the sampling depth (SDEPTH) for the SEAWATER and BIOTA datasets. Additionally, it contains a column for the total depth (TDEPTH) applicable to both the SEDIMENT and SEAWATER datasets. In this section, we will create a callback to incorporate both the sampling depth (smp_depth) and total depth (tot_depth) into the MARIS dataset.

class AddDepthCB(Callback):
    "Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns."
    def __call__(self, tfm: Transformer):
        for df in tfm.dfs.values():
            if 'sdepth' in df.columns:
                df['SMP_DEPTH'] = df['sdepth'].astype(float)
            if 'tdepth' in df.columns:
                df['TOT_DEPTH'] = df['tdepth'].astype(float)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[AddDepthCB()])
tfm()
for grp in tfm.dfs.keys():  
    if 'SMP_DEPTH' in tfm.dfs[grp].columns and 'TOT_DEPTH' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['SMP_DEPTH','TOT_DEPTH']].drop_duplicates())
    elif 'SMP_DEPTH' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['SMP_DEPTH']].drop_duplicates())
    elif 'TOT_DEPTH' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['TOT_DEPTH']].drop_duplicates())
BIOTA:        SMP_DEPTH
0            NaN
78         22.00
88         39.00
96         40.00
183        65.00
...          ...
15874      43.10
15921      30.43
15984       7.60
15985       5.50
15988      11.20

[301 rows x 1 columns]
SEAWATER:        SMP_DEPTH  TOT_DEPTH
0            0.0        NaN
1           29.0        NaN
4           39.0        NaN
6           62.0        NaN
10          71.0        NaN
...          ...        ...
21059       15.0       15.0
21217        7.0       16.0
21235       19.2       21.0
21312        1.0        5.5
21521        0.5        NaN

[1686 rows x 2 columns]
SEDIMENT:        TOT_DEPTH
0           25.0
6           61.0
19          31.0
33          39.0
42          36.0
...          ...
35882        3.9
36086      103.0
36449      108.9
36498        4.5
36899      125.0

[195 rows x 1 columns]

Add Salinity

Tip

FEEDBACK TO DATA PROVIDER

The HELCOM dataset includes a column for the salinity of the water (SALIN). According to the HELCOM documentation, the SALIN column represents “Salinity of water in PSU units”.

In the SEAWATER dataset, three entries have salinity values greater than 50 PSU. While salinity values greater than 50 PSU are possible, these entries may require further verification. Notably, these three entries have a salinity value of 99.99 PSU, which suggests potential data entry errors.

tfm.dfs['SEAWATER'][tfm.dfs['SEAWATER']['salin'] > 50]
key nuclide method < value_bq/m³ value_bq/m³ error%_m³ date_of_entry_x country laboratory sequence ... tdepth sdepth salin ttemp filt mors_subbasin helcom_subbasin date_of_entry_y SMP_DEPTH TOT_DEPTH
12288 WDHIG1998072 CS137 3 NaN 40.1 1.6 NaN 6.0 DHIG 1998072.0 ... 25.0 0.0 99.99 5.0 F 5.0 15.0 NaN 0.0 25.0
12289 WDHIG1998072 CS134 3 NaN 1.1 23.6 NaN 6.0 DHIG 1998072.0 ... 25.0 0.0 99.99 5.0 F 5.0 15.0 NaN 0.0 25.0
12290 WDHIG1998072 SR90 2 NaN 8.5 1.9 NaN 6.0 DHIG 1998072.0 ... 25.0 0.0 99.99 5.0 F 5.0 15.0 NaN 0.0 25.0

3 rows × 29 columns

Lets add the salinity values to the SEAWATER DataFrame.


source

AddSalinityCB

 AddSalinityCB (salinity_col:str='salin')

Base class for callbacks.

Exported source
class AddSalinityCB(Callback):
    def __init__(self, salinity_col: str = 'salin'):
        self.salinity_col = salinity_col
    "Add salinity to the SEAWATER DataFrame."
    def __call__(self, tfm: Transformer):
        for df in tfm.dfs.values():
            if self.salinity_col in df.columns:
                df['SALINITY'] = df[self.salinity_col].astype(float)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[AddSalinityCB()])
tfm()
for grp in tfm.dfs.keys():  
    if 'SALINITY' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['SALINITY']].drop_duplicates())
SEAWATER:        SALINITY
0           NaN
97        7.570
98        7.210
101       7.280
104       7.470
...         ...
21449    11.244
21450     7.426
21451     9.895
21452     2.805
21453     7.341

[2766 rows x 1 columns]

Add Temperature

Tip

FEEDBACK TO DATA PROVIDER

The HELCOM dataset includes a column for the temperature of the water (TTEMP). According to the HELCOM documentation, the TTEMP column represents: > ‘Water temperature in Celsius (ºC) degrees of sampled water’

In the SEAWATER dataset, several entries have temperature values greater than 50ºC. These entries may require further verification. Notably, these entries have a temperature value of 99.99ºC, which suggests potential data entry errors, see below.

t_df= tfm.dfs['SEAWATER'][tfm.dfs['SEAWATER']['ttemp'] > 50]
print('Number of entries with temperature greater than 50ºC: ', t_df.shape[0])
t_df.head()
Number of entries with temperature greater than 50ºC:  92
key nuclide method < value_bq/m³ value_bq/m³ error%_m³ date_of_entry_x country laboratory sequence ... longitude (dddddd) tdepth sdepth salin ttemp filt mors_subbasin helcom_subbasin date_of_entry_y SALINITY
5954 WDHIG1995559 CS134 4 NaN 1.7 15.0 NaN 6.0 DHIG 1995559.0 ... 10.2033 13.0 11.0 14.81 99.9 N 5.0 15.0 NaN 14.81
5955 WDHIG1995559 CS137 4 NaN 58.7 2.0 NaN 6.0 DHIG 1995559.0 ... 10.2033 13.0 11.0 14.81 99.9 N 5.0 15.0 NaN 14.81
5960 WDHIG1995569 CS134 4 NaN 1.4 12.0 NaN 6.0 DHIG 1995569.0 ... 10.2777 14.0 12.0 14.80 99.9 N 5.0 15.0 NaN 14.80
5961 WDHIG1995569 CS137 4 NaN 62.8 1.0 NaN 6.0 DHIG 1995569.0 ... 10.2777 14.0 12.0 14.80 99.9 N 5.0 15.0 NaN 14.80
5964 WDHIG1995571 CS134 4 NaN 1.5 17.0 NaN 6.0 DHIG 1995571.0 ... 10.2000 19.0 17.0 14.59 99.9 N 5.0 15.0 NaN 14.59

5 rows × 28 columns

Lets add the temperature values to the SEAWATER DataFrame.


source

AddTemperatureCB

 AddTemperatureCB (temperature_col:str='ttemp')

Base class for callbacks.

Exported source
class AddTemperatureCB(Callback):
    def __init__(self, temperature_col: str = 'ttemp'):
        self.temperature_col = temperature_col
    "Add temperature to the SEAWATER DataFrame."
    def __call__(self, tfm: Transformer):
        for df in tfm.dfs.values():
            if self.temperature_col in df.columns:
                df['TEMPERATURE'] = df[self.temperature_col].astype(float)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[AddTemperatureCB()])
tfm()
for grp in tfm.dfs.keys():  
    if 'TEMPERATURE' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['TEMPERATURE']].drop_duplicates())
SEAWATER:        TEMPERATURE
0              NaN
987           7.80
990           6.50
993           4.10
996           4.80
...            ...
21521         0.57
21523        18.27
21525        21.54
21529         4.94
21537         2.35

[1086 rows x 1 columns]

Add Methods (FOR NEXT VERSION)

The HELCOM dataset includes a look-up table ANALYSIS_METHOD.csv which captures the methods used by HELCOM in a description field (free text). Lets review the ANALYSIS METHOD descriptions of HELCOM dataset.

analsis_method_df = read_csv('ANALYSIS_METHOD.csv')
analsis_method_df.head(3)
METHOD DESCRIPTION COUNTRY
0 BFFG01 Gammaspectrometric analysis with Germanium detectors (p-type HGeLi's and HPGe's and 1 n-type HPGe), with efficiency 20-48% Energy resolution 1.8-2.3 keV at 1.33 MeV (not to in use any more) 6
1 BFFG02 Sr-90, a) Y-90 extraction method dried ash and added Y-90 + HCl, Ph adjustment and Y-90 extraction with HDEHP in n-heptane b) Modified version of classic nitric acid method (not to in use any more) 6
2 BFFG03 Pu238, Pu239241; Ashing and and drying the traces (not to in use any more) 6

Number of unique ANALYSIS_METHOD DESCRIPTION

len(analsis_method_df['DESCRIPTION'].unique())
68
Exported source
lut_method = lambda: read_csv('ANALYSIS_METHOD.csv').set_index('METHOD').to_dict()['DESCRIPTION']
prepmet_lut = pd.read_excel(prepmet_lut_path())
sampmet_lut = pd.read_excel(sampmet_lut_path())
counmet_lut = pd.read_excel(counmet_lut_path())

DISCUSS repition of counting method in counmet_lut. When should we use each of them?

counmet_lut.head(10)
counmet_id counmet code
0 -1 Not applicable NaN
1 0 Not available 0
2 1 Atomic absorption AA
3 2 Alpha ALP
4 3 Alpha ionization chamber spectrometry ALPI
5 4 Alpha liquid scintillation spectrometry ALPL
6 5 Alpha semiconductor spectrometry ALPS
7 6 Alpha total ALPT
8 7 Accelerator mass spectrometry AMS
9 8 Beta BET

Add slice position (TOP and BOTTOM)


source

RemapSedSliceTopBottomCB

 RemapSedSliceTopBottomCB ()

Remap Sediment slice top and bottom to MARIS format.

Exported source
class RemapSedSliceTopBottomCB(Callback):
    "Remap Sediment slice top and bottom to MARIS format."
    def __call__(self, tfm: Transformer):
        "Iterate through all DataFrames in the transformer object and remap sediment slice top and bottom."
        tfm.dfs['SEDIMENT']['TOP'] = tfm.dfs['SEDIMENT']['uppsli']
        tfm.dfs['SEDIMENT']['BOTTOM'] = tfm.dfs['SEDIMENT']['lowsli']
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[RemapSedSliceTopBottomCB()])
tfm()
print(tfm.dfs['SEDIMENT'][['TOP','BOTTOM']].head())
    TOP  BOTTOM
0  15.0    20.0
1  20.0    25.0
2  25.0    30.0
3  30.0    35.0
4  35.0    40.0

Add dry weight, wet weight and percentage weight

Tip

FEEDBACK TO DATA PROVIDER: Entries for the BASIS value of the BIOTA dataset report a value of F which is not consistent with the HELCOM description provided in the metadata. The GUIDELINES FOR MONITORING OF RADIOACTIVE SUBSTANCES was obtained from here.

Lets take a look at the BIOTA BASIS values:

dfs['BIOTA']['basis'].unique()
array(['W', nan, 'D', 'F'], dtype=object)

Number of entries for each BASIS value:

dfs['BIOTA']['basis'].value_counts()
basis
W    12164
D     3868
F       25
Name: count, dtype: int64
Tip

FEEDBACK TO DATA PROVIDER: Some entries for DW% (Dry weight as percentage (%) of fresh weight) are much higher than 100%. Additionally, DW% is repoted as 0% in some cases.

For BIOTA, the number of entries for DW% higher than 100%:

dfs['BIOTA']['dw%'][dfs['BIOTA']['dw%'] > 100].count()
20

For BIOTA, the number of entries for DW% equal to 0%:

dfs['BIOTA']['dw%'][dfs['BIOTA']['dw%'] == 0].count()
6

For SEDIMENT, the number of entries for DW% higher than 100%:

dfs['SEDIMENT']['dw%'][dfs['SEDIMENT']['dw%'] > 100].count()
625

For SEDIMENT, the number of entries for DW% equal to 0%:

dfs['SEDIMENT']['dw%'][dfs['SEDIMENT']['dw%'] == 0].count()
302
Tip

FEEDBACK TO DATA PROVIDER: Several SEDIMENT entries have DW% (Dry weight as percentage of fresh weight) values less than 1%. While technically possible, this would indicate samples contained more than 99% water content.

For SEDIMENT, the number of entries for DW% less than 1% but greater than 0.001%:

percent=1
dfs['SEDIMENT']['dw%'][(dfs['SEDIMENT']['dw%'] < percent) & (dfs['SEDIMENT']['dw%'] > 0.001)].count()
24

Lets take a look at the MARIS description of the percentwt, drywt and wetwt variables:

  • percentwt: Dry weight as ratio of fresh weight, expressed as a decimal .
  • drywt: Dry weight in grams.
  • wetwt: Fresh weight in grams.

Lets take a look at the HELCOM dataset, the weight of the sample is not reported for SEDIMENT. However, the percentage dry weight is reported as DW%.

dfs['SEDIMENT'].columns
Index(['key', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'error%_kg',
       '< value_bq/m²', 'value_bq/m²', 'error%_m²', 'date_of_entry_x',
       'country', 'laboratory', 'sequence', 'date', 'year', 'month', 'day',
       'station', 'latitude (ddmmmm)', 'latitude (dddddd)',
       'longitude (ddmmmm)', 'longitude (dddddd)', 'device', 'tdepth',
       'uppsli', 'lowsli', 'area', 'sedi', 'oxic', 'dw%', 'loi%',
       'mors_subbasin', 'helcom_subbasin', 'sum_link', 'date_of_entry_y'],
      dtype='object')

The BIOTA dataset reports the weight of the sample as WEIGHT and the percentage dry weight as DW%. The BASIS column describes the basis the value reported

dfs['BIOTA'].columns
Index(['key', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'basis',
       'error%', 'number', 'date_of_entry_x', 'country', 'laboratory',
       'sequence', 'date', 'year', 'month', 'day', 'station',
       'latitude ddmmmm', 'latitude dddddd', 'longitude ddmmmm',
       'longitude dddddd', 'sdepth', 'rubin', 'biotatype', 'tissue', 'no',
       'length', 'weight', 'dw%', 'loi%', 'mors_subbasin', 'helcom_subbasin',
       'date_of_entry_y'],
      dtype='object')

source

LookupDryWetPercentWeightCB

 LookupDryWetPercentWeightCB ()

Lookup dry-wet ratio and format for MARIS.

Exported source
class LookupDryWetPercentWeightCB(Callback):
    "Lookup dry-wet ratio and format for MARIS."
    def __call__(self, tfm: Transformer):
        "Iterate through all DataFrames in the transformer object and apply the dry-wet ratio lookup."
        for grp in tfm.dfs.keys():
            if 'dw%' in tfm.dfs[grp].columns:
                self._apply_dry_wet_ratio(tfm.dfs[grp])
            if 'weight' in tfm.dfs[grp].columns and 'basis' in tfm.dfs[grp].columns:
                self._correct_basis(tfm.dfs[grp])
                self._apply_weight(tfm.dfs[grp])

    def _apply_dry_wet_ratio(self, df: pd.DataFrame) -> None:
        "Apply dry-wet ratio conversion and formatting to the given DataFrame."
        df['PERCENTWT'] = df['dw%'] / 100  # Convert percentage to fraction
        df.loc[df['PERCENTWT'] == 0, 'PERCENTWT'] = np.NaN  # Convert 0% to NaN

    def _correct_basis(self, df: pd.DataFrame) -> None:
        "Correct BASIS values. Assuming F = Fresh weight, so F = W"
        df.loc[df['basis'] == 'F', 'basis'] = 'W'

    def _apply_weight(self, df: pd.DataFrame) -> None:
        "Apply weight conversion and formatting to the given DataFrame."
        dry_condition = df['basis'] == 'D'
        wet_condition = df['basis'] == 'W'
        
        df.loc[dry_condition, 'DRYWT'] = df['weight']
        df.loc[dry_condition & df['PERCENTWT'].notna(), 'WETWT'] = df['weight'] / df['PERCENTWT']
        
        df.loc[wet_condition, 'WETWT'] = df['weight']
        df.loc[wet_condition & df['PERCENTWT'].notna(), 'DRYWT'] = df['weight'] * df['PERCENTWT']
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            LookupDryWetPercentWeightCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print('BIOTA:', tfm.dfs['BIOTA'][['PERCENTWT','DRYWT','WETWT']].head(), '\n')
print('SEDIMENT:', tfm.dfs['SEDIMENT']['PERCENTWT'].unique())
                           BIOTA  SEAWATER  SEDIMENT
Number of rows in dfs      16124     21634     40744
Number of rows in tfm.dfs  16124     21634     40744
Number of rows removed         0         0         0 

BIOTA:    PERCENTWT      DRYWT  WETWT
0    0.18453  174.93444  948.0
1    0.18453  174.93444  948.0
2    0.18453  174.93444  948.0
3    0.18453  174.93444  948.0
4    0.18458  177.93512  964.0 

SEDIMENT: [       nan 0.1        0.13       ... 0.24418605 0.25764192 0.26396495]

Note that the dry weight is greater than the wet weight for some entries in the BIOTA dataset due to the DW% being greater than 100%, see above. Lets take a look at the number of entries where this is the case:

tfm.dfs['BIOTA'][['DRYWT','WETWT']][tfm.dfs['BIOTA']['DRYWT'] > tfm.dfs['BIOTA']['WETWT']].count()
DRYWT    20
WETWT    20
dtype: int64

Standardize Coordinates

Tip

FEEDBACK TO DATA PROVIDER: Column names for geographical coordinates are inconsistent across sample types (biota, sediment, seawater). Sometimes using parentheses, sometimes not.

dfs = load_data(src_dir, use_cache=True)
for grp in dfs.keys():
    print(f'{grp}: {[col for col in dfs[grp].columns if "lon" in col or "lat" in col]}')
BIOTA: ['latitude ddmmmm', 'latitude dddddd', 'longitude ddmmmm', 'longitude dddddd']
SEAWATER: ['latitude (ddmmmm)', 'latitude (dddddd)', 'longitude (ddmmmm)', 'longitude (dddddd)']
SEDIMENT: ['latitude (ddmmmm)', 'latitude (dddddd)', 'longitude (ddmmmm)', 'longitude (dddddd)']
Tip

FEEDBACK TO DATA PROVIDER: HELCOM SEAWATER data includes values of 0 or nan for both latitude and longitude.


source

ParseCoordinates

 ParseCoordinates (fn_convert_cor:Callable)

Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero.

Type Details
fn_convert_cor Callable Function that converts coordinates from degree-minute to decimal degree format
Exported source
class ParseCoordinates(Callback):
    "Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero."
    def __init__(self, 
                 fn_convert_cor: Callable # Function that converts coordinates from degree-minute to decimal degree format
                 ):
        self.fn_convert_cor = fn_convert_cor

    def __call__(self, tfm:Transformer):
        for df in tfm.dfs.values():
            self._format_coordinates(df)

    def _format_coordinates(self, df:pd.DataFrame) -> None:
        coord_cols = self._get_coord_columns(df.columns)
        
        
        for coord in ['lat', 'lon']:
            decimal_col, minute_col = coord_cols[f'{coord}_d'], coord_cols[f'{coord}_m']
            # Attempt to convert columns to numeric, coercing errors to NaN.
            df[decimal_col] = pd.to_numeric(df[decimal_col], errors='coerce')
            df[minute_col] = pd.to_numeric(df[minute_col], errors='coerce')
            condition = df[decimal_col].isna() | (df[decimal_col] == 0)
            df[coord.upper()] = np.where(condition,
                                 df[minute_col].apply(self._safe_convert),
                                 df[decimal_col])
        
        df.dropna(subset=['LAT', 'LON'], inplace=True)

    def _get_coord_columns(self, columns) -> dict:
        return {
            'lon_d': self._find_coord_column(columns, 'lon', 'dddddd'),
            'lat_d': self._find_coord_column(columns, 'lat', 'dddddd'),
            'lon_m': self._find_coord_column(columns, 'lon', 'ddmmmm'),
            'lat_m': self._find_coord_column(columns, 'lat', 'ddmmmm')
        }

    def _find_coord_column(self, columns, coord_type, coord_format) -> str:
        pattern = re.compile(f'{coord_type}.*{coord_format}', re.IGNORECASE)
        matching_columns = [col for col in columns if pattern.search(col)]
        return matching_columns[0] if matching_columns else None

    def _safe_convert(self, value) -> str:
        if pd.isna(value):
            return value
        try:
            return self.fn_convert_cor(value)
        except Exception as e:
            print(f"Error converting value {value}: {e}")
            return value
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[                    
                            ParseCoordinates(ddmm_to_dd),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['BIOTA'][['LAT','LON']])
                           BIOTA  SEAWATER  SEDIMENT
Number of rows in dfs      16124     21634     40744
Number of rows in tfm.dfs  16124     21626     40743
Number of rows removed         0         8         1 

             LAT        LON
0      54.283333  12.316667
1      54.283333  12.316667
2      54.283333  12.316667
3      54.283333  12.316667
4      54.283333  12.316667
...          ...        ...
16119  61.241500  21.395000
16120  61.241500  21.395000
16121  61.343333  21.385000
16122  61.343333  21.385000
16123  61.343333  21.385000

[16124 rows x 2 columns]

Lets review the dropped rows for SEAWATER:

with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(tfm.dfs_dropped['SEAWATER'])
key nuclide method < value_bq/m³ value_bq/m³ error%_m³ date_of_entry_x country laboratory sequence date year month day station latitude (ddmmmm) latitude (dddddd) longitude (ddmmmm) longitude (dddddd) tdepth sdepth salin ttemp filt mors_subbasin helcom_subbasin date_of_entry_y
20556 WSSSM2015009 H3 STYR201 < 2450.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20557 WSSSM2015010 H3 STYR201 NaN 2510.0 29.17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20558 WSSSM2015011 H3 STYR201 < 2450.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20559 WSSSM2015012 H3 STYR201 NaN 1740.0 41.26 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20560 WSSSM2015013 H3 STYR201 NaN 1650.0 43.53 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20561 WSSSM2015014 H3 STYR201 < 2277.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20562 WSSSM2015015 H3 STYR201 < 2277.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20563 WSSSM2015016 H3 STYR201 < 2277.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Sanitize coordinates drops a row when both longitude & latitude equal 0 or data contains unrealistic longitude & latitude values. Converts longitude & latitude , separator to . separator.”

dfs = load_data(src_dir,  use_cache=True)
tfm = Transformer(dfs, cbs=[
                            ParseCoordinates(ddmm_to_dd),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['BIOTA'][['LAT','LON']])
                           BIOTA  SEAWATER  SEDIMENT
Number of rows in dfs      16124     21634     40744
Number of rows in tfm.dfs  16124     21626     40743
Number of rows removed         0         8         1 

             LAT        LON
0      54.283333  12.316667
1      54.283333  12.316667
2      54.283333  12.316667
3      54.283333  12.316667
4      54.283333  12.316667
...          ...        ...
16119  61.241500  21.395000
16120  61.241500  21.395000
16121  61.343333  21.385000
16122  61.343333  21.385000
16123  61.343333  21.385000

[16124 rows x 2 columns]

Review all callbacks

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
                            RemapNuclideNameCB(lut_nuclides, col_name='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SplitSedimentValuesCB(coi_sediment),
                            SanitizeValueCB(coi_val),       
                            NormalizeUncCB(),
                            RemapUnitCB(),
                            RemapDetectionLimitCB(coi_dl, lut_dl),                           
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
                            RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
                            RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA'),
                            RemapSedimentCB(fn_lut=lut_sediments, replace_lut=sed_replace_lut),
                            RemapFiltCB(lut_filtered),
                            RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER']),
                            AddSampleIDCB(lut_type=SMP_TYPE_LUT),
                            AddDepthCB(),
                            AddSalinityCB(),
                            AddTemperatureCB(),
                            RemapSedSliceTopBottomCB(),
                            LookupDryWetPercentWeightCB(),
                            ParseCoordinates(ddmm_to_dd),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
Warning: 8 missing time value(s) in SEAWATER
Warning: 1 missing time value(s) in SEDIMENT
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
                           BIOTA  SEAWATER  SEDIMENT
Number of rows in dfs      16124     21634     40744
Number of rows in tfm.dfs  16094     21473     70449
Number of rows removed        30       161       144 

Lets inspect the rows that are removed for the SEAWATER data:

grp='SEAWATER' # 'SEAWATER', 'BIOTA' or 'SEDIMENT'
print(f'{grp}, number of dropped rows: {tfm.dfs_dropped[grp].shape[0]}.')
print(f'Viewing dropped rows for {grp}:')
tfm.dfs_dropped[grp]
SEAWATER, number of dropped rows: 161.
Viewing dropped rows for SEAWATER:
key nuclide method < value_bq/m³ value_bq/m³ error%_m³ date_of_entry_x country laboratory sequence ... longitude (ddmmmm) longitude (dddddd) tdepth sdepth salin ttemp filt mors_subbasin helcom_subbasin date_of_entry_y
13439 WRISO2001025 CS137 RISO02 NaN NaN 10.0 NaN 26.0 RISO 2001025.0 ... 10.500 10.833333 22.0 20.0 0.00 NaN N 5.0 5.0 NaN
14017 WLEPA2002001 CS134 LEPA02 < NaN NaN NaN 93.0 LEPA 2002001.0 ... 21.030 21.050000 16.0 0.0 3.77 14.40 N 4.0 9.0 NaN
14020 WLEPA2002002 CS134 LEPA02 < NaN NaN NaN 93.0 LEPA 2002004.0 ... 20.574 20.956667 14.0 0.0 6.57 11.95 N 4.0 9.0 NaN
14023 WLEPA2002003 CS134 LEPA02 < NaN NaN NaN 93.0 LEPA 2002007.0 ... 19.236 19.393333 73.0 0.0 7.00 9.19 N 4.0 9.0 NaN
14026 WLEPA2002004 CS134 LEPA02 < NaN NaN NaN 93.0 LEPA 2002010.0 ... 20.205 20.341700 47.0 0.0 7.06 8.65 N 4.0 9.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21542 WLRPC2023011 SR90 LRPC02 NaN NaN NaN 05/03/24 00:00:00 93.0 LRPC 2023011.0 ... 20.480 20.800000 45.0 1.0 7.22 19.80 N 4.0 9.0 05/03/24 00:00:00
21543 WLRPC2023012 CS137 LRPC01 NaN NaN NaN 05/03/24 00:00:00 93.0 LRPC 2023012.0 ... 20.480 20.800000 45.0 1.0 7.23 8.80 N 4.0 9.0 05/03/24 00:00:00
21544 WLRPC2023012 SR90 LRPC02 NaN NaN NaN 05/03/24 00:00:00 93.0 LRPC 2023012.0 ... 20.480 20.800000 45.0 1.0 7.23 8.80 N 4.0 9.0 05/03/24 00:00:00
21545 WLRPC2023013 CS137 LRPC01 NaN NaN NaN 05/03/24 00:00:00 93.0 LRPC 2023013.0 ... 20.427 20.711700 41.0 1.0 7.23 19.30 N 4.0 9.0 05/03/24 00:00:00
21546 WLRPC2023013 SR90 LRPC02 NaN NaN NaN 05/03/24 00:00:00 93.0 LRPC 2023013.0 ... 20.427 20.711700 41.0 1.0 7.23 19.30 N 4.0 9.0 05/03/24 00:00:00

161 rows × 27 columns

Example change logs

dfs = load_data(src_dir, use_cache=True)

tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
                            RemapNuclideNameCB(lut_nuclides, col_name='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SplitSedimentValuesCB(coi_sediment),
                            SanitizeValueCB(coi_val),       
                            NormalizeUncCB(),
                            RemapUnitCB(),
                            RemapDetectionLimitCB(coi_dl, lut_dl),                           
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
                            RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
                            RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA'),
                            RemapSedimentCB(fn_lut=lut_sediments, replace_lut=sed_replace_lut),
                            RemapFiltCB(lut_filtered),
                            RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER']),
                            AddSampleIDCB(lut_type=SMP_TYPE_LUT),
                            AddDepthCB(),
                            AddSalinityCB(),
                            AddTemperatureCB(),
                            RemapSedSliceTopBottomCB(),
                            LookupDryWetPercentWeightCB(),
                            ParseCoordinates(ddmm_to_dd),
                            SanitizeLonLatCB(),
                            ])

tfm()
tfm.logs
Warning: 8 missing time value(s) in SEAWATER
Warning: 1 missing time value(s) in SEDIMENT
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column.",
 'Remap data provider nuclide names to standardized MARIS nuclide names.',
 'Standardize time format across all dataframes.',
 'Encode time as seconds since epoch.',
 'Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements.',
 'Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column.',
 'Convert from relative error to standard uncertainty.',
 'Set the `unit` id column in the DataFrames based on a lookup table.',
 'Remap value type to MARIS format.',
 "Remap values from 'rubin' to 'SPECIES' for groups: BIOTA.",
 "Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA.",
 "Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA.",
 'Lookup sediment id using lookup table.',
 'Lookup filt value in dataframe using the lookup table.',
 "Remap values from 'laboratory' to 'LAB' for groups: BIOTA, SEDIMENT and SEAWATER.",
 'Generate sample id, `SMP_ID`, from encoded group, encoded `LAB` and sequence values.',
 "Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns.",
 'Remap Sediment slice top and bottom to MARIS format.',
 'Lookup dry-wet ratio and format for MARIS.',
 'Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero.',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.']

Feed global attributes


source

get_attrs

 get_attrs (tfm:marisco.callbacks.Transformer, zotero_key:str,
            kw:list=['oceanography', 'Earth Science > Oceans > Ocean
            Chemistry> Radionuclides', 'Earth Science > Human Dimensions >
            Environmental Impacts > Nuclear Radiation Exposure', 'Earth
            Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth
            Science > Oceans > Marine Sediments', 'Earth Science > Oceans
            > Ocean Chemistry, Earth Science > Oceans > Sea Ice >
            Isotopes', 'Earth Science > Oceans > Water Quality > Ocean
            Contaminants', 'Earth Science > Biological Classification >
            Animals/Vertebrates > Fish', 'Earth Science > Biosphere >
            Ecosystems > Marine Ecosystems', 'Earth Science > Biological
            Classification > Animals/Invertebrates > Mollusks', 'Earth
            Science > Biological Classification > Animals/Invertebrates >
            Arthropods > Crustaceans', 'Earth Science > Biological
            Classification > Plants > Macroalgae (Seaweeds)'])

Retrieve all global attributes.

Type Default Details
tfm Transformer Transformer object
zotero_key str Zotero dataset record key
kw list [‘oceanography’, ‘Earth Science > Oceans > Ocean Chemistry> Radionuclides’, ‘Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure’, ‘Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments’, ‘Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes’, ‘Earth Science > Oceans > Water Quality > Ocean Contaminants’, ‘Earth Science > Biological Classification > Animals/Vertebrates > Fish’, ‘Earth Science > Biosphere > Ecosystems > Marine Ecosystems’, ‘Earth Science > Biological Classification > Animals/Invertebrates > Mollusks’, ‘Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans’, ‘Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)’] List of keywords
Returns dict Global attributes
Exported source
def get_attrs(
    tfm: Transformer, # Transformer object
    zotero_key: str, # Zotero dataset record key
    kw: list = kw # List of keywords
    ) -> dict: # Global attributes
    "Retrieve all global attributes."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()
get_attrs(tfm, zotero_key=zotero_key, kw=kw)
{'geospatial_lat_min': '31.17',
 'geospatial_lat_max': '65.75',
 'geospatial_lon_min': '9.6333',
 'geospatial_lon_max': '53.5',
 'geospatial_bounds': 'POLYGON ((9.6333 53.5, 31.17 53.5, 31.17 65.75, 9.6333 65.75, 9.6333 53.5))',
 'geospatial_vertical_max': '437.0',
 'geospatial_vertical_min': '0.0',
 'time_coverage_start': '1984-01-10T00:00:00',
 'time_coverage_end': '2023-11-30T00:00:00',
 'id': '26VMZZ2Q',
 'title': 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances',
 'summary': 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting data under Recommendation 26/3.\n\nThe database is updated and quality assured annually by HELCOM MORS EG.',
 'creator_name': '[{"creatorType": "author", "name": "HELCOM MORS"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
 'publisher_postprocess_logs': "Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column., Remap data provider nuclide names to standardized MARIS nuclide names., Standardize time format across all dataframes., Encode time as seconds since epoch., Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements., Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column., Convert from relative error to standard uncertainty., Set the `unit` id column in the DataFrames based on a lookup table., Remap value type to MARIS format., Remap values from 'rubin' to 'SPECIES' for groups: BIOTA., Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA., Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA., Lookup sediment id using lookup table., Lookup filt value in dataframe using the lookup table., Remap values from 'laboratory' to 'LAB' for groups: BIOTA, SEDIMENT and SEAWATER., Generate sample id, `SMP_ID`, from encoded group, encoded `LAB` and sequence values., Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns., Remap Sediment slice top and bottom to MARIS format., Lookup dry-wet ratio and format for MARIS., Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator."}

Encoding NetCDF


source

encode

 encode (src_dir:str, fname_out_nc:str, **kwargs)

Encode data to NetCDF.

Type Details
src_dir str Input file name
fname_out_nc str Output file name
kwargs
Returns None Additional arguments
Exported source
def encode(
    src_dir: str, # Input file name
    fname_out_nc: str, # Output file name
    **kwargs # Additional arguments
    ) -> None:
    "Encode data to NetCDF."
    dfs = load_data(src_dir)
    tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
                            RemapNuclideNameCB(lut_nuclides, col_name='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SplitSedimentValuesCB(coi_sediment),
                            SanitizeValueCB(coi_val),       
                            NormalizeUncCB(),
                            RemapUnitCB(),
                            RemapDetectionLimitCB(coi_dl, lut_dl),                           
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
                            RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
                            RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA'),
                            RemapSedimentCB(fn_lut=lut_sediments, replace_lut=sed_replace_lut),
                            RemapFiltCB(lut_filtered),
                            #RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER']),
                            #AddSampleIDCB(lut_type=SMP_TYPE_LUT),
                            AddDepthCB(),
                            AddSalinityCB(),
                            AddTemperatureCB(),
                            RemapSedSliceTopBottomCB(),
                            LookupDryWetPercentWeightCB(),
                            ParseCoordinates(ddmm_to_dd),
                            SanitizeLonLatCB(),
                            ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out_nc, 
                            global_attrs=get_attrs(tfm, zotero_key=zotero_key, kw=kw),
                            verbose=kwargs.get('verbose', False),
                           )
    encoder.encode()
encode(src_dir, fname_out_nc, verbose=False)
Warning: 8 missing time value(s) in SEAWATER
Warning: 1 missing time value(s) in SEDIMENT
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.

NetCDF Review

First lets review the global attributes of the NetCDF file:

contents = ExtractNetcdfContents(fname_out_nc)
print(contents.global_attrs)
{'id': '26VMZZ2Q', 'title': 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances', 'summary': 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting data under Recommendation 26/3.\n\nThe database is updated and quality assured annually by HELCOM MORS EG.', 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)', 'history': 'TBD', 'keywords_vocabulary': 'GCMD Science Keywords', 'keywords_vocabulary_url': 'https://gcmd.earthdata.nasa.gov/static/kms/', 'record': 'TBD', 'featureType': 'TBD', 'cdm_data_type': 'TBD', 'Conventions': 'CF-1.10 ACDD-1.3', 'publisher_name': 'Paul MCGINNITY, Iolanda OSVATH, Florence DESCROIX-COMANDUCCI', 'publisher_email': 'p.mc-ginnity@iaea.org, i.osvath@iaea.org, F.Descroix-Comanducci@iaea.org', 'publisher_url': 'https://maris.iaea.org', 'publisher_institution': 'International Atomic Energy Agency - IAEA', 'creator_name': '[{"creatorType": "author", "name": "HELCOM MORS"}]', 'institution': 'TBD', 'metadata_link': 'TBD', 'creator_email': 'TBD', 'creator_url': 'TBD', 'references': 'TBD', 'license': 'Without prejudice to the applicable Terms and Conditions (https://nucleus.iaea.org/Pages/Others/Disclaimer.aspx), I hereby agree that any use of the data will contain appropriate acknowledgement of the data source(s) and the IAEA Marine Radioactivity Information System (MARIS).', 'comment': 'TBD', 'geospatial_lat_min': '31.17', 'geospatial_lon_min': '9.6333', 'geospatial_lat_max': '65.75', 'geospatial_lon_max': '53.5', 'geospatial_vertical_min': '0.0', 'geospatial_vertical_max': '437.0', 'geospatial_bounds': 'POLYGON ((9.6333 53.5, 31.17 53.5, 31.17 65.75, 9.6333 65.75, 9.6333 53.5))', 'geospatial_bounds_crs': 'EPSG:4326', 'time_coverage_start': '1984-01-10T00:00:00', 'time_coverage_end': '2023-11-30T00:00:00', 'local_time_zone': 'TBD', 'date_created': 'TBD', 'date_modified': 'TBD', 'publisher_postprocess_logs': "Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column., Remap data provider nuclide names to standardized MARIS nuclide names., Standardize time format across all dataframes., Encode time as seconds since epoch., Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements., Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column., Convert from relative error to standard uncertainty., Set the `unit` id column in the DataFrames based on a lookup table., Remap value type to MARIS format., Remap values from 'rubin' to 'SPECIES' for groups: BIOTA., Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA., Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA., Lookup sediment id using lookup table., Lookup filt value in dataframe using the lookup table., Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns., Remap Sediment slice top and bottom to MARIS format., Lookup dry-wet ratio and format for MARIS., Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator."}

Review the publisher_postprocess_logs.

print(contents.global_attrs['publisher_postprocess_logs'])
Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column., Remap data provider nuclide names to standardized MARIS nuclide names., Standardize time format across all dataframes., Encode time as seconds since epoch., Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements., Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column., Convert from relative error to standard uncertainty., Set the `unit` id column in the DataFrames based on a lookup table., Remap value type to MARIS format., Remap values from 'rubin' to 'SPECIES' for groups: BIOTA., Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA., Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA., Lookup sediment id using lookup table., Lookup filt value in dataframe using the lookup table., Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns., Remap Sediment slice top and bottom to MARIS format., Lookup dry-wet ratio and format for MARIS., Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.

Now lets review the enums of the groups in the NetCDF file:

print(contents.enum_dicts)
{'BIOTA': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', 'pu240': '69', 'pu241': '70', 'am240': '71', 'am241': '72', 'cm242': '73', 'cm243': '74', 'cm244': '75', 'cs134_137_tot': '76', 'pu239_240_tot': '77', 'pu239_240_iii_iv_tot': '78', 'pu239_240_v_vi_tot': '79', 'cm243_244_tot': '80', 'pu238_pu239_240_tot_ratio': '81', 'am241_pu239_240_tot_ratio': '82', 'cs137_134_ratio': '83', 'cd109': '84', 'eu152': '85', 'fe59': '86', 'gd153': '87', 'ir192': '88', 'pu238_240_tot': '89', 'rb86': '90', 'sc46': '91', 'sn113': '92', 'sn117m': '93', 'tl208': '94', 'mo99': '95', 'tc99m': '96', 'ru105': '97', 'te129': '98', 'te132': '99', 'i132': '100', 'i135': '101', 'cs136': '102', 'tbeta': '103', 'talpha': '104', 'i133': '105', 'th230': '106', 'pa231': '107', 'u236': '108', 'ag111': '109', 'in116m': '110', 'te123m': '111', 'sb127': '112', 'ba133': '113', 'ce139': '114', 'tl201': '116', 'hg203': '117', 'na22': '122', 'pa234m': '123', 'am243': '124', 'se75': '126', 'sr85': '127', 'y88': '128', 'ce140': '129', 'bi212': '130', 'u236_238_ratio': '131', 'i125': '132', 'ba137m': '133', 'u232': '134', 'pa233': '135', 'ru106_rh106_tot': '136', 'tu': '137', 'tbeta40k': '138', 'fe55': '139', 'ce144_pr144_tot': '140', 'pu240_pu239_ratio': '141', 'u233': '142', 'pu239_242_tot': '143', 'ac227': '144'}, 'unit': {'Not applicable': '-1', 'NOT AVAILABLE': '0', 'Bq per m3': '1', 'Bq per m2': '2', 'Bq per kg': '3', 'Bq per kgd': '4', 'Bq per kgw': '5', 'kg per kg': '6', 'TU': '7', 'DELTA per mill': '8', 'atom per kg': '9', 'atom per kgd': '10', 'atom per kgw': '11', 'atom per l': '12', 'Bq per kgC': '13'}, 'dl': {'Not applicable': '-1', 'Not available': '0', 'Detected value': '1', 'Detection limit': '2', 'Not detected': '3', 'Derived': '4'}, 'bio_group': {'Not applicable': '-1', 'Not available': '0', 'Birds': '1', 'Crustaceans': '2', 'Echinoderms': '3', 'Fish': '4', 'Mammals': '5', 'Molluscs': '6', 'Others': '7', 'Plankton': '8', 'Polychaete worms': '9', 'Reptile': '10', 'Seaweeds and plants': '11', 'Cephalopods': '12', 'Gastropods': '13', 'Bivalves': '14'}, 'species': {'NOT AVAILABLE': '0', 'Aristeus antennatus': '1', 'Apostichopus': '2', 'Saccharina japonica var religiosa': '3', 'Siganus fuscescens': '4', 'Alpheus dentipes': '5', 'Hexagrammos agrammus': '6', 'Ditrema temminckii': '7', 'Parapristipoma trilineatum': '8', 'Scombrops boops': '9', 'Pseudopleuronectes schrenki': '10', 'Desmarestia ligulata': '11', 'Saccharina japonica': '12', 'Neodilsea yendoana': '13', 'Costaria costata': '14', 'Sargassum yezoense': '15', 'Acanthephyra pelagica': '16', 'Sargassum ringgoldianum': '17', 'Acanthephyra quadrispinosa': '18', 'Sargassum thunbergii': '19', 'Sargassum patens': '20', 'Asterias rubens': '21', 'Sargassum miyabei': '22', 'Homarus gammarus': '23', 'Acanthephyra stylorostratis': '24', 'Acanthocybium solandri': '25', 'Acanthopagrus bifasciatus': '26', 'Acanthophora muscoides': '27', 'Acanthophora spicifera': '28', 'Acanthurus triostegus': '29', 'Actinopterygii': '30', 'Adamussium colbecki': '31', 'Ahnfeltiopsis densa': '32', 'Alepes melanoptera': '33', 'Ampharetidae': '34', 'Anchoviella lepidentostole': '35', 'Anguillidae': '36', 'Aphroditidae': '37', 'Arnoglossus': '38', 'Aurigequula fasciata': '39', 'Balaenoptera musculus': '40', 'Balaenoptera physalus': '41', 'Balistes': '42', 'Beryciformes': '43', 'Bryopsis maxima': '44', 'Callinectes sp': '45', 'Callorhinus ursinus': '46', 'Carassius auratus auratus': '47', 'Carcharhinus sorrah': '48', 'Caridae': '49', 'Clupea harengus': '50', 'Cathorops spixii': '51', 'Caulerpa racemosa': '52', 'Caulerpa scalpelliformis': '53', 'Caulerpa sertularioides': '54', 'Cellana radiata': '55', 'Coscinasterias tenuispina': '56', 'Centroceras clavulatum': '57', 'Centropomus parallelus': '58', 'Crangon crangon': '59', 'Ceramium diaphanum': '60', 'Ceramium rubrum': '61', 'Chaenocephalus aceratus': '62', 'Chaetodipterus faber': '63', 'Chaetomorpha antennina': '64', 'Chaetomorpha linoides': '65', 'Chelidonichthys kumu': '66', 'Chelon ramada': '67', 'Chiloscyllium': '68', 'Chionodraco hamatus': '69', 'Chlamys islandica': '70', 'Chlorophyta': '71', 'Chondrichthyes': '72', 'Chrysaora': '73', 'Cladophora nitellopsis': '74', 'Cladophora vagabunda': '75', 'Cladophoropsis membranacea': '76', 'Clupea': '77', 'Coccotylus truncatus': '78', 'Codium fragile': '79', 'Crassostrea': '80', 'Cynoscion acoupa': '81', 'Cynoscion jamaicensis': '82', 'Cynoscion leiarchus': '83', 'Engraulis encrasicolus': '84', 'Cypselurus agoo agoo': '85', 'Cystophora cristata': '86', 'Cystoseira barbata': '87', 'Cystoseira crinita': '88', 'Decapodiformes': '89', 'Decapterus russelli': '90', 'Decapterus scombrinus': '91', 'Delphinapterus leucas': '92', 'Delphinus capensis': '93', 'Diapterus rhombeus': '94', 'Dicentrarchus punctatus': '95', 'Fucus vesiculosus': '96', 'Funchalia woodwardi': '97', 'Ecklonia bicyclis': '98', 'Gadus morhua': '99', 'Ecklonia kurome': '100', 'Gennadas elegans': '101', 'Eisenia arborea': '102', 'Encrasicholina devisi': '103', 'Enteromorpha': '104', 'Enteromorpha flexuosa': '105', 'Enteromorpha intestinalis': '106', 'Epinephelinae': '107', 'Epinephelus diacanthus': '108', 'Exocoetidae': '109', 'Saccharina latissima': '110', 'Gracilaria corticata': '111', 'Ligur ensiferus': '112', 'Gracilaria debilis': '113', 'Gracilaria edulis': '114', 'Gracilariales': '115', 'Grateloupia elliptica': '116', 'Grateloupia filicina': '117', 'Lysmata seticaudata': '118', 'Gymnogongrus griffithsiae': '119', 'Mya arenaria': '120', 'Halichoerus grypus': '121', 'Macoma balthica': '122', 'Marthasterias glacialis': '123', 'Halimeda macroloba': '124', 'Harengula clupeola': '125', 'Harpagifer antarcticus': '126', 'Hemifusus ternatanus': '127', 'Hemiramphus brasiliensis': '128', 'Mytilus edulis': '129', 'Metapenaeus affinis': '130', 'Heteroscleromorpha': '131', 'Heterosigma akashiwo': '132', 'Hilsa ilisha': '133', 'Metapenaeus monoceros': '134', 'Metapenaeus stebbingi': '135', 'Holothuria': '136', 'Hoplobrotula armata': '137', 'Hypnea musciformis': '138', 'Merlangius merlangus': '139', 'Iridaea cordata': '140', 'Jania rubens': '141', 'Meganyctiphanes norvegica': '142', 'Johnius glaucus': '143', 'Kappaphycus': '144', 'Kappaphycus alvarezii': '145', 'Laevistrombus canarium': '146', 'Lagenodelphis hosei': '147', 'Lambia': '148', 'Laminaria japonica': '149', 'Laminaria longissima': '150', 'Larimus breviceps': '151', 'Laurencia papillosa': '152', 'Leiognathidae': '153', 'Leiognathus dussumieri': '154', 'Lepidochelys olivacea': '155', 'Leptonychotes weddellii': '156', 'Limanda yokohamae': '157', 'Nephrops norvegicus': '158', 'Neuston': '159', 'Littoraria undulata': '160', 'Loligo vulgaris': '161', 'Lumbrineridae': '162', 'Lutjanus fulviflamma': '163', 'Marginisporum aberrans': '164', 'Megalaspis cordyla': '165', 'Octopus vulgaris': '166', 'Menticirrhus americanus': '167', 'Mesoplodon densirostris': '168', 'Palaemon longirostris': '169', 'Metapenaeus brevicornis': '170', 'Pasiphaea multidentata': '171', 'Pasiphaea sivado': '172', 'Parapenaeopsis stylifera': '173', 'Miichthys miiuy': '174', 'Mirounga leonina': '175', 'Brachidontes striatulus': '176', 'Monodon monoceros': '177', 'Mugil platanus': '178', 'Penaeus semisulcatus': '179', 'Mullus barbatus': '180', 'Mycteroperca rubra': '181', 'Philocheras echinulatus': '182', 'Myelophycus simplex': '183', 'Mytilus coruscus': '184', 'Penaeus indicus': '185', 'Natator depressus': '186', 'Pandalus jordani': '187', 'Melicertus kerathurus': '188', 'Parapenaeus longirostris': '189', 'Plesionika': '190', 'Platichthys flesus': '191', 'Pleuronectes platessa': '192', 'Nematopalaemon tenuipes': '193', 'Nematoscelis difficilis': '194', 'Nemipterus': '195', 'Aegaeon lacazei': '196', 'Nephtyidae': '197', 'Nereididae': '198', 'Netuma bilineata': '199', 'Nibea maculata': '200', 'Oceana serrulata': '201', 'Palaemon serratus': '202', 'Ocypode': '203', 'Odobenus rosmarus': '204', 'Ogcocephalus vespertilio': '205', 'Oligoplites saurus': '206', 'Onuphidae': '207', 'Opheliidae': '208', 'Opisthonema oglinum': '209', 'Opisthopterus tardoore': '210', 'Orientomysis mitsukurii': '211', 'Otolithes cuvieri': '212', 'Padina pavonica': '213', 'Padina tetrastromatica': '214', 'Padina vickersiae': '215', 'Pagellus affinis': '216', 'Pagophilus groenlandicus': '217', 'Paguroidea': '218', 'Pagurus': '219', 'Systellaspis debilis': '220', 'Sergestes': '221', 'Sergestes arcticus': '222', 'Pampus argenteus': '223', 'Sergestes arachnipodus': '224', 'Sergestes henseni': '225', 'Sergestes prehensilis': '226', 'Sergestes robustus': '227', 'Pangasius pangasius': '228', 'Panulirus homarus': '229', 'Paracentrotus lividus': '230', 'Pasiphaea sp': '231', 'Pectinariidae': '232', 'Penaeus': '233', 'Phoca vitulina': '234', 'Photopectoralis bindus': '235', 'Phyllospadix iwatensis': '236', 'Plectorhinchus mediterraneus': '237', 'Pleuronectes mochigarei': '238', 'Pleuronectes obscurus': '239', 'Plocamium brasiliense': '240', 'Polynemus paradiseus': '241', 'Polysiphonia': '242', 'Sprattus sprattus': '243', 'Scomber scombrus': '244', 'Polysiphonia fucoides': '245', 'Gonostomatidae': '246', 'Perca fluviatilis': '247', 'Pomadasys crocro': '248', 'Porphyra tenera': '249', 'Potamogeton pectinatus': '250', 'Priacanthus hamrur': '251', 'Pseudorhombus malayanus': '252', 'Pterocladiella capillacea': '253', 'Pusa caspica': '254', 'Pusa sibirica': '255', 'Pylaiella littoralis': '256', 'Sabellidae': '257', 'Salangichthys ishikawae': '258', 'Sarconema filiforme': '259', 'Sardinella albella': '260', 'Sardinella brasiliensis': '261', 'Sardinops melanostictus': '262', 'Sargassum cymosum': '263', 'Sargassum linearifolium': '264', 'Sargassum micracanthum': '265', 'Xiphias gladius': '266', 'Sargassum novae hollandiae': '267', 'Sargassum oligocystum': '268', 'Esox lucius': '269', 'Limanda limanda': '270', 'Abramis brama': '271', 'Anguilla anguilla': '272', 'Arctica islandica': '273', 'Cerastoderma edule': '274', 'Cyprinus carpio': '275', 'Echinodermata': '276', 'Fish larvae': '277', 'Myoxocephalus scorpius': '278', 'Osmerus eperlanus': '279', 'Plankton': '280', 'Scophthalmus maximus': '281', 'Rhodophyta': '282', 'Rutilus rutilus': '283', 'Saduria entomon': '284', 'Sander lucioperca': '285', 'Gasterosteus aculeatus': '286', 'Zoarces viviparus': '287', 'Gymnocephalus cernua': '288', 'Furcellaria lumbricalis': '289', 'Cladophora glomerata': '290', 'Lateolabrax japonicus': '291', 'Okamejei kenojei': '292', 'Sebastes pachycephalus': '293', 'Squalus acanthias': '294', 'Gadus macrocephalus': '295', 'Paralichthys olivaceus': '296', 'Ovalipes punctatus': '297', 'Pseudopleuronectes yokohamae': '298', 'Hemitripterus villosus': '299', 'Clidoderma asperrimum': '300', 'Microstomus achne': '301', 'Lepidotrigla microptera': '302', 'Hexagrammos otakii': '303', 'Kareius bicoloratus': '304', 'Pleuronichthys cornutus': '305', 'Enteroctopus dofleini': '306', 'Ammodytes personatus': '307', 'Lophius litulon': '308', 'Eopsetta grigorjewi': '309', 'Takifugu porphyreus': '310', 'Loliolus japonica': '311', 'Sepia andreana': '312', 'Sebastes cheni': '313', 'Portunus trituberculatus': '314', 'Sebastes schlegelii': '315', 'Pennahia argentata': '316', 'Platichthys stellatus': '317', 'Gadus chalcogrammus': '318', 'Chelidonichthys spinosus': '319', 'Conger myriaster': '320', 'Heterololigo bleekeri': '321', 'Stichaeus grigorjewi': '322', 'Pseudopleuronectes herzensteini': '323', 'Octopus conispadiceus': '324', 'Hippoglossoides dubius': '325', 'Cleisthenes pinetorum': '326', 'Glyptocephalus stelleri': '327', 'Tanakius kitaharae': '328', 'Nibea mitsukurii': '329', 'Dasyatis matsubarai': '330', 'Verasper moseri': '331', 'Hemitrygon akajei': '332', 'Triakis scyllium': '333', 'Trachurus japonicus': '334', 'Zeus faber': '335', 'Pagrus major': '336', 'Acanthopagrus schlegelii': '337', 'Dentex tumifrons': '338', 'Mustelus manazo': '339', 'Seriola quinqueradiata': '340', 'Hyperoglyphe japonica': '341', 'Carcharhinus': '342', 'Platycephalus': '343', 'Scomber japonicus': '344', 'Squatina japonica': '345', 'Alopias pelagicus': '346', 'Zenopsis nebulosa': '347', 'Cynoglossus joyneri': '348', 'Verasper variegatus': '349', 'Oncorhynchus keta': '350', 'Physiculus japonicus': '351', 'Oplegnathus punctatus': '352', 'Arothron hispidus': '353', 'Stereolepis doederleini': '354', 'Takifugu snyderi': '355', 'Scomber australasicus': '356', 'Liparis tanakae': '357', 'Thamnaconus modestus': '358', 'Gnathophis nystromi': '359', 'Sebastes oblongus': '360', 'Sebastiscus marmoratus': '361', 'Takifugu pardalis': '362', 'Mugil cephalus': '363', 'Ditrema temminckii temminckii': '364', 'Konosirus punctatus': '365', 'Tribolodon brandtii': '366', 'Oncorhynchus masou': '367', 'Aluterus monoceros': '368', 'Todarodes pacificus': '369', 'Myoxocephalus stelleri': '370', 'Myliobatis tobijei': '371', 'Scyliorhinus torazame': '372', 'Lophiomus setigerus': '373', 'Heterodontus japonicus': '374', 'Sebastes vulpes': '375', 'Paraplagusia japonica': '376', 'Ostrea edulis': '377', 'Melanogrammus aeglefinus': '378', 'Pollachius virens': '379', 'Pollachius pollachius': '380', 'Sebastes marinus': '381', 'Anarhichas minor': '382', 'Anarhichas denticulatus': '383', 'Reinhardtius hippoglossoides': '384', 'Trisopterus esmarkii': '385', 'Micromesistius poutassou': '386', 'Coryphaenoides rupestris': '387', 'Argentina silus': '388', 'Salmo salar': '389', 'Sebastes viviparus': '390', 'Buccinum undatum': '391', 'Fucus serratus': '392', 'Merluccius merluccius': '393', 'Littorina littorea': '394', 'Fucus': '395', 'Rhodymenia': '396', 'Solea solea': '397', 'Trachurus trachurus': '398', 'Eutrigla gurnardus': '399', 'Pelvetia canaliculata': '400', 'Ascophyllum nodosum': '401', 'Mallotus villosus': '402', 'Pecten maximus': '403', 'Hippoglossoides platessoides': '404', 'Sebastes mentella': '405', 'Modiolus modiolus': '406', 'Boreogadus saida': '407', 'Sepia': '408', 'Gadus': '409', 'Sardina pilchardus': '410', 'Pleuronectiformes': '411', 'Molva molva': '412', 'Patella': '413', 'Crassostrea gigas': '414', 'Dasyatis pastinaca': '415', 'Lophius piscatorius': '416', 'Porphyra umbilicalis': '417', 'Patella vulgata': '418', 'Brosme brosme': '419', 'Glyptocephalus cynoglossus': '420', 'Galeus melastomus': '421', 'Chimaera monstrosa': '422', 'Etmopterus spinax': '423', 'Dicentrarchus labrax': '424', 'Osilinus lineatus': '425', 'Hippoglossus hippoglossus': '426', 'Cyclopterus lumpus': '427', 'Molva dypterygia': '428', 'Microstomus kitt': '429', 'Fucus distichus': '430', 'Tapes': '431', 'Sebastes norvegicus': '432', 'Phycis blennoides': '433', 'Fucus spiralis': '434', 'Laminaria digitata': '435', 'Dipturus batis': '436', 'Anarhichas lupus': '437', 'Lumpenus lampretaeformis': '438', 'Lycodes vahlii': '439', 'Argentina sphyraena': '440', 'Trisopterus minutus': '441', 'Thunnus': '442', 'Hyperoplus lanceolatus': '443', 'Gaidropsarus argentatus': '444', 'Engraulis japonicus': '445', 'Mytilus galloprovincialis': '446', 'Undaria pinnatifida': '447', 'Chlorophthalmus albatrossis': '448', 'Sargassum fusiforme': '449', 'Eisenia bicyclis': '450', 'Spisula sachalinensis': '451', 'Strongylocentrotus nudus': '452', 'Haliotis discus hannai': '453', 'Dexistes rikuzenius': '454', 'Ruditapes philippinarum': '455', 'Apostichopus japonicus': '456', 'Pterothrissus gissu': '457', 'Helicolenus hilgendorfii': '458', 'Buccinum isaotakii': '459', 'Neptunea intersculpta': '460', 'Apostichopus nigripunctatus': '461', 'Sebastes thompsoni': '462', 'Oratosquilla oratoria': '463', 'Oncorhynchus kisutch': '464', 'Erimacrus isenbeckii': '465', 'Sillago japonica': '466', 'Trachysalambria curvirostris': '467', 'Mytilus unguiculatus': '468', 'Crassostrea nippona': '469', 'Laminariales': '470', 'Uroteuthis edulis': '471', 'Takifugu poecilonotus': '472', 'Neptunea arthritica': '473', 'Katsuwonus pelamis': '474', 'Doederleinia berycoides': '475', 'Metapenaeopsis dalei': '476', 'Seriola dumerili': '477', 'Pseudorhombus pentophthalmus': '478', 'Stephanolepis cirrhifer': '479', 'Cookeolus japonicus': '480', 'Panulirus japonicus': '481', 'Thunnus orientalis': '482', 'Halocynthia roretzi': '483', 'Etrumeus sadina': '484', 'Cololabis saira': '485', 'Coryphaena hippurus': '486', 'Sarda orientalis': '487', 'Octopus ocellatus': '488', 'Sardinops sagax': '489', 'Sphyraena pinguis': '490', 'Sebastes ventricosus': '491', 'Occella iburia': '492', 'Glossanodon semifasciatus': '493', 'Mizuhopecten yessoensis': '494', 'Neosalangichthys ishikawae': '495', 'Bothrocara tanakae': '496', 'Malacocottus zonurus': '497', 'Coelorinchus macrochir': '498', 'Neptunea constricta': '499', 'Beringius polynematicus': '500', 'Sebastes nivosus': '501', 'Pandalus eous': '502', 'Synaphobranchus kaupii': '503', 'Sebastolobus macrochir': '504', 'Marsupenaeus japonicus': '505', 'Japelion hirasei': '506', 'Pleurogrammus azonus': '507', 'Monostroma nitidum': '508', 'Atheresthes evermanni': '509', 'Takifugu rubripes': '510', 'Chionoecetes opilio': '511', 'Pandalopsis coccinata': '512', 'Chionoecetes japonicus': '513', 'Sebastes matsubarae': '514', 'Scombrops gilberti': '515', 'Hyporhamphus sajori': '516', 'Trichiurus lepturus': '517', 'Alcichthys elongatus': '518', 'Volutharpa perryi': '519', 'Mercenaria stimpsoni': '520', 'Berryteuthis magister': '521', 'Aptocyclus ventricosus': '522', 'Euphausia pacifica': '523', 'Salangichthys microdon': '524', 'Telmessus acutidens': '525', 'Ceratophyllum demersum': '526', 'Pandalus nipponensis': '527', 'Sebastes owstoni': '528', 'Cociella crocodilus': '529', 'Conger japonicus': '530', 'Sardinella zunasi': '531', 'Cheilopogon pinnatibarbatus japonicus': '532', 'Oplegnathus fasciatus': '533', 'Macridiscus aequilatera': '534', 'Repomucenus ornatipinnis': '535', 'Clupea pallasii': '536', 'Scorpaena neglecta': '537', 'Scomberomorus niphonius': '538', 'Leucopsarion petersii': '539', 'Sebastes scythropus': '540', 'Strongylura anastomella': '541', 'Laemonema longipes': '542', 'Fusitriton oregonensis': '543', 'Japelion pericochlion': '544', 'Sebastes steindachneri': '545', 'Auxis rochei': '546', 'Lobotes surinamensis': '547', 'Auxis thazard': '548', 'Chlorophthalmus borealis': '549', 'Etelis coruscans': '550', 'Sebastes inermis': '551', 'Cynoglossus interruptus': '552', 'Erilepis zonifer': '553', 'Tridentiger obscurus': '554', 'Caranx sexfasciatus': '555', 'Thunnus thynnus': '556', 'Takifugu stictonotus': '557', 'Euthynnus affinis': '558', 'Synagrops japonicus': '559', 'Okamejei schmidti': '560', 'Suggrundus meerdervoortii': '561', 'Sebastes baramenuke': '562', 'Pleurogrammus monopterygius': '563', 'Decapterus maruadsi': '564', 'Girella punctata': '565', 'Sphyraena japonica': '566', 'Ommastrephes bartramii': '567', 'Sepiella japonica': '568', 'Sepioteuthis lessoniana': '569', 'Eucleoteuthis luminosa': '570', 'Gloiopeltis furcata': '571', 'Macrobrachium nipponense': '572', 'Sepia kobiensis': '573', 'Eriocheir japonica': '574', 'Magallana nippona': '575', 'Meretrix lusoria': '576', 'Chondrus ocellatus': '577', 'Chondrus elatus': '578', 'Gloiopeltis': '579', 'Holothuroidea': '580', 'Corbicula japonica': '581', 'Sunetta menstrualis': '582', 'Pseudorhombus cinnamoneus': '583', 'Takifugu niphobles': '584', 'Lagocephalus gloveri': '585', 'Beryx splendens': '586', 'Parastichopus nigripunctatus': '587', 'Venerupis philippinarum': '588', 'Haliotis': '589', 'Liparis agassizii': '590', 'Seriola lalandi': '591', 'Niphon spinosus': '592', 'Pleuronichthys japonicus': '593', 'Sergia lucens': '594', 'Sphoeroides pachygaster': '595', 'Coryphaenoides acrolepis': '596', 'Pseudopleuronectes obscurus': '597', 'Pyropia yezoensis': '598', 'Isurus oxyrinchus': '599', 'Sargassum fulvellum': '600', 'Prionace glauca': '601', 'Kajikia audax': '602', 'Thunnus albacares': '603', 'Thunnus alalunga': '604', 'Thunnus obesus': '605', 'Lamna ditropis': '606', 'Glyptocidaris crenularis': '607', 'Asterias amurensis': '608', 'Sepiida': '609', 'Congridae': '610', 'Takifugu': '611', 'Sargassum horneri': '612', 'Haliotis discus': '613', 'Pleuronectidae': '614', 'Acanthogobius flavimanus': '615', 'Acanthogobius lactipes': '616', 'Pholis nebulosa': '617', 'Hemigrapsus penicillatus': '618', 'Palaemon paucidens': '619', 'Mysidae': '620', 'Zostera marina': '621', 'Ulva pertusa': '622', 'Gobiidae': '623', 'Atherinidae': '624', 'Tribolodon': '625', 'Alpheus': '626', 'Polychaeta': '627', 'Sebastes': '628', 'Charybdis japonica': '629', 'Hemigrapsus': '630', 'Favonigobius gymnauchen': '631', 'Palaemon': '632', 'Planiliza haematocheila': '633', 'Palaemonidae': '634', 'Pholis crassispina': '635', 'Laminaria': '636', 'Distolasterias nipon': '637', 'Lophiiformes': '638', 'Alpheus brevicristatus': '639', 'Undaria undariodes': '640', 'Neomysis awatschensis': '641', 'Alpheidae': '642', 'Macrobrachium': '643', 'Hediste': '644', 'Gymnogobius breunigii': '645', 'Luidia quinaria': '646', 'Rhizoprionodon acutus': '647', 'Carangoides equula': '648', 'Carcinoplax longimana': '649', 'Anomura': '650', 'Spatangoida': '651', 'Plesiobatis daviesi': '652', 'Eusphyra blochii': '653', 'Ruditapes variegata': '654', 'Sinonovacula constricta': '655', 'Penaeus monodon': '656', 'Litopenaeus vannamei': '657', 'Solenocera crassicornis': '658', 'Stomatopoda': '659', 'Teuthida': '660', 'Octopus': '661', 'Larimichthys polyactis': '662', 'Scomberomorini': '663', 'Channa argus': '664', 'Ranina ranina': '665', 'Lates calcarifer': '666', 'Scomberomorus commerson': '667', 'Lutjanus malabaricus': '668', 'Thenus parindicus': '669', 'Amusium pleuronectes': '670', 'Loligo': '671', 'Plectropomus leopardus': '672', 'Sillago ciliata': '673', 'Scylla serrata': '674', 'Pinctada maxima': '675', 'Lutjanus argentimaculatus': '676', 'Protonibea diacanthus': '677', 'Polydactylus macrochir': '678', 'Rachycentron canadum': '679', 'Ibacus peronii': '680', 'Arripis trutta': '681', 'Sarda australis': '682', 'Seriola hippos': '683', 'Choerodon schoenleinii': '684', 'Panulirus ornatus': '685', 'Neotrygon kuhlii': '686', 'Lethrinus nebulosus': '687', 'Parupeneus multifasciatus': '688', 'Saccostrea cucullata': '689', 'Lutjanus sebae': '690', 'Thunnus maccoyii': '691', 'Acanthopagrus butcheri': '692', 'Lambis lambis': '693', 'Gerres subfasciatus': '694', 'Zooplankton': '695', 'Phytoplankton': '696', 'Rapana venosa': '697', 'Scapharca inaequivalvis': '698', 'Ulva intestinalis': '699', 'Ulva linza': '700', 'Ceramium virgatum': '701', 'Gayralia oxysperma': '702', 'Vertebrata fucoides': '703', 'Stuckenia pectinata': '704', 'Rochia nilotica': '705', 'Ctenochaetus striatus': '706', 'Serranidae': '707', 'Turbo setosus': '708', 'Pandalidae': '709', 'Gymnosarda unicolor': '710', 'Epinephelini': '711', 'Pisces': '712', 'Liza klunzingeri': '713', 'Acanthopagrus latus': '714', 'Liza subviridis': '715', 'Sparidentex hasta': '716', 'Otolithes ruber': '717', 'Crenidens crenidens': '718', 'Ensis': '719', 'Gastropoda': '720', 'Euheterodonta': '721', 'Scomber': '722', 'Theragra chalcogramma': '723', 'Engraulidae': '724', 'Ostreidae': '725', 'Phaeophyceae': '726', 'Porphyra': '727', 'Ulva reticulata': '728', 'Perna viridis': '729', 'Fenneropenaeus indicus': '730', 'Merluccius': '731', 'Soleidae': '732', 'Mugilidae': '733', 'Marine algae': '734', 'Scarus rivulatus': '735', 'Scarus coeruleus': '736', 'Sardinella fimbriata': '737', 'Dussumieria acuta': '738', 'Lutjanus kasmira': '739', 'Lutjanus rivulatus': '740', 'Lutjanus bohar': '741', 'Priacanthus blochii': '742', 'Pelates quadrilineatus': '743', 'Epinephelus fasciatus': '744', 'Upeneus vittatus': '745', 'Lethrinus laticaudis': '746', 'Lethrinus lentjan': '747', 'Lethrinus microdon': '748', 'Sphyraena barracuda': '749', 'Alectis indica': '750', 'Epinephelus latifasciatus': '751', 'Nemipterus japonicus': '752', 'Raconda russeliana': '753', 'Lactarius lactarius': '754', 'Aetomylaeus bovinus': '755', 'Pennahia anea': '756', 'Leiognathus fasciatus': '757', 'Sardinella longiceps': '758', 'Tenualosa ilisha': '759', 'Pellona ditchela': '760', 'Stolephorus indicus': '761', 'Setipinna breviceps': '762', 'Rastrelliger kanagurta': '763', 'Chanos chanos': '764', 'Lepturacanthus savala': '765', 'Epinephelus niveatus': '766', 'Lutjanus johnii': '767', 'Carangoides malabaricus': '768', 'Ablennes hians': '769', 'Chirocentrus dorab': '770', 'Scomberomorus cavalla': '771', 'Scomberomorus semifasciatus': '772', 'Scomberomorus guttatus': '773', 'Etrumeus teres': '774', 'Spondyliosoma cantharus': '775', 'Brama brama': '776', 'Dasyatis zugei': '777', 'Harpadon nehereus': '778', 'Carcharhinus melanopterus': '779', 'Penaeus plebejus': '780', 'Sepia officinalis': '781', 'Johnius dussumieri': '782', 'Lutjanus campechanus': '783', 'Ruditapes decussatus': '784', 'Carcinus aestuarii': '785', 'Squilla mantis': '786', 'Epinephelus polyphekadion': '787', 'Lutjanus gibbus': '788', 'Lethrinus mahsena': '789', 'Epinephelus chlorostigma': '790', 'Carangoides bajad': '791', 'Aethaloperca rogaa': '792', 'Atule mate': '793', 'Macolor niger': '794', 'Carangoides fulvoguttatus': '795', 'Plectropomus areolatus': '796', 'Cephalopholis argus': '797', 'Cephalopholis': '798', 'Scarus sordidus': '799', 'Scomberomorus tritor': '800', 'Triaenodon obesus': '801', 'Pomadasys commersonnii': '802', 'Monotaxis grandoculis': '803', 'Plectropomus maculatus': '804', 'Trachinotus blochii': '805', 'Pristipomoides filamentosus': '806', 'Acanthurus gahhm': '807', 'Acanthurus sohal': '808', 'Siganus argenteus': '809', 'Naso unicornis': '810', 'Chanos': '811', 'Oedalechilus labiosus': '812', 'Plectorhinchus gaterinus': '813', 'Mercenaria mercenaria': '814', 'Mytilus': '815', 'Turbo cornutus': '816', 'Decapoda': '817', 'Sphyraena': '818', 'Arius maculatus': '819', 'Penaeus merguiensis': '820', 'Tegillarca granosa': '821', 'Mullus barbatus barbatus': '822', 'Chamelea gallina': '823', 'Metanephrops thomsoni': '824', 'Magallana gigas': '825', 'Branchiostegus japonicus': '826', 'Cephalopoda': '827', 'Lutjanidae': '828', 'Lethrinidae': '829', 'Sphyraena argentea': '830', 'Chirocentrus nudus': '831', 'Trachinotus': '832', 'Mugil auratus': '833', 'Euthynnus alletteratus': '834', 'Sparus aurata': '835', 'Pagrus caeruleostictus': '836', 'Scorpaena scrofa': '837', 'Pagellus erythrinus': '838', 'Epinephelus aeneus': '839', 'Dentex maroccanus': '840', 'Caranx rhonchus': '841', 'Sardinella': '842', 'Siganus': '843', 'Solea': '844', 'Diplodus sargus': '845', 'Lithognathus mormyrus': '846', 'Oblada melanura': '847', 'Siganus rivulatus': '848', 'Chelon labrosus': '849', 'Cynoscion microlepidotus': '850', 'Genypterus brasiliensis': '851', 'Myoxocephalus polyacanthocephalus': '852', 'Hexagrammos lagocephalus': '853', 'Hexagrammos decagrammus': '854', 'Sebastes ciliatus': '855', 'Lepidopsetta polyxystra': '856', 'Clupeiformes': '857', 'Gadidae': '858', 'Brachyura': '859', 'Dasyatis': '860', 'Carcharias': '861', 'Saurida': '862', 'Upeneus': '863', 'Cynoglossus': '864', 'Scomberomorus': '865', 'Terapon': '866', 'Leiognathus': '867', 'Terapontidae': '868', 'Caranx': '869', 'Diplodus': '870', 'Plectorhinchus flavomaculatus': '871', 'Salmonidae': '872', 'Mollusca': '873', 'Boops boops': '874', 'Sarpa salpa': '875', 'Pagellus acarne': '876', 'Spicara smaris': '877', 'Diplodus vulgaris': '878', 'Chelidonichthys lucerna': '879', 'Sarda sarda': '880', 'Serranus cabrilla': '881', 'Diplodus annularis': '882', 'Pagrus pagrus': '883', 'Alosa fallax': '884', 'Belone belone': '885', 'Dentex dentex': '886', 'Sphyraena viridensis': '887', 'Trisopterus capelanus': '888', 'Arnoglossus laterna': '889', 'Procambarus clarkii': '890', 'Nemadactylus macropterus': '891', 'Pagrus auratus': '892', 'Jasus edwardsii': '893', 'Perna canaliculus': '894', 'Pseudophycis bachus': '895', 'Haliotis iris': '896', 'Hoplostethus atlanticus': '897', 'Rhombosolea leporina': '898', 'Zygochlamys delicatula': '899', 'Galeorhinus galeus': '900', 'Parapercis colias': '901', 'Tiostrea chilensis': '902', 'Genypterus blacodes': '903', 'Evechinus chloroticus': '904', 'Austrovenus stutchburyi': '905', 'Micromesistius australis': '906', 'Macruronus novaezelandiae': '907', 'Nototodarus': '908', 'Perna perna': '909', 'Sepia pharaonis': '910', 'Turbo bruneus': '911', 'Portunus sanguinolentus': '912', 'Charybdis natator': '913', 'Charybdis lucifera': '914', 'Panulirus argus': '915', 'Ethmalosa fimbriata': '916', 'Sardinella brachysoma': '917', 'Thryssa mystax': '918', 'Plicofollis dussumieri': '919', 'Nibea soldado': '920', 'Epinephelus melanostigma': '921', 'Megalops cyprinoides': '922', 'Decapterus macarellus': '923', 'Drepane punctata': '924', 'Sillago sihama': '925', 'Tylosurus crocodilus crocodilus': '926', 'Saurida tumbil': '927', 'Cynoglossus macrostomus': '928', 'Parupeneus indicus': '929', 'Synechogobius hasta': '930', 'Busycotypus canaliculatus': '931', 'Pampus cinereus': '932', 'Pomadasys kaakan': '933', 'Epinephelus coioides': '934', 'Sepiella inermis': '935', 'Uroteuthis duvauceli': '936', 'Stomatella auricula': '937', 'Cerithium scabridum': '938', 'Marcia recens': '939', 'Circe intermedia': '940', 'Marcia opima': '941', 'Fulvia fragile': '942', 'Charybdis feriatus': '943', 'Charybdis annulata': '944', 'Atergatis integerrimus': '945', 'Matuta lunaris': '946', 'Calappa lophos': '947', 'Uca annulipes': '948', 'Chlamys varia': '949', 'Cololabis adocetus': '950', 'Seriola lalandi dorsalis': '951', 'Brunneifusus ternatanus': '952', 'Metapenaeus joyneri': '953', 'Epinephelus tauvina': '954', 'Coilia dussumieri': '955', 'Carcharhinus dussumieri': '956', 'Upeneus tragula': '957', 'Sartoriana spinigera': '958', 'Lamellidens marginalis': '959', 'Polydactylus sextarius': '960', 'Johnius macrorhynus': '961', 'Hexanematichthys sagor': '962', 'Sargassum swartzii': '963', 'Argyrops spinifer': '964', 'Synodus intermedius': '965', 'Muraenesox cinereus': '966', 'Carangoides armatus': '967', 'Eleutheronema tetradactylum': '968', 'Mustelus mosis': '969', 'Nemipterus bipunctatus': '970', 'Lutjanus quinquelineatus': '971', 'Platycephalus indicus': '972', 'Rhabdosargus haffara': '973', 'Argyrops filamentosus': '974', 'Brachirus orientalis': '975', 'Mene maculata': '976', 'Hemiramphus marginatus': '977', 'Encrasicholina heteroloba': '978', 'Trachinotus africanus': '979', 'Bramidae': '980', 'Escualosa thoracata': '981', 'Sepia arabica': '982', 'Scatophagus argus': '983', 'Parastromateus niger': '984', 'Planiliza subviridis': '985', 'Labeo rohita': '986', 'Oreochromis niloticus': '987', 'Cardiidae': '988', 'Sargassum angustifolium': '989', 'Pomacea bridgesii': '990', 'Sebastes fasciatus': '991', 'Batoidea': '992', 'Urophycis chuss': '993', 'Dalatias licha': '994', 'Trisopterus luscus': '995', 'Scyliorhinus canicula': '996', 'Ruvettus pretiosus': '997', 'Aphanopus carbo': '998', 'Alepocephalus bairdii': '999', 'Centroscymnus coelolepis': '1000', 'Loligo forbesii': '1001', 'Lutjanus cyanopterus': '1002', 'Mugil liza': '1003', 'Micropogonias furnieri': '1004', 'Balistes capriscus': '1005', 'Haemulidae': '1006', 'Stenotomus caprinus': '1007', 'Hemanthias leptus': '1008', 'Micropogonias undulatus': '1009', 'Cynoscion nebulosus': '1010', 'Rhomboplites aurorubens': '1011', 'Bothidae': '1012', 'Pogonias cromis': '1013', 'Lutjanus synagris': '1014', 'Netuma thalassina': '1015', 'Sillaginopsis panijus': '1016', 'Leptomelanosoma indicum': '1017', 'Therapon': '1018', 'Pterotolithus maculatus': '1019', 'Ilisha filigera': '1020', 'Hilsa kelee': '1021', 'Pampus chinensis': '1022', 'Palaemon styliferus': '1023', 'Argyrosomus regius': '1024', 'Lutjanus': '1025', 'Sciades': '1026', 'Mullus': '1027', 'Albula vulpes': '1028', 'Selar crumenophthalmus': '1029', 'Centropomus': '1030', 'Sardinella aurita': '1031', 'Harengula humeralis': '1032', 'Diapterus auratus': '1033', 'Gerres cinereus': '1034', 'Haemulon parra': '1035', 'Ocyurus chrysurus': '1036', 'Sphyraena guachancho': '1037', 'Anoplopoma fimbria': '1038', 'Nerita versicolor': '1039', 'Bulla striata': '1040', 'Melongena melongena': '1041', 'Trachycardium muricatum': '1042', 'Isognomon alatus': '1043', 'Brachidontes exustus': '1044', 'Crassostrea virginica': '1045', 'Protothaca granulata': '1046', 'Cittarium pica': '1047', 'Penaeus schmitti': '1048', 'Penaeus notialis': '1049', 'Callinectes sapidus': '1050', 'Callinectes danae': '1051', 'Dasyatidae': '1052', 'Caridea': '1053', 'Nephropidae': '1054', 'Sparus': '1055', 'Sargassum boveanum': '1056', 'Haliotis tuberculata': '1057', 'Littorinidae': '1058', 'Seaweed': '1059', 'Echinoidea': '1060', 'Ostreida': '1061', 'Donax trunculus': '1062', 'Scrobicularia plana': '1063', 'Venus verrucosa': '1064', 'Solen marginatus': '1065', 'Testudines': '1066', 'Mullidae': '1067', 'Amphipoda': '1068', 'Cystosphaera jacquinotii': '1069', 'Daption capense': '1070', 'Desmarestia anceps': '1071', 'Himantothallus grandifolius': '1072', 'Mirounga': '1073', 'Nacella concinna': '1074', 'Notothenia coriiceps': '1075', 'Pygoscelis antarcticus': '1076', 'Pygoscelis papua': '1077', 'Oncorhynchus gorbuscha': '1078', 'Oncorhynchus mykiss': '1079', 'Oncorhynchus nerka': '1080', 'Oncorhynchus tshawytscha': '1081', 'Erignathus barbatus': '1082', 'Pusa hispida': '1083', 'Hippoglossus stenolepis': '1084', 'Squalus suckleyi': '1085', 'Sargassum': '1086', 'Codium': '1087', 'Membranoptera alata': '1088', 'Dictyota dichotoma': '1089', 'Plocamium cartilagineum': '1090', 'Galatea paradoxa': '1091', 'Crassostrea tulipa': '1092', 'Macrobrachium sp': '1093', 'Portunus': '1094', 'Tympanotonos fuscatus': '1095', 'Thais': '1096', 'Bivalvia': '1097', 'Cynoglossus senegalensis': '1098', 'Carlarius heudelotii': '1099', 'Fontitrygon margarita': '1100', 'Chrysichthys nigrodigitatus': '1101', 'Acanthephyra purpurea': '1102', 'Actinauge abyssorum': '1103', 'Alaria marginata': '1104', 'Anadara transversa': '1105', 'Anthomedusae': '1106', 'Archosargus probatocephalus': '1107', 'Argyropelecus aculeatus': '1108', 'Ariopsis felis': '1109', 'Astrometis sertulifera': '1110', 'Astropecten': '1111', 'Atherina breviceps': '1112', 'Atolla': '1113', 'Aulacomya atra': '1114', 'Auxis rochei rochei': '1115', 'Auxis thazard thazard': '1116', 'Avicennia marina': '1117', 'Balaena mysticetus': '1118', 'Balaenoptera acutorostrata': '1119', 'Balanus': '1120', 'Berardius bairdii': '1121', 'Beroe': '1122', 'Boopsoidea inornata': '1123', 'Calanoida': '1124', 'Calanus finmarchicus finmarchicus': '1125', 'Callorhinchus milii': '1126', 'Cepphus columba': '1127', 'Cladonia rangiferina': '1128', 'Clinus superciliosus': '1129', 'Codium tomentosum': '1130', 'Copepoda': '1131', 'Coregonus autumnalis': '1132', 'Coregonus nasus': '1133', 'Coregonus sardinella': '1134', 'Coryphaenoides armatus': '1135', 'Coryphoblennius galerita': '1136', 'Creseis sp': '1137', 'Crinoidea': '1138', 'Crossota': '1139', 'Cryptochiton stelleri': '1140', 'Delphinus delphis': '1141', 'Diacria': '1142', 'Dichistius capensis': '1143', 'Dosinia alta': '1144', 'Dugong dugon': '1145', 'Electrona risso': '1146', 'Engraulis capensis': '1147', 'Ensis siliqua': '1148', 'Eryonidae': '1149', 'Eualaria fistulosa': '1150', 'Eupasiphae gilesii': '1151', 'Euphausiacea': '1152', 'Euphausiidae': '1153', 'Eurypharynx pelecanoides': '1154', 'Eurythenes gryllus': '1155', 'Euthynnus lineatus': '1156', 'Fratercula cirrhata': '1157', 'Galeichthys feliceps': '1158', 'Gelidium corneum': '1159', 'Gibbula umbilicalis': '1160', 'Gnathophausia ingens': '1161', 'Gonatus fabricii': '1162', 'Haliaeetus leucocephalus': '1163', 'Haliclona': '1164', 'Halodule uninervis': '1165', 'Hemilepidotus': '1166', 'Hemilepidotus jordani': '1167', 'Heterocarpus ensifer': '1168', 'Heterodontus portusjacksoni': '1169', 'Hippasteria phrygiana': '1170', 'Homola barbata': '1171', 'Hyperoodon planifrons': '1172', 'Hypleurochilus geminatus': '1173', 'Invertebrata': '1174', 'Isognomon bicolor': '1175', 'Isopoda': '1176', 'Kogia breviceps': '1177', 'Labrus bergylta': '1178', 'Lagenorhynchus obliquidens': '1179', 'Lampris guttatus': '1180', 'Larus glaucescens': '1181', 'Leander serratus': '1182', 'Libinia emarginata': '1183', 'Lichia amia': '1184', 'Lipophrys pholis': '1185', 'Lipophrys trigloides': '1186', 'Lithognathus lithognathus': '1187', 'Lithophaga aristata': '1188', 'Lobianchia gemellarii': '1189', 'Loliginidae': '1190', 'Loligo reynaudii': '1191', 'Lophius budegassa': '1192', 'Magallana angulata': '1193', 'Majoidea': '1194', 'Megachasma pelagios': '1195', 'Megaptera novaeangliae': '1196', 'Menippe mercenaria': '1197', 'Mesoplodon carlhubbsi': '1198', 'Mesoplodon stejnegeri': '1199', 'Microstomus pacificus': '1200', 'Morone saxatilis': '1201', 'Mullus surmuletus': '1202', 'Mycteroperca xenarcha': '1203', 'Myliobatis australis': '1204', 'Mysida': '1205', 'Mytilus californianus': '1206', 'Mytilus trossulus': '1207', 'Nephasoma Nephasoma flagriferum': '1208', 'Nudibranchia': '1209', 'Odobenus rosmarus divergens': '1210', 'Ommastrephidae': '1211', 'Ophiomusa lymani': '1212', 'Ophiothrix lineata': '1213', 'Orcinus orca': '1214', 'Ostracoda': '1215', 'Pagellus bogaraveo': '1216', 'Pandalus borealis': '1217', 'Paphies subtriangulata': '1218', 'Parabrotula': '1219', 'Paracalanus': '1220', 'Patella aspera': '1221', 'Periphylla': '1222', 'Phocoena phocoena': '1223', 'Phocoenoides dalli': '1224', 'Phronima': '1225', 'Physeter macrocephalus': '1226', 'Pinctada radiata': '1227', 'Plesionika edwardsii': '1228', 'Pododesmus macrochisma': '1229', 'Pomatomus saltatrix': '1230', 'Portunus pelagicus': '1231', 'Praunus': '1232', 'Pyrosoma': '1233', 'Rangifer tarandus': '1234', 'Rhabdosargus globiceps': '1235', 'Saccorhiza polyschides': '1236', 'Sagitta': '1237', 'Salpa': '1238', 'Salvelinus alpinus': '1239', 'Salvelinus malma': '1240', 'Sarda chiliensis': '1241', 'Sargassum aquifolium': '1242', 'Scalibregmatidae': '1243', 'Sebastes alutus': '1244', 'Sebastes melanops': '1245', 'Seriola dorsalis': '1246', 'Serranus scriba': '1247', 'Sigmops bathyphilus': '1248', 'Silicula fragilis': '1249', 'Sipunculidae': '1250', 'Somateria mollissima': '1251', 'Somateria spectabilis': '1252', 'Sparodon durbanensis': '1253', 'Spicara maena': '1254', 'Squatina australis': '1255', 'Striostrea margaritacea': '1256', 'Stromateus fiatola': '1257', 'Strongylocentrotus polyacanthus': '1258', 'Taractichthys steindachneri': '1259', 'Tectura scutum': '1260', 'Tegula viridula': '1261', 'Thais haemastoma': '1262', 'Thegrefg': '1263', 'Themisto': '1264', 'Thunnus tonggol': '1265', 'Trachurus picturatus': '1266', 'Trachurus symmetricus': '1267', 'Trygonorrhina fasciata': '1268', 'Ulva lactuca': '1269', 'Ursus maritimus': '1270', 'Vampyroteuthis infernalis': '1271', 'Ziphius cavirostris': '1272', 'Alepes kleinii': '1273', 'Alepes vari': '1274', 'Decapterus macrosoma': '1275', 'Lutjanus madras': '1276', 'Lutjanus russellii': '1277', 'Rastrelliger brachysoma': '1278', 'Rastrelliger faughni': '1279', 'Selar boops': '1280', 'Selaroides leptolepis': '1281', 'Sphyraena obtusata': '1282', 'Geloina expansa': '1283', 'Caesio erythrogaster': '1284', 'Euristhmus microceps': '1285', 'Pomacanthus annularis': '1286', 'Scylla': '1287', 'Plotosus lineatus': '1288', 'Prionotus stephanophrys': '1289', 'Trachurus murphyi': '1290', 'Dosidicus gigas': '1291', 'Sarda chiliensis chiliensis': '1292', 'Cynoscion analis': '1293', 'Merluccius gayi peruanus': '1294', 'Brotula ordwayi': '1295', 'Loligo gahi': '1296', 'Merluccius gayi': '1297', 'Ophichthus remiger': '1298', 'Penaeus sp': '1299', 'Trachinotus paitensis': '1300', 'Cheilopogon heterurus': '1301', 'Engraulis ringens': '1302', 'Sciaena deliciosa': '1303', 'Isacia conceptionis': '1304', 'Odontesthes regia': '1305', 'Bodianus diplotaenia': '1306', 'Concholepas concholepas': '1307', 'Diplectrum conceptione': '1308', 'Genypterus maculatus': '1309', 'Labrisomus philippii': '1310', 'Paralabrax humeralis': '1311', 'Prionotus horrens': '1312', 'Dasyatis akajei': '1313', 'Arctoscopus japonicus': '1314', 'Sepia esculenta': '1315', 'Bothrocara hollandi': '1316', 'Cynoglossidae': '1317', 'Lepidotrigla': '1318', 'Lepidotrigla alata': '1319', 'Octopus sinensis': '1320', 'Rhabdosargus sarba': '1321', 'Lophiidae': '1322', 'Muraenesox': '1323', 'Physiculus maximowiczi': '1324', 'Pleuronectoidei': '1325', 'Sciaenidae': '1326', 'Triglidae': '1327', 'Atherina presbyter': '1328', 'Bentheogennema intermedia': '1329', 'Benthesicymidae': '1330', 'Benthesicymus': '1331', 'Buccinum striatissimum': '1332', 'Callinectes': '1333', 'Cancer pagurus': '1334', 'Chaetognatha': '1335', 'Chama macerophylla': '1336', 'Cirripedia': '1337', 'Cyclosalpa': '1338', 'Cymopolia barbata': '1339', 'Cynoscion': '1340', 'Cystoseira amentacea': '1341', 'Ectocarpus siliculosus': '1342', 'Ellisolandia elongata': '1343', 'Enteromorpha linza': '1344', 'Euphausia superba': '1345', 'Gaidropsarus mediterraneus': '1346', 'Gennadas valens': '1347', 'Globicephala': '1348', 'Haliptilon virgatum': '1349', 'Halocynthia aurantium': '1350', 'Heliocidaris crassispina': '1351', 'Hymenodora gracilis': '1352', 'Lagodon rhomboides': '1353', 'Lepas Anatifa anatifera': '1354', 'Lobophora variegata': '1355', 'Macrocystis pyrifera': '1356', 'Maculabatis gerrardi': '1357', 'Nemacystus decipiens': '1358', 'Neptunea polycostata': '1359', 'Padina pavonia': '1360', 'Penaeidae': '1361', 'Petricolinae': '1362', 'Polynemidae': '1363', 'Pristipomoides aquilonaris': '1364', 'Pyropia fallax': '1365', 'Radiolaria': '1366', 'Salpidae': '1367', 'Sardinops melanosticta': '1368', 'Sargassum vulgare': '1369', 'Sciaena umbra': '1370', 'Scorpaena porcus': '1371', 'Sergestidae': '1372', 'Sicyonia brevirostris': '1373', 'Sphaerococcus coronopifolius': '1374', 'Stenella coeruleoalba': '1375', 'Stichopus japonicus': '1376', 'Thalia democratica': '1377', 'Themisto gaudichaudii': '1378', 'Undaria': '1379', 'Analipus japonicus': '1380', 'Sargassum yamadae': '1381', 'Ahnfeltiopsis paradoxa': '1382', 'Scytosiphon lomentaria': '1383', 'Chondria crassicaulis': '1384', 'Grateloupia lanceolata': '1385', 'Colpomenia sinuosa': '1386', 'Chondrus giganteus': '1387', 'Sargassum muticum': '1388', 'Ulva prolifera': '1389', 'Petalonia fascia': '1390', 'Balanus roseus': '1391', 'Chaetomorpha moniligera': '1392', 'Lomentaria hakodatensis': '1393', 'Neodilsea longissima': '1394', 'Polyopes affinis': '1395', 'Schizymenia dubyi': '1396', 'Dictyopteris pacifica': '1397', 'Ahnfeltiopsis flabelliformis': '1398', 'Bangia fuscopurpurea': '1399', 'Calliarthron': '1400', 'Cladophora': '1401', 'Cladophora albida': '1402', 'Dasya sessilis': '1403', 'Delesseria serrulata': '1404', 'Ecklonia cava': '1405', 'Gelidium elegans': '1406', 'Grateloupia turuturu': '1407', 'Hypnea asiatica': '1408', 'Mazzaella japonica': '1409', 'Pachydictyon coriaceum': '1410', 'Padina arborescens': '1411', 'Pterosiphonia pinnulata': '1412', 'Alatocladia yessoensis': '1413', 'Bryopsis plumosa': '1414', 'Ceramium kondoi': '1415', 'Chondracanthus intermedius': '1416', 'Codium contractum': '1417', 'Codium lucasii': '1418', 'Corallina pilulifera': '1419', 'Dictyopteris undulata': '1420', 'Gastroclonium pacificum': '1421', 'Gelidium amansii': '1422', 'Grateloupia sparsa': '1423', 'Laurencia okamurae': '1424', 'Leathesia marina': '1425', 'Lomentaria catenata': '1426', 'Meristotheca papulosa': '1427', 'Sargassum confusum': '1428', 'Sargassum siliquastrum': '1429', 'Tinocladia crassa': '1430', 'Saccharina yendoana': '1431', 'Thalassiophyllum clathrus': '1432', 'Mytilida': '1433', 'Pteriomorphia': '1434', 'Conger': '1435', 'Scyliorhinidae': '1436', 'Labrus': '1437', 'Algae': '1438', 'Necora puber': '1439', 'Anguilla': '1440', 'Rajidae': '1441', 'Buccinidae': '1442', 'Crustacea': '1443', 'Green algae': '1444', 'Ammodytes japonicus': '1445', 'Evynnis tumifrons': '1446', 'Gnathophis nystromi nystromi': '1447', 'Loligo bleekeri': '1448', 'Platichthys bicoloratus': '1449', 'Limanda punctatissima': '1450', 'Loliolus Nipponololigo japonica': '1451', 'Acanthopagrus schlegelii schlegelii': '1452', 'Sepiolina': '1453', 'Gelidium': '1454', 'Atrina pectinata': '1455', 'Echinocardium cordatum': '1456', 'Lamnidae': '1457', 'Meretrix lamarckii': '1458', 'Noctiluca scintillans': '1459', 'Philine argentata': '1460', 'Sergestes lucens': '1461', 'Corbicula sandai': '1462', 'Ulva': '1463', 'Actiniaria': '1464', 'Ctenopharyngodon idella': '1465', 'Ophiuroidea': '1466', 'Scomberoides lysan': '1467', 'Scomberoides tol': '1468', 'Sebastolobus': '1469', 'Selachimorpha': '1470', 'Selene setapinnis': '1471', 'Selene vomer': '1472', 'Sepia elliptica': '1473', 'Sergestes sp': '1474', 'Setipinna taty': '1475', 'Siganus canaliculatus': '1476', 'Sigmops gracile': '1477', 'Solenocera sp': '1478', 'Sparidae': '1479', 'Spermatophytina': '1480', 'Sphoeroides testudineus': '1481', 'Sphyraena jello': '1482', 'Spyridia hypnoides': '1483', 'Squaliformes': '1484', 'Squillidae': '1485', 'Stegophiura sladeni': '1486', 'Stenella longirostris': '1487', 'Stenobrachius leucopsarus': '1488', 'Sternaspidae': '1489', 'Stoechospermum polypodioides': '1490', 'Stolephorus commersonnii': '1491', 'Stromateus cinereus': '1492', 'Stromateus niger': '1493', 'Stromateus sinensis': '1494', 'Synidotea': '1495', 'Takifugu vermicularis': '1496', 'Telatrygon zugei': '1497', 'Terapon jarbua': '1498', 'Terebellidae': '1499', 'Thryssa dussumieri': '1500', 'Thunnini': '1501', 'Tibia curta': '1502', 'Tonna dolium': '1503', 'Trachinus draco': '1504', 'Trematomus bernacchii': '1505', 'Tridacna': '1506', 'Trinectes paulistanus': '1507', 'Trochus radiatus': '1508', 'Turbinaria': '1509', 'Tursiops truncatus': '1510', 'Ucides': '1511', 'Ulva compressa': '1512', 'Ulva fasciata': '1513', 'Ulva flexuosa': '1514', 'Ulva rigida': '1515', 'Upeneus taeniopterus': '1516', 'Upogebiidae': '1517', 'Uroteuthis Photololigo edulis': '1518', 'Valoniopsis pachynema': '1519', 'Veneridae': '1520', 'Venus foveolata': '1521', 'Vertebrata': '1522', 'Volutharpa ampullacea perryi': '1523', 'Zannichellia palustris': '1524', 'Zeus japonicus': '1525', 'Favites': '1526', 'Gadiformes': '1527', 'Gafrarium dispar': '1528', 'Galaxaura frutescens': '1529', 'Gelidium crinale': '1530', 'Genidens genidens': '1531', 'Girella elevata': '1532', 'Girella tricuspidata': '1533', 'Dentex hypselosomus': '1534', 'Saurida elongata': '1535', 'Pseudolabrus eoethinus': '1536', 'Atrobucca nibe': '1537', 'Diagramma pictum': '1538', 'Sepia lycidas': '1539', 'Plectorhinchus cinctus': '1540', 'Metapenaeopsis acclivis': '1541', 'Metapenaeopsis barbata': '1542', 'Nibea albiflora': '1543', 'Girella leonina': '1544', 'Sphyraenidae': '1545', 'Parapercis pulchella': '1546', 'Parapercis sexfasciata': '1547', 'Thysanoteuthis rhombus': '1548', 'Lepidotrigla kishinouyi': '1549', 'Cystoseira': '1550', 'Padina': '1551', 'Halimeda': '1552', 'Pacifastacus leniusculus': '1553', 'Salmo trutta': '1554', 'Chondrus crispus': '1555', 'Ictalurus punctatus': '1556', 'Acanthurus': '1557', 'Scombridae': '1558', 'Leukoma staminea': '1559', 'Trochidae': '1560', 'Protonibea': '1562', 'Anchoa compressa': '1563', 'Ensis magnus': '1564', 'Bolinus brandaris': '1565', 'Lutjanus notatus': '1566', 'Lethrinus olivaceus': '1567', 'Carassius auratus': '1569', 'Mugil': '1570', 'Gobius': '1571', 'Lajonkairia lajonkairii': '1572', 'Chrysophrys auratus': '1573', 'Galeorhinus australis': '1574', 'Nototodarus sloanii gouldi': '1575', 'Tylosurus crocodilus': '1576', 'Acanthogobius hasta': '1577', 'Penaeus chinensis': '1578', 'Ruditapes variegatus': '1579', 'Marcia marmorata': '1580', 'Rachycentron': '1581', 'Scomber kanagurta': '1582', 'Arius': '1583', 'Panulirus versicolor': '1584', 'Tilapia zillii': '1585', 'Schizoporella errata': '1586', 'Phallusia nigra': '1587', 'Physeter catodon': '1588', 'Salmo trutta trutta': '1589', 'Tachysurus thalassinus': '1590', 'Sillago domina': '1591', 'Otolithus argenteus': '1592', 'Trichiurus haumela': '1593', 'Otolithes maculata': '1594', 'Hilsa kanagurta': '1595', 'Oreochromis mossambicus': '1596', 'Siluriformes': '1597', 'Theodoxus euxinus': '1598', 'Formio niger': '1599', 'Rastrelliger': '1600', 'Nephasoma flagriferum': '1601', 'Ophiomusium lymani': '1602', 'Nematonurus armatus': '1603', 'Thalamitoides spinigera': '1604', 'Capros aper': '1605', 'Gadiculus argenteus thori': '1606', 'Phorcus lineatus': '1607', 'Penaeus vannamei': '1608', 'Raja montagui': '1609', 'Scophthalmus rhombus': '1610', 'Crambe maritima': '1611', 'Fucus ceranoides': '1612', 'Maja squinado': '1613', 'Salicornia europaea': '1614', 'Aequipecten opercularis': '1615', 'Galathea squamifera': '1616', 'Cynoglossus semilaevis': '1617', 'Loliolus beka': '1619', 'Octopus variabilis': '1620', 'Abudefduf sexfasciatus': '1621', 'Acanthurus blochii': '1622', 'Achillea millefolium': '1623', 'Alaria crassifolia': '1624', 'Albulidae': '1625', 'Ammodytes': '1626', 'Anadara satowi': '1627', 'Argyrosomus japonicus': '1628', 'Ascidiacea': '1629', 'Aulopiformes': '1630', 'Babylonia japonica': '1631', 'Babylonia kirana': '1632', 'Bathylagidae': '1633', 'Beryx decadactylus': '1634', 'Branchiostegus': '1635', 'Buccinum': '1636', 'Caesio lunaris': '1637', 'Callionymus curvicornis': '1638', 'Campylaephora hypnaeoides': '1639', 'Cetoscarus ocellatus': '1640', 'Charonia tritonis': '1641', 'Chelon haematocheilus': '1642', 'Chlorurus sordidus': '1643', 'Choerodon azurio': '1644', 'Chromis notata': '1645', 'Cladosiphon okamuranus': '1646', 'Cociella punctata': '1647', 'Coryphaena': '1648', 'Cyclina sinensis': '1649', 'Cymbacephalus beauforti': '1650', 'Dendrobranchiata': '1651', 'Digenea simplex': '1652', 'Ditrema viride': '1653', 'Enteromorpha prolifera': '1654', 'Epinephelus': '1655', 'Epinephelus akaara': '1656', 'Epinephelus awoara': '1657', 'Etelis carbunculus': '1658', 'Fistularia commersonii': '1659', 'Fulvia mutica': '1660', 'Fusinus colus': '1661', 'Gafrarium tumidum': '1662', 'Gelidiaceae': '1663', 'Girella cyanea': '1664', 'Girella mezina': '1665', 'Goniistius zonatus': '1666', 'Gracilaria': '1667', 'Gymnocranius euanus': '1668', 'Heikeopsis japonica': '1669', 'Hemitrygon': '1670', 'Hippoglossoides pinetorum': '1671', 'Holothuria atra': '1672', 'Holothuria leucospilota': '1673', 'Idiosepiidae': '1674', 'Inegocia japonica': '1675', 'Inimicus didactylus': '1676', 'Ishige': '1677', 'Lagocephalus spadiceus': '1678', 'Lambis truncata': '1679', 'Leiognathus equula': '1680', 'Lethrinus xanthochilus': '1681', 'Lutjanus erythropterus': '1682', 'Lutjanus semicinctus': '1683', 'Monodonta labio': '1684', 'Monostroma kuroshiense': '1685', 'Mulloidichthys flavolineatus': '1686', 'Mulloidichthys vanicolensis': '1687', 'Muraenesocidae': '1688', 'Myagropsis myagroides': '1689', 'Mytilisepta virgata': '1690', 'Naso brevirostris': '1691', 'Nematalosa japonica': '1692', 'Nemipterus virgatus': '1693', 'Nipponacmea': '1694', 'Nuchequula nuchalis': '1695', 'Octopus cyanea': '1696', 'Panopea generosa': '1697', 'Paralichthys': '1698', 'Paralithodes camtschaticus': '1699', 'Parascolopsis inermis': '1700', 'Pectinidae': '1701', 'Pentapodus aureofasciatus': '1702', 'Pinctada fucata': '1703', 'Pitar citrinus': '1704', 'Platycephalidae': '1705', 'Plecoglossus altivelis': '1706', 'Pleuronectes herzensteini': '1707', 'Priacanthus macracanthus': '1708', 'Pristipomoides': '1709', 'Psenopsis anomala': '1710', 'Pseudobalistes fuscus': '1711', 'Pseudocaranx dentex': '1712', 'Pseudolabrus sieboldi': '1713', 'Pseudorhombus arsius': '1714', 'Pterocaesio chrysozona': '1715', 'Rhynchopelates oxyrhynchus': '1716', 'Ryukyupercis gushikeni': '1717', 'Saccostrea echinata': '1718', 'Sargassum hemiphyllum': '1719', 'Sargassum piluliferum': '1720', 'Saurida micropectoralis': '1721', 'Saurida undosquamis': '1722', 'Saurida wanieso': '1723', 'Scarus forsteni': '1724', 'Scarus ghobban': '1725', 'Scarus ovifrons': '1726', 'Scarus rubroviolaceus': '1727', 'Scyphozoa': '1728', 'Sebastes iracundus': '1729', 'Semicossyphus reticulatus': '1730', 'Sepia latimanus': '1731', 'Siganus guttatus': '1732', 'Siganus luridus': '1733', 'Sphaerotrichia divaricata': '1734', 'Sphyrnidae': '1735', 'Spondylus regius': '1736', 'Spratelloides gracilis': '1737', 'Sthenoteuthis oualaniensis': '1738', 'Tetraodontidae': '1739', 'Trichiurus lepturus japonicus': '1740', 'Tridacna crocea': '1741', 'Turbo argyrostomus': '1742', 'Tylosurus pacificus': '1743', 'Ulvophyceae': '1744', 'Upeneus japonicus': '1745', 'Upeneus moluccensis': '1746', 'Uranoscopus japonicus': '1747', 'Anguilliformes': '1748', 'Crithmum maritimum': '1749', 'Littorina': '1750', 'Nucella lapillus': '1752', 'Scyliorhinus stellaris': '1753', 'Annelida': '1754', 'Aphrodita aculeata': '1755', 'Callionymus lyra': '1756', 'Urticina felina': '1757', 'Gebiidea': '1758', 'Bonellia viridis': '1759', 'Alcyonium glomeratum': '1760'}, 'body_part': {'Not applicable': '-1', 'Not available': '0', 'Whole animal': '1', 'Whole animal eviscerated': '2', 'Whole animal eviscerated without head': '3', 'Flesh with bones': '4', 'Blood': '5', 'Skeleton': '6', 'Bones': '7', 'Exoskeleton': '8', 'Endoskeleton': '9', 'Shells': '10', 'Molt': '11', 'Skin': '12', 'Head': '13', 'Tooth': '14', 'Otolith': '15', 'Fins': '16', 'Faecal pellet': '17', 'Byssus': '18', 'Soft parts': '19', 'Viscera': '20', 'Stomach': '21', 'Hepatopancreas': '22', 'Digestive gland': '23', 'Pyloric caeca': '24', 'Liver': '25', 'Intestine': '26', 'Kidney': '27', 'Spleen': '28', 'Brain': '29', 'Eye': '30', 'Fat': '31', 'Heart': '32', 'Branchial heart': '33', 'Muscle': '34', 'Mantle': '35', 'Gills': '36', 'Gonad': '37', 'Ovary': '38', 'Testes': '39', 'Whole plant': '40', 'Flower': '41', 'Leaf': '42', 'Old leaf': '43', 'Young leaf': '44', 'Leaf upper part': '45', 'Leaf lower part': '46', 'Scales': '47', 'Root rhizome': '48', 'Whole macro alga': '49', 'Phytoplankton': '50', 'Thallus': '51', 'Flesh without bones': '52', 'Stomach and intestine': '53', 'Whole haptophytic plants': '54', 'Loose drifting plants': '55', 'Growing tips': '56', 'Upper parts of plants': '57', 'Lower parts of plants': '58', 'Shells carapace': '59', 'Flesh with scales': '60'}}, 'SEAWATER': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', 'pu240': '69', 'pu241': '70', 'am240': '71', 'am241': '72', 'cm242': '73', 'cm243': '74', 'cm244': '75', 'cs134_137_tot': '76', 'pu239_240_tot': '77', 'pu239_240_iii_iv_tot': '78', 'pu239_240_v_vi_tot': '79', 'cm243_244_tot': '80', 'pu238_pu239_240_tot_ratio': '81', 'am241_pu239_240_tot_ratio': '82', 'cs137_134_ratio': '83', 'cd109': '84', 'eu152': '85', 'fe59': '86', 'gd153': '87', 'ir192': '88', 'pu238_240_tot': '89', 'rb86': '90', 'sc46': '91', 'sn113': '92', 'sn117m': '93', 'tl208': '94', 'mo99': '95', 'tc99m': '96', 'ru105': '97', 'te129': '98', 'te132': '99', 'i132': '100', 'i135': '101', 'cs136': '102', 'tbeta': '103', 'talpha': '104', 'i133': '105', 'th230': '106', 'pa231': '107', 'u236': '108', 'ag111': '109', 'in116m': '110', 'te123m': '111', 'sb127': '112', 'ba133': '113', 'ce139': '114', 'tl201': '116', 'hg203': '117', 'na22': '122', 'pa234m': '123', 'am243': '124', 'se75': '126', 'sr85': '127', 'y88': '128', 'ce140': '129', 'bi212': '130', 'u236_238_ratio': '131', 'i125': '132', 'ba137m': '133', 'u232': '134', 'pa233': '135', 'ru106_rh106_tot': '136', 'tu': '137', 'tbeta40k': '138', 'fe55': '139', 'ce144_pr144_tot': '140', 'pu240_pu239_ratio': '141', 'u233': '142', 'pu239_242_tot': '143', 'ac227': '144'}, 'unit': {'Not applicable': '-1', 'NOT AVAILABLE': '0', 'Bq per m3': '1', 'Bq per m2': '2', 'Bq per kg': '3', 'Bq per kgd': '4', 'Bq per kgw': '5', 'kg per kg': '6', 'TU': '7', 'DELTA per mill': '8', 'atom per kg': '9', 'atom per kgd': '10', 'atom per kgw': '11', 'atom per l': '12', 'Bq per kgC': '13'}, 'dl': {'Not applicable': '-1', 'Not available': '0', 'Detected value': '1', 'Detection limit': '2', 'Not detected': '3', 'Derived': '4'}, 'filt': {'Not applicable': '-1', 'Not available': '0', 'Yes': '1', 'No': '2'}}, 'SEDIMENT': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', 'pu240': '69', 'pu241': '70', 'am240': '71', 'am241': '72', 'cm242': '73', 'cm243': '74', 'cm244': '75', 'cs134_137_tot': '76', 'pu239_240_tot': '77', 'pu239_240_iii_iv_tot': '78', 'pu239_240_v_vi_tot': '79', 'cm243_244_tot': '80', 'pu238_pu239_240_tot_ratio': '81', 'am241_pu239_240_tot_ratio': '82', 'cs137_134_ratio': '83', 'cd109': '84', 'eu152': '85', 'fe59': '86', 'gd153': '87', 'ir192': '88', 'pu238_240_tot': '89', 'rb86': '90', 'sc46': '91', 'sn113': '92', 'sn117m': '93', 'tl208': '94', 'mo99': '95', 'tc99m': '96', 'ru105': '97', 'te129': '98', 'te132': '99', 'i132': '100', 'i135': '101', 'cs136': '102', 'tbeta': '103', 'talpha': '104', 'i133': '105', 'th230': '106', 'pa231': '107', 'u236': '108', 'ag111': '109', 'in116m': '110', 'te123m': '111', 'sb127': '112', 'ba133': '113', 'ce139': '114', 'tl201': '116', 'hg203': '117', 'na22': '122', 'pa234m': '123', 'am243': '124', 'se75': '126', 'sr85': '127', 'y88': '128', 'ce140': '129', 'bi212': '130', 'u236_238_ratio': '131', 'i125': '132', 'ba137m': '133', 'u232': '134', 'pa233': '135', 'ru106_rh106_tot': '136', 'tu': '137', 'tbeta40k': '138', 'fe55': '139', 'ce144_pr144_tot': '140', 'pu240_pu239_ratio': '141', 'u233': '142', 'pu239_242_tot': '143', 'ac227': '144'}, 'unit': {'Not applicable': '-1', 'NOT AVAILABLE': '0', 'Bq per m3': '1', 'Bq per m2': '2', 'Bq per kg': '3', 'Bq per kgd': '4', 'Bq per kgw': '5', 'kg per kg': '6', 'TU': '7', 'DELTA per mill': '8', 'atom per kg': '9', 'atom per kgd': '10', 'atom per kgw': '11', 'atom per l': '12', 'Bq per kgC': '13'}, 'dl': {'Not applicable': '-1', 'Not available': '0', 'Detected value': '1', 'Detection limit': '2', 'Not detected': '3', 'Derived': '4'}, 'sed_type': {'Not applicable': '-1', 'Not available': '0', 'Clay': '1', 'Gravel': '2', 'Marsh': '3', 'Mud': '4', 'Muddy sand': '5', 'Sand': '6', 'Fine sand': '7', 'Sandy mud': '8', 'Pebby sand': '9', 'Silt and clay': '10', 'Silt and gravel': '11', 'Silt': '12', 'Silty sand': '13', 'Sludge': '14', 'Turf': '15', 'Very coarse sand': '16', 'Coarse sand': '17', 'Medium sand': '18', 'Very fine sand': '19', 'Coarse silt': '20', 'Medium silt': '21', 'Fine silt': '22', 'Very fine silt': '23', 'Calcareous': '24', 'Glacial': '25', 'Soft': '26', 'Sulphidic': '27', 'Fe Mg concretions': '28', 'Sand and gravel': '29', 'Pure sand': '30', 'Sand and fine sand': '31', 'Sand and clay': '32', 'Sand and mud': '33', 'Fine sand and gravel': '34', 'Fine sand and sand': '35', 'Pure fine sand': '36', 'Fine sand and silt': '37', 'Fine sand and clay': '38', 'Fine sand and mud': '39', 'Silt and sand': '40', 'Silt and fine sand': '41', 'Pure silt': '42', 'Silt and mud': '43', 'Clay and gravel': '44', 'Clay and sand': '45', 'Clay and fine sand': '46', 'Pure clay': '47', 'Clay and silt': '48', 'Clay and mud': '49', 'Glacial clay': '50', 'Soft clay': '51', 'Sulphidic clay': '52', 'Clay and Fe Mg concretions': '53', 'Mud and gravel': '54', 'Mud and sand': '55', 'Mud and fine sand': '56', 'Mud and clay': '57', 'Pure mud': '58', 'Soft mud': '59', 'Sulphidic mud': '60', 'Mud and Fe Mg concretions': '61', 'Sand and silt': '62'}}}

Lets review the data of the NetCDF file:

dfs = contents.dfs
dfs
{'BIOTA':              LON        LAT  SMP_DEPTH        TIME  NUCLIDE       VALUE  UNIT  \
 0      12.316667  54.283333        NaN  1348358400       31    0.010140     5   
 1      12.316667  54.283333        NaN  1348358400        4  135.300003     5   
 2      12.316667  54.283333        NaN  1348358400        9    0.013980     5   
 3      12.316667  54.283333        NaN  1348358400       33    4.338000     5   
 4      12.316667  54.283333        NaN  1348358400       31    0.009614     5   
 ...          ...        ...        ...         ...      ...         ...   ...   
 16089  21.395000  61.241501        2.0  1652140800       33   13.700000     4   
 16090  21.395000  61.241501        2.0  1652140800        9    0.500000     4   
 16091  21.385000  61.343334        NaN  1663200000        4   50.700001     4   
 16092  21.385000  61.343334        NaN  1663200000       33    0.880000     4   
 16093  21.385000  61.343334        NaN  1663200000       12    6.600000     4   
 
             UNC  DL  BIO_GROUP  SPECIES  BODY_PART       DRYWT  WETWT  \
 0           NaN   2          4       99         52  174.934433  948.0   
 1      4.830210   1          4       99         52  174.934433  948.0   
 2           NaN   2          4       99         52  174.934433  948.0   
 3      0.150962   1          4       99         52  174.934433  948.0   
 4           NaN   2          4       99         52  177.935120  964.0   
 ...         ...  ..        ...      ...        ...         ...    ...   
 16089  0.520600   1         11       96         55         NaN    NaN   
 16090  0.045500   1         11       96         55         NaN    NaN   
 16091  4.106700   1         14      129          1         NaN    NaN   
 16092  0.140800   1         14      129          1         NaN    NaN   
 16093  0.349800   1         14      129          1         NaN    NaN   
 
        PERCENTWT  
 0        0.18453  
 1        0.18453  
 2        0.18453  
 3        0.18453  
 4        0.18458  
 ...          ...  
 16089        NaN  
 16090        NaN  
 16091        NaN  
 16092        NaN  
 16093        NaN  
 
 [16094 rows x 15 columns],
 'SEAWATER':              LON        LAT  SMP_DEPTH  TOT_DEPTH        TIME  NUCLIDE  \
 0      29.333300  60.083302        0.0        NaN  1337731200       33   
 1      29.333300  60.083302       29.0        NaN  1337731200       33   
 2      23.150000  59.433300        0.0        NaN  1339891200       33   
 3      27.983299  60.250000        0.0        NaN  1337817600       33   
 4      27.983299  60.250000       39.0        NaN  1337817600       33   
 ...          ...        ...        ...        ...         ...      ...   
 21468  13.499833  54.600334        0.0       47.0  1686441600        1   
 21469  13.499833  54.600334       45.0       47.0  1686441600        1   
 21470  14.200833  54.600334        0.0       11.0  1686614400        1   
 21471  14.665500  54.600334        0.0       20.0  1686614400        1   
 21472  14.330000  54.600334        0.0       17.0  1686614400        1   
 
             VALUE  UNIT        UNC  DL  FILT  
 0        5.300000     1   1.696000   1     0  
 1       19.900000     1   3.980000   1     0  
 2       25.500000     1   5.100000   1     0  
 3       17.000000     1   4.930000   1     0  
 4       22.200001     1   3.996000   1     0  
 ...           ...   ...        ...  ..   ...  
 21468  702.838074     1  51.276207   1     0  
 21469  725.855713     1  52.686260   1     0  
 21470  648.992920     1  48.154419   1     0  
 21471  627.178406     1  46.245316   1     0  
 21472  605.715088     1  45.691143   1     0  
 
 [21473 rows x 11 columns],
 'SEDIMENT':              LON        LAT  TOT_DEPTH        TIME  NUCLIDE        VALUE  \
 0      27.799999  60.466667       25.0  1337904000       33  1200.000000   
 1      27.799999  60.466667       25.0  1337904000       33   250.000000   
 2      27.799999  60.466667       25.0  1337904000       33   140.000000   
 3      27.799999  60.466667       25.0  1337904000       33    79.000000   
 4      27.799999  60.466667       25.0  1337904000       33    29.000000   
 ...          ...        ...        ...         ...      ...          ...   
 70444  15.537800  54.617832       62.0  1654646400       67     0.044000   
 70445  15.537800  54.617832       62.0  1654646400       77     2.500000   
 70446  15.537800  54.617832       62.0  1654646400        4  5873.000000   
 70447  15.537800  54.617832       62.0  1654646400       33    21.200001   
 70448  15.537800  54.617832       62.0  1654646400       77     0.370000   
 
        UNIT         UNC  DL  SED_TYPE   TOP  BOTTOM  PERCENTWT  
 0         4  240.000000   1         0  15.0    20.0        NaN  
 1         4   50.000000   1         0  20.0    25.0        NaN  
 2         4   29.400000   1         0  25.0    30.0        NaN  
 3         4   15.800000   1         0  30.0    35.0        NaN  
 4         4    6.960000   1         0  35.0    40.0        NaN  
 ...     ...         ...  ..       ...   ...     ...        ...  
 70444     4    0.015312   1        10  15.0    17.0   0.257642  
 70445     4    0.185000   1        10  15.0    17.0   0.257642  
 70446     4  164.444000   1        10  17.0    19.0   0.263965  
 70447     4    2.162400   1        10  17.0    19.0   0.263965  
 70448     4    0.048100   1        10  17.0    19.0   0.263965  
 
 [70449 rows x 13 columns]}

Lets review the biota data:

nc_dfs_biota=dfs['BIOTA']
nc_dfs_biota
LON LAT SMP_DEPTH TIME NUCLIDE VALUE UNIT UNC DL BIO_GROUP SPECIES BODY_PART DRYWT WETWT PERCENTWT
0 12.316667 54.283333 NaN 1348358400 31 0.010140 5 NaN 2 4 99 52 174.934433 948.0 0.18453
1 12.316667 54.283333 NaN 1348358400 4 135.300003 5 4.830210 1 4 99 52 174.934433 948.0 0.18453
2 12.316667 54.283333 NaN 1348358400 9 0.013980 5 NaN 2 4 99 52 174.934433 948.0 0.18453
3 12.316667 54.283333 NaN 1348358400 33 4.338000 5 0.150962 1 4 99 52 174.934433 948.0 0.18453
4 12.316667 54.283333 NaN 1348358400 31 0.009614 5 NaN 2 4 99 52 177.935120 964.0 0.18458
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16089 21.395000 61.241501 2.0 1652140800 33 13.700000 4 0.520600 1 11 96 55 NaN NaN NaN
16090 21.395000 61.241501 2.0 1652140800 9 0.500000 4 0.045500 1 11 96 55 NaN NaN NaN
16091 21.385000 61.343334 NaN 1663200000 4 50.700001 4 4.106700 1 14 129 1 NaN NaN NaN
16092 21.385000 61.343334 NaN 1663200000 33 0.880000 4 0.140800 1 14 129 1 NaN NaN NaN
16093 21.385000 61.343334 NaN 1663200000 12 6.600000 4 0.349800 1 14 129 1 NaN NaN NaN

16094 rows × 15 columns

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            SplitSedimentValuesCB(coi_sediment),
                            SanitizeValueCB(coi_val),
                            NormalizeUncCB(),                  
                            RemapUnitCB(),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

for grp in ['BIOTA', 'SEDIMENT', 'SEAWATER']:
    print(f'Unique DL values for {grp}: {tfm.dfs[grp]["DL"].unique()}')
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
                           BIOTA  SEAWATER  SEDIMENT
Number of rows in dfs      16124     21634     40744
Number of rows in tfm.dfs  16094     21481     70451
Number of rows removed        30       153       144 

Unique DL values for BIOTA: [2 1 0]
Unique DL values for SEDIMENT: [1 2 0]
Unique DL values for SEAWATER: [1 2 0]

Lets review the sediment data:

nc_dfs_sediment = dfs['SEDIMENT']
nc_dfs_sediment
key nuclide method < value_bq/kg value_bq/kg error%_kg < value_bq/m² value_bq/m² error%_m² date_of_entry_x ... lowsli area sedi oxic dw% loi% mors_subbasin helcom_subbasin sum_link date_of_entry_y
0 SKRIL2012116 CS137 NaN NaN 1200.000 20.0 NaN NaN NaN 08/20/14 00:00:00 ... 20.0 0.00600 NaN NaN NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00
1 SKRIL2012117 CS137 NaN NaN 250.000 20.0 NaN NaN NaN 08/20/14 00:00:00 ... 25.0 0.00600 NaN NaN NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00
2 SKRIL2012118 CS137 NaN NaN 140.000 21.0 NaN NaN NaN 08/20/14 00:00:00 ... 30.0 0.00600 NaN NaN NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00
3 SKRIL2012119 CS137 NaN NaN 79.000 20.0 NaN NaN NaN 08/20/14 00:00:00 ... 35.0 0.00600 NaN NaN NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00
4 SKRIL2012120 CS137 NaN NaN 29.000 24.0 NaN NaN NaN 08/20/14 00:00:00 ... 40.0 0.00600 NaN NaN NaN NaN 11.0 11.0 NaN 08/20/14 00:00:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
40739 SCLOR2022071 PU238 CLOR08 NaN 0.007 32.9 NaN 0.044 34.8 05/03/24 00:00:00 ... 17.0 0.01178 34.0 A 25.764192 NaN 6.0 6.0 NaN 05/03/24 00:00:00
40740 SCLOR2022071 PU239240 CLOR08 NaN 0.420 4.6 NaN 2.500 7.4 05/03/24 00:00:00 ... 17.0 0.01178 34.0 A 25.764192 NaN 6.0 6.0 NaN 05/03/24 00:00:00
40741 SCLOR2022072 K40 CLOR01 NaN 956.000 1.3 NaN 5873.000 2.8 05/03/24 00:00:00 ... 19.0 0.01178 34.0 A 26.396495 NaN 6.0 6.0 NaN 05/03/24 00:00:00
40742 SCLOR2022072 CS137 CLOR01 NaN 3.460 9.9 NaN 21.200 10.2 05/03/24 00:00:00 ... 19.0 0.01178 34.0 A 26.396495 NaN 6.0 6.0 NaN 05/03/24 00:00:00
40743 SCLOR2022072 PU239240 CLOR08 NaN 0.060 10.4 NaN 0.370 13.0 05/03/24 00:00:00 ... 19.0 0.01178 34.0 A 26.396495 NaN 6.0 6.0 NaN 05/03/24 00:00:00

40744 rows × 35 columns

Lets review the seawater data:

nc_dfs_seawater = dfs['SEAWATER']
nc_dfs_seawater
key nuclide method < value_bq/m³ value_bq/m³ error%_m³ date_of_entry_x country laboratory sequence ... longitude (ddmmmm) longitude (dddddd) tdepth sdepth salin ttemp filt mors_subbasin helcom_subbasin date_of_entry_y
0 WKRIL2012003 CS137 NaN NaN 5.300000 32.000000 08/20/14 00:00:00 90.0 KRIL 2012003.0 ... 29.2000 29.333300 NaN 0.0 NaN NaN NaN 11.0 11.0 08/20/14 00:00:00
1 WKRIL2012004 CS137 NaN NaN 19.900000 20.000000 08/20/14 00:00:00 90.0 KRIL 2012004.0 ... 29.2000 29.333300 NaN 29.0 NaN NaN NaN 11.0 11.0 08/20/14 00:00:00
2 WKRIL2012005 CS137 NaN NaN 25.500000 20.000000 08/20/14 00:00:00 90.0 KRIL 2012005.0 ... 23.0900 23.150000 NaN 0.0 NaN NaN NaN 11.0 3.0 08/20/14 00:00:00
3 WKRIL2012006 CS137 NaN NaN 17.000000 29.000000 08/20/14 00:00:00 90.0 KRIL 2012006.0 ... 27.5900 27.983300 NaN 0.0 NaN NaN NaN 11.0 11.0 08/20/14 00:00:00
4 WKRIL2012007 CS137 NaN NaN 22.200000 18.000000 08/20/14 00:00:00 90.0 KRIL 2012007.0 ... 27.5900 27.983300 NaN 39.0 NaN NaN NaN 11.0 11.0 08/20/14 00:00:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21629 WDHIG2023112 H3 DHIG04 NaN 702.838089 7.295593 05/03/24 00:00:00 6.0 DHIG 2023112.0 ... 13.2999 13.499833 47.0 0.0 7.89 NaN NaN 2.0 2.0 05/03/24 00:00:00
21630 WDHIG2023113 H3 DHIG04 NaN 725.855727 7.258503 05/03/24 00:00:00 6.0 DHIG 2023113.0 ... 13.2999 13.499833 47.0 45.0 14.80 NaN NaN 2.0 2.0 05/03/24 00:00:00
21631 WDHIG2023143 H3 DHIG04 NaN 648.992944 7.419868 05/03/24 00:00:00 6.0 DHIG 2023143.0 ... 14.1205 14.200833 11.0 0.0 5.70 NaN NaN 2.0 6.0 05/03/24 00:00:00
21632 WDHIG2023145 H3 DHIG04 NaN 627.178435 7.373550 05/03/24 00:00:00 6.0 DHIG 2023145.0 ... 14.3993 14.665500 20.0 0.0 7.76 NaN NaN 2.0 6.0 05/03/24 00:00:00
21633 WDHIG2023147 H3 DHIG04 NaN 605.715107 7.543339 05/03/24 00:00:00 6.0 DHIG 2023147.0 ... 14.1980 14.330000 17.0 0.0 7.67 NaN NaN 2.0 2.0 05/03/24 00:00:00

21634 rows × 27 columns

Data Format Conversion

The MARIS data processing workflow involves two key steps:

  1. NetCDF to Standardized CSV Compatible with OpenRefine Pipeline
    • Convert standardized NetCDF files to CSV formats compatible with OpenRefine using the NetCDFDecoder.
    • Preserve data integrity and variable relationships.
    • Maintain standardized nomenclature and units.
  2. Database Integration
    • Process the converted CSV files using OpenRefine.
    • Apply data cleaning and standardization rules.
    • Export validated data to the MARIS master database.

This section focuses on the first step: converting NetCDF files to a format suitable for OpenRefine processing using the NetCDFDecoder class.

decode(fname_in=fname_out_nc, verbose=True)
Saved BIOTA to ../../_data/output/100-HELCOM-MORS-2024_BIOTA.csv
Saved SEAWATER to ../../_data/output/100-HELCOM-MORS-2024_SEAWATER.csv
Saved SEDIMENT to ../../_data/output/100-HELCOM-MORS-2024_SEDIMENT.csv