This data pipeline, known as a “handler” in Marisco terminology, is designed to clean, standardize, and encode OSPAR data into NetCDF format. The handler processes raw OSPAR data, applying various transformations and lookups to align it with MARIS data standards.

Key functions of this handler:

This handler is a crucial component in the Marisco data processing workflow, ensuring OSPAR data is properly integrated into the MARIS database.

Tip

For new MARIS users, please refer to Understanding MARIS Data Formats (NetCDF and Open Refine) for detailed information.

The present notebook pretends to be an instance of Literate Programming in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case marisco/handlers/ospar.py) the code snippet is added to the module using #| export as provided by the wonderful nbdev library.

from IPython.display import display, Markdown

Configuration and File Paths

The handler requires several configuration parameters: 1. src_dir: path to the maris-crawlers folder containing the OSPAR data in CSV format. 2. fname_out_nc: Output path and filename for NetCDF file (relative paths supported) 3. zotero_key: Key for retrieving dataset attributes from Zotero

Tip

FEEDBACK FOR NEXT VERSION: Update src_dir to use Franck’s repository.

Exported source
src_dir = 'https://raw.githubusercontent.com/niallmurphy93/maris-crawlers/refs/heads/main/data/processed/OSPAR'
fname_out_nc = '../../_data/output/191-OSPAR-2024.nc'
zotero_key ='LQRA4MMK' # OSPAR MORS zotero key

Load data

OSPAR data is provided as a zipped Microsoft Access database. To facilitate easier access and integration, we process this dataset and convert it into .csv files. These processed files are then made available in the maris-crawlers repository on GitHub. Once converted, the dataset is in a format that is readily compatible with the marisco data pipeline, ensuring seamless data handling and analysis.


source

read_csv

 read_csv (file_name,
           dir='https://raw.githubusercontent.com/niallmurphy93/maris-
           crawlers/refs/heads/main/data/processed/OSPAR')
Exported source
default_smp_types = {  
    'Biota': 'BIOTA', 
    'Seawater': 'SEAWATER', 
}
Exported source
def read_csv(file_name, dir=src_dir):
    file_path = f'{dir}/{file_name}'
    return pd.read_csv(file_path)

source

load_data

 load_data (src_url:str, smp_types:dict={'Biota': 'BIOTA', 'Seawater':
            'SEAWATER'}, use_cache:bool=False, save_to_cache:bool=False,
            verbose:bool=False)

Load OSPAR data and return the data in a dictionary of dataframes with the dictionary key as the sample type.

Exported source
def load_data(src_url: str, 
              smp_types: dict = default_smp_types, 
              use_cache: bool = False,
              save_to_cache: bool = False,
              verbose: bool = False) -> Dict[str, pd.DataFrame]:
    "Load OSPAR data and return the data in a dictionary of dataframes with the dictionary key as the sample type."
    
    def safe_file_path(url: str) -> str:
        """Safely encode spaces in a URL."""
        return url.replace(" ", "%20")

    def get_file_path(dir_path: str, file_prefix: str) -> str:
        """Construct the full file path based on directory and file prefix."""
        file_path = f"{dir_path}/{file_prefix} data.csv"
        return safe_file_path(file_path) if not use_cache else file_path

    def load_and_process_csv(file_path: str) -> pd.DataFrame:
        """Load a CSV file and process it."""
        if use_cache and not Path(file_path).exists():
            if verbose:
                print(f"{file_path} not found in cache.")
            return pd.DataFrame()

        if verbose:
            start_time = time.time()

        try:
            df = pd.read_csv(file_path)
            df.columns = df.columns.str.lower()
            if verbose:
                print(f"Data loaded from {file_path} in {time.time() - start_time:.2f} seconds.")
            return df
        except Exception as e:
            if verbose:
                print(f"Failed to load {file_path}: {e}")
            return pd.DataFrame()

    def save_to_cache_dir(df: pd.DataFrame, file_prefix: str):
        """Save the DataFrame to the cache directory."""
        cache_dir = cache_path()
        cache_file_path = f"{cache_dir}/{file_prefix} data.csv"
        df.to_csv(cache_file_path, index=False)
        if verbose:
            print(f"Data saved to cache at {cache_file_path}")

    data = {}
    for file_prefix, smp_type in smp_types.items():
        dir_path = cache_path() if use_cache else src_url
        file_path = get_file_path(dir_path, file_prefix)
        df = load_and_process_csv(file_path)

        if save_to_cache and not df.empty:
            save_to_cache_dir(df, file_prefix)

        data[smp_type] = df

    return data
load_data(src_dir, save_to_cache=True, verbose=True)
Data loaded from https://raw.githubusercontent.com/niallmurphy93/maris-crawlers/refs/heads/main/data/processed/OSPAR/Biota%20data.csv in 0.43 seconds.
Data saved to cache at /home/niallmurphy93/.marisco/cache/Biota data.csv
Data loaded from https://raw.githubusercontent.com/niallmurphy93/maris-crawlers/refs/heads/main/data/processed/OSPAR/Seawater%20data.csv in 0.39 seconds.
Data saved to cache at /home/niallmurphy93/.marisco/cache/Seawater data.csv
{'BIOTA':           id contracting party  rsc sub-division             station id  \
 0          1           Belgium                 8  Kloosterzande-Schelde   
 1          2           Belgium                 8  Kloosterzande-Schelde   
 2          3           Belgium                 8  Kloosterzande-Schelde   
 3          4           Belgium                 8  Kloosterzande-Schelde   
 4          5           Belgium                 8  Kloosterzande-Schelde   
 ...      ...               ...               ...                    ...   
 15946  98058            Sweden                12         Ringhals (R22)   
 15947  98059            Sweden                12         Ringhals (R23)   
 15948  98060            Sweden                11                    SW7   
 15949  98061            Sweden                11                   SW6a   
 15950  98062            Sweden                12         Ringhals (R25)   
 
       sample id  latd  latm  lats latdir  longd  ...      sampling date  \
 0      DA 17531    51  23.0  36.0      N      4  ...  03/03/10 00:00:00   
 1      DA 17534    51  23.0  36.0      N      4  ...  06/14/10 00:00:00   
 2      DA 17537    51  23.0  36.0      N      4  ...  09/27/10 00:00:00   
 3      DA 17540    51  23.0  36.0      N      4  ...  12/08/10 00:00:00   
 4      DA 17531    51  23.0  36.0      N      4  ...  03/03/10 00:00:00   
 ...         ...   ...   ...   ...    ...    ...  ...                ...   
 15946       NaN    57  15.0   9.0      N     12  ...  08/09/22 00:00:00   
 15947       NaN    57  18.0  23.0      N     12  ...  09/23/22 00:00:00   
 15948       NaN    58  36.0  12.0      N     11  ...  11/07/22 00:00:00   
 15949       NaN    57  18.0   9.0      N     11  ...  09/20/22 00:00:00   
 15950       NaN    57  20.0   7.0      N     12  ...  09/02/22 00:00:00   
 
        nuclide value type activity or mda uncertainty        unit  \
 0        137Cs          <        0.326416         NaN  Bq/kg f.w.   
 1        137Cs          <        0.442704         NaN  Bq/kg f.w.   
 2        137Cs          <        0.412989         NaN  Bq/kg f.w.   
 3        137Cs          <        0.202768         NaN  Bq/kg f.w.   
 4        226Ra          <        0.652833         NaN  Bq/kg f.w.   
 ...        ...        ...             ...         ...         ...   
 15946    137Cs          =        0.384000    0.024192  Bq/kg f.w.   
 15947    137Cs          =        0.456000    0.024168  Bq/kg f.w.   
 15948    137Cs          =        0.122000    0.062000  Bq/kg f.w.   
 15949    137Cs          <        0.310000         NaN  Bq/kg f.w.   
 15950    137Cs          =        0.306000    0.014382  Bq/kg f.w.   
 
                             data provider measurement comment  \
 0                                 SCK•CEN                 NaN   
 1                                 SCK•CEN                 NaN   
 2                                 SCK•CEN                 NaN   
 3                                 SCK•CEN                 NaN   
 4                                 SCK•CEN                 NaN   
 ...                                   ...                 ...   
 15946  Swedish Radiation Safety Authority                 NaN   
 15947  Swedish Radiation Safety Authority                 NaN   
 15948  Swedish Radiation Safety Authority                 NaN   
 15949  Swedish Radiation Safety Authority                 NaN   
 15950  Swedish Radiation Safety Authority                 NaN   
 
                  sample comment reference comment  
 0                           NaN               NaN  
 1                           NaN               NaN  
 2                           NaN               NaN  
 3                           NaN               NaN  
 4                           NaN               NaN  
 ...                         ...               ...  
 15946  converted from  dw to fw               NaN  
 15947  converted from  dw to fw               NaN  
 15948                       NaN               NaN  
 15949                       NaN               NaN  
 15950  converted from  dw to fw               NaN  
 
 [15951 rows x 27 columns],
 'SEAWATER':            id contracting party  rsc sub-division   station id sample id  \
 0           1           Belgium               8.0  Belgica-W01    WNZ 01   
 1           2           Belgium               8.0  Belgica-W02    WNZ 02   
 2           3           Belgium               8.0  Belgica-W03    WNZ 03   
 3           4           Belgium               8.0  Belgica-W04    WNZ 04   
 4           5           Belgium               8.0  Belgica-W05    WNZ 05   
 ...       ...               ...               ...          ...       ...   
 19188  120364           Ireland               4.0           N2       NaN   
 19189  120365           Ireland               4.0           N3       NaN   
 19190  120366           Ireland               4.0           N8       NaN   
 19191  120367           Ireland               4.0           N9       NaN   
 19192  120368           Ireland               4.0          N10       NaN   
 
        latd  latm  lats latdir  longd  ...      sampling date  nuclide  \
 0        51  22.0  31.0      N      3  ...  01/27/10 00:00:00    137Cs   
 1        51  13.0  25.0      N      2  ...  01/27/10 00:00:00    137Cs   
 2        51  11.0   4.0      N      2  ...  01/27/10 00:00:00    137Cs   
 3        51  25.0  13.0      N      3  ...  01/27/10 00:00:00    137Cs   
 4        51  24.0  58.0      N      2  ...  01/26/10 00:00:00    137Cs   
 ...     ...   ...   ...    ...    ...  ...                ...      ...   
 19188    53  36.0   0.0      N      5  ...                NaN      NaN   
 19189    53  44.0   0.0      N      5  ...                NaN      NaN   
 19190    53  39.0   0.0      N      5  ...                NaN      NaN   
 19191    53  53.0   0.0      N      5  ...                NaN      NaN   
 19192    53  52.0   0.0      N      5  ...                NaN      NaN   
 
       value type activity or mda  uncertainty  unit data provider  \
 0              <            0.20          NaN  Bq/l       SCK•CEN   
 1              <            0.27          NaN  Bq/l       SCK•CEN   
 2              <            0.26          NaN  Bq/l       SCK•CEN   
 3              <            0.25          NaN  Bq/l       SCK•CEN   
 4              <            0.20          NaN  Bq/l       SCK•CEN   
 ...          ...             ...          ...   ...           ...   
 19188        NaN             NaN          NaN   NaN           NaN   
 19189        NaN             NaN          NaN   NaN           NaN   
 19190        NaN             NaN          NaN   NaN           NaN   
 19191        NaN             NaN          NaN   NaN           NaN   
 19192        NaN             NaN          NaN   NaN           NaN   
 
       measurement comment                                     sample comment  \
 0                     NaN                                                NaN   
 1                     NaN                                                NaN   
 2                     NaN                                                NaN   
 3                     NaN                                                NaN   
 4                     NaN                                                NaN   
 ...                   ...                                                ...   
 19188           2021 data  The Irish Navy attempted a few times to collec...   
 19189           2021 data  The Irish Navy attempted a few times to collec...   
 19190           2021 data  The Irish Navy attempted a few times to collec...   
 19191           2021 data  The Irish Navy attempted a few times to collec...   
 19192           2021 data  The Irish Navy attempted a few times to collec...   
 
        reference comment  
 0                    NaN  
 1                    NaN  
 2                    NaN  
 3                    NaN  
 4                    NaN  
 ...                  ...  
 19188                NaN  
 19189                NaN  
 19190                NaN  
 19191                NaN  
 19192                NaN  
 
 [19193 rows x 25 columns]}

Nuclide Name Normalization

Tip

FEEDBACK FOR NEXT VERSION: In the lookup at nuc_lut_path, do we need nc_name? We used nc_name when we were pivoting the table from long to wide format. Should we remove it?

We are standardizing the nuclide names in the OSPAR dataset to align with the standardized names provided in the MARISCO lookup table. The lookup process utilizes three key columns: - nuclide_id: This serves as a unique identifier for each nuclide. - nuclide: Represents the standardized name of the nuclide as per our conventions. - nc_name: Denotes the corresponding name used in NetCDF files. Below, we will examine the structure and contents of the lookup table:

nuc_lut_df = pd.read_excel(nuc_lut_path())
nuc_lut_df.head()
nuclide_id nuclide atomicnb massnb nusymbol half_life hl_unit nc_name
0 -1 NOT APPLICABLE NaN NaN NaN NaN NaN NOT APPLICABLE
1 0 NOT AVAILABLE 0.0 0.0 0 0.00 - NOT AVAILABLE
2 1 TRITIUM 1.0 3.0 3H 12.35 Y h3
3 2 BERYLLIUM 4.0 7.0 7Be 53.30 D be7
4 3 CARBON 6.0 14.0 14C 5730.00 Y c14

OSPAR defines the the nuclide measured in the nuclide column. However, as shown below, the nuclide names are not standardized.

dfs = load_data(src_dir, use_cache=True, verbose=True)
df = get_unique_across_dfs(dfs, 'nuclide', as_df=True)
df.T
Data loaded from /home/niallmurphy93/.marisco/cache/Biota data.csv in 0.05 seconds.
Data loaded from /home/niallmurphy93/.marisco/cache/Seawater data.csv in 0.04 seconds.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
value 210Po 239,240Pu 228Ra Cs-137 3H 99Tc 226Ra 239, 240 Pu NaN CS-137 137Cs 137Cs 210Po 99Tc 238Pu 210Pb 241Am 99Tc

Lower & strip nuclide names

To streamline the process of standardizing nuclide data, we employ the LowerStripNameCB callback. This function is applied to each DataFrame within our dictionary of DataFrames. Specifically, LowerStripNameCB simplifies the nuclide names by converting them to lowercase and removing any leading or trailing whitespace.

dfs = load_data(src_dir, use_cache=True, verbose=True)
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='nuclide')])
dfs_output=tfm()
for key, df in dfs_output.items():
    print(f'{key} nuclides: ')
    print(df['nuclide'].unique())
Data loaded from /home/niallmurphy93/.marisco/cache/Biota data.csv in 0.04 seconds.
Data loaded from /home/niallmurphy93/.marisco/cache/Seawater data.csv in 0.04 seconds.
BIOTA nuclides: 
['137cs' '226ra' '228ra' '239,240pu' '99tc' '210po' '210pb' '3h' 'cs-137'
 '238pu' '239, 240 pu' '241am']
SEAWATER nuclides: 
['137cs' '239,240pu' '226ra' '228ra' '99tc' '3h' '210po' '210pb' nan]

Remap nuclide names to MARIS data formats

Tip

FEEDBACK TO DATA PROVIDER: The nuclide column has inconsistent naming. E.g:

  • Cs-137, 137Cs or CS-137
  • 239, 240 pu or 239,240 pu
  • ra-226 and 226ra

See below:

get_unique_across_dfs(dfs, col_name='nuclide', as_df=True).T
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
value 210Po 239,240Pu 228Ra Cs-137 3H 99Tc 226Ra 239, 240 Pu NaN CS-137 137Cs 137Cs 210Po 99Tc 238Pu 210Pb 241Am 99Tc

Next, we map nuclide names used by OSPAR to the MARIS standard nuclide names.

Remapping data provider nomenclatures to MARIS standards is a recurrent operation and is done in a semi-automated manner according to the following pattern:

  1. Inspect data provider nomenclature:
  2. Match automatically against MARIS nomenclature (using a fuzzy matching algorithm);
  3. Fix potential mismatches;
  4. Apply the lookup table to the dataframe.

We will refer to this process as IMFA (Inspect, Match, Fix, Apply).

Let’s now create an instance of a fuzzy matching algorithm Remapper. This instance will align the nuclide names from the OSPAR dataset with the MARIS standard nuclide names, as defined in the lookup table located at nuc_lut_path and previously shown as nuc_lut_df.

remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_output, col_name='nuclide', as_df=True),
                    maris_lut_fn=nuc_lut_path,
                    maris_col_id='nuclide_id',
                    maris_col_name='nc_name',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='nuclides_ospar.pkl')

Now, we can automatically match the OSPAR nuclide names to the MARIS standard. The match_score column helps us evaluate the results.

remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing:   0%|          | 0/13 [00:00<?, ?it/s]Processing: 100%|██████████| 13/13 [00:00<00:00, 50.27it/s]
1 entries matched the criteria, while 12 entries had a match score of 1 or higher.
matched_maris_name source_name match_score
source_key
239, 240 pu pu240 239, 240 pu 8
239,240pu pu240 239,240pu 6
228ra u235 228ra 4
137cs i133 137cs 4
210pb ru106 210pb 4
241am pu241 241am 4
226ra u234 226ra 4
210po ru106 210po 4
238pu u238 238pu 3
99tc tu 99tc 3
3h tu 3h 2
cs-137 cs137 cs-137 1

We now manually review the unmatched nuclide names and construct a dictionary to map them to the MARIS standard.

The dictionary fixes_nuclide_names applies manual corrections to the nuclide names before the remapping process begins. The generate_lookup_table function constructs a lookup table for this purpose and includes an overwrite parameter, set to True by default. When activated, this parameter enables the function to update the existing cache with a new pickle file containing the updated lookup table. We are now prepared to test the remapping process.

remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
fc.test_eq(len(remapper.select_match(match_score_threshold=1)), 0)
Processing:   0%|          | 0/13 [00:00<?, ?it/s]Processing: 100%|██████████| 13/13 [00:00<00:00, 48.19it/s]

To view all remapped nuclides in a DataFrame, set the match_score_threshold to 0 and enable as_df. Disabling as_df provides a more detailed response that includes the matched_id. This matched_id serves as the unique integer key in the lookup table, establishing a one-to-one relationship between each integer and the standardized MARIS nuclide names.

as_df=False
remapper.generate_lookup_table(as_df=as_df, fixes=fixes_nuclide_names)
matches=remapper.select_match(match_score_threshold=0, verbose=True)
if as_df:
    display(matches.T)
else:
    print(matches)
Processing:   0%|          | 0/13 [00:00<?, ?it/s]Processing: 100%|██████████| 13/13 [00:00<00:00, 52.95it/s]
0 entries matched the criteria, while 13 entries had a match score of 0 or higher.
{'228ra': Match(matched_id=54, matched_maris_name='ra228', source_name='228ra', match_score=0), '3h': Match(matched_id=1, matched_maris_name='h3', source_name='3h', match_score=0), '137cs': Match(matched_id=33, matched_maris_name='cs137', source_name='137cs', match_score=0), 'cs-137': Match(matched_id=33, matched_maris_name='cs137', source_name='cs-137', match_score=0), nan: Match(matched_id=-1, matched_maris_name='Unknown', source_name=nan, match_score=0), '238pu': Match(matched_id=67, matched_maris_name='pu238', source_name='238pu', match_score=0), '210pb': Match(matched_id=41, matched_maris_name='pb210', source_name='210pb', match_score=0), '241am': Match(matched_id=72, matched_maris_name='am241', source_name='241am', match_score=0), '226ra': Match(matched_id=53, matched_maris_name='ra226', source_name='226ra', match_score=0), '99tc': Match(matched_id=15, matched_maris_name='tc99', source_name='99tc', match_score=0), '210po': Match(matched_id=47, matched_maris_name='po210', source_name='210po', match_score=0), '239,240pu': Match(matched_id=77, matched_maris_name='pu239_240_tot', source_name='239,240pu', match_score=0), '239, 240 pu': Match(matched_id=77, matched_maris_name='pu239_240_tot', source_name='239, 240 pu', match_score=0)}

The nuclide names have been successfully remapped. We now create a callback named RemapNuclideNameCB to translate the OSPAR dataset’s nuclide names into the standard nuclide_ids used by MARIS. This callback employs the lut_nuclides lambda function, which provides the required lookup table. Note that the overwrite=False parameter is specified in the Remapper constructor of the lut_nuclides lambda function to utilize the cached version.


source

RemapNuclideNameCB

 RemapNuclideNameCB (fn_lut:Callable, col_name:str)

Remap data provider nuclide names to standardized MARIS nuclide names.

Type Details
fn_lut Callable Function that returns the lookup table dictionary
col_name str Column name to remap

Let’s see it in action, along with the LowerStripNameCB callback:

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide')
                            ])
dfs_out = tfm()

# For instance
for key in dfs_out.keys():
    print(f'Unique nuclide_ids for {key} NUCLIDE column: ', dfs_out[key]['NUCLIDE'].unique())
Unique nuclide_ids for BIOTA NUCLIDE column:  [33 53 54 77 15 47 41  1 67 72]
Unique nuclide_ids for SEAWATER NUCLIDE column:  [33 77 53 54 15  1 47 41 -1]

Standardize Time

Tip

FEEDBACK TO DATA PROVIDER: ‘NaN’ values found for sampling date column in the SEAWATER dataset.

dfs = load_data(src_dir, use_cache=True)

for key in dfs.keys():
    if dfs[key]['sampling date'].isnull().sum() > 0:
        print(f"NaN values found for 'sampling date' in {key} dataset. A total of {dfs[key]['sampling date'].isnull().sum()} NaN values found.")
        print(f'Example:')
        with pd.option_context('display.max_columns', None):
            display(dfs[key][dfs[key]['sampling date'].isnull()].head(2))
    else:
        print(f"No NaN values found for 'sampling date' in {key} dataset.")
No NaN values found for 'sampling date' in BIOTA dataset.
NaN values found for 'sampling date' in SEAWATER dataset. A total of 10 NaN values found.
Example:
id contracting party rsc sub-division station id sample id latd latm lats latdir longd longm longs longdir sample type sampling depth sampling date nuclide value type activity or mda uncertainty unit data provider measurement comment sample comment reference comment
14776 97948 Sweden 11.0 SW7 1 58 36.0 12.0 N 11 14.0 42.0 E WATER 1.0 NaN 3H NaN NaN NaN Bq/l Swedish Radiation Safety Authority no 3H this year due to broken LSC NaN NaN
14780 97952 Sweden 12.0 Ringhals (R35) 7 57 14.0 5.0 N 11 56.0 8.0 E WATER 1.0 NaN 3H NaN NaN NaN Bq/l Swedish Radiation Safety Authority no 3H this year due to broken LSC NaN NaN

We create a callback that remaps the date time format in the dictionary of DataFrames (i.e. %m/%d/%y %H:%M:%S) to a data time object and in the process handle missing date and times.


source

ParseTimeCB

 ParseTimeCB (col_src:dict={'BIOTA': 'sampling date', 'SEAWATER':
              'sampling date'}, col_dst:str='TIME', format:str='%m/%d/%y
              %H:%M:%S')

Parse the time format in the dataframe and check for inconsistencies.

Type Default Details
col_src dict {‘BIOTA’: ‘sampling date’, ‘SEAWATER’: ‘sampling date’} Column name to remap
col_dst str TIME Column name to remap
format str %m/%d/%y %H:%M:%S Time format

Apply the transformer for callback ParseTimeCB.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    ParseTimeCB(),
    CompareDfsAndTfmCB(dfs)])
tfm()


display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))

display(Markdown("<b> Example of parsed time column:</b>"))
with pd.option_context('display.max_rows', None):
    display(tfm.dfs['SEAWATER']['TIME'].head(2))
10 invalid rows found in group 'SEAWATER' during time parsing callback (ParseTimeCB).

Row Count Comparison Before and After Transformation:

BIOTA SEAWATER
Number of rows in original dataframes (dfs): 15951 19193
Number of rows in transformed dataframes (tfm.dfs): 15951 19183
Number of rows removed (tfm.dfs_removed): 0 10

Example of parsed time column:

0   2010-01-27
1   2010-01-27
Name: TIME, dtype: datetime64[ns]

The NetCDF time format requires the time to be encoded as number of milliseconds since a time of origin. In our case the time of origin is 1970-01-01 as indicated in configs.ipynb CONFIFS['units']['time'] dictionary.

EncodeTimeCB transforms the datetime object from ParseTimeCB into the MARIS NetCDF time format.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            EncodeTimeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()

display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))
10 invalid rows found in group 'SEAWATER' during time parsing callback (ParseTimeCB).

Row Count Comparison Before and After Transformation:

BIOTA SEAWATER
Number of rows in original dataframes (dfs): 15951 19193
Number of rows in transformed dataframes (tfm.dfs): 15951 19183
Number of rows removed (tfm.dfs_removed): 0 10

Sanitize value

We create a callback, SanitizeValueCB, to consolidate measurement values into a single column named VALUE and remove any NaN entries.


source

SanitizeValueCB

 SanitizeValueCB (value_col:dict={'BIOTA': 'activity or mda', 'SEAWATER':
                  'activity or mda'})

Sanitize value by removing blank entries and populating value column.

Type Default Details
value_col dict {‘BIOTA’: ‘activity or mda’, ‘SEAWATER’: ‘activity or mda’} Column name to sanitize
Exported source
value_cols = {'BIOTA': 'activity or mda', 'SEAWATER': 'activity or mda'}
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            CompareDfsAndTfmCB(dfs)])

tfm()

display(Markdown("<b> Example of VALUE column:</b>"))
with pd.option_context('display.max_rows', None):
    display(tfm.dfs['SEAWATER'][['VALUE']].head())

display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))

display(Markdown("<b> Example of removed data:</b>"))
with pd.option_context('display.max_columns', None):
    display(tfm.dfs_removed['SEAWATER'].head(2))
10 invalid rows found in group 'SEAWATER' during sanitize value callback.

Example of VALUE column:

VALUE
0 0.20
1 0.27
2 0.26
3 0.25
4 0.20

Row Count Comparison Before and After Transformation:

BIOTA SEAWATER
Number of rows in original dataframes (dfs): 15951 19193
Number of rows in transformed dataframes (tfm.dfs): 15951 19183
Number of rows removed (tfm.dfs_removed): 0 10

Example of removed data:

id contracting party rsc sub-division station id sample id latd latm lats latdir longd longm longs longdir sample type sampling depth sampling date nuclide value type activity or mda uncertainty unit data provider measurement comment sample comment reference comment
14776 97948 Sweden 11.0 SW7 1 58 36.0 12.0 N 11 14.0 42.0 E WATER 1.0 NaN 3H NaN NaN NaN Bq/l Swedish Radiation Safety Authority no 3H this year due to broken LSC NaN NaN
14780 97952 Sweden 12.0 Ringhals (R35) 7 57 14.0 5.0 N 11 56.0 8.0 E WATER 1.0 NaN 3H NaN NaN NaN Bq/l Swedish Radiation Safety Authority no 3H this year due to broken LSC NaN NaN

Normalize uncertainty

We create a callback, NormalizeUncCB, to standardize the uncertainty value to the MARIS format. For each sample type in the OSPAR dataset, the reported uncertainty is given as an expanded uncertainty with a coverage factor 𝑘=2. For further details, refer to the OSPAR reporting guidelines. In MARIS the uncertainty values are reported as standard uncertainty with a coverage factor 𝑘=1.

NormalizeUncCB callback normalizes the uncertainty using the following lambda function:


source

NormalizeUncCB

 NormalizeUncCB (col_unc:dict={'BIOTA': 'uncertainty', 'SEAWATER':
                 'uncertainty'}, fn_convert_unc:Callable=<function
                 <lambda>>)

Normalize uncertainty values in DataFrames.

Type Default Details
col_unc dict {‘BIOTA’: ‘uncertainty’, ‘SEAWATER’: ‘uncertainty’} Column name to normalize
fn_convert_unc Callable Function correcting coverage factor
Exported source
unc_cols = {'BIOTA': 'uncertainty', 'SEAWATER': 'uncertainty'}
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
        SanitizeValueCB(),               
        NormalizeUncCB()
    ])
tfm()

display(Markdown("<b> Example of VALUE and UNC columns:</b>"))  
for grp in ['SEAWATER', 'BIOTA']:
    print(f'\n{grp}:')
    print(tfm.dfs[grp][['VALUE', 'UNC']])
10 invalid rows found in group 'SEAWATER' during sanitize value callback.

Example of VALUE and UNC columns:


SEAWATER:
          VALUE           UNC
0      0.200000           NaN
1      0.270000           NaN
2      0.260000           NaN
3      0.250000           NaN
4      0.200000           NaN
...         ...           ...
19183  0.000005  2.600000e-07
19184  6.152000  3.076000e-01
19185  0.005390  1.078000e-03
19186  0.001420  2.840000e-04
19187  6.078000  3.039000e-01

[19183 rows x 2 columns]

BIOTA:
          VALUE       UNC
0      0.326416       NaN
1      0.442704       NaN
2      0.412989       NaN
3      0.202768       NaN
4      0.652833       NaN
...         ...       ...
15946  0.384000  0.012096
15947  0.456000  0.012084
15948  0.122000  0.031000
15949  0.310000       NaN
15950  0.306000  0.007191

[15951 rows x 2 columns]
Tip

Feedback to Data Provider: The SEAWATER dataset includes instances where the uncertainty values significantly exceed the corresponding measurement values. While such occurrences are not inherently erroneous, they merit attention and may warrant further verification.

To demonstrate instances where the uncertainty significantly surpasses the measurement values, we will initially compute the ‘relative uncertainty’ as a percentage for the seawater dataset.

dfs = load_data(src_dir, use_cache=True)
for grp in ['SEAWATER', 'BIOTA']:
    tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['uncertainty'] / tfm.dfs[grp]['activity or mda'])
    # Multiply by 100 to convert to percentage
    * 100)

Now we will retrieve all rows where the relative uncertainty exceeds 100% for the seawater dataset.

threshold = 100
grp='SEAWATER'
cols_to_show=['id', 'contracting party', 'nuclide', 'value type', 'activity or mda', 'uncertainty', 'unit', 'relative_uncertainty']
df=tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]

print(f'Number of rows where relative uncertainty is greater than {threshold}%: \n {df.shape[0]} \n')

display(Markdown(f"<b> Example of data with relative uncertainty greater than {threshold}%:</b>"))
with pd.option_context('display.max_rows', None):
    display(df.head())
Number of rows where relative uncertainty is greater than 100%: 
 95 

Example of data with relative uncertainty greater than 100%:

id contracting party nuclide value type activity or mda uncertainty unit relative_uncertainty
969 11075 United Kingdom 137Cs = 0.0028 0.3276 Bq/l 11700.0
971 11077 United Kingdom 137Cs = 0.0029 0.3364 Bq/l 11600.0
973 11079 United Kingdom 137Cs = 0.0025 0.3325 Bq/l 13300.0
975 11081 United Kingdom 137Cs = 0.0025 0.3450 Bq/l 13800.0
977 11083 United Kingdom 137Cs = 0.0038 0.3344 Bq/l 8800.0
Tip

FEEDBACK TO DATA PROVIDER: The BIOTA dataset includes instances where the uncertainty values significantly exceed the corresponding measurement values. While such occurrences are not inherently erroneous, they merit attention and may warrant further verification.

Now we will retrieve all rows where the relative uncertainty exceeds 100% for the biota dataset.

threshold = 100
grp='BIOTA' 
cols_to_show=['id', 'contracting party', 'nuclide', 'value type', 'activity or mda', 'uncertainty', 'unit', 'relative_uncertainty']
df=tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]

print(f'Number of rows where relative uncertainty is greater than {threshold}%: \n {df.shape[0]} \n')

display(Markdown(f"<b> Example of data with relative uncertainty greater than {threshold}%:</b>"))
with pd.option_context('display.max_rows', None):
    display(df.head())
Number of rows where relative uncertainty is greater than 100%: 
 100 

Example of data with relative uncertainty greater than 100%:

id contracting party nuclide value type activity or mda uncertainty unit relative_uncertainty
249 3101 Norway 137Cs = 0.0500 0.1000 Bq/kg f.w. 200.000000
306 3158 Norway 137Cs = 0.1500 0.1600 Bq/kg f.w. 106.666667
775 8152 Norway 137Cs = 0.0340 0.0500 Bq/kg f.w. 147.058824
788 8165 Norway 137Cs = 0.0300 0.0500 Bq/kg f.w. 166.666667
1839 19571 Belgium 239,240Pu = 0.0074 0.0093 Bq/kg f.w. 125.675676

Remap units

Tip

FEEDBACK TO DATA PROVIDER: The Unit column contains NaN values for the SEAWATER dataset, as shown below.

number_rows_to_show=2
df=dfs['SEAWATER'][dfs['SEAWATER']['unit'].isnull()]
print(f'Number of rows with NaN in unit column: \n {df.shape[0]} \n')
display(Markdown(f"<b> Example of data with NaN in unit column:</b>"))
with pd.option_context('display.max_columns', None):
    display(df.head(number_rows_to_show))
Number of rows with NaN in unit column: 
 8 

Example of data with NaN in unit column:

id contracting party rsc sub-division station id sample id latd latm lats latdir longd longm longs longdir sample type sampling depth sampling date nuclide value type activity or mda uncertainty unit data provider measurement comment sample comment reference comment
16161 120369 Ireland 1.0 Salthill NaN 53 15.0 40.0 N 9 4.0 15.0 W NaN NaN NaN NaN NaN NaN NaN NaN NaN 2021 data Woodstown (County Waterford) and Salthill (Cou... NaN
16162 120370 Ireland 1.0 Woodstown NaN 52 11.0 55.0 N 6 58.0 47.0 W NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Let’s inspect the unique units used by OSPAR:

get_unique_across_dfs(dfs, col_name='unit', as_df=True)
index value
0 0 NaN
1 1 Bq/l
2 2 Bq/kg f.w.
3 3 BQ/L
4 4 Bq/L
Tip

FEEDBACK TO DATA PROVIDER: Standardizing the units would simplify data processing, as the units are not consistent across the dataset. For example, BQ/L, Bq/l, and Bq/L are used interchangeably.

We will establish unit renaming rules for the OSPAR dataset:

Now we will create a callback, RemapUnitCB, to remap the units in the dataframes. For the SEAWATER dataset, we will set a default unit of Bq/l.


source

RemapUnitCB

 RemapUnitCB (lut:Dict[str,str], default_units:Dict[str,str]={'SEAWATER':
              'Bq/l', 'BIOTA': 'Bq/kg f.w.'}, verbose:bool=False)

Callback to update DataFrame ‘UNIT’ columns based on a lookup table.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SanitizeValueCB(), # Remove blank value entries (also removes NaN values in Unit column) 
                            RemapUnitCB(renaming_unit_rules, verbose=True),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()

display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))

print('Unique Unit values:')
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['UNIT'].unique()}")
10 invalid rows found in group 'SEAWATER' during sanitize value callback.

Row Count Comparison Before and After Transformation:

BIOTA SEAWATER
Number of rows in original dataframes (dfs): 15951 19193
Number of rows in transformed dataframes (tfm.dfs): 15951 19183
Number of rows removed (tfm.dfs_removed): 0 10
Unique Unit values:
BIOTA: [5]
SEAWATER: [1]

Remap detection limit

Tip

FEEDBACK TO DATA PROVIDER: The Value type column contains numerous nan entries.

# Count the number of NaN entries in the 'value type' column for 'SEAWATER'
na_count_seawater = dfs['SEAWATER']['value type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'SEAWATER': {na_count_seawater}")

# Count the number of NaN entries in the 'value type' column for 'BIOTA'
na_count_biota = dfs['BIOTA']['value type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'BIOTA': {na_count_biota}")
Number of NaN 'Value type' entries in 'SEAWATER': 64
Number of NaN 'Value type' entries in 'BIOTA': 23

In the OSPAR dataset, the detection limit is denoted by < in the Value type column. When the Value type is <, the Activity or MDAcolumn specifies the detection limit. Conversely, when the Value type is =, it indicates an actual measurement in theActivity or MDA column. Let’s review the entries in the Value type column for the OSPAR dataset:

for grp in dfs.keys():
    print(f'{grp}:')
    print(tfm.dfs[grp]['value type'].unique())
BIOTA:
['<' '=' nan]
SEAWATER:
['<' '=' nan]

In MARIS the Detection limits are encoded as follows:

pd.read_excel(detection_limit_lut_path())
id name name_sanitized
0 -1 Not applicable Not applicable
1 0 Not Available Not available
2 1 = Detected value
3 2 < Detection limit
4 3 ND Not detected
5 4 DE Derived

We can create a lambda function to retrieve the MARIS lookup table.

We can define the columns of interest in both the SEAWATER and BIOTA DataFrames for the detection limit column.

We now create a callback RemapDetectionLimitCB to remap OSPAR detection limit values to MARIS formatted values using the lookup table. Since the dataset contains ‘nan’ entries for the detection limit column, we will create a condition to set the detection limit to ‘=’ when the value and uncertainty columns are present and the current detection limit value is not in the lookup keys.


source

RemapDetectionLimitCB

 RemapDetectionLimitCB (coi:dict, fn_lut:Callable)

Remap detection limit values to MARIS format using a lookup table.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            NormalizeUncCB(),                  
                            RemapUnitCB(renaming_unit_rules, verbose=True),
                            RemapDetectionLimitCB(coi_dl, lut_dl)])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['DL'].unique()}")
10 invalid rows found in group 'SEAWATER' during sanitize value callback.
BIOTA: [2 1]
SEAWATER: [2 1]

Remap Biota species

The OSPAR dataset contains biota species information in the Species column of the biota DataFrame. To ensure consistency with MARIS standards, it is necessary to remap these species names. We will employ a similar approach to that used for standardizing nuclide names, IMFA (Inspect, Match, Fix, Apply).

We first inspect the unique Species values of the OSPAR Biota dataset:

dfs = load_data(src_dir, use_cache=True)
with pd.option_context('display.max_columns', None):
    display(get_unique_across_dfs(dfs, col_name='species', as_df=True).T)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
value Ostrea edulis CHIMAERA MONSTROSA Brosme brosme Cyclopterus lumpus EUTRIGLA GURNARDUS Dicentrarchus labrax Pelvetia canaliculata GALEUS MELASTOMUS Glyptocephalus cynoglossus Reinhardtius hippoglossoides Pecten maximus NUCELLA LAPILLUS Littorina littorea Trisopterus minutus Unknown Gadus morhua Sebastes viviparus Anarhichas minor GLYPTOCEPHALUS CYNOGLOSSUS MOLVA MOLVA Platichthys flesus CLUPEA HARENGUS Hyperoplus lanceolatus DIPTURUS BATIS CRASSOSTREA GIGAS OSTREA EDULIS Galeus melastomus Eutrigla gurnardus Mytilus edulis Anguilla anguilla MONODONTA LINEATA Trachurus trachurus ASCOPHYLLUM NODOSUM PLATICHTHYS FLESUS Lycodes vahlii RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA LIMANDA LIMANDA ASCOPHYLLUN NODOSUM Tapes sp. Scomber scombrus Microstomus kitt MICROMESISTIUS POUTASSOU RHODYMENIA spp Ostrea Edulis SCOPHTHALMUS RHOMBUS SCOMBER SCOMBRUS Cerastoderma edule ETMOPTERUS SPINAX Squalus acanthias NaN RAJIDAE/BATOIDEA HIPPOGLOSSUS HIPPOGLOSSUS MYTILUS EDULIS MERLANGIUS MERLANGUS LAMINARIA DIGITATA SOLEA SOLEA (S.VULGARIS) PATELLA VULGATA Melanogrammus aeglefinus Limanda limanda Solea solea (S.vulgaris) Thunnus thynnus GADUS MORHUA Clupea Harengus Sebastes norvegicus Capros aper Gadiculus argenteus Merlangius merlangus SEBASTES MARINUS Penaeus vannamei Patella sp. BROSME BROSME TRACHURUS TRACHURUS Argentina sphyraena Gadus morhua Anarhichas denticulatus Mixture of green, red and brown algae Clupea harengus FUCUS SPP. LITTORINA LITTOREA Phycis blennoides Argentina silus BUCCINUM UNDATUM Merluccius merluccius Fucus sp. Rhodymenia spp. Melanogrammus aeglefinus Homarus gammarus Fucus serratus FUCUS SPIRALIS Gaidropsarus argenteus Hippoglossus hippoglossus PLEURONECTES PLATESSA Clupea harengus MERLANGUIS MERLANGUIS Pollachius virens MERLUCCIUS MERLUCCIUS Trisopterus esmarkii Pleuronectiformes [order] PALMARIA PALMATA PECTEN MAXIMUS Sprattus sprattus Sepia spp. Pollachius pollachius SPRATTUS SPRATTUS Ascophyllum nodosum Mytilus Edulis Pleuronectes platessa RAJA DIPTURUS BATIS PECTINIDAE Raja montagui Salmo salar Fucus Vesiculosus REINHARDTIUS HIPPOGLOSSOIDES HIPPOGLOSSOIDES PLATESSOIDES Sebastes vivipares Limanda Limanda Sebastes marinus Gadus Morhua Gadus sp. Merlangius Merlangus Fucus distichus PORPHYRA UMBILICALIS CYCLOPTERUS LUMPUS MELANOGRAMMUS AEGLEFINUS Sebastes Mentella Coryphaenoides rupestris Pleuronectes platessa Hippoglossoides platessoides Crassostrea gigas PELVETIA CANALICULATA Lumpenus lampretaeformis unknown Thunnus sp. Boreogadus Saida Trisopterus esmarki OSILINUS LINEATUS Molva molva MOLVA DYPTERYGIA Buccinum undatum Sebastes mentella MERLUCCIUS MERLUCCIUS Sardina pilchardus PLUERONECTES PLATESSA SALMO SALAR Lophius piscatorius CERASTODERMA (CARDIUM) EDULE Anarhichas lupus Dasyatis pastinaca FUCUS spp BOREOGADUS SAIDA Boreogadus saida Modiolus modiolus FUCUS SERRATUS Nephrops norvegicus FUCUS VESICULOSUS Cerastoderma (Cardium) Edule POLLACHIUS VIRENS SEBASTES MENTELLA DICENTRARCHUS (MORONE) LABRAX ANARHICHAS LUPUS Flatfish Fucus vesiculosus Micromesistius poutassou PATELLA Phoca vitulina Mallotus villosus Gadiculus argenteus thori

We attempt to match the OSPAR species column to the species column of the MARIS nomenclature using the Remapper . First, we initialize the Remapper:

remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='species', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='species_ospar.pkl')

Next, we perform the matching and generate a lookup table that includes the match score, which quantifies the degree of match accuracy:

remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)
Processing:   0%|          | 0/167 [00:00<?, ?it/s]Processing: 100%|██████████| 167/167 [00:22<00:00,  7.30it/s]
129 entries matched the criteria, while 38 entries had a match score of 1 or higher.
source_key RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA Mixture of green, red and brown algae SOLEA SOLEA (S.VULGARIS) Solea solea (S.vulgaris) Cerastoderma (Cardium) Edule CERASTODERMA (CARDIUM) EDULE DICENTRARCHUS (MORONE) LABRAX Pleuronectiformes [order] RAJIDAE/BATOIDEA PALMARIA PALMATA MONODONTA LINEATA Gadiculus argenteus Unknown unknown RAJA DIPTURUS BATIS Sepia spp. Flatfish Rhodymenia spp. FUCUS SPP. Gadus sp. Thunnus sp. Tapes sp. FUCUS spp Fucus sp. Patella sp. RHODYMENIA spp MERLANGUIS MERLANGUIS PLUERONECTES PLATESSA Gaidropsarus argenteus Melanogrammus aeglefinus Pleuronectes platessa Trisopterus esmarki Hippoglossus hippoglossus Sebastes vivipares MERLUCCIUS MERLUCCIUS Gadus morhua ASCOPHYLLUN NODOSUM Clupea harengus
matched_maris_name Lomentaria catenata Mercenaria mercenaria Loligo vulgaris Loligo vulgaris Cerastoderma edule Cerastoderma edule Dicentrarchus labrax Pleuronectiformes Batoidea Alaria marginata Monodonta labio Pampus argenteus Undaria Undaria Dipturus batis Sepia Lambia Rhodymenia Fucus Penaeus sp. Thunnus Tapes Fucus Fucus Patella Rhodymenia Merlangius merlangus Pleuronectes platessa Gaidropsarus argentatus Melanogrammus aeglefinus Pleuronectes platessa Trisopterus esmarkii Hippoglossus hippoglossus Sebastes viviparus Merluccius merluccius Gadus morhua Ascophyllum nodosum Clupea harengus
source_name RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA Mixture of green, red and brown algae SOLEA SOLEA (S.VULGARIS) Solea solea (S.vulgaris) Cerastoderma (Cardium) Edule CERASTODERMA (CARDIUM) EDULE DICENTRARCHUS (MORONE) LABRAX Pleuronectiformes [order] RAJIDAE/BATOIDEA PALMARIA PALMATA MONODONTA LINEATA Gadiculus argenteus Unknown unknown RAJA DIPTURUS BATIS Sepia spp. Flatfish Rhodymenia spp. FUCUS SPP. Gadus sp. Thunnus sp. Tapes sp. FUCUS spp Fucus sp. Patella sp. RHODYMENIA spp MERLANGUIS MERLANGUIS PLUERONECTES PLATESSA Gaidropsarus argenteus Melanogrammus aeglefinus Pleuronectes platessa Trisopterus esmarki Hippoglossus hippoglossus Sebastes vivipares MERLUCCIUS MERLUCCIUS Gadus morhua ASCOPHYLLUN NODOSUM Clupea harengus
match_score 31 26 12 12 10 10 9 8 8 7 6 6 5 5 5 5 5 5 5 4 4 4 4 4 4 4 3 2 2 1 1 1 1 1 1 1 1 1

Below, we fix the entries that are not properly matched by the Remapper:

We can now review the remapping results, incorporating the adjustments from the fixes_biota_species dictionary:

remapper.generate_lookup_table(fixes=fixes_biota_species)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)
Processing:   0%|          | 0/167 [00:00<?, ?it/s]Processing: 100%|██████████| 167/167 [00:22<00:00,  7.37it/s]
139 entries matched the criteria, while 28 entries had a match score of 1 or higher.
source_key Cerastoderma (Cardium) Edule CERASTODERMA (CARDIUM) EDULE DICENTRARCHUS (MORONE) LABRAX Pleuronectiformes [order] Gadiculus argenteus MONODONTA LINEATA FUCUS SPP. Rhodymenia spp. RAJA DIPTURUS BATIS Sepia spp. Tapes sp. RHODYMENIA spp FUCUS spp Patella sp. Fucus sp. Thunnus sp. MERLANGUIS MERLANGUIS PLUERONECTES PLATESSA Gaidropsarus argenteus MERLUCCIUS MERLUCCIUS Clupea harengus Sebastes vivipares Pleuronectes platessa Hippoglossus hippoglossus Trisopterus esmarki Melanogrammus aeglefinus ASCOPHYLLUN NODOSUM Gadus morhua
matched_maris_name Cerastoderma edule Cerastoderma edule Dicentrarchus labrax Pleuronectiformes Pampus argenteus Monodonta labio Fucus Rhodymenia Dipturus batis Sepia Tapes Rhodymenia Fucus Patella Fucus Thunnus Merlangius merlangus Pleuronectes platessa Gaidropsarus argentatus Merluccius merluccius Clupea harengus Sebastes viviparus Pleuronectes platessa Hippoglossus hippoglossus Trisopterus esmarkii Melanogrammus aeglefinus Ascophyllum nodosum Gadus morhua
source_name Cerastoderma (Cardium) Edule CERASTODERMA (CARDIUM) EDULE DICENTRARCHUS (MORONE) LABRAX Pleuronectiformes [order] Gadiculus argenteus MONODONTA LINEATA FUCUS SPP. Rhodymenia spp. RAJA DIPTURUS BATIS Sepia spp. Tapes sp. RHODYMENIA spp FUCUS spp Patella sp. Fucus sp. Thunnus sp. MERLANGUIS MERLANGUIS PLUERONECTES PLATESSA Gaidropsarus argenteus MERLUCCIUS MERLUCCIUS Clupea harengus Sebastes vivipares Pleuronectes platessa Hippoglossus hippoglossus Trisopterus esmarki Melanogrammus aeglefinus ASCOPHYLLUN NODOSUM Gadus morhua
match_score 10 10 9 8 6 6 5 5 5 5 4 4 4 4 4 4 3 2 2 1 1 1 1 1 1 1 1 1

Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now define a Remapper Lambda Function that instantiates the Remapper and returns the corrected lookup table.

Putting it all together, we now apply the RemapCB callback to our data. This process adds a SPECIES column to our BIOTA dataframe, which contains the standardized species IDs.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['SPECIES'].unique()
array([ 377,  129,   96,    0,  192,   99,   50,  378,  270,  379,  380,
        381,  382,  383,  384,  385,  244,  386,  387,  388,  389,  390,
        391,  392,  393,  394,  395,  396,  274,  397,  398,  243,  399,
        400,  401,  402,  403,  404,  405,  406,  407,  191,  139,  408,
        410,  412,  413,  272,  414,  415,  416,  417,  418,  419,  420,
        421,  422,  423,  424,  425,  426,  427,  428,  411,  429,  430,
        431,  432,  433,  434,  435,  436,  437,  438,  439,  440,  441,
        442,  443,  444,  294, 1684, 1610, 1609, 1605, 1608,   23, 1606,
        234,  556, 1701, 1752,  158,  223])

Enhance Species Data Using Biological group.

The Biological group column in the OSPAR dataset provides valuable insights related to species. We will leverage this information to enrich the SPECIES column. To achieve this, we will employ the generic RemapCB callback to create an enhanced_species column. Subsequently, this enhanced_species column will be used to further enrich the SPECIES column.

First we inspect the unique values in the biological group column.

get_unique_across_dfs(dfs, col_name='biological group', as_df=True)
index value
0 0 fish
1 1 MOLLUSCS
2 2 Seaweeds
3 3 FISH
4 4 seaweed
5 5 Seaweed
6 6 molluscs
7 7 SEAWEED
8 8 Fish
9 9 Molluscs

We will remap the biological group columns data to the species column of the MARIS nomenclature, again using a Remapper object:

remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='biological group', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='enhance_species_ospar.pkl')

Like before we will inspect the data.

remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1)
Processing: 100%|██████████| 10/10 [00:01<00:00,  8.26it/s]
matched_maris_name source_name match_score
source_key
fish Fucus fish 4
FISH Fucus FISH 4
Fish Fucus Fish 4
MOLLUSCS Mollusca MOLLUSCS 1
Seaweeds Seaweed Seaweeds 1
molluscs Mollusca molluscs 1
Molluscs Mollusca Molluscs 1

We can see that some entries require manual fixes.

Now we will apply the manual fixes to the lookup table and review.

remapper.generate_lookup_table(fixes=fixes_enhanced_biota_species)
remapper.select_match(match_score_threshold=1)
Processing:   0%|          | 0/10 [00:00<?, ?it/s]Processing: 100%|██████████| 10/10 [00:01<00:00,  6.69it/s]
matched_maris_name source_name match_score
source_key
MOLLUSCS Mollusca MOLLUSCS 1
Seaweeds Seaweed Seaweeds 1
molluscs Mollusca molluscs 1
Molluscs Mollusca Molluscs 1

Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now define a Remapper Lambda Function that instantiates the Remapper and returns the corrected lookup table.

Now we can apply RemapCB which results in the addition of an enhanced_species column in our BIOTA DataFrame.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['enhanced_species'].unique()
array([ 873, 1059,  712])

With the enhanced_species column, we can enrich the SPECIES column. We will use the value in enhanced_species column in the absence of a SPECIES match if the enhanced_species column is valid.


source

EnhanceSpeciesCB

 EnhanceSpeciesCB ()

Enhance the ‘SPECIES’ column using the ‘enhanced_species’ column if conditions are met.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA'),
    EnhanceSpeciesCB()
    ])

tfm()['BIOTA']['SPECIES'].unique()
array([ 377,  129,   96,  712,  192,   99,   50,  378,  270,  379,  380,
        381,  382,  383,  384,  385,  244,  386,  387,  388,  389,  390,
        391,  392,  393,  394,  395,  396,  274,  161,  398,  243,  399,
        400,  401,  402,  403,  404,  405,  406,  407, 1379,  191,  139,
        408, 1299,  410,  148,  412,  413,  272,  414,  415,  416,  417,
        418,  419,  420,  421,  422,  423,  424,  425,  426,  427,  428,
        411,  429,  430,  431,  814,  432,  433,  434,  435,  436,  437,
        438,  439,  440,  441,  442,  443,  444,  294,  992, 1426, 1684,
       1610, 1609, 1605, 1608,   23, 1606,  234,  556, 1701, 1752, 1104,
        158,  223])

All entries are matched for the SPECIES column.

Remap Biota tissues

The OSPAR dataset includes entries where the Body Part is labeled as whole. However, the MARIS data standard requires a more specific distinction for the body_part field, differentiating between Whole animal and Whole plant. Fortunately, the OSPAR dataset provides a Biological group field that allows us to make this distinction.

To address this discrepancy and ensure compatibility with MARIS standards, we will: 1. Create a temporary column body_part_temp that combines information from both Body Part and Biological group. 2. Use this temporary column to perform the lookup using our Remapper object.

Lets create the temporary column, body_part_temp, that combines Body Part and Biological group.


source

AddBodypartTempCB

 AddBodypartTempCB ()

Add a temporary column with the body part and biological group combined.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            ])
dfs_test = tfm()
dfs_test['BIOTA']['body_part_temp'].unique()
array(['whole animal molluscs', 'whole plant seaweed', 'whole fish fish',
       'flesh without bones fish', 'whole animal fish', 'muscle fish',
       'head fish', 'soft parts molluscs', 'growing tips seaweed',
       'soft parts fish', 'unknown fish', 'flesh without bone fish',
       'flesh fish', 'flesh with scales fish', 'liver fish',
       'flesh without bones seaweed', 'whole  fish',
       'flesh without bones molluscs', 'whole  seaweed',
       'whole plant seaweeds', 'whole fish', 'whole without head fish',
       'mix of muscle and whole fish without liver fish',
       'whole fisk fish', 'muscle  fish', 'cod medallion fish',
       'tail and claws fish'], dtype=object)

To align the body_part_temp column with the bodypar column in the MARIS nomenclature, we will use the Remapper. However, since the OSPAR dataset lacks a predefined lookup table for the body_part column, we must first create one. This is accomplished by extracting unique values from the body_part_temp column.

get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True).head()
index value
0 0 whole without head fish
1 1 soft parts fish
2 2 flesh without bones molluscs
3 3 growing tips seaweed
4 4 muscle fish

We can now remap the body_part_temp column to the bodypar column in the MARIS nomenclature using the Remapper. Subsequently, we will inspect the results:

remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True),
                    maris_lut_fn=bodyparts_lut_path,
                    maris_col_id='bodypar_id',
                    maris_col_name='bodypar',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='tissues_ospar.pkl'
                    )

remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=0, verbose=True).T)
Processing:   0%|          | 0/27 [00:00<?, ?it/s]Processing: 100%|██████████| 27/27 [00:00<00:00, 102.11it/s]
0 entries matched the criteria, while 27 entries had a match score of 0 or higher.
source_key mix of muscle and whole fish without liver fish whole without head fish cod medallion fish tail and claws fish unknown fish whole fisk fish whole fish fish whole plant seaweeds whole animal molluscs soft parts molluscs flesh without bones molluscs whole plant seaweed flesh without bones seaweed growing tips seaweed flesh fish whole seaweed muscle fish liver fish flesh without bones fish muscle fish whole fish head fish whole animal fish soft parts fish whole fish flesh with scales fish flesh without bone fish
matched_maris_name Flesh without bones Flesh without bones Old leaf Stomach and intestine Growing tips Whole animal Whole animal Whole plant Whole animal Soft parts Flesh without bones Whole plant Flesh without bones Growing tips Shells Whole plant Muscle Liver Flesh without bones Muscle Whole animal Head Whole animal Soft parts Whole animal Flesh with scales Flesh without bones
source_name mix of muscle and whole fish without liver fish whole without head fish cod medallion fish tail and claws fish unknown fish whole fisk fish whole fish fish whole plant seaweeds whole animal molluscs soft parts molluscs flesh without bones molluscs whole plant seaweed flesh without bones seaweed growing tips seaweed flesh fish whole seaweed muscle fish liver fish flesh without bones fish muscle fish whole fish head fish whole animal fish soft parts fish whole fish flesh with scales fish flesh without bone fish
match_score 31 13 13 13 9 9 9 9 9 9 9 8 8 8 7 7 6 5 5 5 5 5 5 5 5 5 4

Many of the lookup entries are sufficient for our needs. However, for values that don’t find a match, we can use the fixes_biota_bodyparts dictionary to apply manual corrections. First we will create the dictionary.

Now we will generate the lookup table and apply the manual fixes defined in the fixes_biota_bodyparts dictionary.

remapper.generate_lookup_table(fixes=fixes_biota_tissues)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)
Processing:   0%|          | 0/27 [00:00<?, ?it/s]Processing: 100%|██████████| 27/27 [00:00<00:00, 94.75it/s]
1 entries matched the criteria, while 26 entries had a match score of 1 or higher.
source_key whole animal molluscs flesh without bones molluscs whole fisk fish soft parts molluscs whole fish fish whole plant seaweeds growing tips seaweed whole plant seaweed whole seaweed muscle fish flesh without bones fish liver fish whole animal fish flesh with scales fish soft parts fish whole fish head fish whole fish muscle fish flesh without bone fish flesh without bones seaweed mix of muscle and whole fish without liver fish tail and claws fish unknown fish cod medallion fish whole without head fish
matched_maris_name Whole animal Flesh without bones Whole animal Soft parts Whole animal Whole plant Growing tips Whole plant Whole plant Muscle Flesh without bones Liver Whole animal Flesh with scales Soft parts Whole animal Head Whole animal Muscle Flesh without bones (Not available) (Not available) (Not available) (Not available) (Not available) (Not available)
source_name whole animal molluscs flesh without bones molluscs whole fisk fish soft parts molluscs whole fish fish whole plant seaweeds growing tips seaweed whole plant seaweed whole seaweed muscle fish flesh without bones fish liver fish whole animal fish flesh with scales fish soft parts fish whole fish head fish whole fish muscle fish flesh without bone fish flesh without bones seaweed mix of muscle and whole fish without liver fish tail and claws fish unknown fish cod medallion fish whole without head fish
match_score 9 9 9 9 9 9 8 8 7 6 5 5 5 5 5 5 5 5 5 4 2 2 2 2 2 2

At this stage, the majority of entries have been successfully matched to the MARIS nomenclature. Entries that remain unmatched are appropriately marked as ‘not available’. We are now ready to proceed with the final remapping process. We will define a lambda function to instantiate the Remapper, which will then generate and return the corrected lookup table.

Putting it all together, we now apply the RemapCB callback. This process results in the addition of a BODY_PART column to our BIOTA DataFrame.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA')
                            ])
tfm()
tfm.dfs['BIOTA']['BODY_PART'].unique()
array([ 1, 40, 52, 34, 13, 19, 56,  0,  4, 60, 25])

Remap biogroup

The MARIS species lookup table contains a biogroup_id column that associates each species with its corresponding biogroup. We will leverage this relationship to create a BIO_GROUP column in the BIOTA DataFrame.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[ 
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA'),
    EnhanceSpeciesCB(),
    RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA')
    ])

print(tfm()['BIOTA']['BIO_GROUP'].unique())
[14 11  4 13 12  2  5]

Add Sample ID

The OSPAR dataset includes an ID column, which we will use to create a SMP_ID column.


source

AddSampleIdCB

 AddSampleIdCB ()

Include a SMP_ID column from the ID column of OSPAR

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            AddSampleIdCB(),
                            CompareDfsAndTfmCB(dfs)

                            ])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['SMP_ID'].unique()}")

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
BIOTA: [    1     2     3 ... 98060 98061 98062]
SEAWATER: [     1      2      3 ... 120366 120367 120368]
                                                    BIOTA  SEAWATER
Number of rows in original dataframes (dfs):        15951     19193
Number of rows in transformed dataframes (tfm.d...  15951     19193
Number of rows removed (tfm.dfs_removed):               0         0 

Add depth

The OSPAR dataset features a Sampling depth column specifically for the SEAWATER dataset. In this section, we will develop a callback to integrate the sampling depth, denoted as SMP_DEPTH, into the MARIS dataset.


source

AddDepthCB

 AddDepthCB ()

Ensure depth values are floats and add ‘SMP_DEPTH’ columns.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    AddDepthCB()
    ])
tfm()
for grp in tfm.dfs.keys():  
    if 'SMP_DEPTH' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['SMP_DEPTH']].drop_duplicates())
SEAWATER:        SMP_DEPTH
0            3.0
80           2.0
81          21.0
85          31.0
87          32.0
...          ...
16022       71.0
16023       66.0
16025       81.0
16385     1660.0
16389     1500.0

[134 rows x 1 columns]

Standardize Coordinates

The OSPAR dataset offers coordinates in degrees, minutes, and seconds (DMS). The following callback is designed to convert DMS to decimal degrees.


source

ConvertLonLatCB

 ConvertLonLatCB ()

Convert Coordinates to decimal degrees (DDD.DDDDD°).

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB()
                            ])
tfm()

with pd.option_context('display.max_columns', None):
    display(tfm.dfs['SEAWATER'][['LAT','latd', 'latm', 'lats', 'LON', 'latdir', 'longd', 'longm','longs', 'longdir']])
LAT latd latm lats LON latdir longd longm longs longdir
0 51.375278 51 22.0 31.0 3.188056 N 3 11.0 17.0 E
1 51.223611 51 13.0 25.0 2.859444 N 2 51.0 34.0 E
2 51.184444 51 11.0 4.0 2.713611 N 2 42.0 49.0 E
3 51.420278 51 25.0 13.0 3.262222 N 3 15.0 44.0 E
4 51.416111 51 24.0 58.0 2.809722 N 2 48.0 35.0 E
... ... ... ... ... ... ... ... ... ... ...
19188 53.600000 53 36.0 0.0 -5.933333 N 5 56.0 0.0 W
19189 53.733333 53 44.0 0.0 -5.416667 N 5 25.0 0.0 W
19190 53.650000 53 39.0 0.0 -5.233333 N 5 14.0 0.0 W
19191 53.883333 53 53.0 0.0 -5.550000 N 5 33.0 0.0 W
19192 53.866667 53 52.0 0.0 -5.883333 N 5 53.0 0.0 W

19193 rows × 10 columns

Sanitize coordinates drops a row when both longitude & latitude equal 0 or data contains unrealistic longitude & latitude values. Converts longitude & latitude , separator to . separator.”

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()

display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))

with pd.option_context('display.max_columns', None):
    display(tfm.dfs['SEAWATER'][['LAT','LON']])

Row Count Comparison Before and After Transformation:

BIOTA SEAWATER
Number of rows in original dataframes (dfs): 15951 19193
Number of rows in transformed dataframes (tfm.dfs): 15951 19193
Number of rows removed (tfm.dfs_removed): 0 0
LAT LON
0 51.375278 3.188056
1 51.223611 2.859444
2 51.184444 2.713611
3 51.420278 3.262222
4 51.416111 2.809722
... ... ...
19188 53.600000 -5.933333
19189 53.733333 -5.416667
19190 53.650000 -5.233333
19191 53.883333 -5.550000
19192 53.866667 -5.883333

19193 rows × 2 columns

Review all callbacks

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
10 invalid rows found in group 'SEAWATER' during time parsing callback (ParseTimeCB).
                                                    BIOTA  SEAWATER
Number of rows in original dataframes (dfs):        15951     19193
Number of rows in transformed dataframes (tfm.d...  15951     19183
Number of rows removed (tfm.dfs_removed):               0        10 

Example change logs

Review the change logs for the netcdf encoding.

dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            ])

# Transform
tfm()
# Check transformation logs
tfm.logs
10 invalid rows found in group 'SEAWATER' during time parsing callback (ParseTimeCB).
["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column.",
 'Remap data provider nuclide names to standardized MARIS nuclide names.',
 'Parse the time format in the dataframe and check for inconsistencies.',
 'Encode time as seconds since epoch.',
 'Sanitize value by removing blank entries and populating `value` column.',
 'Normalize uncertainty values in DataFrames.',
 "Callback to update DataFrame 'UNIT' columns based on a lookup table.",
 'Remap detection limit values to MARIS format using a lookup table.',
 "Remap values from 'species' to 'SPECIES' for groups: BIOTA.",
 "Remap values from 'biological group' to 'enhanced_species' for groups: BIOTA.",
 "Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met.",
 'Add a temporary column with the body part and biological group combined.',
 "Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA.",
 'Include a SMP_ID column from the ID column of OSPAR',
 "Ensure depth values are floats and add 'SMP_DEPTH' columns.",
 'Convert Coordinates to decimal degrees (DDD.DDDDD°).',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.']

Feed global attributes


source

get_attrs

 get_attrs (tfm:marisco.callbacks.Transformer, zotero_key:str,
            kw:list=['oceanography', 'Earth Science > Oceans > Ocean
            Chemistry> Radionuclides', 'Earth Science > Human Dimensions >
            Environmental Impacts > Nuclear Radiation Exposure', 'Earth
            Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth
            Science > Oceans > Marine Sediments', 'Earth Science > Oceans
            > Ocean Chemistry, Earth Science > Oceans > Sea Ice >
            Isotopes', 'Earth Science > Oceans > Water Quality > Ocean
            Contaminants', 'Earth Science > Biological Classification >
            Animals/Vertebrates > Fish', 'Earth Science > Biosphere >
            Ecosystems > Marine Ecosystems', 'Earth Science > Biological
            Classification > Animals/Invertebrates > Mollusks', 'Earth
            Science > Biological Classification > Animals/Invertebrates >
            Arthropods > Crustaceans', 'Earth Science > Biological
            Classification > Plants > Macroalgae (Seaweeds)'])

Retrieve all global attributes.

Type Default Details
tfm Transformer Transformer object
zotero_key str Zotero dataset record key
kw list [‘oceanography’, ‘Earth Science > Oceans > Ocean Chemistry> Radionuclides’, ‘Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure’, ‘Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments’, ‘Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes’, ‘Earth Science > Oceans > Water Quality > Ocean Contaminants’, ‘Earth Science > Biological Classification > Animals/Vertebrates > Fish’, ‘Earth Science > Biosphere > Ecosystems > Marine Ecosystems’, ‘Earth Science > Biological Classification > Animals/Invertebrates > Mollusks’, ‘Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans’, ‘Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)’] List of keywords
Returns dict Global attributes
get_attrs(tfm, zotero_key=zotero_key, kw=kw)
{'geospatial_lat_min': '49.43222222222222',
 'geospatial_lat_max': '81.26805555555555',
 'geospatial_lon_min': '-58.23166666666667',
 'geospatial_lon_max': '36.181666666666665',
 'geospatial_bounds': 'POLYGON ((-58.23166666666667 36.181666666666665, 49.43222222222222 36.181666666666665, 49.43222222222222 81.26805555555555, -58.23166666666667 81.26805555555555, -58.23166666666667 36.181666666666665))',
 'geospatial_vertical_max': '1850.0',
 'geospatial_vertical_min': '0.0',
 'time_coverage_start': '1995-01-01T00:00:00',
 'time_coverage_end': '2022-12-31T00:00:00',
 'id': 'LQRA4MMK',
 'title': 'OSPAR Environmental Monitoring of Radioactive Substances',
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "OSPAR Comission\'s Radioactive Substances Committee (RSC)"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
 'publisher_postprocess_logs': "Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column., Remap data provider nuclide names to standardized MARIS nuclide names., Parse the time format in the dataframe and check for inconsistencies., Encode time as seconds since epoch., Sanitize value by removing blank entries and populating `value` column., Normalize uncertainty values in DataFrames., Callback to update DataFrame 'UNIT' columns based on a lookup table., Remap detection limit values to MARIS format using a lookup table., Remap values from 'species' to 'SPECIES' for groups: BIOTA., Remap values from 'biological group' to 'enhanced_species' for groups: BIOTA., Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met., Add a temporary column with the body part and biological group combined., Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA., Include a SMP_ID column from the ID column of OSPAR, Ensure depth values are floats and add 'SMP_DEPTH' columns., Convert Coordinates to decimal degrees (DDD.DDDDD°)., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator."}

Encoding NETCDF


source

encode

 encode (fname_out_nc:str, **kwargs)

Encode data to NetCDF.

Type Details
fname_out_nc str Output file name
kwargs
Returns None Additional arguments
encode(fname_out_nc, verbose=False)
10 invalid rows found in group 'SEAWATER' during time parsing callback (ParseTimeCB).

NetCDF Review

First lets review the global attributes of the NetCDF file:

contents = ExtractNetcdfContents(fname_out_nc)
print(contents.global_attrs)
{'id': 'LQRA4MMK', 'title': 'OSPAR Environmental Monitoring of Radioactive Substances', 'summary': '', 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)', 'history': 'TBD', 'keywords_vocabulary': 'GCMD Science Keywords', 'keywords_vocabulary_url': 'https://gcmd.earthdata.nasa.gov/static/kms/', 'record': 'TBD', 'featureType': 'TBD', 'cdm_data_type': 'TBD', 'Conventions': 'CF-1.10 ACDD-1.3', 'publisher_name': 'Paul MCGINNITY, Iolanda OSVATH, Florence DESCROIX-COMANDUCCI', 'publisher_email': 'p.mc-ginnity@iaea.org, i.osvath@iaea.org, F.Descroix-Comanducci@iaea.org', 'publisher_url': 'https://maris.iaea.org', 'publisher_institution': 'International Atomic Energy Agency - IAEA', 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "OSPAR Comission\'s Radioactive Substances Committee (RSC)"}]', 'institution': 'TBD', 'metadata_link': 'TBD', 'creator_email': 'TBD', 'creator_url': 'TBD', 'references': 'TBD', 'license': 'Without prejudice to the applicable Terms and Conditions (https://nucleus.iaea.org/Pages/Others/Disclaimer.aspx), I hereby agree that any use of the data will contain appropriate acknowledgement of the data source(s) and the IAEA Marine Radioactivity Information System (MARIS).', 'comment': 'TBD', 'geospatial_lat_min': '49.43222222222222', 'geospatial_lon_min': '-58.23166666666667', 'geospatial_lat_max': '81.26805555555555', 'geospatial_lon_max': '36.181666666666665', 'geospatial_vertical_min': '0.0', 'geospatial_vertical_max': '1850.0', 'geospatial_bounds': 'POLYGON ((-58.23166666666667 36.181666666666665, 49.43222222222222 36.181666666666665, 49.43222222222222 81.26805555555555, -58.23166666666667 81.26805555555555, -58.23166666666667 36.181666666666665))', 'geospatial_bounds_crs': 'EPSG:4326', 'time_coverage_start': '1995-01-01T00:00:00', 'time_coverage_end': '2022-12-31T00:00:00', 'local_time_zone': 'TBD', 'date_created': 'TBD', 'date_modified': 'TBD', 'publisher_postprocess_logs': "Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column., Remap data provider nuclide names to standardized MARIS nuclide names., Parse the time format in the dataframe and check for inconsistencies., Encode time as seconds since epoch., Sanitize value by removing blank entries and populating `value` column., Normalize uncertainty values in DataFrames., Callback to update DataFrame 'UNIT' columns based on a lookup table., Remap detection limit values to MARIS format using a lookup table., Remap values from 'species' to 'SPECIES' for groups: BIOTA., Remap values from 'biological group' to 'enhanced_species' for groups: BIOTA., Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met., Add a temporary column with the body part and biological group combined., Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA., Include a SMP_ID column from the ID column of OSPAR, Ensure depth values are floats and add 'SMP_DEPTH' columns., Convert Coordinates to decimal degrees (DDD.DDDDD°)., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator."}

Review the publisher_postprocess_logs.

print(contents.global_attrs['publisher_postprocess_logs'])
Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column., Remap data provider nuclide names to standardized MARIS nuclide names., Parse the time format in the dataframe and check for inconsistencies., Encode time as seconds since epoch., Sanitize value by removing blank entries and populating `value` column., Normalize uncertainty values in DataFrames., Callback to update DataFrame 'UNIT' columns based on a lookup table., Remap detection limit values to MARIS format using a lookup table., Remap values from 'species' to 'SPECIES' for groups: BIOTA., Remap values from 'biological group' to 'enhanced_species' for groups: BIOTA., Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met., Add a temporary column with the body part and biological group combined., Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA., Include a SMP_ID column from the ID column of OSPAR, Ensure depth values are floats and add 'SMP_DEPTH' columns., Convert Coordinates to decimal degrees (DDD.DDDDD°)., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.

Now lets review the enums of the groups in the NetCDF file:

print(contents.enum_dicts)
{'BIOTA': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', 'pu240': '69', 'pu241': '70', 'am240': '71', 'am241': '72', 'cm242': '73', 'cm243': '74', 'cm244': '75', 'cs134_137_tot': '76', 'pu239_240_tot': '77', 'pu239_240_iii_iv_tot': '78', 'pu239_240_v_vi_tot': '79', 'cm243_244_tot': '80', 'pu238_pu239_240_tot_ratio': '81', 'am241_pu239_240_tot_ratio': '82', 'cs137_134_ratio': '83', 'cd109': '84', 'eu152': '85', 'fe59': '86', 'gd153': '87', 'ir192': '88', 'pu238_240_tot': '89', 'rb86': '90', 'sc46': '91', 'sn113': '92', 'sn117m': '93', 'tl208': '94', 'mo99': '95', 'tc99m': '96', 'ru105': '97', 'te129': '98', 'te132': '99', 'i132': '100', 'i135': '101', 'cs136': '102', 'tbeta': '103', 'talpha': '104', 'i133': '105', 'th230': '106', 'pa231': '107', 'u236': '108', 'ag111': '109', 'in116m': '110', 'te123m': '111', 'sb127': '112', 'ba133': '113', 'ce139': '114', 'tl201': '116', 'hg203': '117', 'na22': '122', 'pa234m': '123', 'am243': '124', 'se75': '126', 'sr85': '127', 'y88': '128', 'ce140': '129', 'bi212': '130', 'u236_238_ratio': '131', 'i125': '132', 'ba137m': '133', 'u232': '134', 'pa233': '135', 'ru106_rh106_tot': '136', 'tu': '137', 'tbeta40k': '138', 'fe55': '139', 'ce144_pr144_tot': '140', 'pu240_pu239_ratio': '141', 'u233': '142', 'pu239_242_tot': '143', 'ac227': '144'}, 'unit': {'Not applicable': '-1', 'NOT AVAILABLE': '0', 'Bq per m3': '1', 'Bq per m2': '2', 'Bq per kg': '3', 'Bq per kgd': '4', 'Bq per kgw': '5', 'kg per kg': '6', 'TU': '7', 'DELTA per mill': '8', 'atom per kg': '9', 'atom per kgd': '10', 'atom per kgw': '11', 'atom per l': '12', 'Bq per kgC': '13'}, 'dl': {'Not applicable': '-1', 'Not available': '0', 'Detected value': '1', 'Detection limit': '2', 'Not detected': '3', 'Derived': '4'}, 'species': {'NOT AVAILABLE': '0', 'Aristeus antennatus': '1', 'Apostichopus': '2', 'Saccharina japonica var religiosa': '3', 'Siganus fuscescens': '4', 'Alpheus dentipes': '5', 'Hexagrammos agrammus': '6', 'Ditrema temminckii': '7', 'Parapristipoma trilineatum': '8', 'Scombrops boops': '9', 'Pseudopleuronectes schrenki': '10', 'Desmarestia ligulata': '11', 'Saccharina japonica': '12', 'Neodilsea yendoana': '13', 'Costaria costata': '14', 'Sargassum yezoense': '15', 'Acanthephyra pelagica': '16', 'Sargassum ringgoldianum': '17', 'Acanthephyra quadrispinosa': '18', 'Sargassum thunbergii': '19', 'Sargassum patens': '20', 'Asterias rubens': '21', 'Sargassum miyabei': '22', 'Homarus gammarus': '23', 'Acanthephyra stylorostratis': '24', 'Acanthocybium solandri': '25', 'Acanthopagrus bifasciatus': '26', 'Acanthophora muscoides': '27', 'Acanthophora spicifera': '28', 'Acanthurus triostegus': '29', 'Actinopterygii': '30', 'Adamussium colbecki': '31', 'Ahnfeltiopsis densa': '32', 'Alepes melanoptera': '33', 'Ampharetidae': '34', 'Anchoviella lepidentostole': '35', 'Anguillidae': '36', 'Aphroditidae': '37', 'Arnoglossus': '38', 'Aurigequula fasciata': '39', 'Balaenoptera musculus': '40', 'Balaenoptera physalus': '41', 'Balistes': '42', 'Beryciformes': '43', 'Bryopsis maxima': '44', 'Callinectes sp': '45', 'Callorhinus ursinus': '46', 'Carassius auratus auratus': '47', 'Carcharhinus sorrah': '48', 'Caridae': '49', 'Clupea harengus': '50', 'Cathorops spixii': '51', 'Caulerpa racemosa': '52', 'Caulerpa scalpelliformis': '53', 'Caulerpa sertularioides': '54', 'Cellana radiata': '55', 'Coscinasterias tenuispina': '56', 'Centroceras clavulatum': '57', 'Centropomus parallelus': '58', 'Crangon crangon': '59', 'Ceramium diaphanum': '60', 'Ceramium rubrum': '61', 'Chaenocephalus aceratus': '62', 'Chaetodipterus faber': '63', 'Chaetomorpha antennina': '64', 'Chaetomorpha linoides': '65', 'Chelidonichthys kumu': '66', 'Chelon ramada': '67', 'Chiloscyllium': '68', 'Chionodraco hamatus': '69', 'Chlamys islandica': '70', 'Chlorophyta': '71', 'Chondrichthyes': '72', 'Chrysaora': '73', 'Cladophora nitellopsis': '74', 'Cladophora vagabunda': '75', 'Cladophoropsis membranacea': '76', 'Clupea': '77', 'Coccotylus truncatus': '78', 'Codium fragile': '79', 'Crassostrea': '80', 'Cynoscion acoupa': '81', 'Cynoscion jamaicensis': '82', 'Cynoscion leiarchus': '83', 'Engraulis encrasicolus': '84', 'Cypselurus agoo agoo': '85', 'Cystophora cristata': '86', 'Cystoseira barbata': '87', 'Cystoseira crinita': '88', 'Decapodiformes': '89', 'Decapterus russelli': '90', 'Decapterus scombrinus': '91', 'Delphinapterus leucas': '92', 'Delphinus capensis': '93', 'Diapterus rhombeus': '94', 'Dicentrarchus punctatus': '95', 'Fucus vesiculosus': '96', 'Funchalia woodwardi': '97', 'Ecklonia bicyclis': '98', 'Gadus morhua': '99', 'Ecklonia kurome': '100', 'Gennadas elegans': '101', 'Eisenia arborea': '102', 'Encrasicholina devisi': '103', 'Enteromorpha': '104', 'Enteromorpha flexuosa': '105', 'Enteromorpha intestinalis': '106', 'Epinephelinae': '107', 'Epinephelus diacanthus': '108', 'Exocoetidae': '109', 'Saccharina latissima': '110', 'Gracilaria corticata': '111', 'Ligur ensiferus': '112', 'Gracilaria debilis': '113', 'Gracilaria edulis': '114', 'Gracilariales': '115', 'Grateloupia elliptica': '116', 'Grateloupia filicina': '117', 'Lysmata seticaudata': '118', 'Gymnogongrus griffithsiae': '119', 'Mya arenaria': '120', 'Halichoerus grypus': '121', 'Macoma balthica': '122', 'Marthasterias glacialis': '123', 'Halimeda macroloba': '124', 'Harengula clupeola': '125', 'Harpagifer antarcticus': '126', 'Hemifusus ternatanus': '127', 'Hemiramphus brasiliensis': '128', 'Mytilus edulis': '129', 'Metapenaeus affinis': '130', 'Heteroscleromorpha': '131', 'Heterosigma akashiwo': '132', 'Hilsa ilisha': '133', 'Metapenaeus monoceros': '134', 'Metapenaeus stebbingi': '135', 'Holothuria': '136', 'Hoplobrotula armata': '137', 'Hypnea musciformis': '138', 'Merlangius merlangus': '139', 'Iridaea cordata': '140', 'Jania rubens': '141', 'Meganyctiphanes norvegica': '142', 'Johnius glaucus': '143', 'Kappaphycus': '144', 'Kappaphycus alvarezii': '145', 'Laevistrombus canarium': '146', 'Lagenodelphis hosei': '147', 'Lambia': '148', 'Laminaria japonica': '149', 'Laminaria longissima': '150', 'Larimus breviceps': '151', 'Laurencia papillosa': '152', 'Leiognathidae': '153', 'Leiognathus dussumieri': '154', 'Lepidochelys olivacea': '155', 'Leptonychotes weddellii': '156', 'Limanda yokohamae': '157', 'Nephrops norvegicus': '158', 'Neuston': '159', 'Littoraria undulata': '160', 'Loligo vulgaris': '161', 'Lumbrineridae': '162', 'Lutjanus fulviflamma': '163', 'Marginisporum aberrans': '164', 'Megalaspis cordyla': '165', 'Octopus vulgaris': '166', 'Menticirrhus americanus': '167', 'Mesoplodon densirostris': '168', 'Palaemon longirostris': '169', 'Metapenaeus brevicornis': '170', 'Pasiphaea multidentata': '171', 'Pasiphaea sivado': '172', 'Parapenaeopsis stylifera': '173', 'Miichthys miiuy': '174', 'Mirounga leonina': '175', 'Brachidontes striatulus': '176', 'Monodon monoceros': '177', 'Mugil platanus': '178', 'Penaeus semisulcatus': '179', 'Mullus barbatus': '180', 'Mycteroperca rubra': '181', 'Philocheras echinulatus': '182', 'Myelophycus simplex': '183', 'Mytilus coruscus': '184', 'Penaeus indicus': '185', 'Natator depressus': '186', 'Pandalus jordani': '187', 'Melicertus kerathurus': '188', 'Parapenaeus longirostris': '189', 'Plesionika': '190', 'Platichthys flesus': '191', 'Pleuronectes platessa': '192', 'Nematopalaemon tenuipes': '193', 'Nematoscelis difficilis': '194', 'Nemipterus': '195', 'Aegaeon lacazei': '196', 'Nephtyidae': '197', 'Nereididae': '198', 'Netuma bilineata': '199', 'Nibea maculata': '200', 'Oceana serrulata': '201', 'Palaemon serratus': '202', 'Ocypode': '203', 'Odobenus rosmarus': '204', 'Ogcocephalus vespertilio': '205', 'Oligoplites saurus': '206', 'Onuphidae': '207', 'Opheliidae': '208', 'Opisthonema oglinum': '209', 'Opisthopterus tardoore': '210', 'Orientomysis mitsukurii': '211', 'Otolithes cuvieri': '212', 'Padina pavonica': '213', 'Padina tetrastromatica': '214', 'Padina vickersiae': '215', 'Pagellus affinis': '216', 'Pagophilus groenlandicus': '217', 'Paguroidea': '218', 'Pagurus': '219', 'Systellaspis debilis': '220', 'Sergestes': '221', 'Sergestes arcticus': '222', 'Pampus argenteus': '223', 'Sergestes arachnipodus': '224', 'Sergestes henseni': '225', 'Sergestes prehensilis': '226', 'Sergestes robustus': '227', 'Pangasius pangasius': '228', 'Panulirus homarus': '229', 'Paracentrotus lividus': '230', 'Pasiphaea sp': '231', 'Pectinariidae': '232', 'Penaeus': '233', 'Phoca vitulina': '234', 'Photopectoralis bindus': '235', 'Phyllospadix iwatensis': '236', 'Plectorhinchus mediterraneus': '237', 'Pleuronectes mochigarei': '238', 'Pleuronectes obscurus': '239', 'Plocamium brasiliense': '240', 'Polynemus paradiseus': '241', 'Polysiphonia': '242', 'Sprattus sprattus': '243', 'Scomber scombrus': '244', 'Polysiphonia fucoides': '245', 'Gonostomatidae': '246', 'Perca fluviatilis': '247', 'Pomadasys crocro': '248', 'Porphyra tenera': '249', 'Potamogeton pectinatus': '250', 'Priacanthus hamrur': '251', 'Pseudorhombus malayanus': '252', 'Pterocladiella capillacea': '253', 'Pusa caspica': '254', 'Pusa sibirica': '255', 'Pylaiella littoralis': '256', 'Sabellidae': '257', 'Salangichthys ishikawae': '258', 'Sarconema filiforme': '259', 'Sardinella albella': '260', 'Sardinella brasiliensis': '261', 'Sardinops melanostictus': '262', 'Sargassum cymosum': '263', 'Sargassum linearifolium': '264', 'Sargassum micracanthum': '265', 'Xiphias gladius': '266', 'Sargassum novae hollandiae': '267', 'Sargassum oligocystum': '268', 'Esox lucius': '269', 'Limanda limanda': '270', 'Abramis brama': '271', 'Anguilla anguilla': '272', 'Arctica islandica': '273', 'Cerastoderma edule': '274', 'Cyprinus carpio': '275', 'Echinodermata': '276', 'Fish larvae': '277', 'Myoxocephalus scorpius': '278', 'Osmerus eperlanus': '279', 'Plankton': '280', 'Scophthalmus maximus': '281', 'Rhodophyta': '282', 'Rutilus rutilus': '283', 'Saduria entomon': '284', 'Sander lucioperca': '285', 'Gasterosteus aculeatus': '286', 'Zoarces viviparus': '287', 'Gymnocephalus cernua': '288', 'Furcellaria lumbricalis': '289', 'Cladophora glomerata': '290', 'Lateolabrax japonicus': '291', 'Okamejei kenojei': '292', 'Sebastes pachycephalus': '293', 'Squalus acanthias': '294', 'Gadus macrocephalus': '295', 'Paralichthys olivaceus': '296', 'Ovalipes punctatus': '297', 'Pseudopleuronectes yokohamae': '298', 'Hemitripterus villosus': '299', 'Clidoderma asperrimum': '300', 'Microstomus achne': '301', 'Lepidotrigla microptera': '302', 'Hexagrammos otakii': '303', 'Kareius bicoloratus': '304', 'Pleuronichthys cornutus': '305', 'Enteroctopus dofleini': '306', 'Ammodytes personatus': '307', 'Lophius litulon': '308', 'Eopsetta grigorjewi': '309', 'Takifugu porphyreus': '310', 'Loliolus japonica': '311', 'Sepia andreana': '312', 'Sebastes cheni': '313', 'Portunus trituberculatus': '314', 'Sebastes schlegelii': '315', 'Pennahia argentata': '316', 'Platichthys stellatus': '317', 'Gadus chalcogrammus': '318', 'Chelidonichthys spinosus': '319', 'Conger myriaster': '320', 'Heterololigo bleekeri': '321', 'Stichaeus grigorjewi': '322', 'Pseudopleuronectes herzensteini': '323', 'Octopus conispadiceus': '324', 'Hippoglossoides dubius': '325', 'Cleisthenes pinetorum': '326', 'Glyptocephalus stelleri': '327', 'Tanakius kitaharae': '328', 'Nibea mitsukurii': '329', 'Dasyatis matsubarai': '330', 'Verasper moseri': '331', 'Hemitrygon akajei': '332', 'Triakis scyllium': '333', 'Trachurus japonicus': '334', 'Zeus faber': '335', 'Pagrus major': '336', 'Acanthopagrus schlegelii': '337', 'Dentex tumifrons': '338', 'Mustelus manazo': '339', 'Seriola quinqueradiata': '340', 'Hyperoglyphe japonica': '341', 'Carcharhinus': '342', 'Platycephalus': '343', 'Scomber japonicus': '344', 'Squatina japonica': '345', 'Alopias pelagicus': '346', 'Zenopsis nebulosa': '347', 'Cynoglossus joyneri': '348', 'Verasper variegatus': '349', 'Oncorhynchus keta': '350', 'Physiculus japonicus': '351', 'Oplegnathus punctatus': '352', 'Arothron hispidus': '353', 'Stereolepis doederleini': '354', 'Takifugu snyderi': '355', 'Scomber australasicus': '356', 'Liparis tanakae': '357', 'Thamnaconus modestus': '358', 'Gnathophis nystromi': '359', 'Sebastes oblongus': '360', 'Sebastiscus marmoratus': '361', 'Takifugu pardalis': '362', 'Mugil cephalus': '363', 'Ditrema temminckii temminckii': '364', 'Konosirus punctatus': '365', 'Tribolodon brandtii': '366', 'Oncorhynchus masou': '367', 'Aluterus monoceros': '368', 'Todarodes pacificus': '369', 'Myoxocephalus stelleri': '370', 'Myliobatis tobijei': '371', 'Scyliorhinus torazame': '372', 'Lophiomus setigerus': '373', 'Heterodontus japonicus': '374', 'Sebastes vulpes': '375', 'Paraplagusia japonica': '376', 'Ostrea edulis': '377', 'Melanogrammus aeglefinus': '378', 'Pollachius virens': '379', 'Pollachius pollachius': '380', 'Sebastes marinus': '381', 'Anarhichas minor': '382', 'Anarhichas denticulatus': '383', 'Reinhardtius hippoglossoides': '384', 'Trisopterus esmarkii': '385', 'Micromesistius poutassou': '386', 'Coryphaenoides rupestris': '387', 'Argentina silus': '388', 'Salmo salar': '389', 'Sebastes viviparus': '390', 'Buccinum undatum': '391', 'Fucus serratus': '392', 'Merluccius merluccius': '393', 'Littorina littorea': '394', 'Fucus': '395', 'Rhodymenia': '396', 'Solea solea': '397', 'Trachurus trachurus': '398', 'Eutrigla gurnardus': '399', 'Pelvetia canaliculata': '400', 'Ascophyllum nodosum': '401', 'Mallotus villosus': '402', 'Pecten maximus': '403', 'Hippoglossoides platessoides': '404', 'Sebastes mentella': '405', 'Modiolus modiolus': '406', 'Boreogadus saida': '407', 'Sepia': '408', 'Gadus': '409', 'Sardina pilchardus': '410', 'Pleuronectiformes': '411', 'Molva molva': '412', 'Patella': '413', 'Crassostrea gigas': '414', 'Dasyatis pastinaca': '415', 'Lophius piscatorius': '416', 'Porphyra umbilicalis': '417', 'Patella vulgata': '418', 'Brosme brosme': '419', 'Glyptocephalus cynoglossus': '420', 'Galeus melastomus': '421', 'Chimaera monstrosa': '422', 'Etmopterus spinax': '423', 'Dicentrarchus labrax': '424', 'Osilinus lineatus': '425', 'Hippoglossus hippoglossus': '426', 'Cyclopterus lumpus': '427', 'Molva dypterygia': '428', 'Microstomus kitt': '429', 'Fucus distichus': '430', 'Tapes': '431', 'Sebastes norvegicus': '432', 'Phycis blennoides': '433', 'Fucus spiralis': '434', 'Laminaria digitata': '435', 'Dipturus batis': '436', 'Anarhichas lupus': '437', 'Lumpenus lampretaeformis': '438', 'Lycodes vahlii': '439', 'Argentina sphyraena': '440', 'Trisopterus minutus': '441', 'Thunnus': '442', 'Hyperoplus lanceolatus': '443', 'Gaidropsarus argentatus': '444', 'Engraulis japonicus': '445', 'Mytilus galloprovincialis': '446', 'Undaria pinnatifida': '447', 'Chlorophthalmus albatrossis': '448', 'Sargassum fusiforme': '449', 'Eisenia bicyclis': '450', 'Spisula sachalinensis': '451', 'Strongylocentrotus nudus': '452', 'Haliotis discus hannai': '453', 'Dexistes rikuzenius': '454', 'Ruditapes philippinarum': '455', 'Apostichopus japonicus': '456', 'Pterothrissus gissu': '457', 'Helicolenus hilgendorfii': '458', 'Buccinum isaotakii': '459', 'Neptunea intersculpta': '460', 'Apostichopus nigripunctatus': '461', 'Sebastes thompsoni': '462', 'Oratosquilla oratoria': '463', 'Oncorhynchus kisutch': '464', 'Erimacrus isenbeckii': '465', 'Sillago japonica': '466', 'Trachysalambria curvirostris': '467', 'Mytilus unguiculatus': '468', 'Crassostrea nippona': '469', 'Laminariales': '470', 'Uroteuthis edulis': '471', 'Takifugu poecilonotus': '472', 'Neptunea arthritica': '473', 'Katsuwonus pelamis': '474', 'Doederleinia berycoides': '475', 'Metapenaeopsis dalei': '476', 'Seriola dumerili': '477', 'Pseudorhombus pentophthalmus': '478', 'Stephanolepis cirrhifer': '479', 'Cookeolus japonicus': '480', 'Panulirus japonicus': '481', 'Thunnus orientalis': '482', 'Halocynthia roretzi': '483', 'Etrumeus sadina': '484', 'Cololabis saira': '485', 'Coryphaena hippurus': '486', 'Sarda orientalis': '487', 'Octopus ocellatus': '488', 'Sardinops sagax': '489', 'Sphyraena pinguis': '490', 'Sebastes ventricosus': '491', 'Occella iburia': '492', 'Glossanodon semifasciatus': '493', 'Mizuhopecten yessoensis': '494', 'Neosalangichthys ishikawae': '495', 'Bothrocara tanakae': '496', 'Malacocottus zonurus': '497', 'Coelorinchus macrochir': '498', 'Neptunea constricta': '499', 'Beringius polynematicus': '500', 'Sebastes nivosus': '501', 'Pandalus eous': '502', 'Synaphobranchus kaupii': '503', 'Sebastolobus macrochir': '504', 'Marsupenaeus japonicus': '505', 'Japelion hirasei': '506', 'Pleurogrammus azonus': '507', 'Monostroma nitidum': '508', 'Atheresthes evermanni': '509', 'Takifugu rubripes': '510', 'Chionoecetes opilio': '511', 'Pandalopsis coccinata': '512', 'Chionoecetes japonicus': '513', 'Sebastes matsubarae': '514', 'Scombrops gilberti': '515', 'Hyporhamphus sajori': '516', 'Trichiurus lepturus': '517', 'Alcichthys elongatus': '518', 'Volutharpa perryi': '519', 'Mercenaria stimpsoni': '520', 'Berryteuthis magister': '521', 'Aptocyclus ventricosus': '522', 'Euphausia pacifica': '523', 'Salangichthys microdon': '524', 'Telmessus acutidens': '525', 'Ceratophyllum demersum': '526', 'Pandalus nipponensis': '527', 'Sebastes owstoni': '528', 'Cociella crocodilus': '529', 'Conger japonicus': '530', 'Sardinella zunasi': '531', 'Cheilopogon pinnatibarbatus japonicus': '532', 'Oplegnathus fasciatus': '533', 'Macridiscus aequilatera': '534', 'Repomucenus ornatipinnis': '535', 'Clupea pallasii': '536', 'Scorpaena neglecta': '537', 'Scomberomorus niphonius': '538', 'Leucopsarion petersii': '539', 'Sebastes scythropus': '540', 'Strongylura anastomella': '541', 'Laemonema longipes': '542', 'Fusitriton oregonensis': '543', 'Japelion pericochlion': '544', 'Sebastes steindachneri': '545', 'Auxis rochei': '546', 'Lobotes surinamensis': '547', 'Auxis thazard': '548', 'Chlorophthalmus borealis': '549', 'Etelis coruscans': '550', 'Sebastes inermis': '551', 'Cynoglossus interruptus': '552', 'Erilepis zonifer': '553', 'Tridentiger obscurus': '554', 'Caranx sexfasciatus': '555', 'Thunnus thynnus': '556', 'Takifugu stictonotus': '557', 'Euthynnus affinis': '558', 'Synagrops japonicus': '559', 'Okamejei schmidti': '560', 'Suggrundus meerdervoortii': '561', 'Sebastes baramenuke': '562', 'Pleurogrammus monopterygius': '563', 'Decapterus maruadsi': '564', 'Girella punctata': '565', 'Sphyraena japonica': '566', 'Ommastrephes bartramii': '567', 'Sepiella japonica': '568', 'Sepioteuthis lessoniana': '569', 'Eucleoteuthis luminosa': '570', 'Gloiopeltis furcata': '571', 'Macrobrachium nipponense': '572', 'Sepia kobiensis': '573', 'Eriocheir japonica': '574', 'Magallana nippona': '575', 'Meretrix lusoria': '576', 'Chondrus ocellatus': '577', 'Chondrus elatus': '578', 'Gloiopeltis': '579', 'Holothuroidea': '580', 'Corbicula japonica': '581', 'Sunetta menstrualis': '582', 'Pseudorhombus cinnamoneus': '583', 'Takifugu niphobles': '584', 'Lagocephalus gloveri': '585', 'Beryx splendens': '586', 'Parastichopus nigripunctatus': '587', 'Venerupis philippinarum': '588', 'Haliotis': '589', 'Liparis agassizii': '590', 'Seriola lalandi': '591', 'Niphon spinosus': '592', 'Pleuronichthys japonicus': '593', 'Sergia lucens': '594', 'Sphoeroides pachygaster': '595', 'Coryphaenoides acrolepis': '596', 'Pseudopleuronectes obscurus': '597', 'Pyropia yezoensis': '598', 'Isurus oxyrinchus': '599', 'Sargassum fulvellum': '600', 'Prionace glauca': '601', 'Kajikia audax': '602', 'Thunnus albacares': '603', 'Thunnus alalunga': '604', 'Thunnus obesus': '605', 'Lamna ditropis': '606', 'Glyptocidaris crenularis': '607', 'Asterias amurensis': '608', 'Sepiida': '609', 'Congridae': '610', 'Takifugu': '611', 'Sargassum horneri': '612', 'Haliotis discus': '613', 'Pleuronectidae': '614', 'Acanthogobius flavimanus': '615', 'Acanthogobius lactipes': '616', 'Pholis nebulosa': '617', 'Hemigrapsus penicillatus': '618', 'Palaemon paucidens': '619', 'Mysidae': '620', 'Zostera marina': '621', 'Ulva pertusa': '622', 'Gobiidae': '623', 'Atherinidae': '624', 'Tribolodon': '625', 'Alpheus': '626', 'Polychaeta': '627', 'Sebastes': '628', 'Charybdis japonica': '629', 'Hemigrapsus': '630', 'Favonigobius gymnauchen': '631', 'Palaemon': '632', 'Planiliza haematocheila': '633', 'Palaemonidae': '634', 'Pholis crassispina': '635', 'Laminaria': '636', 'Distolasterias nipon': '637', 'Lophiiformes': '638', 'Alpheus brevicristatus': '639', 'Undaria undariodes': '640', 'Neomysis awatschensis': '641', 'Alpheidae': '642', 'Macrobrachium': '643', 'Hediste': '644', 'Gymnogobius breunigii': '645', 'Luidia quinaria': '646', 'Rhizoprionodon acutus': '647', 'Carangoides equula': '648', 'Carcinoplax longimana': '649', 'Anomura': '650', 'Spatangoida': '651', 'Plesiobatis daviesi': '652', 'Eusphyra blochii': '653', 'Ruditapes variegata': '654', 'Sinonovacula constricta': '655', 'Penaeus monodon': '656', 'Litopenaeus vannamei': '657', 'Solenocera crassicornis': '658', 'Stomatopoda': '659', 'Teuthida': '660', 'Octopus': '661', 'Larimichthys polyactis': '662', 'Scomberomorini': '663', 'Channa argus': '664', 'Ranina ranina': '665', 'Lates calcarifer': '666', 'Scomberomorus commerson': '667', 'Lutjanus malabaricus': '668', 'Thenus parindicus': '669', 'Amusium pleuronectes': '670', 'Loligo': '671', 'Plectropomus leopardus': '672', 'Sillago ciliata': '673', 'Scylla serrata': '674', 'Pinctada maxima': '675', 'Lutjanus argentimaculatus': '676', 'Protonibea diacanthus': '677', 'Polydactylus macrochir': '678', 'Rachycentron canadum': '679', 'Ibacus peronii': '680', 'Arripis trutta': '681', 'Sarda australis': '682', 'Seriola hippos': '683', 'Choerodon schoenleinii': '684', 'Panulirus ornatus': '685', 'Neotrygon kuhlii': '686', 'Lethrinus nebulosus': '687', 'Parupeneus multifasciatus': '688', 'Saccostrea cucullata': '689', 'Lutjanus sebae': '690', 'Thunnus maccoyii': '691', 'Acanthopagrus butcheri': '692', 'Lambis lambis': '693', 'Gerres subfasciatus': '694', 'Zooplankton': '695', 'Phytoplankton': '696', 'Rapana venosa': '697', 'Scapharca inaequivalvis': '698', 'Ulva intestinalis': '699', 'Ulva linza': '700', 'Ceramium virgatum': '701', 'Gayralia oxysperma': '702', 'Vertebrata fucoides': '703', 'Stuckenia pectinata': '704', 'Rochia nilotica': '705', 'Ctenochaetus striatus': '706', 'Serranidae': '707', 'Turbo setosus': '708', 'Pandalidae': '709', 'Gymnosarda unicolor': '710', 'Epinephelini': '711', 'Pisces': '712', 'Liza klunzingeri': '713', 'Acanthopagrus latus': '714', 'Liza subviridis': '715', 'Sparidentex hasta': '716', 'Otolithes ruber': '717', 'Crenidens crenidens': '718', 'Ensis': '719', 'Gastropoda': '720', 'Euheterodonta': '721', 'Scomber': '722', 'Theragra chalcogramma': '723', 'Engraulidae': '724', 'Ostreidae': '725', 'Phaeophyceae': '726', 'Porphyra': '727', 'Ulva reticulata': '728', 'Perna viridis': '729', 'Fenneropenaeus indicus': '730', 'Merluccius': '731', 'Soleidae': '732', 'Mugilidae': '733', 'Marine algae': '734', 'Scarus rivulatus': '735', 'Scarus coeruleus': '736', 'Sardinella fimbriata': '737', 'Dussumieria acuta': '738', 'Lutjanus kasmira': '739', 'Lutjanus rivulatus': '740', 'Lutjanus bohar': '741', 'Priacanthus blochii': '742', 'Pelates quadrilineatus': '743', 'Epinephelus fasciatus': '744', 'Upeneus vittatus': '745', 'Lethrinus laticaudis': '746', 'Lethrinus lentjan': '747', 'Lethrinus microdon': '748', 'Sphyraena barracuda': '749', 'Alectis indica': '750', 'Epinephelus latifasciatus': '751', 'Nemipterus japonicus': '752', 'Raconda russeliana': '753', 'Lactarius lactarius': '754', 'Aetomylaeus bovinus': '755', 'Pennahia anea': '756', 'Leiognathus fasciatus': '757', 'Sardinella longiceps': '758', 'Tenualosa ilisha': '759', 'Pellona ditchela': '760', 'Stolephorus indicus': '761', 'Setipinna breviceps': '762', 'Rastrelliger kanagurta': '763', 'Chanos chanos': '764', 'Lepturacanthus savala': '765', 'Epinephelus niveatus': '766', 'Lutjanus johnii': '767', 'Carangoides malabaricus': '768', 'Ablennes hians': '769', 'Chirocentrus dorab': '770', 'Scomberomorus cavalla': '771', 'Scomberomorus semifasciatus': '772', 'Scomberomorus guttatus': '773', 'Etrumeus teres': '774', 'Spondyliosoma cantharus': '775', 'Brama brama': '776', 'Dasyatis zugei': '777', 'Harpadon nehereus': '778', 'Carcharhinus melanopterus': '779', 'Penaeus plebejus': '780', 'Sepia officinalis': '781', 'Johnius dussumieri': '782', 'Lutjanus campechanus': '783', 'Ruditapes decussatus': '784', 'Carcinus aestuarii': '785', 'Squilla mantis': '786', 'Epinephelus polyphekadion': '787', 'Lutjanus gibbus': '788', 'Lethrinus mahsena': '789', 'Epinephelus chlorostigma': '790', 'Carangoides bajad': '791', 'Aethaloperca rogaa': '792', 'Atule mate': '793', 'Macolor niger': '794', 'Carangoides fulvoguttatus': '795', 'Plectropomus areolatus': '796', 'Cephalopholis argus': '797', 'Cephalopholis': '798', 'Scarus sordidus': '799', 'Scomberomorus tritor': '800', 'Triaenodon obesus': '801', 'Pomadasys commersonnii': '802', 'Monotaxis grandoculis': '803', 'Plectropomus maculatus': '804', 'Trachinotus blochii': '805', 'Pristipomoides filamentosus': '806', 'Acanthurus gahhm': '807', 'Acanthurus sohal': '808', 'Siganus argenteus': '809', 'Naso unicornis': '810', 'Chanos': '811', 'Oedalechilus labiosus': '812', 'Plectorhinchus gaterinus': '813', 'Mercenaria mercenaria': '814', 'Mytilus': '815', 'Turbo cornutus': '816', 'Decapoda': '817', 'Sphyraena': '818', 'Arius maculatus': '819', 'Penaeus merguiensis': '820', 'Tegillarca granosa': '821', 'Mullus barbatus barbatus': '822', 'Chamelea gallina': '823', 'Metanephrops thomsoni': '824', 'Magallana gigas': '825', 'Branchiostegus japonicus': '826', 'Cephalopoda': '827', 'Lutjanidae': '828', 'Lethrinidae': '829', 'Sphyraena argentea': '830', 'Chirocentrus nudus': '831', 'Trachinotus': '832', 'Mugil auratus': '833', 'Euthynnus alletteratus': '834', 'Sparus aurata': '835', 'Pagrus caeruleostictus': '836', 'Scorpaena scrofa': '837', 'Pagellus erythrinus': '838', 'Epinephelus aeneus': '839', 'Dentex maroccanus': '840', 'Caranx rhonchus': '841', 'Sardinella': '842', 'Siganus': '843', 'Solea': '844', 'Diplodus sargus': '845', 'Lithognathus mormyrus': '846', 'Oblada melanura': '847', 'Siganus rivulatus': '848', 'Chelon labrosus': '849', 'Cynoscion microlepidotus': '850', 'Genypterus brasiliensis': '851', 'Myoxocephalus polyacanthocephalus': '852', 'Hexagrammos lagocephalus': '853', 'Hexagrammos decagrammus': '854', 'Sebastes ciliatus': '855', 'Lepidopsetta polyxystra': '856', 'Clupeiformes': '857', 'Gadidae': '858', 'Brachyura': '859', 'Dasyatis': '860', 'Carcharias': '861', 'Saurida': '862', 'Upeneus': '863', 'Cynoglossus': '864', 'Scomberomorus': '865', 'Terapon': '866', 'Leiognathus': '867', 'Terapontidae': '868', 'Caranx': '869', 'Diplodus': '870', 'Plectorhinchus flavomaculatus': '871', 'Salmonidae': '872', 'Mollusca': '873', 'Boops boops': '874', 'Sarpa salpa': '875', 'Pagellus acarne': '876', 'Spicara smaris': '877', 'Diplodus vulgaris': '878', 'Chelidonichthys lucerna': '879', 'Sarda sarda': '880', 'Serranus cabrilla': '881', 'Diplodus annularis': '882', 'Pagrus pagrus': '883', 'Alosa fallax': '884', 'Belone belone': '885', 'Dentex dentex': '886', 'Sphyraena viridensis': '887', 'Trisopterus capelanus': '888', 'Arnoglossus laterna': '889', 'Procambarus clarkii': '890', 'Nemadactylus macropterus': '891', 'Pagrus auratus': '892', 'Jasus edwardsii': '893', 'Perna canaliculus': '894', 'Pseudophycis bachus': '895', 'Haliotis iris': '896', 'Hoplostethus atlanticus': '897', 'Rhombosolea leporina': '898', 'Zygochlamys delicatula': '899', 'Galeorhinus galeus': '900', 'Parapercis colias': '901', 'Tiostrea chilensis': '902', 'Genypterus blacodes': '903', 'Evechinus chloroticus': '904', 'Austrovenus stutchburyi': '905', 'Micromesistius australis': '906', 'Macruronus novaezelandiae': '907', 'Nototodarus': '908', 'Perna perna': '909', 'Sepia pharaonis': '910', 'Turbo bruneus': '911', 'Portunus sanguinolentus': '912', 'Charybdis natator': '913', 'Charybdis lucifera': '914', 'Panulirus argus': '915', 'Ethmalosa fimbriata': '916', 'Sardinella brachysoma': '917', 'Thryssa mystax': '918', 'Plicofollis dussumieri': '919', 'Nibea soldado': '920', 'Epinephelus melanostigma': '921', 'Megalops cyprinoides': '922', 'Decapterus macarellus': '923', 'Drepane punctata': '924', 'Sillago sihama': '925', 'Tylosurus crocodilus crocodilus': '926', 'Saurida tumbil': '927', 'Cynoglossus macrostomus': '928', 'Parupeneus indicus': '929', 'Synechogobius hasta': '930', 'Busycotypus canaliculatus': '931', 'Pampus cinereus': '932', 'Pomadasys kaakan': '933', 'Epinephelus coioides': '934', 'Sepiella inermis': '935', 'Uroteuthis duvauceli': '936', 'Stomatella auricula': '937', 'Cerithium scabridum': '938', 'Marcia recens': '939', 'Circe intermedia': '940', 'Marcia opima': '941', 'Fulvia fragile': '942', 'Charybdis feriatus': '943', 'Charybdis annulata': '944', 'Atergatis integerrimus': '945', 'Matuta lunaris': '946', 'Calappa lophos': '947', 'Uca annulipes': '948', 'Chlamys varia': '949', 'Cololabis adocetus': '950', 'Seriola lalandi dorsalis': '951', 'Brunneifusus ternatanus': '952', 'Metapenaeus joyneri': '953', 'Epinephelus tauvina': '954', 'Coilia dussumieri': '955', 'Carcharhinus dussumieri': '956', 'Upeneus tragula': '957', 'Sartoriana spinigera': '958', 'Lamellidens marginalis': '959', 'Polydactylus sextarius': '960', 'Johnius macrorhynus': '961', 'Hexanematichthys sagor': '962', 'Sargassum swartzii': '963', 'Argyrops spinifer': '964', 'Synodus intermedius': '965', 'Muraenesox cinereus': '966', 'Carangoides armatus': '967', 'Eleutheronema tetradactylum': '968', 'Mustelus mosis': '969', 'Nemipterus bipunctatus': '970', 'Lutjanus quinquelineatus': '971', 'Platycephalus indicus': '972', 'Rhabdosargus haffara': '973', 'Argyrops filamentosus': '974', 'Brachirus orientalis': '975', 'Mene maculata': '976', 'Hemiramphus marginatus': '977', 'Encrasicholina heteroloba': '978', 'Trachinotus africanus': '979', 'Bramidae': '980', 'Escualosa thoracata': '981', 'Sepia arabica': '982', 'Scatophagus argus': '983', 'Parastromateus niger': '984', 'Planiliza subviridis': '985', 'Labeo rohita': '986', 'Oreochromis niloticus': '987', 'Cardiidae': '988', 'Sargassum angustifolium': '989', 'Pomacea bridgesii': '990', 'Sebastes fasciatus': '991', 'Batoidea': '992', 'Urophycis chuss': '993', 'Dalatias licha': '994', 'Trisopterus luscus': '995', 'Scyliorhinus canicula': '996', 'Ruvettus pretiosus': '997', 'Aphanopus carbo': '998', 'Alepocephalus bairdii': '999', 'Centroscymnus coelolepis': '1000', 'Loligo forbesii': '1001', 'Lutjanus cyanopterus': '1002', 'Mugil liza': '1003', 'Micropogonias furnieri': '1004', 'Balistes capriscus': '1005', 'Haemulidae': '1006', 'Stenotomus caprinus': '1007', 'Hemanthias leptus': '1008', 'Micropogonias undulatus': '1009', 'Cynoscion nebulosus': '1010', 'Rhomboplites aurorubens': '1011', 'Bothidae': '1012', 'Pogonias cromis': '1013', 'Lutjanus synagris': '1014', 'Netuma thalassina': '1015', 'Sillaginopsis panijus': '1016', 'Leptomelanosoma indicum': '1017', 'Therapon': '1018', 'Pterotolithus maculatus': '1019', 'Ilisha filigera': '1020', 'Hilsa kelee': '1021', 'Pampus chinensis': '1022', 'Palaemon styliferus': '1023', 'Argyrosomus regius': '1024', 'Lutjanus': '1025', 'Sciades': '1026', 'Mullus': '1027', 'Albula vulpes': '1028', 'Selar crumenophthalmus': '1029', 'Centropomus': '1030', 'Sardinella aurita': '1031', 'Harengula humeralis': '1032', 'Diapterus auratus': '1033', 'Gerres cinereus': '1034', 'Haemulon parra': '1035', 'Ocyurus chrysurus': '1036', 'Sphyraena guachancho': '1037', 'Anoplopoma fimbria': '1038', 'Nerita versicolor': '1039', 'Bulla striata': '1040', 'Melongena melongena': '1041', 'Trachycardium muricatum': '1042', 'Isognomon alatus': '1043', 'Brachidontes exustus': '1044', 'Crassostrea virginica': '1045', 'Protothaca granulata': '1046', 'Cittarium pica': '1047', 'Penaeus schmitti': '1048', 'Penaeus notialis': '1049', 'Callinectes sapidus': '1050', 'Callinectes danae': '1051', 'Dasyatidae': '1052', 'Caridea': '1053', 'Nephropidae': '1054', 'Sparus': '1055', 'Sargassum boveanum': '1056', 'Haliotis tuberculata': '1057', 'Littorinidae': '1058', 'Seaweed': '1059', 'Echinoidea': '1060', 'Ostreida': '1061', 'Donax trunculus': '1062', 'Scrobicularia plana': '1063', 'Venus verrucosa': '1064', 'Solen marginatus': '1065', 'Testudines': '1066', 'Mullidae': '1067', 'Amphipoda': '1068', 'Cystosphaera jacquinotii': '1069', 'Daption capense': '1070', 'Desmarestia anceps': '1071', 'Himantothallus grandifolius': '1072', 'Mirounga': '1073', 'Nacella concinna': '1074', 'Notothenia coriiceps': '1075', 'Pygoscelis antarcticus': '1076', 'Pygoscelis papua': '1077', 'Oncorhynchus gorbuscha': '1078', 'Oncorhynchus mykiss': '1079', 'Oncorhynchus nerka': '1080', 'Oncorhynchus tshawytscha': '1081', 'Erignathus barbatus': '1082', 'Pusa hispida': '1083', 'Hippoglossus stenolepis': '1084', 'Squalus suckleyi': '1085', 'Sargassum': '1086', 'Codium': '1087', 'Membranoptera alata': '1088', 'Dictyota dichotoma': '1089', 'Plocamium cartilagineum': '1090', 'Galatea paradoxa': '1091', 'Crassostrea tulipa': '1092', 'Macrobrachium sp': '1093', 'Portunus': '1094', 'Tympanotonos fuscatus': '1095', 'Thais': '1096', 'Bivalvia': '1097', 'Cynoglossus senegalensis': '1098', 'Carlarius heudelotii': '1099', 'Fontitrygon margarita': '1100', 'Chrysichthys nigrodigitatus': '1101', 'Acanthephyra purpurea': '1102', 'Actinauge abyssorum': '1103', 'Alaria marginata': '1104', 'Anadara transversa': '1105', 'Anthomedusae': '1106', 'Archosargus probatocephalus': '1107', 'Argyropelecus aculeatus': '1108', 'Ariopsis felis': '1109', 'Astrometis sertulifera': '1110', 'Astropecten': '1111', 'Atherina breviceps': '1112', 'Atolla': '1113', 'Aulacomya atra': '1114', 'Auxis rochei rochei': '1115', 'Auxis thazard thazard': '1116', 'Avicennia marina': '1117', 'Balaena mysticetus': '1118', 'Balaenoptera acutorostrata': '1119', 'Balanus': '1120', 'Berardius bairdii': '1121', 'Beroe': '1122', 'Boopsoidea inornata': '1123', 'Calanoida': '1124', 'Calanus finmarchicus finmarchicus': '1125', 'Callorhinchus milii': '1126', 'Cepphus columba': '1127', 'Cladonia rangiferina': '1128', 'Clinus superciliosus': '1129', 'Codium tomentosum': '1130', 'Copepoda': '1131', 'Coregonus autumnalis': '1132', 'Coregonus nasus': '1133', 'Coregonus sardinella': '1134', 'Coryphaenoides armatus': '1135', 'Coryphoblennius galerita': '1136', 'Creseis sp': '1137', 'Crinoidea': '1138', 'Crossota': '1139', 'Cryptochiton stelleri': '1140', 'Delphinus delphis': '1141', 'Diacria': '1142', 'Dichistius capensis': '1143', 'Dosinia alta': '1144', 'Dugong dugon': '1145', 'Electrona risso': '1146', 'Engraulis capensis': '1147', 'Ensis siliqua': '1148', 'Eryonidae': '1149', 'Eualaria fistulosa': '1150', 'Eupasiphae gilesii': '1151', 'Euphausiacea': '1152', 'Euphausiidae': '1153', 'Eurypharynx pelecanoides': '1154', 'Eurythenes gryllus': '1155', 'Euthynnus lineatus': '1156', 'Fratercula cirrhata': '1157', 'Galeichthys feliceps': '1158', 'Gelidium corneum': '1159', 'Gibbula umbilicalis': '1160', 'Gnathophausia ingens': '1161', 'Gonatus fabricii': '1162', 'Haliaeetus leucocephalus': '1163', 'Haliclona': '1164', 'Halodule uninervis': '1165', 'Hemilepidotus': '1166', 'Hemilepidotus jordani': '1167', 'Heterocarpus ensifer': '1168', 'Heterodontus portusjacksoni': '1169', 'Hippasteria phrygiana': '1170', 'Homola barbata': '1171', 'Hyperoodon planifrons': '1172', 'Hypleurochilus geminatus': '1173', 'Invertebrata': '1174', 'Isognomon bicolor': '1175', 'Isopoda': '1176', 'Kogia breviceps': '1177', 'Labrus bergylta': '1178', 'Lagenorhynchus obliquidens': '1179', 'Lampris guttatus': '1180', 'Larus glaucescens': '1181', 'Leander serratus': '1182', 'Libinia emarginata': '1183', 'Lichia amia': '1184', 'Lipophrys pholis': '1185', 'Lipophrys trigloides': '1186', 'Lithognathus lithognathus': '1187', 'Lithophaga aristata': '1188', 'Lobianchia gemellarii': '1189', 'Loliginidae': '1190', 'Loligo reynaudii': '1191', 'Lophius budegassa': '1192', 'Magallana angulata': '1193', 'Majoidea': '1194', 'Megachasma pelagios': '1195', 'Megaptera novaeangliae': '1196', 'Menippe mercenaria': '1197', 'Mesoplodon carlhubbsi': '1198', 'Mesoplodon stejnegeri': '1199', 'Microstomus pacificus': '1200', 'Morone saxatilis': '1201', 'Mullus surmuletus': '1202', 'Mycteroperca xenarcha': '1203', 'Myliobatis australis': '1204', 'Mysida': '1205', 'Mytilus californianus': '1206', 'Mytilus trossulus': '1207', 'Nephasoma Nephasoma flagriferum': '1208', 'Nudibranchia': '1209', 'Odobenus rosmarus divergens': '1210', 'Ommastrephidae': '1211', 'Ophiomusa lymani': '1212', 'Ophiothrix lineata': '1213', 'Orcinus orca': '1214', 'Ostracoda': '1215', 'Pagellus bogaraveo': '1216', 'Pandalus borealis': '1217', 'Paphies subtriangulata': '1218', 'Parabrotula': '1219', 'Paracalanus': '1220', 'Patella aspera': '1221', 'Periphylla': '1222', 'Phocoena phocoena': '1223', 'Phocoenoides dalli': '1224', 'Phronima': '1225', 'Physeter macrocephalus': '1226', 'Pinctada radiata': '1227', 'Plesionika edwardsii': '1228', 'Pododesmus macrochisma': '1229', 'Pomatomus saltatrix': '1230', 'Portunus pelagicus': '1231', 'Praunus': '1232', 'Pyrosoma': '1233', 'Rangifer tarandus': '1234', 'Rhabdosargus globiceps': '1235', 'Saccorhiza polyschides': '1236', 'Sagitta': '1237', 'Salpa': '1238', 'Salvelinus alpinus': '1239', 'Salvelinus malma': '1240', 'Sarda chiliensis': '1241', 'Sargassum aquifolium': '1242', 'Scalibregmatidae': '1243', 'Sebastes alutus': '1244', 'Sebastes melanops': '1245', 'Seriola dorsalis': '1246', 'Serranus scriba': '1247', 'Sigmops bathyphilus': '1248', 'Silicula fragilis': '1249', 'Sipunculidae': '1250', 'Somateria mollissima': '1251', 'Somateria spectabilis': '1252', 'Sparodon durbanensis': '1253', 'Spicara maena': '1254', 'Squatina australis': '1255', 'Striostrea margaritacea': '1256', 'Stromateus fiatola': '1257', 'Strongylocentrotus polyacanthus': '1258', 'Taractichthys steindachneri': '1259', 'Tectura scutum': '1260', 'Tegula viridula': '1261', 'Thais haemastoma': '1262', 'Thegrefg': '1263', 'Themisto': '1264', 'Thunnus tonggol': '1265', 'Trachurus picturatus': '1266', 'Trachurus symmetricus': '1267', 'Trygonorrhina fasciata': '1268', 'Ulva lactuca': '1269', 'Ursus maritimus': '1270', 'Vampyroteuthis infernalis': '1271', 'Ziphius cavirostris': '1272', 'Alepes kleinii': '1273', 'Alepes vari': '1274', 'Decapterus macrosoma': '1275', 'Lutjanus madras': '1276', 'Lutjanus russellii': '1277', 'Rastrelliger brachysoma': '1278', 'Rastrelliger faughni': '1279', 'Selar boops': '1280', 'Selaroides leptolepis': '1281', 'Sphyraena obtusata': '1282', 'Geloina expansa': '1283', 'Caesio erythrogaster': '1284', 'Euristhmus microceps': '1285', 'Pomacanthus annularis': '1286', 'Scylla': '1287', 'Plotosus lineatus': '1288', 'Prionotus stephanophrys': '1289', 'Trachurus murphyi': '1290', 'Dosidicus gigas': '1291', 'Sarda chiliensis chiliensis': '1292', 'Cynoscion analis': '1293', 'Merluccius gayi peruanus': '1294', 'Brotula ordwayi': '1295', 'Loligo gahi': '1296', 'Merluccius gayi': '1297', 'Ophichthus remiger': '1298', 'Penaeus sp': '1299', 'Trachinotus paitensis': '1300', 'Cheilopogon heterurus': '1301', 'Engraulis ringens': '1302', 'Sciaena deliciosa': '1303', 'Isacia conceptionis': '1304', 'Odontesthes regia': '1305', 'Bodianus diplotaenia': '1306', 'Concholepas concholepas': '1307', 'Diplectrum conceptione': '1308', 'Genypterus maculatus': '1309', 'Labrisomus philippii': '1310', 'Paralabrax humeralis': '1311', 'Prionotus horrens': '1312', 'Dasyatis akajei': '1313', 'Arctoscopus japonicus': '1314', 'Sepia esculenta': '1315', 'Bothrocara hollandi': '1316', 'Cynoglossidae': '1317', 'Lepidotrigla': '1318', 'Lepidotrigla alata': '1319', 'Octopus sinensis': '1320', 'Rhabdosargus sarba': '1321', 'Lophiidae': '1322', 'Muraenesox': '1323', 'Physiculus maximowiczi': '1324', 'Pleuronectoidei': '1325', 'Sciaenidae': '1326', 'Triglidae': '1327', 'Atherina presbyter': '1328', 'Bentheogennema intermedia': '1329', 'Benthesicymidae': '1330', 'Benthesicymus': '1331', 'Buccinum striatissimum': '1332', 'Callinectes': '1333', 'Cancer pagurus': '1334', 'Chaetognatha': '1335', 'Chama macerophylla': '1336', 'Cirripedia': '1337', 'Cyclosalpa': '1338', 'Cymopolia barbata': '1339', 'Cynoscion': '1340', 'Cystoseira amentacea': '1341', 'Ectocarpus siliculosus': '1342', 'Ellisolandia elongata': '1343', 'Enteromorpha linza': '1344', 'Euphausia superba': '1345', 'Gaidropsarus mediterraneus': '1346', 'Gennadas valens': '1347', 'Globicephala': '1348', 'Haliptilon virgatum': '1349', 'Halocynthia aurantium': '1350', 'Heliocidaris crassispina': '1351', 'Hymenodora gracilis': '1352', 'Lagodon rhomboides': '1353', 'Lepas Anatifa anatifera': '1354', 'Lobophora variegata': '1355', 'Macrocystis pyrifera': '1356', 'Maculabatis gerrardi': '1357', 'Nemacystus decipiens': '1358', 'Neptunea polycostata': '1359', 'Padina pavonia': '1360', 'Penaeidae': '1361', 'Petricolinae': '1362', 'Polynemidae': '1363', 'Pristipomoides aquilonaris': '1364', 'Pyropia fallax': '1365', 'Radiolaria': '1366', 'Salpidae': '1367', 'Sardinops melanosticta': '1368', 'Sargassum vulgare': '1369', 'Sciaena umbra': '1370', 'Scorpaena porcus': '1371', 'Sergestidae': '1372', 'Sicyonia brevirostris': '1373', 'Sphaerococcus coronopifolius': '1374', 'Stenella coeruleoalba': '1375', 'Stichopus japonicus': '1376', 'Thalia democratica': '1377', 'Themisto gaudichaudii': '1378', 'Undaria': '1379', 'Analipus japonicus': '1380', 'Sargassum yamadae': '1381', 'Ahnfeltiopsis paradoxa': '1382', 'Scytosiphon lomentaria': '1383', 'Chondria crassicaulis': '1384', 'Grateloupia lanceolata': '1385', 'Colpomenia sinuosa': '1386', 'Chondrus giganteus': '1387', 'Sargassum muticum': '1388', 'Ulva prolifera': '1389', 'Petalonia fascia': '1390', 'Balanus roseus': '1391', 'Chaetomorpha moniligera': '1392', 'Lomentaria hakodatensis': '1393', 'Neodilsea longissima': '1394', 'Polyopes affinis': '1395', 'Schizymenia dubyi': '1396', 'Dictyopteris pacifica': '1397', 'Ahnfeltiopsis flabelliformis': '1398', 'Bangia fuscopurpurea': '1399', 'Calliarthron': '1400', 'Cladophora': '1401', 'Cladophora albida': '1402', 'Dasya sessilis': '1403', 'Delesseria serrulata': '1404', 'Ecklonia cava': '1405', 'Gelidium elegans': '1406', 'Grateloupia turuturu': '1407', 'Hypnea asiatica': '1408', 'Mazzaella japonica': '1409', 'Pachydictyon coriaceum': '1410', 'Padina arborescens': '1411', 'Pterosiphonia pinnulata': '1412', 'Alatocladia yessoensis': '1413', 'Bryopsis plumosa': '1414', 'Ceramium kondoi': '1415', 'Chondracanthus intermedius': '1416', 'Codium contractum': '1417', 'Codium lucasii': '1418', 'Corallina pilulifera': '1419', 'Dictyopteris undulata': '1420', 'Gastroclonium pacificum': '1421', 'Gelidium amansii': '1422', 'Grateloupia sparsa': '1423', 'Laurencia okamurae': '1424', 'Leathesia marina': '1425', 'Lomentaria catenata': '1426', 'Meristotheca papulosa': '1427', 'Sargassum confusum': '1428', 'Sargassum siliquastrum': '1429', 'Tinocladia crassa': '1430', 'Saccharina yendoana': '1431', 'Thalassiophyllum clathrus': '1432', 'Mytilida': '1433', 'Pteriomorphia': '1434', 'Conger': '1435', 'Scyliorhinidae': '1436', 'Labrus': '1437', 'Algae': '1438', 'Necora puber': '1439', 'Anguilla': '1440', 'Rajidae': '1441', 'Buccinidae': '1442', 'Crustacea': '1443', 'Green algae': '1444', 'Ammodytes japonicus': '1445', 'Evynnis tumifrons': '1446', 'Gnathophis nystromi nystromi': '1447', 'Loligo bleekeri': '1448', 'Platichthys bicoloratus': '1449', 'Limanda punctatissima': '1450', 'Loliolus Nipponololigo japonica': '1451', 'Acanthopagrus schlegelii schlegelii': '1452', 'Sepiolina': '1453', 'Gelidium': '1454', 'Atrina pectinata': '1455', 'Echinocardium cordatum': '1456', 'Lamnidae': '1457', 'Meretrix lamarckii': '1458', 'Noctiluca scintillans': '1459', 'Philine argentata': '1460', 'Sergestes lucens': '1461', 'Corbicula sandai': '1462', 'Ulva': '1463', 'Actiniaria': '1464', 'Ctenopharyngodon idella': '1465', 'Ophiuroidea': '1466', 'Scomberoides lysan': '1467', 'Scomberoides tol': '1468', 'Sebastolobus': '1469', 'Selachimorpha': '1470', 'Selene setapinnis': '1471', 'Selene vomer': '1472', 'Sepia elliptica': '1473', 'Sergestes sp': '1474', 'Setipinna taty': '1475', 'Siganus canaliculatus': '1476', 'Sigmops gracile': '1477', 'Solenocera sp': '1478', 'Sparidae': '1479', 'Spermatophytina': '1480', 'Sphoeroides testudineus': '1481', 'Sphyraena jello': '1482', 'Spyridia hypnoides': '1483', 'Squaliformes': '1484', 'Squillidae': '1485', 'Stegophiura sladeni': '1486', 'Stenella longirostris': '1487', 'Stenobrachius leucopsarus': '1488', 'Sternaspidae': '1489', 'Stoechospermum polypodioides': '1490', 'Stolephorus commersonnii': '1491', 'Stromateus cinereus': '1492', 'Stromateus niger': '1493', 'Stromateus sinensis': '1494', 'Synidotea': '1495', 'Takifugu vermicularis': '1496', 'Telatrygon zugei': '1497', 'Terapon jarbua': '1498', 'Terebellidae': '1499', 'Thryssa dussumieri': '1500', 'Thunnini': '1501', 'Tibia curta': '1502', 'Tonna dolium': '1503', 'Trachinus draco': '1504', 'Trematomus bernacchii': '1505', 'Tridacna': '1506', 'Trinectes paulistanus': '1507', 'Trochus radiatus': '1508', 'Turbinaria': '1509', 'Tursiops truncatus': '1510', 'Ucides': '1511', 'Ulva compressa': '1512', 'Ulva fasciata': '1513', 'Ulva flexuosa': '1514', 'Ulva rigida': '1515', 'Upeneus taeniopterus': '1516', 'Upogebiidae': '1517', 'Uroteuthis Photololigo edulis': '1518', 'Valoniopsis pachynema': '1519', 'Veneridae': '1520', 'Venus foveolata': '1521', 'Vertebrata': '1522', 'Volutharpa ampullacea perryi': '1523', 'Zannichellia palustris': '1524', 'Zeus japonicus': '1525', 'Favites': '1526', 'Gadiformes': '1527', 'Gafrarium dispar': '1528', 'Galaxaura frutescens': '1529', 'Gelidium crinale': '1530', 'Genidens genidens': '1531', 'Girella elevata': '1532', 'Girella tricuspidata': '1533', 'Dentex hypselosomus': '1534', 'Saurida elongata': '1535', 'Pseudolabrus eoethinus': '1536', 'Atrobucca nibe': '1537', 'Diagramma pictum': '1538', 'Sepia lycidas': '1539', 'Plectorhinchus cinctus': '1540', 'Metapenaeopsis acclivis': '1541', 'Metapenaeopsis barbata': '1542', 'Nibea albiflora': '1543', 'Girella leonina': '1544', 'Sphyraenidae': '1545', 'Parapercis pulchella': '1546', 'Parapercis sexfasciata': '1547', 'Thysanoteuthis rhombus': '1548', 'Lepidotrigla kishinouyi': '1549', 'Cystoseira': '1550', 'Padina': '1551', 'Halimeda': '1552', 'Pacifastacus leniusculus': '1553', 'Salmo trutta': '1554', 'Chondrus crispus': '1555', 'Ictalurus punctatus': '1556', 'Acanthurus': '1557', 'Scombridae': '1558', 'Leukoma staminea': '1559', 'Trochidae': '1560', 'Protonibea': '1562', 'Anchoa compressa': '1563', 'Ensis magnus': '1564', 'Bolinus brandaris': '1565', 'Lutjanus notatus': '1566', 'Lethrinus olivaceus': '1567', 'Carassius auratus': '1569', 'Mugil': '1570', 'Gobius': '1571', 'Lajonkairia lajonkairii': '1572', 'Chrysophrys auratus': '1573', 'Galeorhinus australis': '1574', 'Nototodarus sloanii gouldi': '1575', 'Tylosurus crocodilus': '1576', 'Acanthogobius hasta': '1577', 'Penaeus chinensis': '1578', 'Ruditapes variegatus': '1579', 'Marcia marmorata': '1580', 'Rachycentron': '1581', 'Scomber kanagurta': '1582', 'Arius': '1583', 'Panulirus versicolor': '1584', 'Tilapia zillii': '1585', 'Schizoporella errata': '1586', 'Phallusia nigra': '1587', 'Physeter catodon': '1588', 'Salmo trutta trutta': '1589', 'Tachysurus thalassinus': '1590', 'Sillago domina': '1591', 'Otolithus argenteus': '1592', 'Trichiurus haumela': '1593', 'Otolithes maculata': '1594', 'Hilsa kanagurta': '1595', 'Oreochromis mossambicus': '1596', 'Siluriformes': '1597', 'Theodoxus euxinus': '1598', 'Formio niger': '1599', 'Rastrelliger': '1600', 'Nephasoma flagriferum': '1601', 'Ophiomusium lymani': '1602', 'Nematonurus armatus': '1603', 'Thalamitoides spinigera': '1604', 'Capros aper': '1605', 'Gadiculus argenteus thori': '1606', 'Phorcus lineatus': '1607', 'Penaeus vannamei': '1608', 'Raja montagui': '1609', 'Scophthalmus rhombus': '1610', 'Crambe maritima': '1611', 'Fucus ceranoides': '1612', 'Maja squinado': '1613', 'Salicornia europaea': '1614', 'Aequipecten opercularis': '1615', 'Galathea squamifera': '1616', 'Cynoglossus semilaevis': '1617', 'Loliolus beka': '1619', 'Octopus variabilis': '1620', 'Abudefduf sexfasciatus': '1621', 'Acanthurus blochii': '1622', 'Achillea millefolium': '1623', 'Alaria crassifolia': '1624', 'Albulidae': '1625', 'Ammodytes': '1626', 'Anadara satowi': '1627', 'Argyrosomus japonicus': '1628', 'Ascidiacea': '1629', 'Aulopiformes': '1630', 'Babylonia japonica': '1631', 'Babylonia kirana': '1632', 'Bathylagidae': '1633', 'Beryx decadactylus': '1634', 'Branchiostegus': '1635', 'Buccinum': '1636', 'Caesio lunaris': '1637', 'Callionymus curvicornis': '1638', 'Campylaephora hypnaeoides': '1639', 'Cetoscarus ocellatus': '1640', 'Charonia tritonis': '1641', 'Chelon haematocheilus': '1642', 'Chlorurus sordidus': '1643', 'Choerodon azurio': '1644', 'Chromis notata': '1645', 'Cladosiphon okamuranus': '1646', 'Cociella punctata': '1647', 'Coryphaena': '1648', 'Cyclina sinensis': '1649', 'Cymbacephalus beauforti': '1650', 'Dendrobranchiata': '1651', 'Digenea simplex': '1652', 'Ditrema viride': '1653', 'Enteromorpha prolifera': '1654', 'Epinephelus': '1655', 'Epinephelus akaara': '1656', 'Epinephelus awoara': '1657', 'Etelis carbunculus': '1658', 'Fistularia commersonii': '1659', 'Fulvia mutica': '1660', 'Fusinus colus': '1661', 'Gafrarium tumidum': '1662', 'Gelidiaceae': '1663', 'Girella cyanea': '1664', 'Girella mezina': '1665', 'Goniistius zonatus': '1666', 'Gracilaria': '1667', 'Gymnocranius euanus': '1668', 'Heikeopsis japonica': '1669', 'Hemitrygon': '1670', 'Hippoglossoides pinetorum': '1671', 'Holothuria atra': '1672', 'Holothuria leucospilota': '1673', 'Idiosepiidae': '1674', 'Inegocia japonica': '1675', 'Inimicus didactylus': '1676', 'Ishige': '1677', 'Lagocephalus spadiceus': '1678', 'Lambis truncata': '1679', 'Leiognathus equula': '1680', 'Lethrinus xanthochilus': '1681', 'Lutjanus erythropterus': '1682', 'Lutjanus semicinctus': '1683', 'Monodonta labio': '1684', 'Monostroma kuroshiense': '1685', 'Mulloidichthys flavolineatus': '1686', 'Mulloidichthys vanicolensis': '1687', 'Muraenesocidae': '1688', 'Myagropsis myagroides': '1689', 'Mytilisepta virgata': '1690', 'Naso brevirostris': '1691', 'Nematalosa japonica': '1692', 'Nemipterus virgatus': '1693', 'Nipponacmea': '1694', 'Nuchequula nuchalis': '1695', 'Octopus cyanea': '1696', 'Panopea generosa': '1697', 'Paralichthys': '1698', 'Paralithodes camtschaticus': '1699', 'Parascolopsis inermis': '1700', 'Pectinidae': '1701', 'Pentapodus aureofasciatus': '1702', 'Pinctada fucata': '1703', 'Pitar citrinus': '1704', 'Platycephalidae': '1705', 'Plecoglossus altivelis': '1706', 'Pleuronectes herzensteini': '1707', 'Priacanthus macracanthus': '1708', 'Pristipomoides': '1709', 'Psenopsis anomala': '1710', 'Pseudobalistes fuscus': '1711', 'Pseudocaranx dentex': '1712', 'Pseudolabrus sieboldi': '1713', 'Pseudorhombus arsius': '1714', 'Pterocaesio chrysozona': '1715', 'Rhynchopelates oxyrhynchus': '1716', 'Ryukyupercis gushikeni': '1717', 'Saccostrea echinata': '1718', 'Sargassum hemiphyllum': '1719', 'Sargassum piluliferum': '1720', 'Saurida micropectoralis': '1721', 'Saurida undosquamis': '1722', 'Saurida wanieso': '1723', 'Scarus forsteni': '1724', 'Scarus ghobban': '1725', 'Scarus ovifrons': '1726', 'Scarus rubroviolaceus': '1727', 'Scyphozoa': '1728', 'Sebastes iracundus': '1729', 'Semicossyphus reticulatus': '1730', 'Sepia latimanus': '1731', 'Siganus guttatus': '1732', 'Siganus luridus': '1733', 'Sphaerotrichia divaricata': '1734', 'Sphyrnidae': '1735', 'Spondylus regius': '1736', 'Spratelloides gracilis': '1737', 'Sthenoteuthis oualaniensis': '1738', 'Tetraodontidae': '1739', 'Trichiurus lepturus japonicus': '1740', 'Tridacna crocea': '1741', 'Turbo argyrostomus': '1742', 'Tylosurus pacificus': '1743', 'Ulvophyceae': '1744', 'Upeneus japonicus': '1745', 'Upeneus moluccensis': '1746', 'Uranoscopus japonicus': '1747', 'Anguilliformes': '1748', 'Crithmum maritimum': '1749', 'Littorina': '1750', 'Nucella lapillus': '1752', 'Scyliorhinus stellaris': '1753', 'Annelida': '1754', 'Aphrodita aculeata': '1755', 'Callionymus lyra': '1756', 'Urticina felina': '1757', 'Gebiidea': '1758', 'Bonellia viridis': '1759', 'Alcyonium glomeratum': '1760'}, 'body_part': {'Not applicable': '-1', 'Not available': '0', 'Whole animal': '1', 'Whole animal eviscerated': '2', 'Whole animal eviscerated without head': '3', 'Flesh with bones': '4', 'Blood': '5', 'Skeleton': '6', 'Bones': '7', 'Exoskeleton': '8', 'Endoskeleton': '9', 'Shells': '10', 'Molt': '11', 'Skin': '12', 'Head': '13', 'Tooth': '14', 'Otolith': '15', 'Fins': '16', 'Faecal pellet': '17', 'Byssus': '18', 'Soft parts': '19', 'Viscera': '20', 'Stomach': '21', 'Hepatopancreas': '22', 'Digestive gland': '23', 'Pyloric caeca': '24', 'Liver': '25', 'Intestine': '26', 'Kidney': '27', 'Spleen': '28', 'Brain': '29', 'Eye': '30', 'Fat': '31', 'Heart': '32', 'Branchial heart': '33', 'Muscle': '34', 'Mantle': '35', 'Gills': '36', 'Gonad': '37', 'Ovary': '38', 'Testes': '39', 'Whole plant': '40', 'Flower': '41', 'Leaf': '42', 'Old leaf': '43', 'Young leaf': '44', 'Leaf upper part': '45', 'Leaf lower part': '46', 'Scales': '47', 'Root rhizome': '48', 'Whole macro alga': '49', 'Phytoplankton': '50', 'Thallus': '51', 'Flesh without bones': '52', 'Stomach and intestine': '53', 'Whole haptophytic plants': '54', 'Loose drifting plants': '55', 'Growing tips': '56', 'Upper parts of plants': '57', 'Lower parts of plants': '58', 'Shells carapace': '59', 'Flesh with scales': '60'}}, 'SEAWATER': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', 'pu240': '69', 'pu241': '70', 'am240': '71', 'am241': '72', 'cm242': '73', 'cm243': '74', 'cm244': '75', 'cs134_137_tot': '76', 'pu239_240_tot': '77', 'pu239_240_iii_iv_tot': '78', 'pu239_240_v_vi_tot': '79', 'cm243_244_tot': '80', 'pu238_pu239_240_tot_ratio': '81', 'am241_pu239_240_tot_ratio': '82', 'cs137_134_ratio': '83', 'cd109': '84', 'eu152': '85', 'fe59': '86', 'gd153': '87', 'ir192': '88', 'pu238_240_tot': '89', 'rb86': '90', 'sc46': '91', 'sn113': '92', 'sn117m': '93', 'tl208': '94', 'mo99': '95', 'tc99m': '96', 'ru105': '97', 'te129': '98', 'te132': '99', 'i132': '100', 'i135': '101', 'cs136': '102', 'tbeta': '103', 'talpha': '104', 'i133': '105', 'th230': '106', 'pa231': '107', 'u236': '108', 'ag111': '109', 'in116m': '110', 'te123m': '111', 'sb127': '112', 'ba133': '113', 'ce139': '114', 'tl201': '116', 'hg203': '117', 'na22': '122', 'pa234m': '123', 'am243': '124', 'se75': '126', 'sr85': '127', 'y88': '128', 'ce140': '129', 'bi212': '130', 'u236_238_ratio': '131', 'i125': '132', 'ba137m': '133', 'u232': '134', 'pa233': '135', 'ru106_rh106_tot': '136', 'tu': '137', 'tbeta40k': '138', 'fe55': '139', 'ce144_pr144_tot': '140', 'pu240_pu239_ratio': '141', 'u233': '142', 'pu239_242_tot': '143', 'ac227': '144'}, 'unit': {'Not applicable': '-1', 'NOT AVAILABLE': '0', 'Bq per m3': '1', 'Bq per m2': '2', 'Bq per kg': '3', 'Bq per kgd': '4', 'Bq per kgw': '5', 'kg per kg': '6', 'TU': '7', 'DELTA per mill': '8', 'atom per kg': '9', 'atom per kgd': '10', 'atom per kgw': '11', 'atom per l': '12', 'Bq per kgC': '13'}, 'dl': {'Not applicable': '-1', 'Not available': '0', 'Detected value': '1', 'Detection limit': '2', 'Not detected': '3', 'Derived': '4'}}}

Lets review the data of the NetCDF file:

dfs = contents.dfs
dfs
{'BIOTA':              LON        LAT        TIME  SMP_ID  NUCLIDE     VALUE  UNIT  \
 0       4.031111  51.393333  1267574400       1       33  0.326416     5   
 1       4.031111  51.393333  1276473600       2       33  0.442704     5   
 2       4.031111  51.393333  1285545600       3       33  0.412989     5   
 3       4.031111  51.393333  1291766400       4       33  0.202768     5   
 4       4.031111  51.393333  1267574400       5       53  0.652833     5   
 ...          ...        ...         ...     ...      ...       ...   ...   
 15946  12.087778  57.252499  1660003200   98058       33  0.384000     5   
 15947  12.107500  57.306389  1663891200   98059       33  0.456000     5   
 15948  11.245000  58.603333  1667779200   98060       33  0.122000     5   
 15949  11.905278  57.302502  1663632000   98061       33  0.310000     5   
 15950  12.076667  57.335278  1662076800   98062       33  0.306000     5   
 
             UNC  DL  SPECIES  BODY_PART  
 0           NaN   2      377          1  
 1           NaN   2      377          1  
 2           NaN   2      377          1  
 3           NaN   2      377          1  
 4           NaN   2      377          1  
 ...         ...  ..      ...        ...  
 15946  0.012096   1      272         52  
 15947  0.012084   1      272         52  
 15948  0.031000   1      129         19  
 15949       NaN   2      129         19  
 15950  0.007191   1       96         40  
 
 [15951 rows x 11 columns],
 'SEAWATER':             LON        LAT  SMP_DEPTH        TIME  SMP_ID  NUCLIDE     VALUE  \
 0      3.188056  51.375278        3.0  1264550400       1       33  0.200000   
 1      2.859444  51.223610        3.0  1264550400       2       33  0.270000   
 2      2.713611  51.184444        3.0  1264550400       3       33  0.260000   
 3      3.262222  51.420277        3.0  1264550400       4       33  0.250000   
 4      2.809722  51.416111        3.0  1264464000       5       33  0.200000   
 ...         ...        ...        ...         ...     ...      ...       ...   
 19178  4.615278  52.831944        1.0  1573649640   97102       77  0.000005   
 19179  3.565556  51.411945        1.0  1575977820   96936        1  6.152000   
 19180  3.565556  51.411945        1.0  1575977820   96949       53  0.005390   
 19181  3.565556  51.411945        1.0  1575977820   96962       54  0.001420   
 19182  3.493889  51.719444        1.0  1576680180   96982        1  6.078000   
 
        UNIT           UNC  DL  
 0         1           NaN   2  
 1         1           NaN   2  
 2         1           NaN   2  
 3         1           NaN   2  
 4         1           NaN   2  
 ...     ...           ...  ..  
 19178     1  2.600000e-07   1  
 19179     1  3.076000e-01   1  
 19180     1  1.078000e-03   1  
 19181     1  2.840000e-04   1  
 19182     1  3.039000e-01   1  
 
 [19183 rows x 10 columns]}

Lets review the biota data:

nc_dfs_biota=dfs['BIOTA']
nc_dfs_biota
LON LAT TIME SMP_ID NUCLIDE VALUE UNIT UNC DL SPECIES BODY_PART
0 4.031111 51.393333 1267574400 1 33 0.326416 5 NaN 2 377 1
1 4.031111 51.393333 1276473600 2 33 0.442704 5 NaN 2 377 1
2 4.031111 51.393333 1285545600 3 33 0.412989 5 NaN 2 377 1
3 4.031111 51.393333 1291766400 4 33 0.202768 5 NaN 2 377 1
4 4.031111 51.393333 1267574400 5 53 0.652833 5 NaN 2 377 1
... ... ... ... ... ... ... ... ... ... ... ...
15946 12.087778 57.252499 1660003200 98058 33 0.384000 5 0.012096 1 272 52
15947 12.107500 57.306389 1663891200 98059 33 0.456000 5 0.012084 1 272 52
15948 11.245000 58.603333 1667779200 98060 33 0.122000 5 0.031000 1 129 19
15949 11.905278 57.302502 1663632000 98061 33 0.310000 5 NaN 2 129 19
15950 12.076667 57.335278 1662076800 98062 33 0.306000 5 0.007191 1 96 40

15951 rows × 11 columns

Lets review the seawater data:

nc_dfs_seawater=dfs['SEAWATER']
nc_dfs_seawater
LON LAT SMP_DEPTH TIME SMP_ID NUCLIDE VALUE UNIT UNC DL
0 3.188056 51.375278 3.0 1264550400 1 33 0.200000 1 NaN 2
1 2.859444 51.223610 3.0 1264550400 2 33 0.270000 1 NaN 2
2 2.713611 51.184444 3.0 1264550400 3 33 0.260000 1 NaN 2
3 3.262222 51.420277 3.0 1264550400 4 33 0.250000 1 NaN 2
4 2.809722 51.416111 3.0 1264464000 5 33 0.200000 1 NaN 2
... ... ... ... ... ... ... ... ... ... ...
19178 4.615278 52.831944 1.0 1573649640 97102 77 0.000005 1 2.600000e-07 1
19179 3.565556 51.411945 1.0 1575977820 96936 1 6.152000 1 3.076000e-01 1
19180 3.565556 51.411945 1.0 1575977820 96949 53 0.005390 1 1.078000e-03 1
19181 3.565556 51.411945 1.0 1575977820 96962 54 0.001420 1 2.840000e-04 1
19182 3.493889 51.719444 1.0 1576680180 96982 1 6.078000 1 3.039000e-01 1

19183 rows × 10 columns

Data Format Conversion

The MARIS data processing workflow involves two key steps:

  1. NetCDF to Standardized CSV Compatible with OpenRefine Pipeline
    • Convert standardized NetCDF files to CSV formats compatible with OpenRefine using the NetCDFDecoder.
    • Preserve data integrity and variable relationships.
    • Maintain standardized nomenclature and units.
  2. Database Integration
    • Process the converted CSV files using OpenRefine.
    • Apply data cleaning and standardization rules.
    • Export validated data to the MARIS master database.

This section focuses on the first step: converting NetCDF files to a format suitable for OpenRefine processing using the NetCDFDecoder class.

decode(fname_in=fname_out_nc, verbose=True)
Saved BIOTA to ../../_data/output/191-OSPAR-2024_BIOTA.csv
Saved SEAWATER to ../../_data/output/191-OSPAR-2024_SEAWATER.csv