Exported source
= 'https://raw.githubusercontent.com/franckalbinet/maris-crawlers/refs/heads/main/data/processed/HELCOM%20MORS'
src_dir = '../../_data/output/100-HELCOM-MORS-2024.nc'
fname_out_nc ='26VMZZ2Q' # HELCOM MORS zotero key zotero_key
This data pipeline, known as a “handler” in Marisco terminology, is designed to clean, standardize, and encode HELCOM data into
NetCDF
format. The handler processes raw HELCOM data, applying various transformations and lookups to align it withMARIS
data standards.
Key functions of this handler:
NetCDF
format compatible with MARIS requirementsThis handler is a crucial component in the Marisco data processing workflow, ensuring HELCOM data is properly integrated into the MARIS database.
For new MARIS users, please refer to Understanding MARIS Data Formats (NetCDF and Open Refine) for detailed information.
The present notebook pretends to be an instance of Literate Programming in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case marisco/handlers/helcom.py
) the code snippet is added to the module using #| exports
as provided by the wonderful nbdev library.
src_dir: path to the maris-crawlers folder containing the HELCOM data in CSV format.
fname_out_nc: path and filename for the NetCDF output.The path can be defined as a relative path.
Zotero key: used to retrieve attributes related to the dataset from Zotero. The MARIS datasets include a library available on Zotero.
FEEDBACK FOR NEXT VERSION: Review the NetCDF file naming convention as discussed here. I think we should include ‘MARISCO’ in the filename and attributes to acknowledge the contributions and branding of the project.
Helcom MORS (Monitoring of Radioactive Substances in the Baltic Sea) data is provided as a zipped Microsoft Access database. We automatically fetch and convert this dataset with database tables exported as .csv
files using a Github action here: maris-crawlers.
The dataset is then accessible in an amenable format for the marisco
data pipeline.
read_csv (file_name, dir='https://raw.githubusercontent.com/franckalbinet/maris- crawlers/refs/heads/main/data/processed/HELCOM%20MORS')
load_data (src_url:str, smp_types:dict={'BIO': 'BIOTA', 'SEA': 'SEAWATER', 'SED': 'SEDIMENT'}, use_cache:bool=False, save_to_cache:bool=False, verbose:bool=False)
Load HELCOM data and return the data in a dictionary of dataframes with the dictionary key as the sample type.
def load_data(src_url: str,
smp_types: dict = default_smp_types,
use_cache: bool = False,
save_to_cache: bool = False,
verbose: bool = False) -> Dict[str, pd.DataFrame]:
"Load HELCOM data and return the data in a dictionary of dataframes with the dictionary key as the sample type."
def load_and_merge(file_prefix: str) -> pd.DataFrame:
if use_cache:
dir=cache_path()
else:
dir = src_url
file_smp_path = f'{dir}/{file_prefix}01.csv'
file_meas_path = f'{dir}/{file_prefix}02.csv'
if use_cache:
if not Path(file_smp_path).exists():
print(f'{file_smp_path} not found.')
if not Path(file_meas_path).exists():
print(f'{file_meas_path} not found.')
if verbose:
start_time = time.time()
df_meas = read_csv(f'{file_prefix}02.csv', dir)
df_smp = read_csv(f'{file_prefix}01.csv', dir)
df_meas.columns = df_meas.columns.str.lower()
df_smp.columns = df_smp.columns.str.lower()
merged_df = pd.merge(df_meas, df_smp, on='key', how='left')
if verbose:
print(f"Downloaded data for {file_prefix}01.csv and {file_prefix}02.csv in {time.time() - start_time:.2f} seconds.")
if save_to_cache:
dir = cache_path()
df_smp.to_csv(f'{dir}/{file_prefix}01.csv', index=False)
df_meas.to_csv(f'{dir}/{file_prefix}02.csv', index=False)
if verbose:
print(f"Saved downloaded data to cache at {dir}/{file_prefix}01.csv and {dir}/{file_prefix}02.csv")
return merged_df
return {smp_type: load_and_merge(file_prefix) for file_prefix, smp_type in smp_types.items()}
dfs
is a dictionary of dataframes created from the Helcom dataset located at the path src_dir
. The data to be included in each dataframe is sorted by sample type. Each dictionary is defined with a key equal to the sample type.
dfs = load_data(src_dir, save_to_cache=True, verbose=True)
print('keys/sample types: ', dfs.keys())
for key in dfs.keys():
print(f'{key} columns: ', dfs[key].columns)
print(f'{key} shape: ', dfs[key].shape)
Downloaded data for BIO01.csv and BIO02.csv in 1.54 seconds.
Saved downloaded data to cache at /home/niallmurphy93/.marisco/cache/BIO01.csv and /home/niallmurphy93/.marisco/cache/BIO02.csv
Downloaded data for SEA01.csv and SEA02.csv in 1.30 seconds.
Saved downloaded data to cache at /home/niallmurphy93/.marisco/cache/SEA01.csv and /home/niallmurphy93/.marisco/cache/SEA02.csv
Downloaded data for SED01.csv and SED02.csv in 1.92 seconds.
Saved downloaded data to cache at /home/niallmurphy93/.marisco/cache/SED01.csv and /home/niallmurphy93/.marisco/cache/SED02.csv
keys/sample types: dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT'])
BIOTA columns: Index(['key', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'basis',
'error%', 'number', 'date_of_entry_x', 'country', 'laboratory',
'sequence', 'date', 'year', 'month', 'day', 'station',
'latitude ddmmmm', 'latitude dddddd', 'longitude ddmmmm',
'longitude dddddd', 'sdepth', 'rubin', 'biotatype', 'tissue', 'no',
'length', 'weight', 'dw%', 'loi%', 'mors_subbasin', 'helcom_subbasin',
'date_of_entry_y'],
dtype='object')
BIOTA shape: (16124, 33)
SEAWATER columns: Index(['key', 'nuclide', 'method', '< value_bq/m³', 'value_bq/m³', 'error%_m³',
'date_of_entry_x', 'country', 'laboratory', 'sequence', 'date', 'year',
'month', 'day', 'station', 'latitude (ddmmmm)', 'latitude (dddddd)',
'longitude (ddmmmm)', 'longitude (dddddd)', 'tdepth', 'sdepth', 'salin',
'ttemp', 'filt', 'mors_subbasin', 'helcom_subbasin', 'date_of_entry_y'],
dtype='object')
SEAWATER shape: (21634, 27)
SEDIMENT columns: Index(['key', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'error%_kg',
'< value_bq/m²', 'value_bq/m²', 'error%_m²', 'date_of_entry_x',
'country', 'laboratory', 'sequence', 'date', 'year', 'month', 'day',
'station', 'latitude (ddmmmm)', 'latitude (dddddd)',
'longitude (ddmmmm)', 'longitude (dddddd)', 'device', 'tdepth',
'uppsli', 'lowsli', 'area', 'sedi', 'oxic', 'dw%', 'loi%',
'mors_subbasin', 'helcom_subbasin', 'sum_link', 'date_of_entry_y'],
dtype='object')
SEDIMENT shape: (40744, 35)
FEEDBACK TO DATA PROVIDER: Some nuclide names contain one or multiple trailing spaces.
This is demonstrated below for the NUCLIDE
column:
df = get_unique_across_dfs(load_data(src_dir, use_cache=True, verbose=True), 'nuclide', as_df=True, include_nchars=True)
df['stripped_chars'] = df['value'].str.strip().str.replace(' ', '').str.len()
print(df[df['n_chars'] != df['stripped_chars']])
Downloaded data for BIO01.csv and BIO02.csv in 0.03 seconds.
Downloaded data for SEA01.csv and SEA02.csv in 0.05 seconds.
Downloaded data for SED01.csv and SED02.csv in 0.09 seconds.
index value n_chars stripped_chars
0 0 SR90 5 4
19 19 CS137 8 5
27 27 CS137 9 5
34 34 PU238 8 5
40 40 CS137 6 5
42 42 CO60 8 4
45 45 TC99 7 4
68 68 SR90 7 4
80 80 K40 8 3
82 82 CS134 8 5
88 88 SR90 8 4
91 91 SR90 6 4
95 95 AM241 8 5
To fix this issue, we use the LowerStripNameCB
callback. For each dataframe in the dictionary of dataframes, it corrects the nuclide name by converting it lowercase, striping any leading or trailing whitespace(s).
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE')])
tfm()
for key, df in tfm.dfs.items():
print(f'{key} Nuclides: ')
print(df['NUCLIDE'].unique())
print(tfm.logs)
BIOTA Nuclides:
['cs134' 'k40' 'co60' 'cs137' 'sr90' 'ag108m' 'mn54' 'co58' 'ag110m'
'zn65' 'sb125' 'pu239240' 'ru106' 'be7' 'ce144' 'pb210' 'po210' 'sb124'
'sr89' 'zr95' 'te129m' 'ru103' 'nb95' 'ce141' 'la140' 'i131' 'ba140'
'pu238' 'u235' 'bi214' 'pb214' 'pb212' 'tl208' 'ac228' 'ra223' 'eu155'
'ra226' 'gd153' 'sn113' 'fe59' 'tc99' 'co57' 'sn117m' 'eu152' 'sc46'
'rb86' 'ra224' 'th232' 'cs134137' 'am241' 'ra228' 'th228' 'k-40' 'cs138'
'cs139' 'cs140' 'cs141' 'cs142' 'cs143' 'cs144' 'cs145' 'cs146']
SEAWATER Nuclides:
['cs137' 'sr90' 'h3' 'cs134' 'pu238' 'pu239240' 'am241' 'cm242' 'cm244'
'tc99' 'k40' 'ru103' 'sr89' 'sb125' 'nb95' 'ru106' 'zr95' 'ag110m'
'cm243244' 'ba140' 'ce144' 'u234' 'u238' 'co60' 'pu239' 'pb210' 'po210'
'np237' 'pu240' 'mn54']
SEDIMENT Nuclides:
['cs137' 'ra226' 'ra228' 'k40' 'sr90' 'cs134137' 'cs134' 'pu239240'
'pu238' 'co60' 'ru103' 'ru106' 'sb125' 'ag110m' 'ce144' 'am241' 'be7'
'th228' 'pb210' 'co58' 'mn54' 'zr95' 'ba140' 'po210' 'ra224' 'nb95'
'pu238240' 'pu241' 'pu239' 'eu155' 'ir192' 'th232' 'cd109' 'sb124' 'zn65'
'th234' 'tl208' 'pb212' 'pb214' 'bi214' 'ac228' 'ra223' 'u235' 'bi212']
["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column."]
Below, we map nuclide names used by HELCOM to the MARIS standard nuclide names.
Remapping data provider nomenclatures to MARIS standards is a recurrent operation and is done in a semi-automated manner according to the following pattern:
We will refer to this process as IMFA (Inspect, Match, Fix, Apply).
The get_unique_across_dfs
function is a utility in MARISCO that retrieves unique values from a specified column across all DataFrames. Note that there is one DataFrame for each sample type, such as biota, sediment, etc.
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE')])
dfs_output = tfm()
# Transpose to display the dataframe horizontally
get_unique_across_dfs(dfs_output, col_name='NUCLIDE', as_df=True).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 |
value | nb95 | cs140 | bi212 | ru106 | pb214 | cs134137 | po210 | pb210 | bi214 | th228 | ... | cs144 | h3 | pu238240 | th232 | pu238 | cs138 | u238 | cs141 | pb212 | co57 |
2 rows × 77 columns
Let’s now create an instance of a fuzzy matching algorithm Remapper
. This instance will match the nuclide names of the HELCOM dataset to the MARIS standard nuclide names.
Lets try to match HELCOM nuclide names to MARIS standard nuclide names as automatically as possible. The match_score
column allows to assess the results:
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 0%| | 0/77 [00:00<?, ?it/s]Processing: 100%|██████████| 77/77 [00:01<00:00, 41.86it/s]
63 entries matched the criteria, while 14 entries had a match score of 1 or higher.
matched_maris_name | source_name | match_score | |
---|---|---|---|
source_key | |||
cs134137 | cs137 | cs134137 | 3 |
cm243244 | cm242 | cm243244 | 3 |
pu239240 | pu239 | pu239240 | 3 |
pu238240 | pu240 | pu238240 | 3 |
cs143 | ce140 | cs143 | 2 |
cs145 | ce140 | cs145 | 2 |
cs142 | ce140 | cs142 | 2 |
cs140 | ce140 | cs140 | 1 |
cs139 | ce139 | cs139 | 1 |
k-40 | k40 | k-40 | 1 |
cs146 | cs136 | cs146 | 1 |
cs144 | cs134 | cs144 | 1 |
cs138 | cs134 | cs138 | 1 |
cs141 | ce141 | cs141 | 1 |
We can now manually inspect the unmatched nuclide names and create a table to correct them to the MARIS standard:
fixes_nuclide_names = {
'cs134137': 'cs134_137_tot',
'cm243244': 'cm243_244_tot',
'pu239240': 'pu239_240_tot',
'pu238240': 'pu238_240_tot',
'cs143': 'cs137',
'cs145': 'cs137',
'cs142': 'cs137',
'cs141': 'cs137',
'cs144': 'cs137',
'k-40': 'k40',
'cs140': 'cs137',
'cs146': 'cs137',
'cs139': 'cs137',
'cs138': 'cs137'
}
We now include the table fixes_nuclide_names
, which applies manual corrections to the nuclide names before the remapping process. The generate_lookup_table
function has an overwrite
parameter (default is True
), which, when set to True
, creates a pickle file cache of the lookup table. We can now test the remapping process:
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
fc.test_eq(len(remapper.select_match(match_score_threshold=1, verbose=True)), 0)
Processing: 0%| | 0/77 [00:00<?, ?it/s]Processing: 100%|██████████| 77/77 [00:01<00:00, 48.43it/s]
77 entries matched the criteria, while 0 entries had a match score of 1 or higher.
Test passes! We can now create a callback RemapNuclideNameCB
to remap the nuclide names. Note that we pass overwrite=False
to the Remapper
constructor to now use the cached version.
# Create a lookup table for nuclide names
lut_nuclides = lambda df: Remapper(provider_lut_df=df,
maris_lut_fn=nuc_lut_path,
maris_col_id='nuclide_id',
maris_col_name='nc_name',
provider_col_to_match='value',
provider_col_key='value',
fname_cache='nuclides_helcom.pkl').generate_lookup_table(fixes=fixes_nuclide_names,
as_df=False, overwrite=False)
We now create the callback RemapNuclideNameCB
, which will remap the nuclide names using the lut_nuclides
lookup table.
RemapNuclideNameCB (fn_lut:Callable, col_name:str)
Remap data provider nuclide names to standardized MARIS nuclide names.
Type | Details | |
---|---|---|
fn_lut | Callable | Function that returns the lookup table dictionary |
col_name | str | Column name to remap |
class RemapNuclideNameCB(Callback):
"Remap data provider nuclide names to standardized MARIS nuclide names."
def __init__(self,
fn_lut: Callable, # Function that returns the lookup table dictionary
col_name: str # Column name to remap
):
fc.store_attr()
def __call__(self, tfm: Transformer):
df_uniques = get_unique_across_dfs(tfm.dfs, col_name=self.col_name, as_df=True)
#lut = {k: v.matched_maris_name for k, v in self.fn_lut(df_uniques).items()}
lut = {k: v.matched_id for k, v in self.fn_lut(df_uniques).items()}
for k in tfm.dfs.keys():
tfm.dfs[k]['NUCLIDE'] = tfm.dfs[k][self.col_name].replace(lut)
Let’s see it in action, along with the LowerStripNameCB
callback:
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
RemapNuclideNameCB(lut_nuclides, col_name='NUCLIDE'),
CompareDfsAndTfmCB(dfs)
])
dfs_out = tfm()
# For instance
for key in dfs_out.keys():
print(f'{key} NUCLIDE unique: ', dfs_out[key]['NUCLIDE'].unique())
print(tfm.logs)
BIOTA NUCLIDE unique: [31 4 9 33 12 21 6 8 22 10 24 77 17 2 37 41 47 23 11 13 25 16 14 36
35 29 34 67 63 46 43 42 94 55 50 40 53 87 92 86 15 7 93 85 91 90 51 59
76 72 54 57]
SEAWATER NUCLIDE unique: [33 12 1 31 67 77 72 73 75 15 4 16 11 24 14 17 13 22 80 34 37 62 64 9
68 41 47 65 69 6]
SEDIMENT NUCLIDE unique: [ 33 53 54 4 12 76 31 77 67 9 16 17 24 22 37 72 2 57
41 8 6 13 34 47 51 14 89 70 68 40 88 59 84 23 10 60
94 42 43 46 55 50 63 130]
["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column.", 'Remap data provider nuclide names to standardized MARIS nuclide names.', 'Create a dataframe of removed data. Data included in the `tfm` not in the `dfs`.']
FEEDBACK TO DATA PROVIDER: Time/date is provide in the DATE
, YEAR
, MONTH
, DAY
columns. Note that the DATE
contains missing values as indicated below. When missing, we fallback on the YEAR
, MONTH
, DAY
columns. Note also that sometimes DAY
and MONTH
contain 0. In this case we systematically set them to 1.
dfs = load_data(src_dir, use_cache=True)
for key in dfs.keys():
print(f'{key} DATE null values: ', dfs[key]['date'].isna().sum())
BIOTA DATE null values: 88
SEAWATER DATE null values: 554
SEDIMENT DATE null values: 830
ParseTimeCB ()
Standardize time format across all dataframes.
class ParseTimeCB(Callback):
"Standardize time format across all dataframes."
def __call__(self, tfm: Transformer):
for df in tfm.dfs.values():
self._process_dates(df)
def _process_dates(self, df: pd.DataFrame) -> None:
"Process and correct date and time information in the DataFrame."
df['TIME'] = self._parse_date(df)
self._handle_missing_dates(df)
self._fill_missing_time(df)
def _parse_date(self, df: pd.DataFrame) -> pd.Series:
"Parse the DATE column if present."
return pd.to_datetime(df['date'], format='%m/%d/%y %H:%M:%S', errors='coerce')
def _handle_missing_dates(self, df: pd.DataFrame):
"Handle cases where DAY or MONTH is 0 or missing."
df.loc[df["day"] == 0, "day"] = 1
df.loc[df["month"] == 0, "month"] = 1
missing_day_month = (df["day"].isna()) & (df["month"].isna()) & (df["year"].notna())
df.loc[missing_day_month, ["day", "month"]] = 1
def _fill_missing_time(self, df: pd.DataFrame) -> None:
"Fill missing time values using year, month, and day columns."
missing_time = df['TIME'].isna()
df.loc[missing_time, 'TIME'] = pd.to_datetime(
df.loc[missing_time, ['year', 'month', 'day']],
format='%Y%m%d',
errors='coerce'
)
Apply the transformer for callbacks ParseTimeCB
. Then, print the TIME
data for seawater
. Passing the CompareDfsAndTfmCB
callback allows us to compare the original dataframes with the transformed dataframes using the compare_stats
attribute.
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['SEAWATER'][['TIME']])
BIOTA SEAWATER SEDIMENT
Number of rows in original dataframes (dfs): 16124 21634 40744
Number of rows in transformed dataframes (tfm.dfs): 16124 21634 40744
Number of rows removed (tfm.dfs_removed): 0 0 0
TIME
0 2012-05-23
1 2012-05-23
2 2012-06-17
3 2012-05-24
4 2012-05-24
... ...
21629 2023-06-11
21630 2023-06-11
21631 2023-06-13
21632 2023-06-13
21633 2023-06-13
[21634 rows x 1 columns]
The NetCDF time format requires that time be encoded as the number of milliseconds since a specified origin. In our case, the origin is 1970-01-01
, as indicated in the cdl.toml
file under the [vars.defaults.time.attrs]
section.
EncodeTimeCB
converts the HELCOM time
format to the MARIS NetCDF time
format.
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
EncodeTimeCB(),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.logs)
Warning: 8 missing time value(s) in SEAWATER
Warning: 1 missing time value(s) in SEDIMENT
BIOTA SEAWATER SEDIMENT
Number of rows in original dataframes (dfs): 16124 21634 40744
Number of rows in transformed dataframes (tfm.dfs): 16124 21626 40743
Number of rows removed (tfm.dfs_removed): 0 8 1
['Standardize time format across all dataframes.', 'Encode time as seconds since epoch.', 'Create a dataframe of removed data. Data included in the `tfm` not in the `dfs`.']
key | nuclide | method | < value_bq/m³ | value_bq/m³ | error%_m³ | date_of_entry_x | country | laboratory | sequence | date | year | month | day | station | latitude (ddmmmm) | latitude (dddddd) | longitude (ddmmmm) | longitude (dddddd) | tdepth | sdepth | salin | ttemp | filt | mors_subbasin | helcom_subbasin | date_of_entry_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | WKRIL2012003 | CS137 | NaN | NaN | 5.3 | 32.0 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012003.0 | 05/23/12 00:00:00 | 2012.0 | 5.0 | 23.0 | RU10 | 60.05 | 60.0833 | 29.20 | 29.3333 | NaN | 0.0 | NaN | NaN | NaN | 11.0 | 11.0 | 08/20/14 00:00:00 |
1 | WKRIL2012004 | CS137 | NaN | NaN | 19.9 | 20.0 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012004.0 | 05/23/12 00:00:00 | 2012.0 | 5.0 | 23.0 | RU10 | 60.05 | 60.0833 | 29.20 | 29.3333 | NaN | 29.0 | NaN | NaN | NaN | 11.0 | 11.0 | 08/20/14 00:00:00 |
2 | WKRIL2012005 | CS137 | NaN | NaN | 25.5 | 20.0 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012005.0 | 06/17/12 00:00:00 | 2012.0 | 6.0 | 17.0 | RU11 | 59.26 | 59.4333 | 23.09 | 23.1500 | NaN | 0.0 | NaN | NaN | NaN | 11.0 | 3.0 | 08/20/14 00:00:00 |
3 | WKRIL2012006 | CS137 | NaN | NaN | 17.0 | 29.0 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012006.0 | 05/24/12 00:00:00 | 2012.0 | 5.0 | 24.0 | RU19 | 60.15 | 60.2500 | 27.59 | 27.9833 | NaN | 0.0 | NaN | NaN | NaN | 11.0 | 11.0 | 08/20/14 00:00:00 |
4 | WKRIL2012007 | CS137 | NaN | NaN | 22.2 | 18.0 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012007.0 | 05/24/12 00:00:00 | 2012.0 | 5.0 | 24.0 | RU19 | 60.15 | 60.2500 | 27.59 | 27.9833 | NaN | 39.0 | NaN | NaN | NaN | 11.0 | 11.0 | 08/20/14 00:00:00 |
Helcom reports two values for the SEDIMENT sample type: VALUE_Bq/kg
and VALUE_Bq/m³
. We need to split this and use a single column VALUE
for the MARIS standard. We will use the UNIT
column to identify the reported values.
Lets take a look at the MARIS unit lookup table:
unit_id | unit | unit_sanitized | ordlist | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | -1 | Not applicable | Not applicable | NaN | NaN | NaN | NaN |
1 | 0 | NOT AVAILABLE | NOT AVAILABLE | 0.0 | NaN | NaN | NaN |
2 | 1 | Bq/m3 | Bq per m3 | 1.0 | Bq/m3 | NaN | Bq/m<sup>3</sup> |
3 | 2 | Bq/m2 | Bq per m2 | 2.0 | NaN | NaN | NaN |
4 | 3 | Bq/kg | Bq per kg | 3.0 | NaN | NaN | NaN |
5 | 4 | Bq/kgd | Bq per kgd | 4.0 | NaN | NaN | NaN |
6 | 5 | Bq/kgw | Bq per kgw | 5.0 | NaN | NaN | NaN |
7 | 6 | kg/kg | kg per kg | 6.0 | NaN | NaN | NaN |
8 | 7 | TU | TU | 7.0 | NaN | NaN | NaN |
9 | 8 | DELTA/mill | DELTA per mill | 8.0 | NaN | NaN | NaN |
10 | 9 | atom/kg | atom per kg | 9.0 | NaN | NaN | NaN |
11 | 10 | atom/kgd | atom per kgd | 10.0 | NaN | NaN | NaN |
12 | 11 | atom/kgw | atom per kgw | 11.0 | NaN | NaN | NaN |
13 | 12 | atom/l | atom per l | 12.0 | NaN | NaN | NaN |
14 | 13 | Bq/kgC | Bq per kgC | 13.0 | NaN | NaN | NaN |
We will define the columns of interest for the SEDIMENT measurement types:
We define the SplitSedimentValuesCB
callback to split the sediment entries into separate rows for Bq/kg and Bq/m² values. We use underscore to denote the columns are temporary columns created during the splitting process.
SplitSedimentValuesCB (coi:Dict[str,Dict[str,Any]])
Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements.
Type | Details | |
---|---|---|
coi | Dict | Columns of interest with value, uncertainty, DL columns and units |
class SplitSedimentValuesCB(Callback):
"Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements."
def __init__(self,
coi: Dict[str, Dict[str, Any]] # Columns of interest with value, uncertainty, DL columns and units
):
fc.store_attr()
def __call__(self, tfm: Transformer):
if 'SEDIMENT' not in tfm.dfs:
return
df = tfm.dfs['SEDIMENT']
dfs_to_concat = []
# For each measurement type (kg and m2)
for measure_type, cols in self.coi.items():
# If any of value/uncertainty/DL exists, keep the row
has_data = (
df[cols['VALUE']].notna() |
df[cols['UNC']].notna() |
df[cols['DL']].notna()
)
if has_data.any():
df_measure = df[has_data].copy()
# Copy columns to standardized names
df_measure['_VALUE'] = df_measure[cols['VALUE']]
df_measure['_UNC'] = df_measure[cols['UNC']]
df_measure['_DL'] = df_measure[cols['DL']]
df_measure['_UNIT'] = cols['UNIT']
dfs_to_concat.append(df_measure)
# Combine all measurement type dataframes
if dfs_to_concat:
tfm.dfs['SEDIMENT'] = pd.concat(dfs_to_concat, ignore_index=True)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SplitSedimentValuesCB(coi_sediment),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
tfm.dfs['SEDIMENT'].head()
BIOTA SEAWATER SEDIMENT
Number of rows in original dataframes (dfs): 16124 21634 40744
Number of rows in transformed dataframes (tfm.dfs): 16124 21634 70697
Number of rows removed (tfm.dfs_removed): 0 0 0
key | nuclide | method | < value_bq/kg | value_bq/kg | error%_kg | < value_bq/m² | value_bq/m² | error%_m² | date_of_entry_x | ... | dw% | loi% | mors_subbasin | helcom_subbasin | sum_link | date_of_entry_y | _VALUE | _UNC | _DL | _UNIT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SKRIL2012116 | CS137 | NaN | NaN | 1200.0 | 20.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 | 1200.0 | 20.0 | NaN | 3 |
1 | SKRIL2012117 | CS137 | NaN | NaN | 250.0 | 20.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 | 250.0 | 20.0 | NaN | 3 |
2 | SKRIL2012118 | CS137 | NaN | NaN | 140.0 | 21.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 | 140.0 | 21.0 | NaN | 3 |
3 | SKRIL2012119 | CS137 | NaN | NaN | 79.0 | 20.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 | 79.0 | 20.0 | NaN | 3 |
4 | SKRIL2012120 | CS137 | NaN | NaN | 29.0 | 24.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 | 29.0 | 24.0 | NaN | 3 |
5 rows × 39 columns
FEEDBACK TO DATA PROVIDER: Some of the HELCOM dataset contains missing values in the VALUE
column, see output after applying the SanitizeValueCB
callback.
We allocate each column containing measurement values (named differently across sample types) into a single column VALUE
and remove NA where needed.
SanitizeValueCB (coi:Dict[str,Dict[str,str]])
Sanitize measurement values by removing blanks and standardizing to use the VALUE
column.
Type | Details | |
---|---|---|
coi | Dict | Columns of interest. Format: {group_name: {‘val’: ‘column_name’}} |
class SanitizeValueCB(Callback):
"Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column."
def __init__(self,
coi: Dict[str, Dict[str, str]] # Columns of interest. Format: {group_name: {'val': 'column_name'}}
):
fc.store_attr()
def __call__(self, tfm: Transformer):
for grp, df in tfm.dfs.items():
value_col = self.coi[grp]['VALUE']
# Count NaN values before dropping
initial_nan_count = df[value_col].isna().sum()
df.dropna(subset=[value_col], inplace=True)
# Count NaN values after dropping
final_nan_count = df[value_col].isna().sum()
dropped_nan_count = initial_nan_count - final_nan_count
# Print the number of dropped NaN values
if dropped_nan_count > 0:
print(f"Warning: {dropped_nan_count} missing value(s) in {value_col} for group {grp}.")
df['VALUE'] = df[value_col]
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SplitSedimentValuesCB(coi_sediment),
SanitizeValueCB(coi_val),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
BIOTA SEAWATER SEDIMENT
Number of rows in original dataframes (dfs): 16124 21634 40744
Number of rows in transformed dataframes (tfm.dfs): 16094 21481 70451
Number of rows removed (tfm.dfs_removed): 30 153 144
Function unc_rel2stan
converts uncertainty from relative uncertainty to standard uncertainty.
unc_rel2stan (df:pandas.core.frame.DataFrame, meas_col:str, unc_col:str)
Convert relative uncertainty to absolute uncertainty.
Type | Details | |
---|---|---|
df | DataFrame | DataFrame containing measurement and uncertainty columns |
meas_col | str | Name of the column with measurement values |
unc_col | str | Name of the column with relative uncertainty values (percentages) |
Returns | Series | Series with calculated absolute uncertainties |
def unc_rel2stan(
df: pd.DataFrame, # DataFrame containing measurement and uncertainty columns
meas_col: str, # Name of the column with measurement values
unc_col: str # Name of the column with relative uncertainty values (percentages)
) -> pd.Series: # Series with calculated absolute uncertainties
"Convert relative uncertainty to absolute uncertainty."
return df.apply(lambda row: row[unc_col] * row[meas_col] / 100, axis=1)
For each sample type in the Helcom dataset, the UNC
is provided as a relative uncertainty. The column names for both the VALUE
and the UNC
vary by sample type. The coi_units_unc
dictionary defines the column names for the VALUE
and UNC
for each sample type.
NormalizeUncCB callback normalizes the UNC
by converting from relative uncertainty to standard uncertainty.
NormalizeUncCB (fn_convert_unc:Callable=<function unc_rel2stan>, coi:List[Tuple[str,str,str]]=[('SEAWATER', 'value_bq/m³', 'error%_m³'), ('BIOTA', 'value_bq/kg', 'error%'), ('SEDIMENT', '_VALUE', '_UNC')])
Convert from relative error to standard uncertainty.
Type | Default | Details | |
---|---|---|---|
fn_convert_unc | Callable | unc_rel2stan | Function converting relative uncertainty to absolute uncertainty |
coi | List | [(‘SEAWATER’, ‘value_bq/m³’, ’error%_m³’), (‘BIOTA’, ‘value_bq/kg’, ‘error%’), (‘SEDIMENT’, ’_VALUE’, ’_UNC’)] | List of columns of interest |
class NormalizeUncCB(Callback):
"Convert from relative error to standard uncertainty."
def __init__(self,
fn_convert_unc: Callable=unc_rel2stan, # Function converting relative uncertainty to absolute uncertainty
coi: List[Tuple[str, str, str]]=coi_units_unc # List of columns of interest
):
fc.store_attr()
def __call__(self, tfm: Transformer):
for grp, val, unc in self.coi:
if grp in tfm.dfs:
df = tfm.dfs[grp]
df['UNC'] = self.fn_convert_unc(df, val, unc)
Apply the transformer for callback [
NormalizeUncCB](https://franckalbinet.github.io/marisco/handlers/ospar.html#normalizeunccb)
. Then, print the value (i.e. activity per unit ) and standard uncertainty for each sample type.
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SplitSedimentValuesCB(coi_sediment),
SanitizeValueCB(coi_val),
NormalizeUncCB(),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(tfm.dfs['SEAWATER'][['VALUE', 'UNC']][:2])
print(tfm.dfs['BIOTA'][['VALUE', 'UNC']][:2])
print(tfm.dfs['SEDIMENT'][['VALUE', 'UNC']][:2])
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
VALUE UNC
0 5.3 1.696
1 19.9 3.980
VALUE UNC
0 0.01014 NaN
1 135.30000 4.83021
VALUE UNC
0 1200.0 240.0
1 250.0 50.0
BIOTA SEAWATER SEDIMENT
Number of rows in original dataframes (dfs): 16124 21634 40744
Number of rows in transformed dataframes (tfm.dfs): 16094 21481 70451
Number of rows removed (tfm.dfs_removed): 30 153 144
FEEDBACK TO DATA PROVIDER: The handling of unit types varies between biota
and sediment
sample types. For consistency and ease of use, it would be beneficial to have dedicated unit columns for all sample types.
Given the inconsistent handling of units across sample types, we need to define custom mapping rules for standardizing the units. The units available in MARIS are:
unit_id | unit | unit_sanitized | |
---|---|---|---|
0 | -1 | Not applicable | Not applicable |
1 | 0 | NOT AVAILABLE | NOT AVAILABLE |
2 | 1 | Bq/m3 | Bq per m3 |
3 | 2 | Bq/m2 | Bq per m2 |
4 | 3 | Bq/kg | Bq per kg |
5 | 4 | Bq/kgd | Bq per kgd |
6 | 5 | Bq/kgw | Bq per kgw |
7 | 6 | kg/kg | kg per kg |
8 | 7 | TU | TU |
9 | 8 | DELTA/mill | DELTA per mill |
10 | 9 | atom/kg | atom per kg |
11 | 10 | atom/kgd | atom per kgd |
12 | 11 | atom/kgw | atom per kgw |
13 | 12 | atom/l | atom per l |
14 | 13 | Bq/kgC | Bq per kgC |
We define unit renaming rules for HELCOM in an ad hoc way:
RemapUnitCB (lut_units:dict={'SEAWATER': 1, 'SEDIMENT': 4, 'BIOTA': {'D': 4, 'W': 5, 'F': 5}})
Set the unit
id column in the DataFrames based on a lookup table.
Type | Default | Details | |
---|---|---|---|
lut_units | dict | {‘SEAWATER’: 1, ‘SEDIMENT’: 4, ‘BIOTA’: {‘D’: 4, ‘W’: 5, ‘F’: 5}} | Dictionary containing renaming rules for different unit categories |
class RemapUnitCB(Callback):
"Set the `unit` id column in the DataFrames based on a lookup table."
def __init__(self,
lut_units: dict=lut_units # Dictionary containing renaming rules for different unit categories
):
fc.store_attr()
def __call__(self, tfm: Transformer):
for grp in tfm.dfs.keys():
if grp in ['SEAWATER', 'SEDIMENT']:
tfm.dfs[grp]['UNIT'] = self.lut_units[grp]
else:
tfm.dfs[grp]['UNIT'] = tfm.dfs[grp]['basis'].apply(lambda x: lut_units[grp].get(x, 0))
Apply the transformer for callback RemapUnitCB()
. Then, print the unique UNIT
for the SEAWATER
dataframe.
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[RemapUnitCB(),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
for grp in ['BIOTA', 'SEDIMENT', 'SEAWATER']:
print(f"{grp}: {tfm()[grp]['UNIT'].unique()}")
BIOTA SEAWATER SEDIMENT
Number of rows in original dataframes (dfs): 16124 21634 40744
Number of rows in transformed dataframes (tfm.dfs): 16124 21634 40744
Number of rows removed (tfm.dfs_removed): 0 0 0
BIOTA: [5 0 4]
SEDIMENT: [4]
SEAWATER: [1]
Detection limits are encoded as follows in MARIS:
id | name | name_sanitized | |
---|---|---|---|
0 | -1 | Not applicable | Not applicable |
1 | 0 | Not Available | Not available |
2 | 1 | = | Detected value |
3 | 2 | < | Detection limit |
4 | 3 | ND | Not detected |
5 | 4 | DE | Derived |
Based on columns of interest for each sample type:
We follow the following business logic to encode the detection limit:
RemapDetectionLimitCB
creates a detection_limit
column with values determined as follows: 1. Perform a lookup with the appropriate columns value type (or DL) columns (< VALUE_Bq/m³
or < VALUE_Bq/kg
) against the table returned from the function get_detectionlimit_lut
. 2. If < VALUE_Bq/m³
or < VALUE_Bq/kg
is NaN but both activity values (VALUE_Bq/m³
or VALUE_Bq/kg
) and standard uncertainty (ERROR%_m³
, ERROR%
, or ERROR%_kg
) are provided, then assign the ID of 1
(i.e. “Detected value”). 3. For other NaN values in the detection_limit
column, set them to 0
(i.e. Not Available
).
RemapDetectionLimitCB (coi:dict, fn_lut:Callable)
Remap value type to MARIS format.
Type | Details | |
---|---|---|
coi | dict | Configuration options for column names |
fn_lut | Callable | Function that returns a lookup table |
class RemapDetectionLimitCB(Callback):
"Remap value type to MARIS format."
def __init__(self,
coi: dict, # Configuration options for column names
fn_lut: Callable # Function that returns a lookup table
):
fc.store_attr()
def __call__(self, tfm: Transformer):
"Remap detection limits in the DataFrames using the lookup table."
lut = self.fn_lut()
for grp in tfm.dfs:
df = tfm.dfs[grp]
self._update_detection_limit(df, grp, lut)
def _update_detection_limit(self,
df: pd.DataFrame, # The DataFrame to modify
grp: str, # The group name to get the column configuration
lut: dict # The lookup table dictionary
) -> None:
"Update detection limit column in the DataFrame based on lookup table and rules."
# Check if the group exists in coi_dl
if grp not in coi_dl:
raise ValueError(f"Group '{grp}' not found in coi_dl configuration.")
# Access column names from coi_dl
value_col = coi_dl[grp]['VALUE']
uncertainty_col = coi_dl[grp]['UNC']
detection_col = coi_dl[grp]['DL']
# Initialize detection limit column
df['DL'] = df[detection_col]
# Set detection limits based on conditions
self._set_detection_limits(df, value_col, uncertainty_col, lut)
def _set_detection_limits(self, df: pd.DataFrame, value_col: str, uncertainty_col: str, lut: dict) -> None:
"Set detection limits based on value and uncertainty columns."
# Condition for setting '='
# 'DL' defaults to equal (i.e. '=') if there is a value and uncertainty and 'DL' value is not
# in the lookup table.
condition_eq =(df[value_col].notna() &
df[uncertainty_col].notna() &
~df['DL'].isin(lut.keys())
)
df.loc[condition_eq, 'DL'] = '='
# Set 'Not Available' for unmatched detection limits
df.loc[~df['DL'].isin(lut.keys()), 'DL'] = 'Not Available'
# Perform lookup to map detection limits
df['DL'] = df['DL'].map(lut)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
SplitSedimentValuesCB(coi_sediment),
SanitizeValueCB(coi_val),
NormalizeUncCB(),
RemapUnitCB(),
RemapDetectionLimitCB(coi_dl, lut_dl),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
for grp in ['BIOTA', 'SEDIMENT', 'SEAWATER']:
print(f'Unique DL values for {grp}: {tfm.dfs[grp]["DL"].unique()}')
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
BIOTA SEAWATER SEDIMENT
Number of rows in original dataframes (dfs): 16124 21634 40744
Number of rows in transformed dataframes (tfm.dfs): 16094 21481 70451
Number of rows removed (tfm.dfs_removed): 30 153 144
Unique DL values for BIOTA: [2 1 0]
Unique DL values for SEDIMENT: [1 2 0]
Unique DL values for SEAWATER: [1 2 0]
FEEDBACK TO DATA PROVIDER: Some of the HELCOM Biota dataset rubin
codes differ from the RUBIN_NAME
lookup table as shown below. Some are mistyped, others contains trailing spaces:
{'CHAR BALT', 'FUCU SPP', 'FUCU VES ', 'FURC LUMB', 'GADU MOR ', 'STUC PECT'}
We will remap the HELCOM RUBIN
column to the MARIS SPECIES
column using the IMFA (Inspect, Match, Fix, Apply) pattern. First lets inspect the RUBIN_NAME.csv
file provided by HELCOM, which describes the nomenclature of biota species.
RUBIN_ID | RUBIN | SCIENTIFIC NAME | ENGLISH NAME | |
---|---|---|---|---|
0 | 11 | ABRA BRA | ABRAMIS BRAMA | BREAM |
1 | 12 | ANGU ANG | ANGUILLA ANGUILLA | EEL |
2 | 13 | ARCT ISL | ARCTICA ISLANDICA | ISLAND CYPRINE |
3 | 14 | ASTE RUB | ASTERIAS RUBENS | COMMON STARFISH |
4 | 15 | CARD EDU | CARDIUM EDULE | COCKLE |
Now we try to MATCH the SCIENTIFIC NAME
column of HELCOM
BIOTA
dataset to the species
column of the MARIS species lookup table, again using a Remapper
object:
remapper = Remapper(provider_lut_df=read_csv('RUBIN_NAME.csv'),
maris_lut_fn=species_lut_path,
maris_col_id='species_id',
maris_col_name='species',
provider_col_to_match='SCIENTIFIC NAME',
provider_col_key='RUBIN',
fname_cache='species_helcom.pkl'
)
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 100%|██████████| 46/46 [00:06<00:00, 7.43it/s]
38 entries matched the criteria, while 8 entries had a match score of 1 or higher.
matched_maris_name | source_name | match_score | |
---|---|---|---|
source_key | |||
STIZ LUC | Sander lucioperca | STIZOSTEDION LUCIOPERCA | 10 |
LAMI SAC | Laminaria japonica | LAMINARIA SACCHARINA | 7 |
CARD EDU | Cardiidae | CARDIUM EDULE | 6 |
CH HI;BA | Macoma balthica | CHARA BALTICA | 6 |
ENCH CIM | Echinodermata | ENCHINODERMATA CIM | 5 |
PSET MAX | Pinctada maxima | PSETTA MAXIMA | 5 |
MACO BAL | Macoma balthica | MACOMA BALTICA | 1 |
STUC PEC | Stuckenia pectinata | STUCKENIA PECTINATE | 1 |
Below, we will correct the entries that were not properly matched by the Remapper
object:
And give the remapper
another try:
remapper.generate_lookup_table(fixes=fixes_biota_species)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 100%|██████████| 46/46 [00:06<00:00, 6.78it/s]
42 entries matched the criteria, while 4 entries had a match score of 1 or higher.
matched_maris_name | source_name | match_score | |
---|---|---|---|
source_key | |||
ENCH CIM | Echinodermata | ENCHINODERMATA CIM | 5 |
MACO BAL | Macoma balthica | MACOMA BALTICA | 1 |
STIZ LUC | Sander lucioperca | STIZOSTEDION LUCIOPERCA | 1 |
STUC PEC | Stuckenia pectinata | STUCKENIA PECTINATE | 1 |
Visual inspection of the remaining unperfectly matched entries seem acceptable to proceed.
We can now use the generic RemapCB
callback to perform the remapping of the RUBIN
column to the species
column after having defined the lookup table lut_biota
.
lut_biota = lambda: Remapper(provider_lut_df=read_csv('RUBIN_NAME.csv'),
maris_lut_fn=species_lut_path,
maris_col_id='species_id',
maris_col_name='species',
provider_col_to_match='SCIENTIFIC NAME',
provider_col_key='RUBIN',
fname_cache='species_helcom.pkl'
).generate_lookup_table(fixes=fixes_biota_species, as_df=False, overwrite=False)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA')
])
tfm()
tfm.dfs['BIOTA'].columns
tfm.dfs['BIOTA']['SPECIES'].unique()
array([ 99, 243, 50, 139, 270, 192, 191, 284, 84, 269, 122,
96, 287, 279, 278, 288, 286, 244, 129, 275, 271, 285,
283, 247, 120, 59, 280, 274, 273, 290, 289, 272, 277,
276, 21, 282, 110, 281, 245, 704, 1524, 703, 0, 621,
60])
Let’s inspect the TISSUE.csv
file provided by HELCOM describing the tissue nomenclature. Biota tissue is known as body part
in the maris data set.
remapper = Remapper(provider_lut_df=read_csv('TISSUE.csv'),
maris_lut_fn=bodyparts_lut_path,
maris_col_id='bodypar_id',
maris_col_name='bodypar',
provider_col_to_match='TISSUE_DESCRIPTION',
provider_col_key='TISSUE',
fname_cache='tissues_helcom.pkl'
)
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 100%|██████████| 29/29 [00:00<00:00, 74.81it/s]
21 entries matched the criteria, while 8 entries had a match score of 1 or higher.
matched_maris_name | source_name | match_score | |
---|---|---|---|
source_key | |||
3 | Flesh without bones | WHOLE FISH WITHOUT HEAD AND ENTRAILS | 20 |
2 | Flesh without bones | WHOLE FISH WITHOUT ENTRAILS | 13 |
8 | Soft parts | SKIN/EPIDERMIS | 10 |
5 | Flesh without bones | FLESH WITHOUT BONES (FILETS) | 9 |
1 | Whole animal | WHOLE FISH | 5 |
12 | Brain | ENTRAILS | 5 |
15 | Stomach and intestine | STOMACH + INTESTINE | 3 |
41 | Whole animal | WHOLE ANIMALS | 1 |
We address several entries that were not correctly matched by the Remapper object, as detailed below:”
remapper.generate_lookup_table(as_df=True, fixes=fixes_biota_tissues)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 0%| | 0/29 [00:00<?, ?it/s]Processing: 100%|██████████| 29/29 [00:00<00:00, 99.87it/s]
25 entries matched the criteria, while 4 entries had a match score of 1 or higher.
matched_maris_name | source_name | match_score | |
---|---|---|---|
source_key | |||
5 | Flesh without bones | FLESH WITHOUT BONES (FILETS) | 9 |
1 | Whole animal | WHOLE FISH | 5 |
15 | Stomach and intestine | STOMACH + INTESTINE | 3 |
41 | Whole animal | WHOLE ANIMALS | 1 |
Visual inspection of the remaining unperfectly matched entries seem acceptable to proceed.
We can now use the generic RemapCB
callback to perform the remapping of the TISSUE
column to the body_part
column after having defined the lookup table lut_tissues
.
lut_tissues = lambda: Remapper(provider_lut_df=read_csv('TISSUE.csv'),
maris_lut_fn=bodyparts_lut_path,
maris_col_id='bodypar_id',
maris_col_name='bodypar',
provider_col_to_match='TISSUE_DESCRIPTION',
provider_col_key='TISSUE',
fname_cache='tissues_helcom.pkl'
).generate_lookup_table(fixes=fixes_biota_tissues, as_df=False, overwrite=False)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
])
print(tfm()['BIOTA'][['tissue', 'BODY_PART']][:5])
tissue BODY_PART
0 5 52
1 5 52
2 5 52
3 5 52
4 5 52
lut_biogroup_from_biota
reads the file at species_lut_path()
and from the contents of this file creates a dictionary linking species_id
to biogroup_id
.
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA')
])
print(tfm()['BIOTA']['BIO_GROUP'].unique())
[ 4 2 14 11 8 3 0]
FEEDBACK TO DATA PROVIDER: The SEDI
values 56
and 73
are not found in the SEDIMENT_TYPE.csv
lookup table provided. Note also there are many nan
values in the SEDIMENT_TYPE.csv
file.
We reassign them to -99
for now but should be clarified/fixed. This is demonstrated below.
df_sed_lut = read_csv('SEDIMENT_TYPE.csv')
dfs = load_data(src_dir, use_cache=True)
sediment_sedi = set(dfs['SEDIMENT']['sedi'].unique())
lookup_sedi = set(df_sed_lut['SEDI'])
missing = sediment_sedi - lookup_sedi
print(f"Missing sediment type values in HELCOM lookup table: {missing if missing else 'None'}")
Missing sediment type values in HELCOM lookup table: {56.0, 73.0, nan}
Once again, we employ the IMFA (Inspect, Match, Fix, Apply) pattern to remap the HELCOM sediment types. Let’s inspect the SEDIMENT_TYPE.csv
file provided by HELCOM describing the sediment type nomenclature:
SEDI | SEDIMENT TYPE | RECOMMENDED TO BE USED | |
---|---|---|---|
0 | -99 | NO DATA | NaN |
1 | 0 | GRAVEL | YES |
2 | 1 | SAND | YES |
3 | 2 | FINE SAND | NO |
4 | 3 | SILT | YES |
Let’s try to match as many as possible:
remapper = Remapper(provider_lut_df=read_csv('SEDIMENT_TYPE.csv'),
maris_lut_fn=sediments_lut_path,
maris_col_id='sedtype_id',
maris_col_name='sedtype',
provider_col_to_match='SEDIMENT TYPE',
provider_col_key='SEDI',
fname_cache='sediments_helcom.pkl'
)
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 55%|█████▌ | 26/47 [00:00<00:00, 119.77it/s]
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) Cell In[62], line 11 1 #| eval: false 2 remapper = Remapper(provider_lut_df=read_csv('SEDIMENT_TYPE.csv'), 3 maris_lut_fn=sediments_lut_path, 4 maris_col_id='sedtype_id', (...) 8 fname_cache='sediments_helcom.pkl' 9 ) ---> 11 remapper.generate_lookup_table(as_df=True) 12 remapper.select_match(match_score_threshold=1, verbose=True) File ~/marisco/projects/marisco/marisco/utils.py:79, in Remapper.generate_lookup_table(self, fixes, as_df, overwrite) 77 self.as_df = as_df 78 if overwrite or not self.cache_file.exists(): ---> 79 self._create_lookup_table() 80 fc.save_pickle(self.cache_file, self.lut) 81 else: File ~/marisco/projects/marisco/marisco/utils.py:89, in Remapper._create_lookup_table(self) 87 df = self.provider_lut_df 88 for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing"): ---> 89 self._process_row(row) File ~/marisco/projects/marisco/marisco/utils.py:96, in Remapper._process_row(self, row) 93 if isinstance(value_to_match, str): # Only process if value is a string 94 # If value is in fixes, use the fixed value 95 name_to_match = self.fixes.get(value_to_match, value_to_match) ---> 96 result = match_maris_lut(self.maris_lut, name_to_match, self.maris_col_id, self.maris_col_name).iloc[0] 97 match = Match(result[self.maris_col_id], result[self.maris_col_name], 98 value_to_match, result['score']) 99 self.lut[row[self.provider_col_key]] = match File ~/marisco/projects/marisco/marisco/utils.py:249, in match_maris_lut(lut, data_provider_name, maris_id, maris_name, dist_fn, nresults) 247 "Fuzzy matching data provider and MARIS lookup tables (e.g biota species, sediments, ...)." 248 if isinstance(lut, str) or isinstance(lut, Path): --> 249 df = pd.read_excel(lut) # Load the LUT if a path is provided 250 elif isinstance(lut, pd.DataFrame): 251 df = lut # Use the DataFrame directly if provided File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/excel/_base.py:495, in read_excel(io, sheet_name, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, date_format, thousands, decimal, comment, skipfooter, storage_options, dtype_backend, engine_kwargs) 493 if not isinstance(io, ExcelFile): 494 should_close = True --> 495 io = ExcelFile( 496 io, 497 storage_options=storage_options, 498 engine=engine, 499 engine_kwargs=engine_kwargs, 500 ) 501 elif engine and engine != io.engine: 502 raise ValueError( 503 "Engine should not be specified when passing " 504 "an ExcelFile - ExcelFile already has the engine set" 505 ) File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/excel/_base.py:1550, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options, engine_kwargs) 1548 ext = "xls" 1549 else: -> 1550 ext = inspect_excel_format( 1551 content_or_path=path_or_buffer, storage_options=storage_options 1552 ) 1553 if ext is None: 1554 raise ValueError( 1555 "Excel file format cannot be determined, you must specify " 1556 "an engine manually." 1557 ) File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/excel/_base.py:1402, in inspect_excel_format(content_or_path, storage_options) 1399 if isinstance(content_or_path, bytes): 1400 content_or_path = BytesIO(content_or_path) -> 1402 with get_handle( 1403 content_or_path, "rb", storage_options=storage_options, is_text=False 1404 ) as handle: 1405 stream = handle.handle 1406 stream.seek(0) File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/common.py:728, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 725 codecs.lookup_error(errors) 727 # open URLs --> 728 ioargs = _get_filepath_or_buffer( 729 path_or_buf, 730 encoding=encoding, 731 compression=compression, 732 mode=mode, 733 storage_options=storage_options, 734 ) 736 handle = ioargs.filepath_or_buffer 737 handles: list[BaseBuffer] File ~/mambaforge/envs/marisco_dev/lib/python3.13/site-packages/pandas/io/common.py:458, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options) 453 raise ValueError( 454 "storage_options passed with file object or non-fsspec file path" 455 ) 457 if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)): --> 458 return IOArgs( 459 filepath_or_buffer=_expand_user(filepath_or_buffer), 460 encoding=encoding, 461 compression=compression, 462 should_close=False, 463 mode=mode, 464 ) 466 # is_file_like requires (read | write) & __iter__ but __iter__ is only 467 # needed for read_csv(engine=python) 468 if not ( 469 hasattr(filepath_or_buffer, "read") or hasattr(filepath_or_buffer, "write") 470 ): KeyboardInterrupt:
We address the remaining unmatched values by adding fixes_sediments:
remapper.generate_lookup_table(as_df=True, fixes=fixes_sediments)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 0%| | 0/47 [00:00<?, ?it/s]Processing: 100%|██████████| 47/47 [00:00<00:00, 102.08it/s]
44 entries matched the criteria, while 3 entries had a match score of 1 or higher.
matched_maris_name | source_name | match_score | |
---|---|---|---|
source_key | |||
-99 | (Not available) | NO DATA | 2 |
50 | Mud and gravel | MUD AND GARVEL | 2 |
46 | Glacial clay | CLACIAL CLAY | 1 |
A visual inspection of the remaining values shows that they are acceptable to proceed.
RemapSedimentCB (fn_lut:Callable, sed_grp_name:str='SEDIMENT', sed_col_name:str='sedi', replace_lut:dict=None)
Lookup sediment id using lookup table.
Type | Default | Details | |
---|---|---|---|
fn_lut | Callable | Function that returns the lookup table dictionary | |
sed_grp_name | str | SEDIMENT | The name of the sediment group |
sed_col_name | str | sedi | The name of the sediment column |
replace_lut | dict | None | Dictionary for replacing SEDI values |
class RemapSedimentCB(Callback):
"Lookup sediment id using lookup table."
def __init__(self,
fn_lut: Callable, # Function that returns the lookup table dictionary
sed_grp_name: str = 'SEDIMENT', # The name of the sediment group
sed_col_name: str = 'sedi', # The name of the sediment column
replace_lut: dict = None # Dictionary for replacing SEDI values
):
fc.store_attr()
def __call__(self, tfm: Transformer):
"Remap sediment types using lookup table."
df = tfm.dfs[self.sed_grp_name]
# Fix inconsistent values and get lookup table
if self.replace_lut: df[self.sed_col_name] = df[self.sed_col_name].replace(self.replace_lut)
lut = self.fn_lut()
# Map sediment types, defaulting to 0 (Not available) for unmatched values
df['SED_TYPE'] = df['sedi'].map(
lambda x: lut.get(x, Match(0, None, None, None)).matched_id
)
lut_sediments = lambda: Remapper(provider_lut_df=read_csv('SEDIMENT_TYPE.csv'),
maris_lut_fn=sediments_lut_path,
maris_col_id='sedtype_id',
maris_col_name='sedtype',
provider_col_to_match='SEDIMENT TYPE',
provider_col_key='SEDI',
fname_cache='sediments_helcom.pkl'
).generate_lookup_table(fixes=fixes_sediments, as_df=False, overwrite=False)
Reassign the SEDI
values of 56
, 73
, and nan
to -99
further remapped to 0
(Not available) in compliance with the MARIS nomenclature:
Utilize the RemapSedimentCB
callback to remap the SEDI
values in the HELCOM dataset to the corresponding MARIS standard sediment type, referred to as SED_TYPE
.
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
RemapSedimentCB(fn_lut=lut_sediments, replace_lut=sed_replace_lut)
])
tfm()
tfm.dfs['SEDIMENT']['SED_TYPE'].unique()
array([ 0, 2, 58, 30, 59, 55, 56, 36, 29, 47, 4, 54, 33, 6, 44, 42, 48,
61, 57, 28, 49, 32, 45, 39, 46, 38, 31, 60, 62, 26, 53, 52, 1, 51,
37, 34, 50, 7, 10, 41, 43, 35])
HELCOM filtered status is encoded as follows in the FILT
column:
dfs = load_data(src_dir, use_cache=True)
get_unique_across_dfs(dfs, col_name='filt', as_df=True).head(5)
index | value | |
---|---|---|
0 | 0 | n |
1 | 1 | F |
2 | 2 | N |
3 | 3 | NaN |
MARIS uses a different encoding for filtered status:
For only four categories to remap, the Remapper
is an overkill. We can use a simple dictionary to map the values:
RemapFiltCB
converts the HELCOM filt
data to the MARIS FILT
format.
RemapFiltCB (lut_filtered:dict={'N': 2, 'n': 2, 'F': 1})
Lookup filt value in dataframe using the lookup table.
Type | Default | Details | |
---|---|---|---|
lut_filtered | dict | {‘N’: 2, ‘n’: 2, ‘F’: 1} | Dictionary mapping filt codes to their corresponding names |
class RemapFiltCB(Callback):
"Lookup filt value in dataframe using the lookup table."
def __init__(self,
lut_filtered: dict=lut_filtered, # Dictionary mapping filt codes to their corresponding names
):
fc.store_attr()
def __call__(self, tfm):
for df in tfm.dfs.values():
if 'filt' in df.columns:
df['FILT'] = df['filt'].map(lambda x: self.lut_filtered.get(x, 0))
For instance:
FEEDBACK FOR NEXT VERSION: Review the inclusion of LAB in the NetCDF output, note with minor updates to dbo_lab.xlsx it would offer a way to include SMP_ID
.
This section could be simplified by including all Helcom ‘LABORATORY’ names in the MARIS standard laboratory names lookup table (dbo_lab.xlsx). For example STUK, KRIL, RISO, etc. are absent from the MARIS standard laboratory names lookup table lab_abb column.
Lets use the utility get_unique_across_dfs
function to review the unique laboratory IDs in the HELCOM dataset:
# Transpose to display the dataframe horizontally
get_unique_across_dfs(tfm.dfs, col_name='laboratory', as_df=True).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
value | BFFG | LREB | SAAS | KRIL | EBRS | DHIG | LRPC | LEPA | CLOR | SSSI | ... | VTIG | STUK | NCRS | LVDC | IMGW | JORC | NaN | LVEA | RISO | ERPC |
2 rows × 21 columns
The HELCOM dataset includes a lookup table LABORATORY_NAME.csv
which captures the laboratory names and codes.
LABORATORY | LABORATORY_NAME | START_DATE | END_DATE | COUNTRY | |
---|---|---|---|---|---|
0 | BFFG | BUNDESFORSCHUNGANSTALT FÜR FISCHEREI, GERMANY | 01/01/86 00:00:00 | 12/31/07 00:00:00 | 6 |
1 | CLOR | CENTRAL LABORATORY FOR RADIOLOGICAL PROTECTION, POLAND | 01/01/84 00:00:00 | NaN | 67 |
2 | DHIG | FEDERAL MARITIME AND HYDROGRAPHIC AGENCY, GERMANY | 01/01/84 00:00:00 | NaN | 6 |
3 | EBRS | RADIATION SAFETY DEPARTMENT ENVIRONMENTAL BOARD, ESTONIA | 01/01/10 00:00:00 | NaN | 91 |
4 | EMHI | ESTONIAN METEOROLOGICAL AND HYDROLOGICAL INSTITUTE, ESTONIA | NaN | NaN | 91 |
Lets take a look at the MARIS standard laboratory names:
lab_id | lab_abb | lab | addr_1 | addr_2 | twn_zip | country | tel | e_mail | fax | note | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1 | Not applicable | Not applicable | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 0 | Not available | Not available | NaN | NaN | NaN | Not available | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1 | IAEA-EL | International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ | NaN | P.O. Box No. 800 | MC-98012 Monaco Cedex | Principality of Monaco | NaN | NaN | NaN | NaN | NaN | NaN | update lab set lab = International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ , country = where lab_id = 1 |
3 | 2 | INPAS | Institute of Nuclear Physics - Academy of Sciences | NaN | NaN | Tirana | Albania | NaN | NaN | NaN | NaN | NaN | NaN | update lab set lab = Institute of Nuclear Physics - Academy of Sciences , country = where lab_id = 2 |
FEEDBACK FOR DATA PROVIDER: One entry for the laboratory
column includes a ‘NaN’, see below.
def find_nan_entries(dfs, columns=None):
"""
Returns a dictionary of DataFrames, each containing the complete rows where any of the specified columns have NaN values from the original DataFrames.
Parameters:
dfs (dict): A dictionary where keys are dataset names and values are pandas DataFrames.
columns (list, optional): A list of column names to check for NaN values. If None, all columns are checked.
Returns:
dict: A dictionary where each key is the dataset name and the value is a DataFrame of complete rows that have NaN entries in the specified columns.
"""
nan_entries = {}
for key, df in dfs.items():
# If columns are specified, check these columns for NaN values
if columns is not None:
# Find rows with NaN values in the specified columns
nan_rows = df[columns].isnull().any(axis=1)
else:
# Find rows with any NaN values across all columns
nan_rows = df.isnull().any(axis=1)
# Use the boolean index to select the complete rows from the original DataFrame
complete_nan_rows = df[nan_rows]
if not complete_nan_rows.empty:
nan_entries[key] = complete_nan_rows
return nan_entries
nan_lab_df = find_nan_entries(dfs, columns=['laboratory'])
print ('Entries with NaN in the `LABORATORY` column:')
for key, df in nan_lab_df.items():
print(f"{key}: \n{df}")
Entries with NaN in the `LABORATORY` column:
SEAWATER:
key nuclide method < value_bq/m³ value_bq/m³ error%_m³ \
20556 WSSSM2015009 H3 STYR201 < 2450.0 NaN
20557 WSSSM2015010 H3 STYR201 NaN 2510.0 29.17
20558 WSSSM2015011 H3 STYR201 < 2450.0 NaN
20559 WSSSM2015012 H3 STYR201 NaN 1740.0 41.26
20560 WSSSM2015013 H3 STYR201 NaN 1650.0 43.53
20561 WSSSM2015014 H3 STYR201 < 2277.0 NaN
20562 WSSSM2015015 H3 STYR201 < 2277.0 NaN
20563 WSSSM2015016 H3 STYR201 < 2277.0 NaN
date_of_entry_x country laboratory sequence ... longitude (ddmmmm) \
20556 NaN NaN NaN NaN ... NaN
20557 NaN NaN NaN NaN ... NaN
20558 NaN NaN NaN NaN ... NaN
20559 NaN NaN NaN NaN ... NaN
20560 NaN NaN NaN NaN ... NaN
20561 NaN NaN NaN NaN ... NaN
20562 NaN NaN NaN NaN ... NaN
20563 NaN NaN NaN NaN ... NaN
longitude (dddddd) tdepth sdepth salin ttemp filt mors_subbasin \
20556 NaN NaN NaN NaN NaN NaN NaN
20557 NaN NaN NaN NaN NaN NaN NaN
20558 NaN NaN NaN NaN NaN NaN NaN
20559 NaN NaN NaN NaN NaN NaN NaN
20560 NaN NaN NaN NaN NaN NaN NaN
20561 NaN NaN NaN NaN NaN NaN NaN
20562 NaN NaN NaN NaN NaN NaN NaN
20563 NaN NaN NaN NaN NaN NaN NaN
helcom_subbasin date_of_entry_y
20556 NaN NaN
20557 NaN NaN
20558 NaN NaN
20559 NaN NaN
20560 NaN NaN
20561 NaN NaN
20562 NaN NaN
20563 NaN NaN
[8 rows x 27 columns]
SEDIMENT:
key nuclide method < value_bq/kg value_bq/kg error%_kg \
35821 SDHIG2016236 CS137 DHIG03 NaN 8.2952 2.351
< value_bq/m² value_bq/m² error%_m² date_of_entry_x ... lowsli \
35821 NaN 237.500899 NaN 05/13/19 00:00:00 ... NaN
area sedi oxic dw% loi% mors_subbasin helcom_subbasin sum_link \
35821 NaN NaN NaN NaN NaN NaN NaN NaN
date_of_entry_y
35821 NaN
[1 rows x 35 columns]
FEEDBACK FOR NEXT VERSION: Consider integrating combine_lut_columns function into utils.ipynb. I’ve updated the remapper and match_maris_lut functions to accept either a lut_path or a DataFrame. This code could be further simplified by handling the file opening (e.g., pd.read_excel) directly within the remapper function, thereby always passing a DataFrame to match_maris_lut. Refer to the implementation in utils.ipynb for details.
The HELCOM description of laboratory includes both the laboratory name and country. Lets update the maris_lab_lut
to include the laboratory name and country in the same column.
combine_lut_columns (lut_path:Callable, combine_cols:List[str]=[])
def combine_lut_columns(lut_path: Callable, combine_cols: List[str] = []):
if lut_path:
df_lut = pd.read_excel(lut_path())
if combine_cols:
# Combine the specified columns into a single column with space as separator
df_lut['combined'] = df_lut[combine_cols].astype(str).agg(' '.join, axis=1)
# Create a column name by joining column names with '_'
combined_col_name = '_'.join(combine_cols)
df_lut.rename(columns={'combined': combined_col_name}, inplace=True)
return df_lut
lab_id | lab_abb | lab | addr_1 | addr_2 | twn_zip | country | tel | e_mail | fax | note | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | lab_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1 | Not applicable | Not applicable | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Not applicable nan |
1 | 0 | Not available | Not available | NaN | NaN | NaN | Not available | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Not available Not available |
2 | 1 | IAEA-EL | International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ | NaN | P.O. Box No. 800 | MC-98012 Monaco Cedex | Principality of Monaco | NaN | NaN | NaN | NaN | NaN | NaN | update lab set lab = International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ , country = where lab_id = 1 | International Atomic Energy Agency - Environment Laboratory _former Marine Environment Laboratory_ Principality of Monaco |
Let’s now create an instance of a fuzzy matching algorithm Remapper
. This instance will match the LABORATORY
column of the HELCOM dataset to the MARIS standard laboratory names using both lab
and country
fields.
remapper = Remapper(provider_lut_df=read_csv('LABORATORY_NAME.csv'),
maris_lut_fn= combine_lut_columns(lut_path=lab_lut_path, combine_cols=['lab','country']),
maris_col_id='lab_id',
maris_col_name='lab_country',
provider_col_to_match='LABORATORY_NAME',
provider_col_key='LABORATORY',
fname_cache='lab_helcom.pkl')
Lets try to match LABORATORY
names to MARIS standard laboratory names as automatically as possible. The match_score
column allows to assess the results:
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)
Processing: 0%| | 0/21 [00:00<?, ?it/s]Processing: 100%|██████████| 21/21 [00:00<00:00, 67.11it/s]
0 entries matched the criteria, while 21 entries had a match score of 1 or higher.
matched_maris_name | source_name | match_score | |
---|---|---|---|
source_key | |||
SSSI | Central Mining Institute Poland | STATENS STRÅLSKYDDSINSTITUT, SWEDEN | 23 |
KRIL | Polytechnic Institute Romania | V. G. KHLOPIN RADIUM INSTITUTE, RUSSIA | 22 |
STUK | Radiation and Nuclear Safety Authority Finland | SÄTEILYTURVAKESKUS, RADIATION AND NUCLEAR SAFETY AUTHORITY, FINLAND | 21 |
LRPC | Radiation Protection Authority Norway | RADIATION PROTECTION CENTRE, LITHUANIA | 15 |
SAAS | National Board of Nuclear Safety and Radiation Protection Germany | NATIONAL BOARD FOR ATOMIC SAFETY AND RADIATION PROTECTION, GERMANY | 10 |
RISO | Risø National Laboratory - The Radiation Research Department Denmark | RISÖ NATIONAL LABORATORY, RADIATION RESEARCH DEPARTMENT, DENMARK | 8 |
LEPA | Environmental Protection Agency Ireland | ENVIRONMENTAL PROTECTION AGENCY, LITHUANIA | 7 |
NCRS | The Swedish University of Agricultural Sciences Sweden | SWEDISH UNIVERSITY OF AGRICULTURAL SCIENCES, SWEDEN | 5 |
CLOR | Central Laboratory for Radiological Protection Poland | CENTRAL LABORATORY FOR RADIOLOGICAL PROTECTION, POLAND | 4 |
SSSM | SVERIGE S STRÅL SÄKERHETS MYNDIGHETEN Sweden | SVERIGE'S STRÅLSÄKERHETS MYNDIGHETEN, SWEDEN | 3 |
VTIG | JOHAN HEINRICH VON THÜNEN-INSTITUTE Germany | JOHANN HEINRICH VON THÜNEN-INSTITUTE, GERMANY | 2 |
EBRS | Radiation Safety Department, Environmental Board Estonia | RADIATION SAFETY DEPARTMENT ENVIRONMENTAL BOARD, ESTONIA | 2 |
LVEA | Latvian Environment Agency Latvia | LATVIAN ENVIRONMENT AGENCY, LATVIA | 1 |
BFFG | BUNDESFORSCHUNGANSTALT FÜR FISCHEREI Germany | BUNDESFORSCHUNGANSTALT FÜR FISCHEREI, GERMANY | 1 |
LVDC | Environmental Data Center of Latvia Latvia | ENVIRONMENTAL DATA CENTER OF LATVIA, LATVIA | 1 |
JORC | Joint Research Center Lithuania | JOINT RESEARCH CENTER, LITHUANIA | 1 |
IMGW | Institute of Meteorology and Water Management Poland | INSTITUTE OF METEOROLOGY AND WATER MANAGEMENT, POLAND | 1 |
ERPC | Estonian Radiation Protection Centre Estonia | ESTONIAN RADIATION PROTECTION CENTRE, ESTONIA | 1 |
EMHI | Estonian Meteorological and Hydrological Institute Estonia | ESTONIAN METEOROLOGICAL AND HYDROLOGICAL INSTITUTE, ESTONIA | 1 |
DHIG | Federal Maritime and Hydrographic Agency Germany | FEDERAL MARITIME AND HYDROGRAPHIC AGENCY, GERMANY | 1 |
LREB | Lielriga Regional Environmental Board Latvia | LIELRIGA REGIONAL ENVIRONMENTAL BOARD, LATVIA | 1 |
Although the match score is 1 or greater for all entries, many are still matched appropriately. Let’s manually correct any unmatched values. Here, we are manually aligning the data providers’ laboratory names with those used by the MARIS LUT.
fixes_lab_names = {
'STATENS STRÅLSKYDDSINSTITUT, SWEDEN': 'Swedish Radiation Safety Authority Sweden',
'V. G. KHLOPIN RADIUM INSTITUTE, RUSSIA': 'V.G. Khlopin Radium Institute - Lab. of Environmental Radioactive Contamination Monitoring Russian Federation',
'ENVIRONMENTAL PROTECTION AGENCY, LITHUANIA': 'Lithuanian Environmental Protection Agency Lithuania',
}
Now, lets apply the manual corrections, fixes_lab_names
and try again.
remapper.generate_lookup_table(as_df=True, fixes=fixes_lab_names)
remapper.select_match(match_score_threshold=1, verbose=True).head(5)
Processing: 0%| | 0/21 [00:00<?, ?it/s]Processing: 100%|██████████| 21/21 [00:00<00:00, 60.11it/s]
3 entries matched the criteria, while 18 entries had a match score of 1 or higher.
matched_maris_name | source_name | match_score | |
---|---|---|---|
source_key | |||
STUK | Radiation and Nuclear Safety Authority Finland | SÄTEILYTURVAKESKUS, RADIATION AND NUCLEAR SAFETY AUTHORITY, FINLAND | 21 |
LRPC | Radiation Protection Authority Norway | RADIATION PROTECTION CENTRE, LITHUANIA | 15 |
SAAS | National Board of Nuclear Safety and Radiation Protection Germany | NATIONAL BOARD FOR ATOMIC SAFETY AND RADIATION PROTECTION, GERMANY | 10 |
RISO | Risø National Laboratory - The Radiation Research Department Denmark | RISÖ NATIONAL LABORATORY, RADIATION RESEARCH DEPARTMENT, DENMARK | 8 |
NCRS | The Swedish University of Agricultural Sciences Sweden | SWEDISH UNIVERSITY OF AGRICULTURAL SCIENCES, SWEDEN | 5 |
We have successfully matched the laboratory names to the MARIS standard laboratory names. We can now create a lookup table for the laboratory names.
# Create a lookup table for laboratory names
lut_lab = lambda: Remapper(provider_lut_df=read_csv('LABORATORY_NAME.csv'),
maris_lut_fn= combine_lut_columns(lut_path=lab_lut_path, combine_cols=['lab','country']),
maris_col_id='lab_id',
maris_col_name='lab_country',
provider_col_to_match='LABORATORY_NAME',
provider_col_key='LABORATORY',
fname_cache='lab_helcom.pkl').generate_lookup_table(fixes=fixes_lab_names,as_df=False, overwrite=False)
We now create the callback RemapLabCB
, which will remap the nuclide names using the lut_lab
lookup table.
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER'])
])
tfm()
# For instance:
unique_labs = set()
# Get unique labs from all groups
for grp in ['BIOTA', 'SEAWATER', 'SEDIMENT']:
if grp in tfm.dfs and 'laboratory' in tfm.dfs[grp].columns:
# Get values and add to set
labs = tfm.dfs[grp]['laboratory'].unique()
unique_labs.update(labs)
print('Example of unique laboratory names: \n', unique_labs)
Example of unique laboratory names:
{'BFFG', 'LREB', 'SAAS', 'KRIL', 'EBRS', 'DHIG', 'LRPC', 'LEPA', 'CLOR', 'SSSI', 'SSSM', 'VTIG', 'STUK', 'NCRS', 'LVDC', 'IMGW', 'JORC', nan, 'LVEA', 'RISO', 'ERPC'}
FEEDBACK FOR NEXT VERSION: Enhancing traceability of NetCDF entries to original samples in the datasource using a standardized SMP_ID
.
Context:
Previously, the NetCDF output did not include a sample laboratory code (or SMP_ID
), limiting our ability to trace data back to its source.
Issue Identified: The KEY
column in the HELCOM dataset, which combines a sample type, a laboratory code, and an integer sequence offers a way trace data back to the HELCOM source. The KEY
is of type string which is not included in our NetCDF output. To include a way to trace data back to the HELCOM source, we propose to include a SMP_ID
in the NetCDF output which is of type integer.
Proposed Solution: For the HELCOM dataset, where the KEY
column includes unique codes like WDHIG1996246
(comprising sample type, lab code, and sequence), we propose encoding this into a structured SMP_ID
. This SMP_ID
will use standardized MARIS Lookup Tables (LUTs) to convert both the sample type and laboratory code into integers.
Implementation Details: - The SMP_ID
will be formatted such that: - The first digit indicates the sample type (e.g., 1 for Seawater). - The next three digits represent the laboratory code (e.g., 313 for DHIG as standardized in dbo_lab.xlsx). - The remaining digits reflect the integer sequence from the HELCOM KEY. - Example: WDHIG1996246
becomes SMP_ID
13131996246
.
Action Required: To adopt this approach, a review and update of the laboratory codes in the LUT (dbo_lab.xlsx) are necessary to ensure consistency and accuracy.
First we wil use check_unique_key_int
to show the non unique integer part of the KEY
column.
def check_unique_key_int(tfm):
"""
Extracts unique 'KEY' values from specified DataFrames, separates them into string and integer components,
and groups keys by their integer components.
Parameters:
tfm (Transformer): The transformer object containing DataFrames.
Returns:
dict: A dictionary with the unique keys, their string and integer components, and grouped keys by integer component.
"""
# Define the groups to extract keys from
groups = ['SEAWATER', 'BIOTA', 'SEDIMENT']
# Initialize a set to store unique keys
unique_keys = set()
# Collect unique keys from each DataFrame
for grp in groups:
unique_keys.update(tfm.dfs[grp]['key'].unique())
# Initialize a dictionary to group keys by their integer components
int_key_map = {}
for key in unique_keys:
# Assuming the integer part starts after the first 5 characters
int_part = int(key[5:]) if key[5:].isdigit() else None # Remaining part as integer
if int_part is not None:
if int_part not in int_key_map:
int_key_map[int_part] = [] # Initialize list for this integer part
int_key_map[int_part].append(key) # Append the complete key to the list
return {
'int_key_map': int_key_map # Return the mapping of integer parts to complete keys
}
Below, we will generate a DataFrame where the index (labeled ‘INT COMPONENT OF KEY
’) represents the integer portion extracted from the Helcom KEY
. The ‘KEYS’ column lists all the KEY
values that include this integer component. Originally, the plan was to use the integer part of the KEY
column to create the SMP_ID
. However, as demonstrated below, the integer part is not unique, which complicates this approach.
# Create DataFrame from dictionary and set index name and column name
unique_key_df = pd.DataFrame.from_dict(check_unique_key_int(tfm)).rename_axis('INT COMPONENT OF `KEY`')
unique_key_df=unique_key_df.rename(columns={unique_key_df.columns[0]: 'KEYS'})
unique_key_df.head(5)
KEYS | |
---|---|
INT COMPONENT OF `KEY` | |
2010003 | [SCLOR2010003, WSSSI2010003, BRISO2010003, WKRIL2010003, WSTUK2010003, BEBRS2010003, BSTUK2010003, WLEPA2010003, SSSSI2010003, SRISO2010003, BSSSM2010003, WIMGW2010003, BCLOR2010003, SKRIL2010003, SSTUK2010003, WLVEA2010003, SLEPA2010003, BVTIG2010003, SLVEA2010003, WEBRS2010003, WRISO2010003] |
1988170 | [SDHIG1988170] |
2014018 | [WIMGW2014018, SCLOR2014018, BCLOR2014018, SSTUK2014018, SEBRS2014018, WRISO2014018, WSTUK2014018, SSSSM2014018] |
2012009 | [SSTUK2012009, BVTIG2012009, WSTUK2012009, SKRIL2012009, BCLOR2012009, SEBRS2012009, WKRIL2012009, BSTUK2012009, WIMGW2012009, SCLOR2012009, BRISO2012009, WRISO2012009, BSSSM2012009, SLEPA2012009, SLVEA2012009, SSSSM2012009, WLEPA2012009] |
1987561 | [SDHIG1987561] |
Below we will create a callback AddSampleIDCB
to remap the KEY
column to the SMP_ID
column in each DataFrame.
Remeber that in HELCOM, the KEY
column is encoded to include the sample type (S=Sediment, W=Seawater, B=Biota), the laboratory code (e.g., DHIG), followed by an integer sequence.
If we update the MARIS LUT (dbo_lab.xlsx), to include laboratory codes (i.e. update the lab_abb
column), then the remapping of the LAB
and the AddSampleIDCB
can be much simpler.
AddSampleIDCB (lut_type:Dict[str,int])
Generate sample id, SMP_ID
, from encoded group, encoded LAB
and sequence values.
class AddSampleIDCB(Callback):
"Generate sample id, `SMP_ID`, from encoded group, encoded `LAB` and sequence values."
def __init__(self, lut_type: Dict[str, int]):
self.lut_type = lut_type
def __call__(self, tfm: Transformer):
for grp in tfm.dfs:
self._remap_sample_id(tfm.dfs[grp], grp)
def _remap_sample_id(self, df: pd.DataFrame, grp: str):
"""
Remaps the 'KEY' column to 'SMP_ID' using the provided lookup table.
Sets 'SMP_ID' to -1 if 'LAB' or 'SEQUENCE' is NaN.
Parameters:
df (pd.DataFrame): The DataFrame to process.
grp (str): The group key from the DataFrame dictionary, used to access specific LUT values.
"""
# Check for NaNs in 'LAB' or 'SEQUENCE' and compute 'SMP_ID' conditionally
df['SMP_ID'] = np.where(
df['LAB'].isna() | df['sequence'].isna(),
-1,
str(self.lut_type[grp]) + df['LAB'].astype(str).str.zfill(3) + df['sequence'].astype(str).str.zfill(7)
)
# Convert 'SMP_ID' to integer, handling floating point representations
df['SMP_ID'] = df['SMP_ID'].apply(lambda x: int(float(x)) if isinstance(x, str) and '.' in x else int(x))
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER']),
AddSampleIDCB(lut_type=SMP_TYPE_LUT),
CompareDfsAndTfmCB(dfs)
])
print(tfm()['SEAWATER']['SMP_ID'].unique())
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
[12112012003 12112012004 12112012005 ... 11102023163 11102023164
11102023165]
BIOTA SEAWATER SEDIMENT
Number of rows in dfs 16124 21634 40744
Number of rows in tfm.dfs 16124 21634 40744
Number of rows removed 0 0 0
The HELCOM dataset includes a column for the sampling depth (SDEPTH
) for the SEAWATER
and BIOTA
datasets. Additionally, it contains a column for the total depth (TDEPTH
) applicable to both the SEDIMENT
and SEAWATER
datasets. In this section, we will create a callback to incorporate both the sampling depth (smp_depth
) and total depth (tot_depth
) into the MARIS dataset.
class AddDepthCB(Callback):
"Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns."
def __call__(self, tfm: Transformer):
for df in tfm.dfs.values():
if 'sdepth' in df.columns:
df['SMP_DEPTH'] = df['sdepth'].astype(float)
if 'tdepth' in df.columns:
df['TOT_DEPTH'] = df['tdepth'].astype(float)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[AddDepthCB()])
tfm()
for grp in tfm.dfs.keys():
if 'SMP_DEPTH' in tfm.dfs[grp].columns and 'TOT_DEPTH' in tfm.dfs[grp].columns:
print(f'{grp}:', tfm.dfs[grp][['SMP_DEPTH','TOT_DEPTH']].drop_duplicates())
elif 'SMP_DEPTH' in tfm.dfs[grp].columns:
print(f'{grp}:', tfm.dfs[grp][['SMP_DEPTH']].drop_duplicates())
elif 'TOT_DEPTH' in tfm.dfs[grp].columns:
print(f'{grp}:', tfm.dfs[grp][['TOT_DEPTH']].drop_duplicates())
BIOTA: SMP_DEPTH
0 NaN
78 22.00
88 39.00
96 40.00
183 65.00
... ...
15874 43.10
15921 30.43
15984 7.60
15985 5.50
15988 11.20
[301 rows x 1 columns]
SEAWATER: SMP_DEPTH TOT_DEPTH
0 0.0 NaN
1 29.0 NaN
4 39.0 NaN
6 62.0 NaN
10 71.0 NaN
... ... ...
21059 15.0 15.0
21217 7.0 16.0
21235 19.2 21.0
21312 1.0 5.5
21521 0.5 NaN
[1686 rows x 2 columns]
SEDIMENT: TOT_DEPTH
0 25.0
6 61.0
19 31.0
33 39.0
42 36.0
... ...
35882 3.9
36086 103.0
36449 108.9
36498 4.5
36899 125.0
[195 rows x 1 columns]
FEEDBACK TO DATA PROVIDER
The HELCOM dataset includes a column for the salinity of the water (SALIN
). According to the HELCOM documentation, the SALIN
column represents “Salinity of water in PSU units”.
In the SEAWATER dataset, three entries have salinity values greater than 50 PSU. While salinity values greater than 50 PSU are possible, these entries may require further verification. Notably, these three entries have a salinity value of 99.99 PSU, which suggests potential data entry errors.
key | nuclide | method | < value_bq/m³ | value_bq/m³ | error%_m³ | date_of_entry_x | country | laboratory | sequence | ... | tdepth | sdepth | salin | ttemp | filt | mors_subbasin | helcom_subbasin | date_of_entry_y | SMP_DEPTH | TOT_DEPTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12288 | WDHIG1998072 | CS137 | 3 | NaN | 40.1 | 1.6 | NaN | 6.0 | DHIG | 1998072.0 | ... | 25.0 | 0.0 | 99.99 | 5.0 | F | 5.0 | 15.0 | NaN | 0.0 | 25.0 |
12289 | WDHIG1998072 | CS134 | 3 | NaN | 1.1 | 23.6 | NaN | 6.0 | DHIG | 1998072.0 | ... | 25.0 | 0.0 | 99.99 | 5.0 | F | 5.0 | 15.0 | NaN | 0.0 | 25.0 |
12290 | WDHIG1998072 | SR90 | 2 | NaN | 8.5 | 1.9 | NaN | 6.0 | DHIG | 1998072.0 | ... | 25.0 | 0.0 | 99.99 | 5.0 | F | 5.0 | 15.0 | NaN | 0.0 | 25.0 |
3 rows × 29 columns
Lets add the salinity values to the SEAWATER DataFrame.
AddSalinityCB (salinity_col:str='salin')
Base class for callbacks.
class AddSalinityCB(Callback):
def __init__(self, salinity_col: str = 'salin'):
self.salinity_col = salinity_col
"Add salinity to the SEAWATER DataFrame."
def __call__(self, tfm: Transformer):
for df in tfm.dfs.values():
if self.salinity_col in df.columns:
df['SALINITY'] = df[self.salinity_col].astype(float)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[AddSalinityCB()])
tfm()
for grp in tfm.dfs.keys():
if 'SALINITY' in tfm.dfs[grp].columns:
print(f'{grp}:', tfm.dfs[grp][['SALINITY']].drop_duplicates())
SEAWATER: SALINITY
0 NaN
97 7.570
98 7.210
101 7.280
104 7.470
... ...
21449 11.244
21450 7.426
21451 9.895
21452 2.805
21453 7.341
[2766 rows x 1 columns]
FEEDBACK TO DATA PROVIDER
The HELCOM dataset includes a column for the temperature of the water (TTEMP
). According to the HELCOM documentation, the TTEMP
column represents: > ‘Water temperature in Celsius (ºC) degrees of sampled water’
In the SEAWATER dataset, several entries have temperature values greater than 50ºC. These entries may require further verification. Notably, these entries have a temperature value of 99.99ºC, which suggests potential data entry errors, see below.
t_df= tfm.dfs['SEAWATER'][tfm.dfs['SEAWATER']['ttemp'] > 50]
print('Number of entries with temperature greater than 50ºC: ', t_df.shape[0])
t_df.head()
Number of entries with temperature greater than 50ºC: 92
key | nuclide | method | < value_bq/m³ | value_bq/m³ | error%_m³ | date_of_entry_x | country | laboratory | sequence | ... | longitude (dddddd) | tdepth | sdepth | salin | ttemp | filt | mors_subbasin | helcom_subbasin | date_of_entry_y | SALINITY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5954 | WDHIG1995559 | CS134 | 4 | NaN | 1.7 | 15.0 | NaN | 6.0 | DHIG | 1995559.0 | ... | 10.2033 | 13.0 | 11.0 | 14.81 | 99.9 | N | 5.0 | 15.0 | NaN | 14.81 |
5955 | WDHIG1995559 | CS137 | 4 | NaN | 58.7 | 2.0 | NaN | 6.0 | DHIG | 1995559.0 | ... | 10.2033 | 13.0 | 11.0 | 14.81 | 99.9 | N | 5.0 | 15.0 | NaN | 14.81 |
5960 | WDHIG1995569 | CS134 | 4 | NaN | 1.4 | 12.0 | NaN | 6.0 | DHIG | 1995569.0 | ... | 10.2777 | 14.0 | 12.0 | 14.80 | 99.9 | N | 5.0 | 15.0 | NaN | 14.80 |
5961 | WDHIG1995569 | CS137 | 4 | NaN | 62.8 | 1.0 | NaN | 6.0 | DHIG | 1995569.0 | ... | 10.2777 | 14.0 | 12.0 | 14.80 | 99.9 | N | 5.0 | 15.0 | NaN | 14.80 |
5964 | WDHIG1995571 | CS134 | 4 | NaN | 1.5 | 17.0 | NaN | 6.0 | DHIG | 1995571.0 | ... | 10.2000 | 19.0 | 17.0 | 14.59 | 99.9 | N | 5.0 | 15.0 | NaN | 14.59 |
5 rows × 28 columns
Lets add the temperature values to the SEAWATER DataFrame.
AddTemperatureCB (temperature_col:str='ttemp')
Base class for callbacks.
class AddTemperatureCB(Callback):
def __init__(self, temperature_col: str = 'ttemp'):
self.temperature_col = temperature_col
"Add temperature to the SEAWATER DataFrame."
def __call__(self, tfm: Transformer):
for df in tfm.dfs.values():
if self.temperature_col in df.columns:
df['TEMPERATURE'] = df[self.temperature_col].astype(float)
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[AddTemperatureCB()])
tfm()
for grp in tfm.dfs.keys():
if 'TEMPERATURE' in tfm.dfs[grp].columns:
print(f'{grp}:', tfm.dfs[grp][['TEMPERATURE']].drop_duplicates())
SEAWATER: TEMPERATURE
0 NaN
987 7.80
990 6.50
993 4.10
996 4.80
... ...
21521 0.57
21523 18.27
21525 21.54
21529 4.94
21537 2.35
[1086 rows x 1 columns]
The HELCOM dataset includes a look-up table ANALYSIS_METHOD.csv
which captures the methods used by HELCOM in a description field (free text). Lets review the ANALYSIS METHOD descriptions of HELCOM dataset.
METHOD | DESCRIPTION | COUNTRY | |
---|---|---|---|
0 | BFFG01 | Gammaspectrometric analysis with Germanium detectors (p-type HGeLi's and HPGe's and 1 n-type HPGe), with efficiency 20-48% Energy resolution 1.8-2.3 keV at 1.33 MeV (not to in use any more) | 6 |
1 | BFFG02 | Sr-90, a) Y-90 extraction method dried ash and added Y-90 + HCl, Ph adjustment and Y-90 extraction with HDEHP in n-heptane b) Modified version of classic nitric acid method (not to in use any more) | 6 |
2 | BFFG03 | Pu238, Pu239241; Ashing and and drying the traces (not to in use any more) | 6 |
Number of unique ANALYSIS_METHOD DESCRIPTION
DISCUSS repition of counting method in counmet_lut
. When should we use each of them?
counmet_id | counmet | code | |
---|---|---|---|
0 | -1 | Not applicable | NaN |
1 | 0 | Not available | 0 |
2 | 1 | Atomic absorption | AA |
3 | 2 | Alpha | ALP |
4 | 3 | Alpha ionization chamber spectrometry | ALPI |
5 | 4 | Alpha liquid scintillation spectrometry | ALPL |
6 | 5 | Alpha semiconductor spectrometry | ALPS |
7 | 6 | Alpha total | ALPT |
8 | 7 | Accelerator mass spectrometry | AMS |
9 | 8 | Beta | BET |
RemapSedSliceTopBottomCB ()
Remap Sediment slice top and bottom to MARIS format.
class RemapSedSliceTopBottomCB(Callback):
"Remap Sediment slice top and bottom to MARIS format."
def __call__(self, tfm: Transformer):
"Iterate through all DataFrames in the transformer object and remap sediment slice top and bottom."
tfm.dfs['SEDIMENT']['TOP'] = tfm.dfs['SEDIMENT']['uppsli']
tfm.dfs['SEDIMENT']['BOTTOM'] = tfm.dfs['SEDIMENT']['lowsli']
FEEDBACK TO DATA PROVIDER: Entries for the BASIS
value of the BIOTA
dataset report a value of F
which is not consistent with the HELCOM description provided in the metadata. The GUIDELINES FOR MONITORING OF RADIOACTIVE SUBSTANCES
was obtained from here.
Lets take a look at the BIOTA BASIS values:
Number of entries for each BASIS
value:
FEEDBACK TO DATA PROVIDER: Some entries for DW%
(Dry weight as percentage (%) of fresh weight) are much higher than 100%. Additionally, DW%
is repoted as 0% in some cases.
For BIOTA, the number of entries for DW%
higher than 100%:
For BIOTA, the number of entries for DW%
equal to 0%:
For SEDIMENT, the number of entries for DW%
higher than 100%:
For SEDIMENT, the number of entries for DW%
equal to 0%:
FEEDBACK TO DATA PROVIDER: Several SEDIMENT entries have DW%
(Dry weight as percentage of fresh weight) values less than 1%. While technically possible, this would indicate samples contained more than 99% water content.
For SEDIMENT, the number of entries for DW%
less than 1% but greater than 0.001%:
percent=1
dfs['SEDIMENT']['dw%'][(dfs['SEDIMENT']['dw%'] < percent) & (dfs['SEDIMENT']['dw%'] > 0.001)].count()
24
Lets take a look at the MARIS description of the percentwt
, drywt
and wetwt
variables:
percentwt
: Dry weight as ratio of fresh weight, expressed as a decimal .drywt
: Dry weight in grams.wetwt
: Fresh weight in grams.Lets take a look at the HELCOM dataset, the weight of the sample is not reported for SEDIMENT
. However, the percentage dry weight is reported as DW%
.
Index(['key', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'error%_kg',
'< value_bq/m²', 'value_bq/m²', 'error%_m²', 'date_of_entry_x',
'country', 'laboratory', 'sequence', 'date', 'year', 'month', 'day',
'station', 'latitude (ddmmmm)', 'latitude (dddddd)',
'longitude (ddmmmm)', 'longitude (dddddd)', 'device', 'tdepth',
'uppsli', 'lowsli', 'area', 'sedi', 'oxic', 'dw%', 'loi%',
'mors_subbasin', 'helcom_subbasin', 'sum_link', 'date_of_entry_y'],
dtype='object')
The BIOTA dataset reports the weight of the sample as WEIGHT
and the percentage dry weight as DW%
. The BASIS
column describes the basis the value reported
Index(['key', 'nuclide', 'method', '< value_bq/kg', 'value_bq/kg', 'basis',
'error%', 'number', 'date_of_entry_x', 'country', 'laboratory',
'sequence', 'date', 'year', 'month', 'day', 'station',
'latitude ddmmmm', 'latitude dddddd', 'longitude ddmmmm',
'longitude dddddd', 'sdepth', 'rubin', 'biotatype', 'tissue', 'no',
'length', 'weight', 'dw%', 'loi%', 'mors_subbasin', 'helcom_subbasin',
'date_of_entry_y'],
dtype='object')
LookupDryWetPercentWeightCB ()
Lookup dry-wet ratio and format for MARIS.
class LookupDryWetPercentWeightCB(Callback):
"Lookup dry-wet ratio and format for MARIS."
def __call__(self, tfm: Transformer):
"Iterate through all DataFrames in the transformer object and apply the dry-wet ratio lookup."
for grp in tfm.dfs.keys():
if 'dw%' in tfm.dfs[grp].columns:
self._apply_dry_wet_ratio(tfm.dfs[grp])
if 'weight' in tfm.dfs[grp].columns and 'basis' in tfm.dfs[grp].columns:
self._correct_basis(tfm.dfs[grp])
self._apply_weight(tfm.dfs[grp])
def _apply_dry_wet_ratio(self, df: pd.DataFrame) -> None:
"Apply dry-wet ratio conversion and formatting to the given DataFrame."
df['PERCENTWT'] = df['dw%'] / 100 # Convert percentage to fraction
df.loc[df['PERCENTWT'] == 0, 'PERCENTWT'] = np.NaN # Convert 0% to NaN
def _correct_basis(self, df: pd.DataFrame) -> None:
"Correct BASIS values. Assuming F = Fresh weight, so F = W"
df.loc[df['basis'] == 'F', 'basis'] = 'W'
def _apply_weight(self, df: pd.DataFrame) -> None:
"Apply weight conversion and formatting to the given DataFrame."
dry_condition = df['basis'] == 'D'
wet_condition = df['basis'] == 'W'
df.loc[dry_condition, 'DRYWT'] = df['weight']
df.loc[dry_condition & df['PERCENTWT'].notna(), 'WETWT'] = df['weight'] / df['PERCENTWT']
df.loc[wet_condition, 'WETWT'] = df['weight']
df.loc[wet_condition & df['PERCENTWT'].notna(), 'DRYWT'] = df['weight'] * df['PERCENTWT']
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
LookupDryWetPercentWeightCB(),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print('BIOTA:', tfm.dfs['BIOTA'][['PERCENTWT','DRYWT','WETWT']].head(), '\n')
print('SEDIMENT:', tfm.dfs['SEDIMENT']['PERCENTWT'].unique())
BIOTA SEAWATER SEDIMENT
Number of rows in dfs 16124 21634 40744
Number of rows in tfm.dfs 16124 21634 40744
Number of rows removed 0 0 0
BIOTA: PERCENTWT DRYWT WETWT
0 0.18453 174.93444 948.0
1 0.18453 174.93444 948.0
2 0.18453 174.93444 948.0
3 0.18453 174.93444 948.0
4 0.18458 177.93512 964.0
SEDIMENT: [ nan 0.1 0.13 ... 0.24418605 0.25764192 0.26396495]
Note that the dry weight is greater than the wet weight for some entries in the BIOTA dataset due to the DW% being greater than 100%, see above. Lets take a look at the number of entries where this is the case:
FEEDBACK TO DATA PROVIDER: Column names for geographical coordinates are inconsistent across sample types (biota, sediment, seawater). Sometimes using parentheses, sometimes not.
dfs = load_data(src_dir, use_cache=True)
for grp in dfs.keys():
print(f'{grp}: {[col for col in dfs[grp].columns if "lon" in col or "lat" in col]}')
BIOTA: ['latitude ddmmmm', 'latitude dddddd', 'longitude ddmmmm', 'longitude dddddd']
SEAWATER: ['latitude (ddmmmm)', 'latitude (dddddd)', 'longitude (ddmmmm)', 'longitude (dddddd)']
SEDIMENT: ['latitude (ddmmmm)', 'latitude (dddddd)', 'longitude (ddmmmm)', 'longitude (dddddd)']
FEEDBACK TO DATA PROVIDER: HELCOM SEAWATER data includes values of 0 or nan for both latitude and longitude.
ParseCoordinates (fn_convert_cor:Callable)
Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero.
Type | Details | |
---|---|---|
fn_convert_cor | Callable | Function that converts coordinates from degree-minute to decimal degree format |
class ParseCoordinates(Callback):
"Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero."
def __init__(self,
fn_convert_cor: Callable # Function that converts coordinates from degree-minute to decimal degree format
):
self.fn_convert_cor = fn_convert_cor
def __call__(self, tfm:Transformer):
for df in tfm.dfs.values():
self._format_coordinates(df)
def _format_coordinates(self, df:pd.DataFrame) -> None:
coord_cols = self._get_coord_columns(df.columns)
for coord in ['lat', 'lon']:
decimal_col, minute_col = coord_cols[f'{coord}_d'], coord_cols[f'{coord}_m']
# Attempt to convert columns to numeric, coercing errors to NaN.
df[decimal_col] = pd.to_numeric(df[decimal_col], errors='coerce')
df[minute_col] = pd.to_numeric(df[minute_col], errors='coerce')
condition = df[decimal_col].isna() | (df[decimal_col] == 0)
df[coord.upper()] = np.where(condition,
df[minute_col].apply(self._safe_convert),
df[decimal_col])
df.dropna(subset=['LAT', 'LON'], inplace=True)
def _get_coord_columns(self, columns) -> dict:
return {
'lon_d': self._find_coord_column(columns, 'lon', 'dddddd'),
'lat_d': self._find_coord_column(columns, 'lat', 'dddddd'),
'lon_m': self._find_coord_column(columns, 'lon', 'ddmmmm'),
'lat_m': self._find_coord_column(columns, 'lat', 'ddmmmm')
}
def _find_coord_column(self, columns, coord_type, coord_format) -> str:
pattern = re.compile(f'{coord_type}.*{coord_format}', re.IGNORECASE)
matching_columns = [col for col in columns if pattern.search(col)]
return matching_columns[0] if matching_columns else None
def _safe_convert(self, value) -> str:
if pd.isna(value):
return value
try:
return self.fn_convert_cor(value)
except Exception as e:
print(f"Error converting value {value}: {e}")
return value
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
ParseCoordinates(ddmm_to_dd),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['BIOTA'][['LAT','LON']])
BIOTA SEAWATER SEDIMENT
Number of rows in dfs 16124 21634 40744
Number of rows in tfm.dfs 16124 21626 40743
Number of rows removed 0 8 1
LAT LON
0 54.283333 12.316667
1 54.283333 12.316667
2 54.283333 12.316667
3 54.283333 12.316667
4 54.283333 12.316667
... ... ...
16119 61.241500 21.395000
16120 61.241500 21.395000
16121 61.343333 21.385000
16122 61.343333 21.385000
16123 61.343333 21.385000
[16124 rows x 2 columns]
Lets review the dropped rows for SEAWATER:
with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
display(tfm.dfs_dropped['SEAWATER'])
key | nuclide | method | < value_bq/m³ | value_bq/m³ | error%_m³ | date_of_entry_x | country | laboratory | sequence | date | year | month | day | station | latitude (ddmmmm) | latitude (dddddd) | longitude (ddmmmm) | longitude (dddddd) | tdepth | sdepth | salin | ttemp | filt | mors_subbasin | helcom_subbasin | date_of_entry_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20556 | WSSSM2015009 | H3 | STYR201 | < | 2450.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20557 | WSSSM2015010 | H3 | STYR201 | NaN | 2510.0 | 29.17 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20558 | WSSSM2015011 | H3 | STYR201 | < | 2450.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20559 | WSSSM2015012 | H3 | STYR201 | NaN | 1740.0 | 41.26 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20560 | WSSSM2015013 | H3 | STYR201 | NaN | 1650.0 | 43.53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20561 | WSSSM2015014 | H3 | STYR201 | < | 2277.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20562 | WSSSM2015015 | H3 | STYR201 | < | 2277.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20563 | WSSSM2015016 | H3 | STYR201 | < | 2277.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Sanitize coordinates drops a row when both longitude & latitude equal 0 or data contains unrealistic longitude & latitude values. Converts longitude & latitude ,
separator to .
separator.”
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
ParseCoordinates(ddmm_to_dd),
SanitizeLonLatCB(),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['BIOTA'][['LAT','LON']])
BIOTA SEAWATER SEDIMENT
Number of rows in dfs 16124 21634 40744
Number of rows in tfm.dfs 16124 21626 40743
Number of rows removed 0 8 1
LAT LON
0 54.283333 12.316667
1 54.283333 12.316667
2 54.283333 12.316667
3 54.283333 12.316667
4 54.283333 12.316667
... ... ...
16119 61.241500 21.395000
16120 61.241500 21.395000
16121 61.343333 21.385000
16122 61.343333 21.385000
16123 61.343333 21.385000
[16124 rows x 2 columns]
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
RemapNuclideNameCB(lut_nuclides, col_name='NUCLIDE'),
ParseTimeCB(),
EncodeTimeCB(),
SplitSedimentValuesCB(coi_sediment),
SanitizeValueCB(coi_val),
NormalizeUncCB(),
RemapUnitCB(),
RemapDetectionLimitCB(coi_dl, lut_dl),
RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA'),
RemapSedimentCB(fn_lut=lut_sediments, replace_lut=sed_replace_lut),
RemapFiltCB(lut_filtered),
RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER']),
AddSampleIDCB(lut_type=SMP_TYPE_LUT),
AddDepthCB(),
AddSalinityCB(),
AddTemperatureCB(),
RemapSedSliceTopBottomCB(),
LookupDryWetPercentWeightCB(),
ParseCoordinates(ddmm_to_dd),
SanitizeLonLatCB(),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
Warning: 8 missing time value(s) in SEAWATER
Warning: 1 missing time value(s) in SEDIMENT
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
BIOTA SEAWATER SEDIMENT
Number of rows in dfs 16124 21634 40744
Number of rows in tfm.dfs 16094 21473 70449
Number of rows removed 30 161 144
Lets inspect the rows that are removed for the SEAWATER data:
grp='SEAWATER' # 'SEAWATER', 'BIOTA' or 'SEDIMENT'
print(f'{grp}, number of dropped rows: {tfm.dfs_dropped[grp].shape[0]}.')
print(f'Viewing dropped rows for {grp}:')
tfm.dfs_dropped[grp]
SEAWATER, number of dropped rows: 161.
Viewing dropped rows for SEAWATER:
key | nuclide | method | < value_bq/m³ | value_bq/m³ | error%_m³ | date_of_entry_x | country | laboratory | sequence | ... | longitude (ddmmmm) | longitude (dddddd) | tdepth | sdepth | salin | ttemp | filt | mors_subbasin | helcom_subbasin | date_of_entry_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13439 | WRISO2001025 | CS137 | RISO02 | NaN | NaN | 10.0 | NaN | 26.0 | RISO | 2001025.0 | ... | 10.500 | 10.833333 | 22.0 | 20.0 | 0.00 | NaN | N | 5.0 | 5.0 | NaN |
14017 | WLEPA2002001 | CS134 | LEPA02 | < | NaN | NaN | NaN | 93.0 | LEPA | 2002001.0 | ... | 21.030 | 21.050000 | 16.0 | 0.0 | 3.77 | 14.40 | N | 4.0 | 9.0 | NaN |
14020 | WLEPA2002002 | CS134 | LEPA02 | < | NaN | NaN | NaN | 93.0 | LEPA | 2002004.0 | ... | 20.574 | 20.956667 | 14.0 | 0.0 | 6.57 | 11.95 | N | 4.0 | 9.0 | NaN |
14023 | WLEPA2002003 | CS134 | LEPA02 | < | NaN | NaN | NaN | 93.0 | LEPA | 2002007.0 | ... | 19.236 | 19.393333 | 73.0 | 0.0 | 7.00 | 9.19 | N | 4.0 | 9.0 | NaN |
14026 | WLEPA2002004 | CS134 | LEPA02 | < | NaN | NaN | NaN | 93.0 | LEPA | 2002010.0 | ... | 20.205 | 20.341700 | 47.0 | 0.0 | 7.06 | 8.65 | N | 4.0 | 9.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
21542 | WLRPC2023011 | SR90 | LRPC02 | NaN | NaN | NaN | 05/03/24 00:00:00 | 93.0 | LRPC | 2023011.0 | ... | 20.480 | 20.800000 | 45.0 | 1.0 | 7.22 | 19.80 | N | 4.0 | 9.0 | 05/03/24 00:00:00 |
21543 | WLRPC2023012 | CS137 | LRPC01 | NaN | NaN | NaN | 05/03/24 00:00:00 | 93.0 | LRPC | 2023012.0 | ... | 20.480 | 20.800000 | 45.0 | 1.0 | 7.23 | 8.80 | N | 4.0 | 9.0 | 05/03/24 00:00:00 |
21544 | WLRPC2023012 | SR90 | LRPC02 | NaN | NaN | NaN | 05/03/24 00:00:00 | 93.0 | LRPC | 2023012.0 | ... | 20.480 | 20.800000 | 45.0 | 1.0 | 7.23 | 8.80 | N | 4.0 | 9.0 | 05/03/24 00:00:00 |
21545 | WLRPC2023013 | CS137 | LRPC01 | NaN | NaN | NaN | 05/03/24 00:00:00 | 93.0 | LRPC | 2023013.0 | ... | 20.427 | 20.711700 | 41.0 | 1.0 | 7.23 | 19.30 | N | 4.0 | 9.0 | 05/03/24 00:00:00 |
21546 | WLRPC2023013 | SR90 | LRPC02 | NaN | NaN | NaN | 05/03/24 00:00:00 | 93.0 | LRPC | 2023013.0 | ... | 20.427 | 20.711700 | 41.0 | 1.0 | 7.23 | 19.30 | N | 4.0 | 9.0 | 05/03/24 00:00:00 |
161 rows × 27 columns
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
RemapNuclideNameCB(lut_nuclides, col_name='NUCLIDE'),
ParseTimeCB(),
EncodeTimeCB(),
SplitSedimentValuesCB(coi_sediment),
SanitizeValueCB(coi_val),
NormalizeUncCB(),
RemapUnitCB(),
RemapDetectionLimitCB(coi_dl, lut_dl),
RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA'),
RemapSedimentCB(fn_lut=lut_sediments, replace_lut=sed_replace_lut),
RemapFiltCB(lut_filtered),
RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER']),
AddSampleIDCB(lut_type=SMP_TYPE_LUT),
AddDepthCB(),
AddSalinityCB(),
AddTemperatureCB(),
RemapSedSliceTopBottomCB(),
LookupDryWetPercentWeightCB(),
ParseCoordinates(ddmm_to_dd),
SanitizeLonLatCB(),
])
tfm()
tfm.logs
Warning: 8 missing time value(s) in SEAWATER
Warning: 1 missing time value(s) in SEDIMENT
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column.",
'Remap data provider nuclide names to standardized MARIS nuclide names.',
'Standardize time format across all dataframes.',
'Encode time as seconds since epoch.',
'Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements.',
'Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column.',
'Convert from relative error to standard uncertainty.',
'Set the `unit` id column in the DataFrames based on a lookup table.',
'Remap value type to MARIS format.',
"Remap values from 'rubin' to 'SPECIES' for groups: BIOTA.",
"Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA.",
"Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA.",
'Lookup sediment id using lookup table.',
'Lookup filt value in dataframe using the lookup table.',
"Remap values from 'laboratory' to 'LAB' for groups: BIOTA, SEDIMENT and SEAWATER.",
'Generate sample id, `SMP_ID`, from encoded group, encoded `LAB` and sequence values.',
"Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns.",
'Remap Sediment slice top and bottom to MARIS format.',
'Lookup dry-wet ratio and format for MARIS.',
'Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero.',
'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.']
get_attrs (tfm:marisco.callbacks.Transformer, zotero_key:str, kw:list=['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides', 'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments', 'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality > Ocean Contaminants', 'Earth Science > Biological Classification > Animals/Vertebrates > Fish', 'Earth Science > Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks', 'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans', 'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)'])
Retrieve all global attributes.
Type | Default | Details | |
---|---|---|---|
tfm | Transformer | Transformer object | |
zotero_key | str | Zotero dataset record key | |
kw | list | [‘oceanography’, ‘Earth Science > Oceans > Ocean Chemistry> Radionuclides’, ‘Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure’, ‘Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments’, ‘Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes’, ‘Earth Science > Oceans > Water Quality > Ocean Contaminants’, ‘Earth Science > Biological Classification > Animals/Vertebrates > Fish’, ‘Earth Science > Biosphere > Ecosystems > Marine Ecosystems’, ‘Earth Science > Biological Classification > Animals/Invertebrates > Mollusks’, ‘Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans’, ‘Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)’] | List of keywords |
Returns | dict | Global attributes |
def get_attrs(
tfm: Transformer, # Transformer object
zotero_key: str, # Zotero dataset record key
kw: list = kw # List of keywords
) -> dict: # Global attributes
"Retrieve all global attributes."
return GlobAttrsFeeder(tfm.dfs, cbs=[
BboxCB(),
DepthRangeCB(),
TimeRangeCB(),
ZoteroCB(zotero_key, cfg=cfg()),
KeyValuePairCB('keywords', ', '.join(kw)),
KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
])()
{'geospatial_lat_min': '31.17',
'geospatial_lat_max': '65.75',
'geospatial_lon_min': '9.6333',
'geospatial_lon_max': '53.5',
'geospatial_bounds': 'POLYGON ((9.6333 53.5, 31.17 53.5, 31.17 65.75, 9.6333 65.75, 9.6333 53.5))',
'geospatial_vertical_max': '437.0',
'geospatial_vertical_min': '0.0',
'time_coverage_start': '1984-01-10T00:00:00',
'time_coverage_end': '2023-11-30T00:00:00',
'id': '26VMZZ2Q',
'title': 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances',
'summary': 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting data under Recommendation 26/3.\n\nThe database is updated and quality assured annually by HELCOM MORS EG.',
'creator_name': '[{"creatorType": "author", "name": "HELCOM MORS"}]',
'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
'publisher_postprocess_logs': "Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column., Remap data provider nuclide names to standardized MARIS nuclide names., Standardize time format across all dataframes., Encode time as seconds since epoch., Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements., Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column., Convert from relative error to standard uncertainty., Set the `unit` id column in the DataFrames based on a lookup table., Remap value type to MARIS format., Remap values from 'rubin' to 'SPECIES' for groups: BIOTA., Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA., Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA., Lookup sediment id using lookup table., Lookup filt value in dataframe using the lookup table., Remap values from 'laboratory' to 'LAB' for groups: BIOTA, SEDIMENT and SEAWATER., Generate sample id, `SMP_ID`, from encoded group, encoded `LAB` and sequence values., Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns., Remap Sediment slice top and bottom to MARIS format., Lookup dry-wet ratio and format for MARIS., Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator."}
encode (src_dir:str, fname_out_nc:str, **kwargs)
Encode data to NetCDF.
Type | Details | |
---|---|---|
src_dir | str | Input file name |
fname_out_nc | str | Output file name |
kwargs | ||
Returns | None | Additional arguments |
def encode(
src_dir: str, # Input file name
fname_out_nc: str, # Output file name
**kwargs # Additional arguments
) -> None:
"Encode data to NetCDF."
dfs = load_data(src_dir)
tfm = Transformer(dfs, cbs=[
LowerStripNameCB(col_src='nuclide', col_dst='NUCLIDE'),
RemapNuclideNameCB(lut_nuclides, col_name='NUCLIDE'),
ParseTimeCB(),
EncodeTimeCB(),
SplitSedimentValuesCB(coi_sediment),
SanitizeValueCB(coi_val),
NormalizeUncCB(),
RemapUnitCB(),
RemapDetectionLimitCB(coi_dl, lut_dl),
RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='rubin', dest_grps='BIOTA'),
RemapCB(fn_lut=lut_tissues, col_remap='BODY_PART', col_src='tissue', dest_grps='BIOTA'),
RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA'),
RemapSedimentCB(fn_lut=lut_sediments, replace_lut=sed_replace_lut),
RemapFiltCB(lut_filtered),
#RemapCB(fn_lut=lut_lab, col_remap='LAB', col_src='laboratory', dest_grps=['BIOTA','SEDIMENT','SEAWATER']),
#AddSampleIDCB(lut_type=SMP_TYPE_LUT),
AddDepthCB(),
AddSalinityCB(),
AddTemperatureCB(),
RemapSedSliceTopBottomCB(),
LookupDryWetPercentWeightCB(),
ParseCoordinates(ddmm_to_dd),
SanitizeLonLatCB(),
])
tfm()
encoder = NetCDFEncoder(tfm.dfs,
dest_fname=fname_out_nc,
global_attrs=get_attrs(tfm, zotero_key=zotero_key, kw=kw),
verbose=kwargs.get('verbose', False),
)
encoder.encode()
Warning: 8 missing time value(s) in SEAWATER
Warning: 1 missing time value(s) in SEDIMENT
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
First lets review the global attributes of the NetCDF file:
{'id': '26VMZZ2Q', 'title': 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances', 'summary': 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting data under Recommendation 26/3.\n\nThe database is updated and quality assured annually by HELCOM MORS EG.', 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)', 'history': 'TBD', 'keywords_vocabulary': 'GCMD Science Keywords', 'keywords_vocabulary_url': 'https://gcmd.earthdata.nasa.gov/static/kms/', 'record': 'TBD', 'featureType': 'TBD', 'cdm_data_type': 'TBD', 'Conventions': 'CF-1.10 ACDD-1.3', 'publisher_name': 'Paul MCGINNITY, Iolanda OSVATH, Florence DESCROIX-COMANDUCCI', 'publisher_email': 'p.mc-ginnity@iaea.org, i.osvath@iaea.org, F.Descroix-Comanducci@iaea.org', 'publisher_url': 'https://maris.iaea.org', 'publisher_institution': 'International Atomic Energy Agency - IAEA', 'creator_name': '[{"creatorType": "author", "name": "HELCOM MORS"}]', 'institution': 'TBD', 'metadata_link': 'TBD', 'creator_email': 'TBD', 'creator_url': 'TBD', 'references': 'TBD', 'license': 'Without prejudice to the applicable Terms and Conditions (https://nucleus.iaea.org/Pages/Others/Disclaimer.aspx), I hereby agree that any use of the data will contain appropriate acknowledgement of the data source(s) and the IAEA Marine Radioactivity Information System (MARIS).', 'comment': 'TBD', 'geospatial_lat_min': '31.17', 'geospatial_lon_min': '9.6333', 'geospatial_lat_max': '65.75', 'geospatial_lon_max': '53.5', 'geospatial_vertical_min': '0.0', 'geospatial_vertical_max': '437.0', 'geospatial_bounds': 'POLYGON ((9.6333 53.5, 31.17 53.5, 31.17 65.75, 9.6333 65.75, 9.6333 53.5))', 'geospatial_bounds_crs': 'EPSG:4326', 'time_coverage_start': '1984-01-10T00:00:00', 'time_coverage_end': '2023-11-30T00:00:00', 'local_time_zone': 'TBD', 'date_created': 'TBD', 'date_modified': 'TBD', 'publisher_postprocess_logs': "Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column., Remap data provider nuclide names to standardized MARIS nuclide names., Standardize time format across all dataframes., Encode time as seconds since epoch., Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements., Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column., Convert from relative error to standard uncertainty., Set the `unit` id column in the DataFrames based on a lookup table., Remap value type to MARIS format., Remap values from 'rubin' to 'SPECIES' for groups: BIOTA., Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA., Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA., Lookup sediment id using lookup table., Lookup filt value in dataframe using the lookup table., Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns., Remap Sediment slice top and bottom to MARIS format., Lookup dry-wet ratio and format for MARIS., Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator."}
Review the publisher_postprocess_logs.
Convert 'nuclide' column values to lowercase, strip spaces, and store in 'NUCLIDE' column., Remap data provider nuclide names to standardized MARIS nuclide names., Standardize time format across all dataframes., Encode time as seconds since epoch., Separate sediment entries into distinct rows for Bq/kg and Bq/m² measurements., Sanitize measurement values by removing blanks and standardizing to use the `VALUE` column., Convert from relative error to standard uncertainty., Set the `unit` id column in the DataFrames based on a lookup table., Remap value type to MARIS format., Remap values from 'rubin' to 'SPECIES' for groups: BIOTA., Remap values from 'tissue' to 'BODY_PART' for groups: BIOTA., Remap values from 'SPECIES' to 'BIO_GROUP' for groups: BIOTA., Lookup sediment id using lookup table., Lookup filt value in dataframe using the lookup table., Ensure depth values are floats and add 'SMP_DEPTH' and 'TOT_DEPTH' columns., Remap Sediment slice top and bottom to MARIS format., Lookup dry-wet ratio and format for MARIS., Get geographical coordinates from columns expressed in degrees decimal format or from columns in degrees/minutes decimal format where degrees decimal format is missing or zero., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.
Now lets review the enums of the groups in the NetCDF file:
{'BIOTA': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', 'pu240': '69', 'pu241': '70', 'am240': '71', 'am241': '72', 'cm242': '73', 'cm243': '74', 'cm244': '75', 'cs134_137_tot': '76', 'pu239_240_tot': '77', 'pu239_240_iii_iv_tot': '78', 'pu239_240_v_vi_tot': '79', 'cm243_244_tot': '80', 'pu238_pu239_240_tot_ratio': '81', 'am241_pu239_240_tot_ratio': '82', 'cs137_134_ratio': '83', 'cd109': '84', 'eu152': '85', 'fe59': '86', 'gd153': '87', 'ir192': '88', 'pu238_240_tot': '89', 'rb86': '90', 'sc46': '91', 'sn113': '92', 'sn117m': '93', 'tl208': '94', 'mo99': '95', 'tc99m': '96', 'ru105': '97', 'te129': '98', 'te132': '99', 'i132': '100', 'i135': '101', 'cs136': '102', 'tbeta': '103', 'talpha': '104', 'i133': '105', 'th230': '106', 'pa231': '107', 'u236': '108', 'ag111': '109', 'in116m': '110', 'te123m': '111', 'sb127': '112', 'ba133': '113', 'ce139': '114', 'tl201': '116', 'hg203': '117', 'na22': '122', 'pa234m': '123', 'am243': '124', 'se75': '126', 'sr85': '127', 'y88': '128', 'ce140': '129', 'bi212': '130', 'u236_238_ratio': '131', 'i125': '132', 'ba137m': '133', 'u232': '134', 'pa233': '135', 'ru106_rh106_tot': '136', 'tu': '137', 'tbeta40k': '138', 'fe55': '139', 'ce144_pr144_tot': '140', 'pu240_pu239_ratio': '141', 'u233': '142', 'pu239_242_tot': '143', 'ac227': '144'}, 'unit': {'Not applicable': '-1', 'NOT AVAILABLE': '0', 'Bq per m3': '1', 'Bq per m2': '2', 'Bq per kg': '3', 'Bq per kgd': '4', 'Bq per kgw': '5', 'kg per kg': '6', 'TU': '7', 'DELTA per mill': '8', 'atom per kg': '9', 'atom per kgd': '10', 'atom per kgw': '11', 'atom per l': '12', 'Bq per kgC': '13'}, 'dl': {'Not applicable': '-1', 'Not available': '0', 'Detected value': '1', 'Detection limit': '2', 'Not detected': '3', 'Derived': '4'}, 'bio_group': {'Not applicable': '-1', 'Not available': '0', 'Birds': '1', 'Crustaceans': '2', 'Echinoderms': '3', 'Fish': '4', 'Mammals': '5', 'Molluscs': '6', 'Others': '7', 'Plankton': '8', 'Polychaete worms': '9', 'Reptile': '10', 'Seaweeds and plants': '11', 'Cephalopods': '12', 'Gastropods': '13', 'Bivalves': '14'}, 'species': {'NOT AVAILABLE': '0', 'Aristeus antennatus': '1', 'Apostichopus': '2', 'Saccharina japonica var religiosa': '3', 'Siganus fuscescens': '4', 'Alpheus dentipes': '5', 'Hexagrammos agrammus': '6', 'Ditrema temminckii': '7', 'Parapristipoma trilineatum': '8', 'Scombrops boops': '9', 'Pseudopleuronectes schrenki': '10', 'Desmarestia ligulata': '11', 'Saccharina japonica': '12', 'Neodilsea yendoana': '13', 'Costaria costata': '14', 'Sargassum yezoense': '15', 'Acanthephyra pelagica': '16', 'Sargassum ringgoldianum': '17', 'Acanthephyra quadrispinosa': '18', 'Sargassum thunbergii': '19', 'Sargassum patens': '20', 'Asterias rubens': '21', 'Sargassum miyabei': '22', 'Homarus gammarus': '23', 'Acanthephyra stylorostratis': '24', 'Acanthocybium solandri': '25', 'Acanthopagrus bifasciatus': '26', 'Acanthophora muscoides': '27', 'Acanthophora spicifera': '28', 'Acanthurus triostegus': '29', 'Actinopterygii': '30', 'Adamussium colbecki': '31', 'Ahnfeltiopsis densa': '32', 'Alepes melanoptera': '33', 'Ampharetidae': '34', 'Anchoviella lepidentostole': '35', 'Anguillidae': '36', 'Aphroditidae': '37', 'Arnoglossus': '38', 'Aurigequula fasciata': '39', 'Balaenoptera musculus': '40', 'Balaenoptera physalus': '41', 'Balistes': '42', 'Beryciformes': '43', 'Bryopsis maxima': '44', 'Callinectes sp': '45', 'Callorhinus ursinus': '46', 'Carassius auratus auratus': '47', 'Carcharhinus sorrah': '48', 'Caridae': '49', 'Clupea harengus': '50', 'Cathorops spixii': '51', 'Caulerpa racemosa': '52', 'Caulerpa scalpelliformis': '53', 'Caulerpa sertularioides': '54', 'Cellana radiata': '55', 'Coscinasterias tenuispina': '56', 'Centroceras clavulatum': '57', 'Centropomus parallelus': '58', 'Crangon crangon': '59', 'Ceramium diaphanum': '60', 'Ceramium rubrum': '61', 'Chaenocephalus aceratus': '62', 'Chaetodipterus faber': '63', 'Chaetomorpha antennina': '64', 'Chaetomorpha linoides': '65', 'Chelidonichthys kumu': '66', 'Chelon ramada': '67', 'Chiloscyllium': '68', 'Chionodraco hamatus': '69', 'Chlamys islandica': '70', 'Chlorophyta': '71', 'Chondrichthyes': '72', 'Chrysaora': '73', 'Cladophora nitellopsis': '74', 'Cladophora vagabunda': '75', 'Cladophoropsis membranacea': '76', 'Clupea': '77', 'Coccotylus truncatus': '78', 'Codium fragile': '79', 'Crassostrea': '80', 'Cynoscion acoupa': '81', 'Cynoscion jamaicensis': '82', 'Cynoscion leiarchus': '83', 'Engraulis encrasicolus': '84', 'Cypselurus agoo agoo': '85', 'Cystophora cristata': '86', 'Cystoseira barbata': '87', 'Cystoseira crinita': '88', 'Decapodiformes': '89', 'Decapterus russelli': '90', 'Decapterus scombrinus': '91', 'Delphinapterus leucas': '92', 'Delphinus capensis': '93', 'Diapterus rhombeus': '94', 'Dicentrarchus punctatus': '95', 'Fucus vesiculosus': '96', 'Funchalia woodwardi': '97', 'Ecklonia bicyclis': '98', 'Gadus morhua': '99', 'Ecklonia kurome': '100', 'Gennadas elegans': '101', 'Eisenia arborea': '102', 'Encrasicholina devisi': '103', 'Enteromorpha': '104', 'Enteromorpha flexuosa': '105', 'Enteromorpha intestinalis': '106', 'Epinephelinae': '107', 'Epinephelus diacanthus': '108', 'Exocoetidae': '109', 'Saccharina latissima': '110', 'Gracilaria corticata': '111', 'Ligur ensiferus': '112', 'Gracilaria debilis': '113', 'Gracilaria edulis': '114', 'Gracilariales': '115', 'Grateloupia elliptica': '116', 'Grateloupia filicina': '117', 'Lysmata seticaudata': '118', 'Gymnogongrus griffithsiae': '119', 'Mya arenaria': '120', 'Halichoerus grypus': '121', 'Macoma balthica': '122', 'Marthasterias glacialis': '123', 'Halimeda macroloba': '124', 'Harengula clupeola': '125', 'Harpagifer antarcticus': '126', 'Hemifusus ternatanus': '127', 'Hemiramphus brasiliensis': '128', 'Mytilus edulis': '129', 'Metapenaeus affinis': '130', 'Heteroscleromorpha': '131', 'Heterosigma akashiwo': '132', 'Hilsa ilisha': '133', 'Metapenaeus monoceros': '134', 'Metapenaeus stebbingi': '135', 'Holothuria': '136', 'Hoplobrotula armata': '137', 'Hypnea musciformis': '138', 'Merlangius merlangus': '139', 'Iridaea cordata': '140', 'Jania rubens': '141', 'Meganyctiphanes norvegica': '142', 'Johnius glaucus': '143', 'Kappaphycus': '144', 'Kappaphycus alvarezii': '145', 'Laevistrombus canarium': '146', 'Lagenodelphis hosei': '147', 'Lambia': '148', 'Laminaria japonica': '149', 'Laminaria longissima': '150', 'Larimus breviceps': '151', 'Laurencia papillosa': '152', 'Leiognathidae': '153', 'Leiognathus dussumieri': '154', 'Lepidochelys olivacea': '155', 'Leptonychotes weddellii': '156', 'Limanda yokohamae': '157', 'Nephrops norvegicus': '158', 'Neuston': '159', 'Littoraria undulata': '160', 'Loligo vulgaris': '161', 'Lumbrineridae': '162', 'Lutjanus fulviflamma': '163', 'Marginisporum aberrans': '164', 'Megalaspis cordyla': '165', 'Octopus vulgaris': '166', 'Menticirrhus americanus': '167', 'Mesoplodon densirostris': '168', 'Palaemon longirostris': '169', 'Metapenaeus brevicornis': '170', 'Pasiphaea multidentata': '171', 'Pasiphaea sivado': '172', 'Parapenaeopsis stylifera': '173', 'Miichthys miiuy': '174', 'Mirounga leonina': '175', 'Brachidontes striatulus': '176', 'Monodon monoceros': '177', 'Mugil platanus': '178', 'Penaeus semisulcatus': '179', 'Mullus barbatus': '180', 'Mycteroperca rubra': '181', 'Philocheras echinulatus': '182', 'Myelophycus simplex': '183', 'Mytilus coruscus': '184', 'Penaeus indicus': '185', 'Natator depressus': '186', 'Pandalus jordani': '187', 'Melicertus kerathurus': '188', 'Parapenaeus longirostris': '189', 'Plesionika': '190', 'Platichthys flesus': '191', 'Pleuronectes platessa': '192', 'Nematopalaemon tenuipes': '193', 'Nematoscelis difficilis': '194', 'Nemipterus': '195', 'Aegaeon lacazei': '196', 'Nephtyidae': '197', 'Nereididae': '198', 'Netuma bilineata': '199', 'Nibea maculata': '200', 'Oceana serrulata': '201', 'Palaemon serratus': '202', 'Ocypode': '203', 'Odobenus rosmarus': '204', 'Ogcocephalus vespertilio': '205', 'Oligoplites saurus': '206', 'Onuphidae': '207', 'Opheliidae': '208', 'Opisthonema oglinum': '209', 'Opisthopterus tardoore': '210', 'Orientomysis mitsukurii': '211', 'Otolithes cuvieri': '212', 'Padina pavonica': '213', 'Padina tetrastromatica': '214', 'Padina vickersiae': '215', 'Pagellus affinis': '216', 'Pagophilus groenlandicus': '217', 'Paguroidea': '218', 'Pagurus': '219', 'Systellaspis debilis': '220', 'Sergestes': '221', 'Sergestes arcticus': '222', 'Pampus argenteus': '223', 'Sergestes arachnipodus': '224', 'Sergestes henseni': '225', 'Sergestes prehensilis': '226', 'Sergestes robustus': '227', 'Pangasius pangasius': '228', 'Panulirus homarus': '229', 'Paracentrotus lividus': '230', 'Pasiphaea sp': '231', 'Pectinariidae': '232', 'Penaeus': '233', 'Phoca vitulina': '234', 'Photopectoralis bindus': '235', 'Phyllospadix iwatensis': '236', 'Plectorhinchus mediterraneus': '237', 'Pleuronectes mochigarei': '238', 'Pleuronectes obscurus': '239', 'Plocamium brasiliense': '240', 'Polynemus paradiseus': '241', 'Polysiphonia': '242', 'Sprattus sprattus': '243', 'Scomber scombrus': '244', 'Polysiphonia fucoides': '245', 'Gonostomatidae': '246', 'Perca fluviatilis': '247', 'Pomadasys crocro': '248', 'Porphyra tenera': '249', 'Potamogeton pectinatus': '250', 'Priacanthus hamrur': '251', 'Pseudorhombus malayanus': '252', 'Pterocladiella capillacea': '253', 'Pusa caspica': '254', 'Pusa sibirica': '255', 'Pylaiella littoralis': '256', 'Sabellidae': '257', 'Salangichthys ishikawae': '258', 'Sarconema filiforme': '259', 'Sardinella albella': '260', 'Sardinella brasiliensis': '261', 'Sardinops melanostictus': '262', 'Sargassum cymosum': '263', 'Sargassum linearifolium': '264', 'Sargassum micracanthum': '265', 'Xiphias gladius': '266', 'Sargassum novae hollandiae': '267', 'Sargassum oligocystum': '268', 'Esox lucius': '269', 'Limanda limanda': '270', 'Abramis brama': '271', 'Anguilla anguilla': '272', 'Arctica islandica': '273', 'Cerastoderma edule': '274', 'Cyprinus carpio': '275', 'Echinodermata': '276', 'Fish larvae': '277', 'Myoxocephalus scorpius': '278', 'Osmerus eperlanus': '279', 'Plankton': '280', 'Scophthalmus maximus': '281', 'Rhodophyta': '282', 'Rutilus rutilus': '283', 'Saduria entomon': '284', 'Sander lucioperca': '285', 'Gasterosteus aculeatus': '286', 'Zoarces viviparus': '287', 'Gymnocephalus cernua': '288', 'Furcellaria lumbricalis': '289', 'Cladophora glomerata': '290', 'Lateolabrax japonicus': '291', 'Okamejei kenojei': '292', 'Sebastes pachycephalus': '293', 'Squalus acanthias': '294', 'Gadus macrocephalus': '295', 'Paralichthys olivaceus': '296', 'Ovalipes punctatus': '297', 'Pseudopleuronectes yokohamae': '298', 'Hemitripterus villosus': '299', 'Clidoderma asperrimum': '300', 'Microstomus achne': '301', 'Lepidotrigla microptera': '302', 'Hexagrammos otakii': '303', 'Kareius bicoloratus': '304', 'Pleuronichthys cornutus': '305', 'Enteroctopus dofleini': '306', 'Ammodytes personatus': '307', 'Lophius litulon': '308', 'Eopsetta grigorjewi': '309', 'Takifugu porphyreus': '310', 'Loliolus japonica': '311', 'Sepia andreana': '312', 'Sebastes cheni': '313', 'Portunus trituberculatus': '314', 'Sebastes schlegelii': '315', 'Pennahia argentata': '316', 'Platichthys stellatus': '317', 'Gadus chalcogrammus': '318', 'Chelidonichthys spinosus': '319', 'Conger myriaster': '320', 'Heterololigo bleekeri': '321', 'Stichaeus grigorjewi': '322', 'Pseudopleuronectes herzensteini': '323', 'Octopus conispadiceus': '324', 'Hippoglossoides dubius': '325', 'Cleisthenes pinetorum': '326', 'Glyptocephalus stelleri': '327', 'Tanakius kitaharae': '328', 'Nibea mitsukurii': '329', 'Dasyatis matsubarai': '330', 'Verasper moseri': '331', 'Hemitrygon akajei': '332', 'Triakis scyllium': '333', 'Trachurus japonicus': '334', 'Zeus faber': '335', 'Pagrus major': '336', 'Acanthopagrus schlegelii': '337', 'Dentex tumifrons': '338', 'Mustelus manazo': '339', 'Seriola quinqueradiata': '340', 'Hyperoglyphe japonica': '341', 'Carcharhinus': '342', 'Platycephalus': '343', 'Scomber japonicus': '344', 'Squatina japonica': '345', 'Alopias pelagicus': '346', 'Zenopsis nebulosa': '347', 'Cynoglossus joyneri': '348', 'Verasper variegatus': '349', 'Oncorhynchus keta': '350', 'Physiculus japonicus': '351', 'Oplegnathus punctatus': '352', 'Arothron hispidus': '353', 'Stereolepis doederleini': '354', 'Takifugu snyderi': '355', 'Scomber australasicus': '356', 'Liparis tanakae': '357', 'Thamnaconus modestus': '358', 'Gnathophis nystromi': '359', 'Sebastes oblongus': '360', 'Sebastiscus marmoratus': '361', 'Takifugu pardalis': '362', 'Mugil cephalus': '363', 'Ditrema temminckii temminckii': '364', 'Konosirus punctatus': '365', 'Tribolodon brandtii': '366', 'Oncorhynchus masou': '367', 'Aluterus monoceros': '368', 'Todarodes pacificus': '369', 'Myoxocephalus stelleri': '370', 'Myliobatis tobijei': '371', 'Scyliorhinus torazame': '372', 'Lophiomus setigerus': '373', 'Heterodontus japonicus': '374', 'Sebastes vulpes': '375', 'Paraplagusia japonica': '376', 'Ostrea edulis': '377', 'Melanogrammus aeglefinus': '378', 'Pollachius virens': '379', 'Pollachius pollachius': '380', 'Sebastes marinus': '381', 'Anarhichas minor': '382', 'Anarhichas denticulatus': '383', 'Reinhardtius hippoglossoides': '384', 'Trisopterus esmarkii': '385', 'Micromesistius poutassou': '386', 'Coryphaenoides rupestris': '387', 'Argentina silus': '388', 'Salmo salar': '389', 'Sebastes viviparus': '390', 'Buccinum undatum': '391', 'Fucus serratus': '392', 'Merluccius merluccius': '393', 'Littorina littorea': '394', 'Fucus': '395', 'Rhodymenia': '396', 'Solea solea': '397', 'Trachurus trachurus': '398', 'Eutrigla gurnardus': '399', 'Pelvetia canaliculata': '400', 'Ascophyllum nodosum': '401', 'Mallotus villosus': '402', 'Pecten maximus': '403', 'Hippoglossoides platessoides': '404', 'Sebastes mentella': '405', 'Modiolus modiolus': '406', 'Boreogadus saida': '407', 'Sepia': '408', 'Gadus': '409', 'Sardina pilchardus': '410', 'Pleuronectiformes': '411', 'Molva molva': '412', 'Patella': '413', 'Crassostrea gigas': '414', 'Dasyatis pastinaca': '415', 'Lophius piscatorius': '416', 'Porphyra umbilicalis': '417', 'Patella vulgata': '418', 'Brosme brosme': '419', 'Glyptocephalus cynoglossus': '420', 'Galeus melastomus': '421', 'Chimaera monstrosa': '422', 'Etmopterus spinax': '423', 'Dicentrarchus labrax': '424', 'Osilinus lineatus': '425', 'Hippoglossus hippoglossus': '426', 'Cyclopterus lumpus': '427', 'Molva dypterygia': '428', 'Microstomus kitt': '429', 'Fucus distichus': '430', 'Tapes': '431', 'Sebastes norvegicus': '432', 'Phycis blennoides': '433', 'Fucus spiralis': '434', 'Laminaria digitata': '435', 'Dipturus batis': '436', 'Anarhichas lupus': '437', 'Lumpenus lampretaeformis': '438', 'Lycodes vahlii': '439', 'Argentina sphyraena': '440', 'Trisopterus minutus': '441', 'Thunnus': '442', 'Hyperoplus lanceolatus': '443', 'Gaidropsarus argentatus': '444', 'Engraulis japonicus': '445', 'Mytilus galloprovincialis': '446', 'Undaria pinnatifida': '447', 'Chlorophthalmus albatrossis': '448', 'Sargassum fusiforme': '449', 'Eisenia bicyclis': '450', 'Spisula sachalinensis': '451', 'Strongylocentrotus nudus': '452', 'Haliotis discus hannai': '453', 'Dexistes rikuzenius': '454', 'Ruditapes philippinarum': '455', 'Apostichopus japonicus': '456', 'Pterothrissus gissu': '457', 'Helicolenus hilgendorfii': '458', 'Buccinum isaotakii': '459', 'Neptunea intersculpta': '460', 'Apostichopus nigripunctatus': '461', 'Sebastes thompsoni': '462', 'Oratosquilla oratoria': '463', 'Oncorhynchus kisutch': '464', 'Erimacrus isenbeckii': '465', 'Sillago japonica': '466', 'Trachysalambria curvirostris': '467', 'Mytilus unguiculatus': '468', 'Crassostrea nippona': '469', 'Laminariales': '470', 'Uroteuthis edulis': '471', 'Takifugu poecilonotus': '472', 'Neptunea arthritica': '473', 'Katsuwonus pelamis': '474', 'Doederleinia berycoides': '475', 'Metapenaeopsis dalei': '476', 'Seriola dumerili': '477', 'Pseudorhombus pentophthalmus': '478', 'Stephanolepis cirrhifer': '479', 'Cookeolus japonicus': '480', 'Panulirus japonicus': '481', 'Thunnus orientalis': '482', 'Halocynthia roretzi': '483', 'Etrumeus sadina': '484', 'Cololabis saira': '485', 'Coryphaena hippurus': '486', 'Sarda orientalis': '487', 'Octopus ocellatus': '488', 'Sardinops sagax': '489', 'Sphyraena pinguis': '490', 'Sebastes ventricosus': '491', 'Occella iburia': '492', 'Glossanodon semifasciatus': '493', 'Mizuhopecten yessoensis': '494', 'Neosalangichthys ishikawae': '495', 'Bothrocara tanakae': '496', 'Malacocottus zonurus': '497', 'Coelorinchus macrochir': '498', 'Neptunea constricta': '499', 'Beringius polynematicus': '500', 'Sebastes nivosus': '501', 'Pandalus eous': '502', 'Synaphobranchus kaupii': '503', 'Sebastolobus macrochir': '504', 'Marsupenaeus japonicus': '505', 'Japelion hirasei': '506', 'Pleurogrammus azonus': '507', 'Monostroma nitidum': '508', 'Atheresthes evermanni': '509', 'Takifugu rubripes': '510', 'Chionoecetes opilio': '511', 'Pandalopsis coccinata': '512', 'Chionoecetes japonicus': '513', 'Sebastes matsubarae': '514', 'Scombrops gilberti': '515', 'Hyporhamphus sajori': '516', 'Trichiurus lepturus': '517', 'Alcichthys elongatus': '518', 'Volutharpa perryi': '519', 'Mercenaria stimpsoni': '520', 'Berryteuthis magister': '521', 'Aptocyclus ventricosus': '522', 'Euphausia pacifica': '523', 'Salangichthys microdon': '524', 'Telmessus acutidens': '525', 'Ceratophyllum demersum': '526', 'Pandalus nipponensis': '527', 'Sebastes owstoni': '528', 'Cociella crocodilus': '529', 'Conger japonicus': '530', 'Sardinella zunasi': '531', 'Cheilopogon pinnatibarbatus japonicus': '532', 'Oplegnathus fasciatus': '533', 'Macridiscus aequilatera': '534', 'Repomucenus ornatipinnis': '535', 'Clupea pallasii': '536', 'Scorpaena neglecta': '537', 'Scomberomorus niphonius': '538', 'Leucopsarion petersii': '539', 'Sebastes scythropus': '540', 'Strongylura anastomella': '541', 'Laemonema longipes': '542', 'Fusitriton oregonensis': '543', 'Japelion pericochlion': '544', 'Sebastes steindachneri': '545', 'Auxis rochei': '546', 'Lobotes surinamensis': '547', 'Auxis thazard': '548', 'Chlorophthalmus borealis': '549', 'Etelis coruscans': '550', 'Sebastes inermis': '551', 'Cynoglossus interruptus': '552', 'Erilepis zonifer': '553', 'Tridentiger obscurus': '554', 'Caranx sexfasciatus': '555', 'Thunnus thynnus': '556', 'Takifugu stictonotus': '557', 'Euthynnus affinis': '558', 'Synagrops japonicus': '559', 'Okamejei schmidti': '560', 'Suggrundus meerdervoortii': '561', 'Sebastes baramenuke': '562', 'Pleurogrammus monopterygius': '563', 'Decapterus maruadsi': '564', 'Girella punctata': '565', 'Sphyraena japonica': '566', 'Ommastrephes bartramii': '567', 'Sepiella japonica': '568', 'Sepioteuthis lessoniana': '569', 'Eucleoteuthis luminosa': '570', 'Gloiopeltis furcata': '571', 'Macrobrachium nipponense': '572', 'Sepia kobiensis': '573', 'Eriocheir japonica': '574', 'Magallana nippona': '575', 'Meretrix lusoria': '576', 'Chondrus ocellatus': '577', 'Chondrus elatus': '578', 'Gloiopeltis': '579', 'Holothuroidea': '580', 'Corbicula japonica': '581', 'Sunetta menstrualis': '582', 'Pseudorhombus cinnamoneus': '583', 'Takifugu niphobles': '584', 'Lagocephalus gloveri': '585', 'Beryx splendens': '586', 'Parastichopus nigripunctatus': '587', 'Venerupis philippinarum': '588', 'Haliotis': '589', 'Liparis agassizii': '590', 'Seriola lalandi': '591', 'Niphon spinosus': '592', 'Pleuronichthys japonicus': '593', 'Sergia lucens': '594', 'Sphoeroides pachygaster': '595', 'Coryphaenoides acrolepis': '596', 'Pseudopleuronectes obscurus': '597', 'Pyropia yezoensis': '598', 'Isurus oxyrinchus': '599', 'Sargassum fulvellum': '600', 'Prionace glauca': '601', 'Kajikia audax': '602', 'Thunnus albacares': '603', 'Thunnus alalunga': '604', 'Thunnus obesus': '605', 'Lamna ditropis': '606', 'Glyptocidaris crenularis': '607', 'Asterias amurensis': '608', 'Sepiida': '609', 'Congridae': '610', 'Takifugu': '611', 'Sargassum horneri': '612', 'Haliotis discus': '613', 'Pleuronectidae': '614', 'Acanthogobius flavimanus': '615', 'Acanthogobius lactipes': '616', 'Pholis nebulosa': '617', 'Hemigrapsus penicillatus': '618', 'Palaemon paucidens': '619', 'Mysidae': '620', 'Zostera marina': '621', 'Ulva pertusa': '622', 'Gobiidae': '623', 'Atherinidae': '624', 'Tribolodon': '625', 'Alpheus': '626', 'Polychaeta': '627', 'Sebastes': '628', 'Charybdis japonica': '629', 'Hemigrapsus': '630', 'Favonigobius gymnauchen': '631', 'Palaemon': '632', 'Planiliza haematocheila': '633', 'Palaemonidae': '634', 'Pholis crassispina': '635', 'Laminaria': '636', 'Distolasterias nipon': '637', 'Lophiiformes': '638', 'Alpheus brevicristatus': '639', 'Undaria undariodes': '640', 'Neomysis awatschensis': '641', 'Alpheidae': '642', 'Macrobrachium': '643', 'Hediste': '644', 'Gymnogobius breunigii': '645', 'Luidia quinaria': '646', 'Rhizoprionodon acutus': '647', 'Carangoides equula': '648', 'Carcinoplax longimana': '649', 'Anomura': '650', 'Spatangoida': '651', 'Plesiobatis daviesi': '652', 'Eusphyra blochii': '653', 'Ruditapes variegata': '654', 'Sinonovacula constricta': '655', 'Penaeus monodon': '656', 'Litopenaeus vannamei': '657', 'Solenocera crassicornis': '658', 'Stomatopoda': '659', 'Teuthida': '660', 'Octopus': '661', 'Larimichthys polyactis': '662', 'Scomberomorini': '663', 'Channa argus': '664', 'Ranina ranina': '665', 'Lates calcarifer': '666', 'Scomberomorus commerson': '667', 'Lutjanus malabaricus': '668', 'Thenus parindicus': '669', 'Amusium pleuronectes': '670', 'Loligo': '671', 'Plectropomus leopardus': '672', 'Sillago ciliata': '673', 'Scylla serrata': '674', 'Pinctada maxima': '675', 'Lutjanus argentimaculatus': '676', 'Protonibea diacanthus': '677', 'Polydactylus macrochir': '678', 'Rachycentron canadum': '679', 'Ibacus peronii': '680', 'Arripis trutta': '681', 'Sarda australis': '682', 'Seriola hippos': '683', 'Choerodon schoenleinii': '684', 'Panulirus ornatus': '685', 'Neotrygon kuhlii': '686', 'Lethrinus nebulosus': '687', 'Parupeneus multifasciatus': '688', 'Saccostrea cucullata': '689', 'Lutjanus sebae': '690', 'Thunnus maccoyii': '691', 'Acanthopagrus butcheri': '692', 'Lambis lambis': '693', 'Gerres subfasciatus': '694', 'Zooplankton': '695', 'Phytoplankton': '696', 'Rapana venosa': '697', 'Scapharca inaequivalvis': '698', 'Ulva intestinalis': '699', 'Ulva linza': '700', 'Ceramium virgatum': '701', 'Gayralia oxysperma': '702', 'Vertebrata fucoides': '703', 'Stuckenia pectinata': '704', 'Rochia nilotica': '705', 'Ctenochaetus striatus': '706', 'Serranidae': '707', 'Turbo setosus': '708', 'Pandalidae': '709', 'Gymnosarda unicolor': '710', 'Epinephelini': '711', 'Pisces': '712', 'Liza klunzingeri': '713', 'Acanthopagrus latus': '714', 'Liza subviridis': '715', 'Sparidentex hasta': '716', 'Otolithes ruber': '717', 'Crenidens crenidens': '718', 'Ensis': '719', 'Gastropoda': '720', 'Euheterodonta': '721', 'Scomber': '722', 'Theragra chalcogramma': '723', 'Engraulidae': '724', 'Ostreidae': '725', 'Phaeophyceae': '726', 'Porphyra': '727', 'Ulva reticulata': '728', 'Perna viridis': '729', 'Fenneropenaeus indicus': '730', 'Merluccius': '731', 'Soleidae': '732', 'Mugilidae': '733', 'Marine algae': '734', 'Scarus rivulatus': '735', 'Scarus coeruleus': '736', 'Sardinella fimbriata': '737', 'Dussumieria acuta': '738', 'Lutjanus kasmira': '739', 'Lutjanus rivulatus': '740', 'Lutjanus bohar': '741', 'Priacanthus blochii': '742', 'Pelates quadrilineatus': '743', 'Epinephelus fasciatus': '744', 'Upeneus vittatus': '745', 'Lethrinus laticaudis': '746', 'Lethrinus lentjan': '747', 'Lethrinus microdon': '748', 'Sphyraena barracuda': '749', 'Alectis indica': '750', 'Epinephelus latifasciatus': '751', 'Nemipterus japonicus': '752', 'Raconda russeliana': '753', 'Lactarius lactarius': '754', 'Aetomylaeus bovinus': '755', 'Pennahia anea': '756', 'Leiognathus fasciatus': '757', 'Sardinella longiceps': '758', 'Tenualosa ilisha': '759', 'Pellona ditchela': '760', 'Stolephorus indicus': '761', 'Setipinna breviceps': '762', 'Rastrelliger kanagurta': '763', 'Chanos chanos': '764', 'Lepturacanthus savala': '765', 'Epinephelus niveatus': '766', 'Lutjanus johnii': '767', 'Carangoides malabaricus': '768', 'Ablennes hians': '769', 'Chirocentrus dorab': '770', 'Scomberomorus cavalla': '771', 'Scomberomorus semifasciatus': '772', 'Scomberomorus guttatus': '773', 'Etrumeus teres': '774', 'Spondyliosoma cantharus': '775', 'Brama brama': '776', 'Dasyatis zugei': '777', 'Harpadon nehereus': '778', 'Carcharhinus melanopterus': '779', 'Penaeus plebejus': '780', 'Sepia officinalis': '781', 'Johnius dussumieri': '782', 'Lutjanus campechanus': '783', 'Ruditapes decussatus': '784', 'Carcinus aestuarii': '785', 'Squilla mantis': '786', 'Epinephelus polyphekadion': '787', 'Lutjanus gibbus': '788', 'Lethrinus mahsena': '789', 'Epinephelus chlorostigma': '790', 'Carangoides bajad': '791', 'Aethaloperca rogaa': '792', 'Atule mate': '793', 'Macolor niger': '794', 'Carangoides fulvoguttatus': '795', 'Plectropomus areolatus': '796', 'Cephalopholis argus': '797', 'Cephalopholis': '798', 'Scarus sordidus': '799', 'Scomberomorus tritor': '800', 'Triaenodon obesus': '801', 'Pomadasys commersonnii': '802', 'Monotaxis grandoculis': '803', 'Plectropomus maculatus': '804', 'Trachinotus blochii': '805', 'Pristipomoides filamentosus': '806', 'Acanthurus gahhm': '807', 'Acanthurus sohal': '808', 'Siganus argenteus': '809', 'Naso unicornis': '810', 'Chanos': '811', 'Oedalechilus labiosus': '812', 'Plectorhinchus gaterinus': '813', 'Mercenaria mercenaria': '814', 'Mytilus': '815', 'Turbo cornutus': '816', 'Decapoda': '817', 'Sphyraena': '818', 'Arius maculatus': '819', 'Penaeus merguiensis': '820', 'Tegillarca granosa': '821', 'Mullus barbatus barbatus': '822', 'Chamelea gallina': '823', 'Metanephrops thomsoni': '824', 'Magallana gigas': '825', 'Branchiostegus japonicus': '826', 'Cephalopoda': '827', 'Lutjanidae': '828', 'Lethrinidae': '829', 'Sphyraena argentea': '830', 'Chirocentrus nudus': '831', 'Trachinotus': '832', 'Mugil auratus': '833', 'Euthynnus alletteratus': '834', 'Sparus aurata': '835', 'Pagrus caeruleostictus': '836', 'Scorpaena scrofa': '837', 'Pagellus erythrinus': '838', 'Epinephelus aeneus': '839', 'Dentex maroccanus': '840', 'Caranx rhonchus': '841', 'Sardinella': '842', 'Siganus': '843', 'Solea': '844', 'Diplodus sargus': '845', 'Lithognathus mormyrus': '846', 'Oblada melanura': '847', 'Siganus rivulatus': '848', 'Chelon labrosus': '849', 'Cynoscion microlepidotus': '850', 'Genypterus brasiliensis': '851', 'Myoxocephalus polyacanthocephalus': '852', 'Hexagrammos lagocephalus': '853', 'Hexagrammos decagrammus': '854', 'Sebastes ciliatus': '855', 'Lepidopsetta polyxystra': '856', 'Clupeiformes': '857', 'Gadidae': '858', 'Brachyura': '859', 'Dasyatis': '860', 'Carcharias': '861', 'Saurida': '862', 'Upeneus': '863', 'Cynoglossus': '864', 'Scomberomorus': '865', 'Terapon': '866', 'Leiognathus': '867', 'Terapontidae': '868', 'Caranx': '869', 'Diplodus': '870', 'Plectorhinchus flavomaculatus': '871', 'Salmonidae': '872', 'Mollusca': '873', 'Boops boops': '874', 'Sarpa salpa': '875', 'Pagellus acarne': '876', 'Spicara smaris': '877', 'Diplodus vulgaris': '878', 'Chelidonichthys lucerna': '879', 'Sarda sarda': '880', 'Serranus cabrilla': '881', 'Diplodus annularis': '882', 'Pagrus pagrus': '883', 'Alosa fallax': '884', 'Belone belone': '885', 'Dentex dentex': '886', 'Sphyraena viridensis': '887', 'Trisopterus capelanus': '888', 'Arnoglossus laterna': '889', 'Procambarus clarkii': '890', 'Nemadactylus macropterus': '891', 'Pagrus auratus': '892', 'Jasus edwardsii': '893', 'Perna canaliculus': '894', 'Pseudophycis bachus': '895', 'Haliotis iris': '896', 'Hoplostethus atlanticus': '897', 'Rhombosolea leporina': '898', 'Zygochlamys delicatula': '899', 'Galeorhinus galeus': '900', 'Parapercis colias': '901', 'Tiostrea chilensis': '902', 'Genypterus blacodes': '903', 'Evechinus chloroticus': '904', 'Austrovenus stutchburyi': '905', 'Micromesistius australis': '906', 'Macruronus novaezelandiae': '907', 'Nototodarus': '908', 'Perna perna': '909', 'Sepia pharaonis': '910', 'Turbo bruneus': '911', 'Portunus sanguinolentus': '912', 'Charybdis natator': '913', 'Charybdis lucifera': '914', 'Panulirus argus': '915', 'Ethmalosa fimbriata': '916', 'Sardinella brachysoma': '917', 'Thryssa mystax': '918', 'Plicofollis dussumieri': '919', 'Nibea soldado': '920', 'Epinephelus melanostigma': '921', 'Megalops cyprinoides': '922', 'Decapterus macarellus': '923', 'Drepane punctata': '924', 'Sillago sihama': '925', 'Tylosurus crocodilus crocodilus': '926', 'Saurida tumbil': '927', 'Cynoglossus macrostomus': '928', 'Parupeneus indicus': '929', 'Synechogobius hasta': '930', 'Busycotypus canaliculatus': '931', 'Pampus cinereus': '932', 'Pomadasys kaakan': '933', 'Epinephelus coioides': '934', 'Sepiella inermis': '935', 'Uroteuthis duvauceli': '936', 'Stomatella auricula': '937', 'Cerithium scabridum': '938', 'Marcia recens': '939', 'Circe intermedia': '940', 'Marcia opima': '941', 'Fulvia fragile': '942', 'Charybdis feriatus': '943', 'Charybdis annulata': '944', 'Atergatis integerrimus': '945', 'Matuta lunaris': '946', 'Calappa lophos': '947', 'Uca annulipes': '948', 'Chlamys varia': '949', 'Cololabis adocetus': '950', 'Seriola lalandi dorsalis': '951', 'Brunneifusus ternatanus': '952', 'Metapenaeus joyneri': '953', 'Epinephelus tauvina': '954', 'Coilia dussumieri': '955', 'Carcharhinus dussumieri': '956', 'Upeneus tragula': '957', 'Sartoriana spinigera': '958', 'Lamellidens marginalis': '959', 'Polydactylus sextarius': '960', 'Johnius macrorhynus': '961', 'Hexanematichthys sagor': '962', 'Sargassum swartzii': '963', 'Argyrops spinifer': '964', 'Synodus intermedius': '965', 'Muraenesox cinereus': '966', 'Carangoides armatus': '967', 'Eleutheronema tetradactylum': '968', 'Mustelus mosis': '969', 'Nemipterus bipunctatus': '970', 'Lutjanus quinquelineatus': '971', 'Platycephalus indicus': '972', 'Rhabdosargus haffara': '973', 'Argyrops filamentosus': '974', 'Brachirus orientalis': '975', 'Mene maculata': '976', 'Hemiramphus marginatus': '977', 'Encrasicholina heteroloba': '978', 'Trachinotus africanus': '979', 'Bramidae': '980', 'Escualosa thoracata': '981', 'Sepia arabica': '982', 'Scatophagus argus': '983', 'Parastromateus niger': '984', 'Planiliza subviridis': '985', 'Labeo rohita': '986', 'Oreochromis niloticus': '987', 'Cardiidae': '988', 'Sargassum angustifolium': '989', 'Pomacea bridgesii': '990', 'Sebastes fasciatus': '991', 'Batoidea': '992', 'Urophycis chuss': '993', 'Dalatias licha': '994', 'Trisopterus luscus': '995', 'Scyliorhinus canicula': '996', 'Ruvettus pretiosus': '997', 'Aphanopus carbo': '998', 'Alepocephalus bairdii': '999', 'Centroscymnus coelolepis': '1000', 'Loligo forbesii': '1001', 'Lutjanus cyanopterus': '1002', 'Mugil liza': '1003', 'Micropogonias furnieri': '1004', 'Balistes capriscus': '1005', 'Haemulidae': '1006', 'Stenotomus caprinus': '1007', 'Hemanthias leptus': '1008', 'Micropogonias undulatus': '1009', 'Cynoscion nebulosus': '1010', 'Rhomboplites aurorubens': '1011', 'Bothidae': '1012', 'Pogonias cromis': '1013', 'Lutjanus synagris': '1014', 'Netuma thalassina': '1015', 'Sillaginopsis panijus': '1016', 'Leptomelanosoma indicum': '1017', 'Therapon': '1018', 'Pterotolithus maculatus': '1019', 'Ilisha filigera': '1020', 'Hilsa kelee': '1021', 'Pampus chinensis': '1022', 'Palaemon styliferus': '1023', 'Argyrosomus regius': '1024', 'Lutjanus': '1025', 'Sciades': '1026', 'Mullus': '1027', 'Albula vulpes': '1028', 'Selar crumenophthalmus': '1029', 'Centropomus': '1030', 'Sardinella aurita': '1031', 'Harengula humeralis': '1032', 'Diapterus auratus': '1033', 'Gerres cinereus': '1034', 'Haemulon parra': '1035', 'Ocyurus chrysurus': '1036', 'Sphyraena guachancho': '1037', 'Anoplopoma fimbria': '1038', 'Nerita versicolor': '1039', 'Bulla striata': '1040', 'Melongena melongena': '1041', 'Trachycardium muricatum': '1042', 'Isognomon alatus': '1043', 'Brachidontes exustus': '1044', 'Crassostrea virginica': '1045', 'Protothaca granulata': '1046', 'Cittarium pica': '1047', 'Penaeus schmitti': '1048', 'Penaeus notialis': '1049', 'Callinectes sapidus': '1050', 'Callinectes danae': '1051', 'Dasyatidae': '1052', 'Caridea': '1053', 'Nephropidae': '1054', 'Sparus': '1055', 'Sargassum boveanum': '1056', 'Haliotis tuberculata': '1057', 'Littorinidae': '1058', 'Seaweed': '1059', 'Echinoidea': '1060', 'Ostreida': '1061', 'Donax trunculus': '1062', 'Scrobicularia plana': '1063', 'Venus verrucosa': '1064', 'Solen marginatus': '1065', 'Testudines': '1066', 'Mullidae': '1067', 'Amphipoda': '1068', 'Cystosphaera jacquinotii': '1069', 'Daption capense': '1070', 'Desmarestia anceps': '1071', 'Himantothallus grandifolius': '1072', 'Mirounga': '1073', 'Nacella concinna': '1074', 'Notothenia coriiceps': '1075', 'Pygoscelis antarcticus': '1076', 'Pygoscelis papua': '1077', 'Oncorhynchus gorbuscha': '1078', 'Oncorhynchus mykiss': '1079', 'Oncorhynchus nerka': '1080', 'Oncorhynchus tshawytscha': '1081', 'Erignathus barbatus': '1082', 'Pusa hispida': '1083', 'Hippoglossus stenolepis': '1084', 'Squalus suckleyi': '1085', 'Sargassum': '1086', 'Codium': '1087', 'Membranoptera alata': '1088', 'Dictyota dichotoma': '1089', 'Plocamium cartilagineum': '1090', 'Galatea paradoxa': '1091', 'Crassostrea tulipa': '1092', 'Macrobrachium sp': '1093', 'Portunus': '1094', 'Tympanotonos fuscatus': '1095', 'Thais': '1096', 'Bivalvia': '1097', 'Cynoglossus senegalensis': '1098', 'Carlarius heudelotii': '1099', 'Fontitrygon margarita': '1100', 'Chrysichthys nigrodigitatus': '1101', 'Acanthephyra purpurea': '1102', 'Actinauge abyssorum': '1103', 'Alaria marginata': '1104', 'Anadara transversa': '1105', 'Anthomedusae': '1106', 'Archosargus probatocephalus': '1107', 'Argyropelecus aculeatus': '1108', 'Ariopsis felis': '1109', 'Astrometis sertulifera': '1110', 'Astropecten': '1111', 'Atherina breviceps': '1112', 'Atolla': '1113', 'Aulacomya atra': '1114', 'Auxis rochei rochei': '1115', 'Auxis thazard thazard': '1116', 'Avicennia marina': '1117', 'Balaena mysticetus': '1118', 'Balaenoptera acutorostrata': '1119', 'Balanus': '1120', 'Berardius bairdii': '1121', 'Beroe': '1122', 'Boopsoidea inornata': '1123', 'Calanoida': '1124', 'Calanus finmarchicus finmarchicus': '1125', 'Callorhinchus milii': '1126', 'Cepphus columba': '1127', 'Cladonia rangiferina': '1128', 'Clinus superciliosus': '1129', 'Codium tomentosum': '1130', 'Copepoda': '1131', 'Coregonus autumnalis': '1132', 'Coregonus nasus': '1133', 'Coregonus sardinella': '1134', 'Coryphaenoides armatus': '1135', 'Coryphoblennius galerita': '1136', 'Creseis sp': '1137', 'Crinoidea': '1138', 'Crossota': '1139', 'Cryptochiton stelleri': '1140', 'Delphinus delphis': '1141', 'Diacria': '1142', 'Dichistius capensis': '1143', 'Dosinia alta': '1144', 'Dugong dugon': '1145', 'Electrona risso': '1146', 'Engraulis capensis': '1147', 'Ensis siliqua': '1148', 'Eryonidae': '1149', 'Eualaria fistulosa': '1150', 'Eupasiphae gilesii': '1151', 'Euphausiacea': '1152', 'Euphausiidae': '1153', 'Eurypharynx pelecanoides': '1154', 'Eurythenes gryllus': '1155', 'Euthynnus lineatus': '1156', 'Fratercula cirrhata': '1157', 'Galeichthys feliceps': '1158', 'Gelidium corneum': '1159', 'Gibbula umbilicalis': '1160', 'Gnathophausia ingens': '1161', 'Gonatus fabricii': '1162', 'Haliaeetus leucocephalus': '1163', 'Haliclona': '1164', 'Halodule uninervis': '1165', 'Hemilepidotus': '1166', 'Hemilepidotus jordani': '1167', 'Heterocarpus ensifer': '1168', 'Heterodontus portusjacksoni': '1169', 'Hippasteria phrygiana': '1170', 'Homola barbata': '1171', 'Hyperoodon planifrons': '1172', 'Hypleurochilus geminatus': '1173', 'Invertebrata': '1174', 'Isognomon bicolor': '1175', 'Isopoda': '1176', 'Kogia breviceps': '1177', 'Labrus bergylta': '1178', 'Lagenorhynchus obliquidens': '1179', 'Lampris guttatus': '1180', 'Larus glaucescens': '1181', 'Leander serratus': '1182', 'Libinia emarginata': '1183', 'Lichia amia': '1184', 'Lipophrys pholis': '1185', 'Lipophrys trigloides': '1186', 'Lithognathus lithognathus': '1187', 'Lithophaga aristata': '1188', 'Lobianchia gemellarii': '1189', 'Loliginidae': '1190', 'Loligo reynaudii': '1191', 'Lophius budegassa': '1192', 'Magallana angulata': '1193', 'Majoidea': '1194', 'Megachasma pelagios': '1195', 'Megaptera novaeangliae': '1196', 'Menippe mercenaria': '1197', 'Mesoplodon carlhubbsi': '1198', 'Mesoplodon stejnegeri': '1199', 'Microstomus pacificus': '1200', 'Morone saxatilis': '1201', 'Mullus surmuletus': '1202', 'Mycteroperca xenarcha': '1203', 'Myliobatis australis': '1204', 'Mysida': '1205', 'Mytilus californianus': '1206', 'Mytilus trossulus': '1207', 'Nephasoma Nephasoma flagriferum': '1208', 'Nudibranchia': '1209', 'Odobenus rosmarus divergens': '1210', 'Ommastrephidae': '1211', 'Ophiomusa lymani': '1212', 'Ophiothrix lineata': '1213', 'Orcinus orca': '1214', 'Ostracoda': '1215', 'Pagellus bogaraveo': '1216', 'Pandalus borealis': '1217', 'Paphies subtriangulata': '1218', 'Parabrotula': '1219', 'Paracalanus': '1220', 'Patella aspera': '1221', 'Periphylla': '1222', 'Phocoena phocoena': '1223', 'Phocoenoides dalli': '1224', 'Phronima': '1225', 'Physeter macrocephalus': '1226', 'Pinctada radiata': '1227', 'Plesionika edwardsii': '1228', 'Pododesmus macrochisma': '1229', 'Pomatomus saltatrix': '1230', 'Portunus pelagicus': '1231', 'Praunus': '1232', 'Pyrosoma': '1233', 'Rangifer tarandus': '1234', 'Rhabdosargus globiceps': '1235', 'Saccorhiza polyschides': '1236', 'Sagitta': '1237', 'Salpa': '1238', 'Salvelinus alpinus': '1239', 'Salvelinus malma': '1240', 'Sarda chiliensis': '1241', 'Sargassum aquifolium': '1242', 'Scalibregmatidae': '1243', 'Sebastes alutus': '1244', 'Sebastes melanops': '1245', 'Seriola dorsalis': '1246', 'Serranus scriba': '1247', 'Sigmops bathyphilus': '1248', 'Silicula fragilis': '1249', 'Sipunculidae': '1250', 'Somateria mollissima': '1251', 'Somateria spectabilis': '1252', 'Sparodon durbanensis': '1253', 'Spicara maena': '1254', 'Squatina australis': '1255', 'Striostrea margaritacea': '1256', 'Stromateus fiatola': '1257', 'Strongylocentrotus polyacanthus': '1258', 'Taractichthys steindachneri': '1259', 'Tectura scutum': '1260', 'Tegula viridula': '1261', 'Thais haemastoma': '1262', 'Thegrefg': '1263', 'Themisto': '1264', 'Thunnus tonggol': '1265', 'Trachurus picturatus': '1266', 'Trachurus symmetricus': '1267', 'Trygonorrhina fasciata': '1268', 'Ulva lactuca': '1269', 'Ursus maritimus': '1270', 'Vampyroteuthis infernalis': '1271', 'Ziphius cavirostris': '1272', 'Alepes kleinii': '1273', 'Alepes vari': '1274', 'Decapterus macrosoma': '1275', 'Lutjanus madras': '1276', 'Lutjanus russellii': '1277', 'Rastrelliger brachysoma': '1278', 'Rastrelliger faughni': '1279', 'Selar boops': '1280', 'Selaroides leptolepis': '1281', 'Sphyraena obtusata': '1282', 'Geloina expansa': '1283', 'Caesio erythrogaster': '1284', 'Euristhmus microceps': '1285', 'Pomacanthus annularis': '1286', 'Scylla': '1287', 'Plotosus lineatus': '1288', 'Prionotus stephanophrys': '1289', 'Trachurus murphyi': '1290', 'Dosidicus gigas': '1291', 'Sarda chiliensis chiliensis': '1292', 'Cynoscion analis': '1293', 'Merluccius gayi peruanus': '1294', 'Brotula ordwayi': '1295', 'Loligo gahi': '1296', 'Merluccius gayi': '1297', 'Ophichthus remiger': '1298', 'Penaeus sp': '1299', 'Trachinotus paitensis': '1300', 'Cheilopogon heterurus': '1301', 'Engraulis ringens': '1302', 'Sciaena deliciosa': '1303', 'Isacia conceptionis': '1304', 'Odontesthes regia': '1305', 'Bodianus diplotaenia': '1306', 'Concholepas concholepas': '1307', 'Diplectrum conceptione': '1308', 'Genypterus maculatus': '1309', 'Labrisomus philippii': '1310', 'Paralabrax humeralis': '1311', 'Prionotus horrens': '1312', 'Dasyatis akajei': '1313', 'Arctoscopus japonicus': '1314', 'Sepia esculenta': '1315', 'Bothrocara hollandi': '1316', 'Cynoglossidae': '1317', 'Lepidotrigla': '1318', 'Lepidotrigla alata': '1319', 'Octopus sinensis': '1320', 'Rhabdosargus sarba': '1321', 'Lophiidae': '1322', 'Muraenesox': '1323', 'Physiculus maximowiczi': '1324', 'Pleuronectoidei': '1325', 'Sciaenidae': '1326', 'Triglidae': '1327', 'Atherina presbyter': '1328', 'Bentheogennema intermedia': '1329', 'Benthesicymidae': '1330', 'Benthesicymus': '1331', 'Buccinum striatissimum': '1332', 'Callinectes': '1333', 'Cancer pagurus': '1334', 'Chaetognatha': '1335', 'Chama macerophylla': '1336', 'Cirripedia': '1337', 'Cyclosalpa': '1338', 'Cymopolia barbata': '1339', 'Cynoscion': '1340', 'Cystoseira amentacea': '1341', 'Ectocarpus siliculosus': '1342', 'Ellisolandia elongata': '1343', 'Enteromorpha linza': '1344', 'Euphausia superba': '1345', 'Gaidropsarus mediterraneus': '1346', 'Gennadas valens': '1347', 'Globicephala': '1348', 'Haliptilon virgatum': '1349', 'Halocynthia aurantium': '1350', 'Heliocidaris crassispina': '1351', 'Hymenodora gracilis': '1352', 'Lagodon rhomboides': '1353', 'Lepas Anatifa anatifera': '1354', 'Lobophora variegata': '1355', 'Macrocystis pyrifera': '1356', 'Maculabatis gerrardi': '1357', 'Nemacystus decipiens': '1358', 'Neptunea polycostata': '1359', 'Padina pavonia': '1360', 'Penaeidae': '1361', 'Petricolinae': '1362', 'Polynemidae': '1363', 'Pristipomoides aquilonaris': '1364', 'Pyropia fallax': '1365', 'Radiolaria': '1366', 'Salpidae': '1367', 'Sardinops melanosticta': '1368', 'Sargassum vulgare': '1369', 'Sciaena umbra': '1370', 'Scorpaena porcus': '1371', 'Sergestidae': '1372', 'Sicyonia brevirostris': '1373', 'Sphaerococcus coronopifolius': '1374', 'Stenella coeruleoalba': '1375', 'Stichopus japonicus': '1376', 'Thalia democratica': '1377', 'Themisto gaudichaudii': '1378', 'Undaria': '1379', 'Analipus japonicus': '1380', 'Sargassum yamadae': '1381', 'Ahnfeltiopsis paradoxa': '1382', 'Scytosiphon lomentaria': '1383', 'Chondria crassicaulis': '1384', 'Grateloupia lanceolata': '1385', 'Colpomenia sinuosa': '1386', 'Chondrus giganteus': '1387', 'Sargassum muticum': '1388', 'Ulva prolifera': '1389', 'Petalonia fascia': '1390', 'Balanus roseus': '1391', 'Chaetomorpha moniligera': '1392', 'Lomentaria hakodatensis': '1393', 'Neodilsea longissima': '1394', 'Polyopes affinis': '1395', 'Schizymenia dubyi': '1396', 'Dictyopteris pacifica': '1397', 'Ahnfeltiopsis flabelliformis': '1398', 'Bangia fuscopurpurea': '1399', 'Calliarthron': '1400', 'Cladophora': '1401', 'Cladophora albida': '1402', 'Dasya sessilis': '1403', 'Delesseria serrulata': '1404', 'Ecklonia cava': '1405', 'Gelidium elegans': '1406', 'Grateloupia turuturu': '1407', 'Hypnea asiatica': '1408', 'Mazzaella japonica': '1409', 'Pachydictyon coriaceum': '1410', 'Padina arborescens': '1411', 'Pterosiphonia pinnulata': '1412', 'Alatocladia yessoensis': '1413', 'Bryopsis plumosa': '1414', 'Ceramium kondoi': '1415', 'Chondracanthus intermedius': '1416', 'Codium contractum': '1417', 'Codium lucasii': '1418', 'Corallina pilulifera': '1419', 'Dictyopteris undulata': '1420', 'Gastroclonium pacificum': '1421', 'Gelidium amansii': '1422', 'Grateloupia sparsa': '1423', 'Laurencia okamurae': '1424', 'Leathesia marina': '1425', 'Lomentaria catenata': '1426', 'Meristotheca papulosa': '1427', 'Sargassum confusum': '1428', 'Sargassum siliquastrum': '1429', 'Tinocladia crassa': '1430', 'Saccharina yendoana': '1431', 'Thalassiophyllum clathrus': '1432', 'Mytilida': '1433', 'Pteriomorphia': '1434', 'Conger': '1435', 'Scyliorhinidae': '1436', 'Labrus': '1437', 'Algae': '1438', 'Necora puber': '1439', 'Anguilla': '1440', 'Rajidae': '1441', 'Buccinidae': '1442', 'Crustacea': '1443', 'Green algae': '1444', 'Ammodytes japonicus': '1445', 'Evynnis tumifrons': '1446', 'Gnathophis nystromi nystromi': '1447', 'Loligo bleekeri': '1448', 'Platichthys bicoloratus': '1449', 'Limanda punctatissima': '1450', 'Loliolus Nipponololigo japonica': '1451', 'Acanthopagrus schlegelii schlegelii': '1452', 'Sepiolina': '1453', 'Gelidium': '1454', 'Atrina pectinata': '1455', 'Echinocardium cordatum': '1456', 'Lamnidae': '1457', 'Meretrix lamarckii': '1458', 'Noctiluca scintillans': '1459', 'Philine argentata': '1460', 'Sergestes lucens': '1461', 'Corbicula sandai': '1462', 'Ulva': '1463', 'Actiniaria': '1464', 'Ctenopharyngodon idella': '1465', 'Ophiuroidea': '1466', 'Scomberoides lysan': '1467', 'Scomberoides tol': '1468', 'Sebastolobus': '1469', 'Selachimorpha': '1470', 'Selene setapinnis': '1471', 'Selene vomer': '1472', 'Sepia elliptica': '1473', 'Sergestes sp': '1474', 'Setipinna taty': '1475', 'Siganus canaliculatus': '1476', 'Sigmops gracile': '1477', 'Solenocera sp': '1478', 'Sparidae': '1479', 'Spermatophytina': '1480', 'Sphoeroides testudineus': '1481', 'Sphyraena jello': '1482', 'Spyridia hypnoides': '1483', 'Squaliformes': '1484', 'Squillidae': '1485', 'Stegophiura sladeni': '1486', 'Stenella longirostris': '1487', 'Stenobrachius leucopsarus': '1488', 'Sternaspidae': '1489', 'Stoechospermum polypodioides': '1490', 'Stolephorus commersonnii': '1491', 'Stromateus cinereus': '1492', 'Stromateus niger': '1493', 'Stromateus sinensis': '1494', 'Synidotea': '1495', 'Takifugu vermicularis': '1496', 'Telatrygon zugei': '1497', 'Terapon jarbua': '1498', 'Terebellidae': '1499', 'Thryssa dussumieri': '1500', 'Thunnini': '1501', 'Tibia curta': '1502', 'Tonna dolium': '1503', 'Trachinus draco': '1504', 'Trematomus bernacchii': '1505', 'Tridacna': '1506', 'Trinectes paulistanus': '1507', 'Trochus radiatus': '1508', 'Turbinaria': '1509', 'Tursiops truncatus': '1510', 'Ucides': '1511', 'Ulva compressa': '1512', 'Ulva fasciata': '1513', 'Ulva flexuosa': '1514', 'Ulva rigida': '1515', 'Upeneus taeniopterus': '1516', 'Upogebiidae': '1517', 'Uroteuthis Photololigo edulis': '1518', 'Valoniopsis pachynema': '1519', 'Veneridae': '1520', 'Venus foveolata': '1521', 'Vertebrata': '1522', 'Volutharpa ampullacea perryi': '1523', 'Zannichellia palustris': '1524', 'Zeus japonicus': '1525', 'Favites': '1526', 'Gadiformes': '1527', 'Gafrarium dispar': '1528', 'Galaxaura frutescens': '1529', 'Gelidium crinale': '1530', 'Genidens genidens': '1531', 'Girella elevata': '1532', 'Girella tricuspidata': '1533', 'Dentex hypselosomus': '1534', 'Saurida elongata': '1535', 'Pseudolabrus eoethinus': '1536', 'Atrobucca nibe': '1537', 'Diagramma pictum': '1538', 'Sepia lycidas': '1539', 'Plectorhinchus cinctus': '1540', 'Metapenaeopsis acclivis': '1541', 'Metapenaeopsis barbata': '1542', 'Nibea albiflora': '1543', 'Girella leonina': '1544', 'Sphyraenidae': '1545', 'Parapercis pulchella': '1546', 'Parapercis sexfasciata': '1547', 'Thysanoteuthis rhombus': '1548', 'Lepidotrigla kishinouyi': '1549', 'Cystoseira': '1550', 'Padina': '1551', 'Halimeda': '1552', 'Pacifastacus leniusculus': '1553', 'Salmo trutta': '1554', 'Chondrus crispus': '1555', 'Ictalurus punctatus': '1556', 'Acanthurus': '1557', 'Scombridae': '1558', 'Leukoma staminea': '1559', 'Trochidae': '1560', 'Protonibea': '1562', 'Anchoa compressa': '1563', 'Ensis magnus': '1564', 'Bolinus brandaris': '1565', 'Lutjanus notatus': '1566', 'Lethrinus olivaceus': '1567', 'Carassius auratus': '1569', 'Mugil': '1570', 'Gobius': '1571', 'Lajonkairia lajonkairii': '1572', 'Chrysophrys auratus': '1573', 'Galeorhinus australis': '1574', 'Nototodarus sloanii gouldi': '1575', 'Tylosurus crocodilus': '1576', 'Acanthogobius hasta': '1577', 'Penaeus chinensis': '1578', 'Ruditapes variegatus': '1579', 'Marcia marmorata': '1580', 'Rachycentron': '1581', 'Scomber kanagurta': '1582', 'Arius': '1583', 'Panulirus versicolor': '1584', 'Tilapia zillii': '1585', 'Schizoporella errata': '1586', 'Phallusia nigra': '1587', 'Physeter catodon': '1588', 'Salmo trutta trutta': '1589', 'Tachysurus thalassinus': '1590', 'Sillago domina': '1591', 'Otolithus argenteus': '1592', 'Trichiurus haumela': '1593', 'Otolithes maculata': '1594', 'Hilsa kanagurta': '1595', 'Oreochromis mossambicus': '1596', 'Siluriformes': '1597', 'Theodoxus euxinus': '1598', 'Formio niger': '1599', 'Rastrelliger': '1600', 'Nephasoma flagriferum': '1601', 'Ophiomusium lymani': '1602', 'Nematonurus armatus': '1603', 'Thalamitoides spinigera': '1604', 'Capros aper': '1605', 'Gadiculus argenteus thori': '1606', 'Phorcus lineatus': '1607', 'Penaeus vannamei': '1608', 'Raja montagui': '1609', 'Scophthalmus rhombus': '1610', 'Crambe maritima': '1611', 'Fucus ceranoides': '1612', 'Maja squinado': '1613', 'Salicornia europaea': '1614', 'Aequipecten opercularis': '1615', 'Galathea squamifera': '1616', 'Cynoglossus semilaevis': '1617', 'Loliolus beka': '1619', 'Octopus variabilis': '1620', 'Abudefduf sexfasciatus': '1621', 'Acanthurus blochii': '1622', 'Achillea millefolium': '1623', 'Alaria crassifolia': '1624', 'Albulidae': '1625', 'Ammodytes': '1626', 'Anadara satowi': '1627', 'Argyrosomus japonicus': '1628', 'Ascidiacea': '1629', 'Aulopiformes': '1630', 'Babylonia japonica': '1631', 'Babylonia kirana': '1632', 'Bathylagidae': '1633', 'Beryx decadactylus': '1634', 'Branchiostegus': '1635', 'Buccinum': '1636', 'Caesio lunaris': '1637', 'Callionymus curvicornis': '1638', 'Campylaephora hypnaeoides': '1639', 'Cetoscarus ocellatus': '1640', 'Charonia tritonis': '1641', 'Chelon haematocheilus': '1642', 'Chlorurus sordidus': '1643', 'Choerodon azurio': '1644', 'Chromis notata': '1645', 'Cladosiphon okamuranus': '1646', 'Cociella punctata': '1647', 'Coryphaena': '1648', 'Cyclina sinensis': '1649', 'Cymbacephalus beauforti': '1650', 'Dendrobranchiata': '1651', 'Digenea simplex': '1652', 'Ditrema viride': '1653', 'Enteromorpha prolifera': '1654', 'Epinephelus': '1655', 'Epinephelus akaara': '1656', 'Epinephelus awoara': '1657', 'Etelis carbunculus': '1658', 'Fistularia commersonii': '1659', 'Fulvia mutica': '1660', 'Fusinus colus': '1661', 'Gafrarium tumidum': '1662', 'Gelidiaceae': '1663', 'Girella cyanea': '1664', 'Girella mezina': '1665', 'Goniistius zonatus': '1666', 'Gracilaria': '1667', 'Gymnocranius euanus': '1668', 'Heikeopsis japonica': '1669', 'Hemitrygon': '1670', 'Hippoglossoides pinetorum': '1671', 'Holothuria atra': '1672', 'Holothuria leucospilota': '1673', 'Idiosepiidae': '1674', 'Inegocia japonica': '1675', 'Inimicus didactylus': '1676', 'Ishige': '1677', 'Lagocephalus spadiceus': '1678', 'Lambis truncata': '1679', 'Leiognathus equula': '1680', 'Lethrinus xanthochilus': '1681', 'Lutjanus erythropterus': '1682', 'Lutjanus semicinctus': '1683', 'Monodonta labio': '1684', 'Monostroma kuroshiense': '1685', 'Mulloidichthys flavolineatus': '1686', 'Mulloidichthys vanicolensis': '1687', 'Muraenesocidae': '1688', 'Myagropsis myagroides': '1689', 'Mytilisepta virgata': '1690', 'Naso brevirostris': '1691', 'Nematalosa japonica': '1692', 'Nemipterus virgatus': '1693', 'Nipponacmea': '1694', 'Nuchequula nuchalis': '1695', 'Octopus cyanea': '1696', 'Panopea generosa': '1697', 'Paralichthys': '1698', 'Paralithodes camtschaticus': '1699', 'Parascolopsis inermis': '1700', 'Pectinidae': '1701', 'Pentapodus aureofasciatus': '1702', 'Pinctada fucata': '1703', 'Pitar citrinus': '1704', 'Platycephalidae': '1705', 'Plecoglossus altivelis': '1706', 'Pleuronectes herzensteini': '1707', 'Priacanthus macracanthus': '1708', 'Pristipomoides': '1709', 'Psenopsis anomala': '1710', 'Pseudobalistes fuscus': '1711', 'Pseudocaranx dentex': '1712', 'Pseudolabrus sieboldi': '1713', 'Pseudorhombus arsius': '1714', 'Pterocaesio chrysozona': '1715', 'Rhynchopelates oxyrhynchus': '1716', 'Ryukyupercis gushikeni': '1717', 'Saccostrea echinata': '1718', 'Sargassum hemiphyllum': '1719', 'Sargassum piluliferum': '1720', 'Saurida micropectoralis': '1721', 'Saurida undosquamis': '1722', 'Saurida wanieso': '1723', 'Scarus forsteni': '1724', 'Scarus ghobban': '1725', 'Scarus ovifrons': '1726', 'Scarus rubroviolaceus': '1727', 'Scyphozoa': '1728', 'Sebastes iracundus': '1729', 'Semicossyphus reticulatus': '1730', 'Sepia latimanus': '1731', 'Siganus guttatus': '1732', 'Siganus luridus': '1733', 'Sphaerotrichia divaricata': '1734', 'Sphyrnidae': '1735', 'Spondylus regius': '1736', 'Spratelloides gracilis': '1737', 'Sthenoteuthis oualaniensis': '1738', 'Tetraodontidae': '1739', 'Trichiurus lepturus japonicus': '1740', 'Tridacna crocea': '1741', 'Turbo argyrostomus': '1742', 'Tylosurus pacificus': '1743', 'Ulvophyceae': '1744', 'Upeneus japonicus': '1745', 'Upeneus moluccensis': '1746', 'Uranoscopus japonicus': '1747', 'Anguilliformes': '1748', 'Crithmum maritimum': '1749', 'Littorina': '1750', 'Nucella lapillus': '1752', 'Scyliorhinus stellaris': '1753', 'Annelida': '1754', 'Aphrodita aculeata': '1755', 'Callionymus lyra': '1756', 'Urticina felina': '1757', 'Gebiidea': '1758', 'Bonellia viridis': '1759', 'Alcyonium glomeratum': '1760'}, 'body_part': {'Not applicable': '-1', 'Not available': '0', 'Whole animal': '1', 'Whole animal eviscerated': '2', 'Whole animal eviscerated without head': '3', 'Flesh with bones': '4', 'Blood': '5', 'Skeleton': '6', 'Bones': '7', 'Exoskeleton': '8', 'Endoskeleton': '9', 'Shells': '10', 'Molt': '11', 'Skin': '12', 'Head': '13', 'Tooth': '14', 'Otolith': '15', 'Fins': '16', 'Faecal pellet': '17', 'Byssus': '18', 'Soft parts': '19', 'Viscera': '20', 'Stomach': '21', 'Hepatopancreas': '22', 'Digestive gland': '23', 'Pyloric caeca': '24', 'Liver': '25', 'Intestine': '26', 'Kidney': '27', 'Spleen': '28', 'Brain': '29', 'Eye': '30', 'Fat': '31', 'Heart': '32', 'Branchial heart': '33', 'Muscle': '34', 'Mantle': '35', 'Gills': '36', 'Gonad': '37', 'Ovary': '38', 'Testes': '39', 'Whole plant': '40', 'Flower': '41', 'Leaf': '42', 'Old leaf': '43', 'Young leaf': '44', 'Leaf upper part': '45', 'Leaf lower part': '46', 'Scales': '47', 'Root rhizome': '48', 'Whole macro alga': '49', 'Phytoplankton': '50', 'Thallus': '51', 'Flesh without bones': '52', 'Stomach and intestine': '53', 'Whole haptophytic plants': '54', 'Loose drifting plants': '55', 'Growing tips': '56', 'Upper parts of plants': '57', 'Lower parts of plants': '58', 'Shells carapace': '59', 'Flesh with scales': '60'}}, 'SEAWATER': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', 'pu240': '69', 'pu241': '70', 'am240': '71', 'am241': '72', 'cm242': '73', 'cm243': '74', 'cm244': '75', 'cs134_137_tot': '76', 'pu239_240_tot': '77', 'pu239_240_iii_iv_tot': '78', 'pu239_240_v_vi_tot': '79', 'cm243_244_tot': '80', 'pu238_pu239_240_tot_ratio': '81', 'am241_pu239_240_tot_ratio': '82', 'cs137_134_ratio': '83', 'cd109': '84', 'eu152': '85', 'fe59': '86', 'gd153': '87', 'ir192': '88', 'pu238_240_tot': '89', 'rb86': '90', 'sc46': '91', 'sn113': '92', 'sn117m': '93', 'tl208': '94', 'mo99': '95', 'tc99m': '96', 'ru105': '97', 'te129': '98', 'te132': '99', 'i132': '100', 'i135': '101', 'cs136': '102', 'tbeta': '103', 'talpha': '104', 'i133': '105', 'th230': '106', 'pa231': '107', 'u236': '108', 'ag111': '109', 'in116m': '110', 'te123m': '111', 'sb127': '112', 'ba133': '113', 'ce139': '114', 'tl201': '116', 'hg203': '117', 'na22': '122', 'pa234m': '123', 'am243': '124', 'se75': '126', 'sr85': '127', 'y88': '128', 'ce140': '129', 'bi212': '130', 'u236_238_ratio': '131', 'i125': '132', 'ba137m': '133', 'u232': '134', 'pa233': '135', 'ru106_rh106_tot': '136', 'tu': '137', 'tbeta40k': '138', 'fe55': '139', 'ce144_pr144_tot': '140', 'pu240_pu239_ratio': '141', 'u233': '142', 'pu239_242_tot': '143', 'ac227': '144'}, 'unit': {'Not applicable': '-1', 'NOT AVAILABLE': '0', 'Bq per m3': '1', 'Bq per m2': '2', 'Bq per kg': '3', 'Bq per kgd': '4', 'Bq per kgw': '5', 'kg per kg': '6', 'TU': '7', 'DELTA per mill': '8', 'atom per kg': '9', 'atom per kgd': '10', 'atom per kgw': '11', 'atom per l': '12', 'Bq per kgC': '13'}, 'dl': {'Not applicable': '-1', 'Not available': '0', 'Detected value': '1', 'Detection limit': '2', 'Not detected': '3', 'Derived': '4'}, 'filt': {'Not applicable': '-1', 'Not available': '0', 'Yes': '1', 'No': '2'}}, 'SEDIMENT': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', 'pu240': '69', 'pu241': '70', 'am240': '71', 'am241': '72', 'cm242': '73', 'cm243': '74', 'cm244': '75', 'cs134_137_tot': '76', 'pu239_240_tot': '77', 'pu239_240_iii_iv_tot': '78', 'pu239_240_v_vi_tot': '79', 'cm243_244_tot': '80', 'pu238_pu239_240_tot_ratio': '81', 'am241_pu239_240_tot_ratio': '82', 'cs137_134_ratio': '83', 'cd109': '84', 'eu152': '85', 'fe59': '86', 'gd153': '87', 'ir192': '88', 'pu238_240_tot': '89', 'rb86': '90', 'sc46': '91', 'sn113': '92', 'sn117m': '93', 'tl208': '94', 'mo99': '95', 'tc99m': '96', 'ru105': '97', 'te129': '98', 'te132': '99', 'i132': '100', 'i135': '101', 'cs136': '102', 'tbeta': '103', 'talpha': '104', 'i133': '105', 'th230': '106', 'pa231': '107', 'u236': '108', 'ag111': '109', 'in116m': '110', 'te123m': '111', 'sb127': '112', 'ba133': '113', 'ce139': '114', 'tl201': '116', 'hg203': '117', 'na22': '122', 'pa234m': '123', 'am243': '124', 'se75': '126', 'sr85': '127', 'y88': '128', 'ce140': '129', 'bi212': '130', 'u236_238_ratio': '131', 'i125': '132', 'ba137m': '133', 'u232': '134', 'pa233': '135', 'ru106_rh106_tot': '136', 'tu': '137', 'tbeta40k': '138', 'fe55': '139', 'ce144_pr144_tot': '140', 'pu240_pu239_ratio': '141', 'u233': '142', 'pu239_242_tot': '143', 'ac227': '144'}, 'unit': {'Not applicable': '-1', 'NOT AVAILABLE': '0', 'Bq per m3': '1', 'Bq per m2': '2', 'Bq per kg': '3', 'Bq per kgd': '4', 'Bq per kgw': '5', 'kg per kg': '6', 'TU': '7', 'DELTA per mill': '8', 'atom per kg': '9', 'atom per kgd': '10', 'atom per kgw': '11', 'atom per l': '12', 'Bq per kgC': '13'}, 'dl': {'Not applicable': '-1', 'Not available': '0', 'Detected value': '1', 'Detection limit': '2', 'Not detected': '3', 'Derived': '4'}, 'sed_type': {'Not applicable': '-1', 'Not available': '0', 'Clay': '1', 'Gravel': '2', 'Marsh': '3', 'Mud': '4', 'Muddy sand': '5', 'Sand': '6', 'Fine sand': '7', 'Sandy mud': '8', 'Pebby sand': '9', 'Silt and clay': '10', 'Silt and gravel': '11', 'Silt': '12', 'Silty sand': '13', 'Sludge': '14', 'Turf': '15', 'Very coarse sand': '16', 'Coarse sand': '17', 'Medium sand': '18', 'Very fine sand': '19', 'Coarse silt': '20', 'Medium silt': '21', 'Fine silt': '22', 'Very fine silt': '23', 'Calcareous': '24', 'Glacial': '25', 'Soft': '26', 'Sulphidic': '27', 'Fe Mg concretions': '28', 'Sand and gravel': '29', 'Pure sand': '30', 'Sand and fine sand': '31', 'Sand and clay': '32', 'Sand and mud': '33', 'Fine sand and gravel': '34', 'Fine sand and sand': '35', 'Pure fine sand': '36', 'Fine sand and silt': '37', 'Fine sand and clay': '38', 'Fine sand and mud': '39', 'Silt and sand': '40', 'Silt and fine sand': '41', 'Pure silt': '42', 'Silt and mud': '43', 'Clay and gravel': '44', 'Clay and sand': '45', 'Clay and fine sand': '46', 'Pure clay': '47', 'Clay and silt': '48', 'Clay and mud': '49', 'Glacial clay': '50', 'Soft clay': '51', 'Sulphidic clay': '52', 'Clay and Fe Mg concretions': '53', 'Mud and gravel': '54', 'Mud and sand': '55', 'Mud and fine sand': '56', 'Mud and clay': '57', 'Pure mud': '58', 'Soft mud': '59', 'Sulphidic mud': '60', 'Mud and Fe Mg concretions': '61', 'Sand and silt': '62'}}}
Lets review the data of the NetCDF file:
{'BIOTA': LON LAT SMP_DEPTH TIME NUCLIDE VALUE UNIT \
0 12.316667 54.283333 NaN 1348358400 31 0.010140 5
1 12.316667 54.283333 NaN 1348358400 4 135.300003 5
2 12.316667 54.283333 NaN 1348358400 9 0.013980 5
3 12.316667 54.283333 NaN 1348358400 33 4.338000 5
4 12.316667 54.283333 NaN 1348358400 31 0.009614 5
... ... ... ... ... ... ... ...
16089 21.395000 61.241501 2.0 1652140800 33 13.700000 4
16090 21.395000 61.241501 2.0 1652140800 9 0.500000 4
16091 21.385000 61.343334 NaN 1663200000 4 50.700001 4
16092 21.385000 61.343334 NaN 1663200000 33 0.880000 4
16093 21.385000 61.343334 NaN 1663200000 12 6.600000 4
UNC DL BIO_GROUP SPECIES BODY_PART DRYWT WETWT \
0 NaN 2 4 99 52 174.934433 948.0
1 4.830210 1 4 99 52 174.934433 948.0
2 NaN 2 4 99 52 174.934433 948.0
3 0.150962 1 4 99 52 174.934433 948.0
4 NaN 2 4 99 52 177.935120 964.0
... ... .. ... ... ... ... ...
16089 0.520600 1 11 96 55 NaN NaN
16090 0.045500 1 11 96 55 NaN NaN
16091 4.106700 1 14 129 1 NaN NaN
16092 0.140800 1 14 129 1 NaN NaN
16093 0.349800 1 14 129 1 NaN NaN
PERCENTWT
0 0.18453
1 0.18453
2 0.18453
3 0.18453
4 0.18458
... ...
16089 NaN
16090 NaN
16091 NaN
16092 NaN
16093 NaN
[16094 rows x 15 columns],
'SEAWATER': LON LAT SMP_DEPTH TOT_DEPTH TIME NUCLIDE \
0 29.333300 60.083302 0.0 NaN 1337731200 33
1 29.333300 60.083302 29.0 NaN 1337731200 33
2 23.150000 59.433300 0.0 NaN 1339891200 33
3 27.983299 60.250000 0.0 NaN 1337817600 33
4 27.983299 60.250000 39.0 NaN 1337817600 33
... ... ... ... ... ... ...
21468 13.499833 54.600334 0.0 47.0 1686441600 1
21469 13.499833 54.600334 45.0 47.0 1686441600 1
21470 14.200833 54.600334 0.0 11.0 1686614400 1
21471 14.665500 54.600334 0.0 20.0 1686614400 1
21472 14.330000 54.600334 0.0 17.0 1686614400 1
VALUE UNIT UNC DL FILT
0 5.300000 1 1.696000 1 0
1 19.900000 1 3.980000 1 0
2 25.500000 1 5.100000 1 0
3 17.000000 1 4.930000 1 0
4 22.200001 1 3.996000 1 0
... ... ... ... .. ...
21468 702.838074 1 51.276207 1 0
21469 725.855713 1 52.686260 1 0
21470 648.992920 1 48.154419 1 0
21471 627.178406 1 46.245316 1 0
21472 605.715088 1 45.691143 1 0
[21473 rows x 11 columns],
'SEDIMENT': LON LAT TOT_DEPTH TIME NUCLIDE VALUE \
0 27.799999 60.466667 25.0 1337904000 33 1200.000000
1 27.799999 60.466667 25.0 1337904000 33 250.000000
2 27.799999 60.466667 25.0 1337904000 33 140.000000
3 27.799999 60.466667 25.0 1337904000 33 79.000000
4 27.799999 60.466667 25.0 1337904000 33 29.000000
... ... ... ... ... ... ...
70444 15.537800 54.617832 62.0 1654646400 67 0.044000
70445 15.537800 54.617832 62.0 1654646400 77 2.500000
70446 15.537800 54.617832 62.0 1654646400 4 5873.000000
70447 15.537800 54.617832 62.0 1654646400 33 21.200001
70448 15.537800 54.617832 62.0 1654646400 77 0.370000
UNIT UNC DL SED_TYPE TOP BOTTOM PERCENTWT
0 4 240.000000 1 0 15.0 20.0 NaN
1 4 50.000000 1 0 20.0 25.0 NaN
2 4 29.400000 1 0 25.0 30.0 NaN
3 4 15.800000 1 0 30.0 35.0 NaN
4 4 6.960000 1 0 35.0 40.0 NaN
... ... ... .. ... ... ... ...
70444 4 0.015312 1 10 15.0 17.0 0.257642
70445 4 0.185000 1 10 15.0 17.0 0.257642
70446 4 164.444000 1 10 17.0 19.0 0.263965
70447 4 2.162400 1 10 17.0 19.0 0.263965
70448 4 0.048100 1 10 17.0 19.0 0.263965
[70449 rows x 13 columns]}
Lets review the biota data:
LON | LAT | SMP_DEPTH | TIME | NUCLIDE | VALUE | UNIT | UNC | DL | BIO_GROUP | SPECIES | BODY_PART | DRYWT | WETWT | PERCENTWT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12.316667 | 54.283333 | NaN | 1348358400 | 31 | 0.010140 | 5 | NaN | 2 | 4 | 99 | 52 | 174.934433 | 948.0 | 0.18453 |
1 | 12.316667 | 54.283333 | NaN | 1348358400 | 4 | 135.300003 | 5 | 4.830210 | 1 | 4 | 99 | 52 | 174.934433 | 948.0 | 0.18453 |
2 | 12.316667 | 54.283333 | NaN | 1348358400 | 9 | 0.013980 | 5 | NaN | 2 | 4 | 99 | 52 | 174.934433 | 948.0 | 0.18453 |
3 | 12.316667 | 54.283333 | NaN | 1348358400 | 33 | 4.338000 | 5 | 0.150962 | 1 | 4 | 99 | 52 | 174.934433 | 948.0 | 0.18453 |
4 | 12.316667 | 54.283333 | NaN | 1348358400 | 31 | 0.009614 | 5 | NaN | 2 | 4 | 99 | 52 | 177.935120 | 964.0 | 0.18458 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
16089 | 21.395000 | 61.241501 | 2.0 | 1652140800 | 33 | 13.700000 | 4 | 0.520600 | 1 | 11 | 96 | 55 | NaN | NaN | NaN |
16090 | 21.395000 | 61.241501 | 2.0 | 1652140800 | 9 | 0.500000 | 4 | 0.045500 | 1 | 11 | 96 | 55 | NaN | NaN | NaN |
16091 | 21.385000 | 61.343334 | NaN | 1663200000 | 4 | 50.700001 | 4 | 4.106700 | 1 | 14 | 129 | 1 | NaN | NaN | NaN |
16092 | 21.385000 | 61.343334 | NaN | 1663200000 | 33 | 0.880000 | 4 | 0.140800 | 1 | 14 | 129 | 1 | NaN | NaN | NaN |
16093 | 21.385000 | 61.343334 | NaN | 1663200000 | 12 | 6.600000 | 4 | 0.349800 | 1 | 14 | 129 | 1 | NaN | NaN | NaN |
16094 rows × 15 columns
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
SplitSedimentValuesCB(coi_sediment),
SanitizeValueCB(coi_val),
NormalizeUncCB(),
RemapUnitCB(),
RemapDetectionLimitCB(coi_dl, lut_dl),
CompareDfsAndTfmCB(dfs)
])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
for grp in ['BIOTA', 'SEDIMENT', 'SEAWATER']:
print(f'Unique DL values for {grp}: {tfm.dfs[grp]["DL"].unique()}')
Warning: 30 missing value(s) in value_bq/kg for group BIOTA.
Warning: 153 missing value(s) in value_bq/m³ for group SEAWATER.
Warning: 246 missing value(s) in _VALUE for group SEDIMENT.
BIOTA SEAWATER SEDIMENT
Number of rows in dfs 16124 21634 40744
Number of rows in tfm.dfs 16094 21481 70451
Number of rows removed 30 153 144
Unique DL values for BIOTA: [2 1 0]
Unique DL values for SEDIMENT: [1 2 0]
Unique DL values for SEAWATER: [1 2 0]
Lets review the sediment data:
key | nuclide | method | < value_bq/kg | value_bq/kg | error%_kg | < value_bq/m² | value_bq/m² | error%_m² | date_of_entry_x | ... | lowsli | area | sedi | oxic | dw% | loi% | mors_subbasin | helcom_subbasin | sum_link | date_of_entry_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SKRIL2012116 | CS137 | NaN | NaN | 1200.000 | 20.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | 20.0 | 0.00600 | NaN | NaN | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 |
1 | SKRIL2012117 | CS137 | NaN | NaN | 250.000 | 20.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | 25.0 | 0.00600 | NaN | NaN | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 |
2 | SKRIL2012118 | CS137 | NaN | NaN | 140.000 | 21.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | 30.0 | 0.00600 | NaN | NaN | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 |
3 | SKRIL2012119 | CS137 | NaN | NaN | 79.000 | 20.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | 35.0 | 0.00600 | NaN | NaN | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 |
4 | SKRIL2012120 | CS137 | NaN | NaN | 29.000 | 24.0 | NaN | NaN | NaN | 08/20/14 00:00:00 | ... | 40.0 | 0.00600 | NaN | NaN | NaN | NaN | 11.0 | 11.0 | NaN | 08/20/14 00:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
40739 | SCLOR2022071 | PU238 | CLOR08 | NaN | 0.007 | 32.9 | NaN | 0.044 | 34.8 | 05/03/24 00:00:00 | ... | 17.0 | 0.01178 | 34.0 | A | 25.764192 | NaN | 6.0 | 6.0 | NaN | 05/03/24 00:00:00 |
40740 | SCLOR2022071 | PU239240 | CLOR08 | NaN | 0.420 | 4.6 | NaN | 2.500 | 7.4 | 05/03/24 00:00:00 | ... | 17.0 | 0.01178 | 34.0 | A | 25.764192 | NaN | 6.0 | 6.0 | NaN | 05/03/24 00:00:00 |
40741 | SCLOR2022072 | K40 | CLOR01 | NaN | 956.000 | 1.3 | NaN | 5873.000 | 2.8 | 05/03/24 00:00:00 | ... | 19.0 | 0.01178 | 34.0 | A | 26.396495 | NaN | 6.0 | 6.0 | NaN | 05/03/24 00:00:00 |
40742 | SCLOR2022072 | CS137 | CLOR01 | NaN | 3.460 | 9.9 | NaN | 21.200 | 10.2 | 05/03/24 00:00:00 | ... | 19.0 | 0.01178 | 34.0 | A | 26.396495 | NaN | 6.0 | 6.0 | NaN | 05/03/24 00:00:00 |
40743 | SCLOR2022072 | PU239240 | CLOR08 | NaN | 0.060 | 10.4 | NaN | 0.370 | 13.0 | 05/03/24 00:00:00 | ... | 19.0 | 0.01178 | 34.0 | A | 26.396495 | NaN | 6.0 | 6.0 | NaN | 05/03/24 00:00:00 |
40744 rows × 35 columns
Lets review the seawater data:
key | nuclide | method | < value_bq/m³ | value_bq/m³ | error%_m³ | date_of_entry_x | country | laboratory | sequence | ... | longitude (ddmmmm) | longitude (dddddd) | tdepth | sdepth | salin | ttemp | filt | mors_subbasin | helcom_subbasin | date_of_entry_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | WKRIL2012003 | CS137 | NaN | NaN | 5.300000 | 32.000000 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012003.0 | ... | 29.2000 | 29.333300 | NaN | 0.0 | NaN | NaN | NaN | 11.0 | 11.0 | 08/20/14 00:00:00 |
1 | WKRIL2012004 | CS137 | NaN | NaN | 19.900000 | 20.000000 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012004.0 | ... | 29.2000 | 29.333300 | NaN | 29.0 | NaN | NaN | NaN | 11.0 | 11.0 | 08/20/14 00:00:00 |
2 | WKRIL2012005 | CS137 | NaN | NaN | 25.500000 | 20.000000 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012005.0 | ... | 23.0900 | 23.150000 | NaN | 0.0 | NaN | NaN | NaN | 11.0 | 3.0 | 08/20/14 00:00:00 |
3 | WKRIL2012006 | CS137 | NaN | NaN | 17.000000 | 29.000000 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012006.0 | ... | 27.5900 | 27.983300 | NaN | 0.0 | NaN | NaN | NaN | 11.0 | 11.0 | 08/20/14 00:00:00 |
4 | WKRIL2012007 | CS137 | NaN | NaN | 22.200000 | 18.000000 | 08/20/14 00:00:00 | 90.0 | KRIL | 2012007.0 | ... | 27.5900 | 27.983300 | NaN | 39.0 | NaN | NaN | NaN | 11.0 | 11.0 | 08/20/14 00:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
21629 | WDHIG2023112 | H3 | DHIG04 | NaN | 702.838089 | 7.295593 | 05/03/24 00:00:00 | 6.0 | DHIG | 2023112.0 | ... | 13.2999 | 13.499833 | 47.0 | 0.0 | 7.89 | NaN | NaN | 2.0 | 2.0 | 05/03/24 00:00:00 |
21630 | WDHIG2023113 | H3 | DHIG04 | NaN | 725.855727 | 7.258503 | 05/03/24 00:00:00 | 6.0 | DHIG | 2023113.0 | ... | 13.2999 | 13.499833 | 47.0 | 45.0 | 14.80 | NaN | NaN | 2.0 | 2.0 | 05/03/24 00:00:00 |
21631 | WDHIG2023143 | H3 | DHIG04 | NaN | 648.992944 | 7.419868 | 05/03/24 00:00:00 | 6.0 | DHIG | 2023143.0 | ... | 14.1205 | 14.200833 | 11.0 | 0.0 | 5.70 | NaN | NaN | 2.0 | 6.0 | 05/03/24 00:00:00 |
21632 | WDHIG2023145 | H3 | DHIG04 | NaN | 627.178435 | 7.373550 | 05/03/24 00:00:00 | 6.0 | DHIG | 2023145.0 | ... | 14.3993 | 14.665500 | 20.0 | 0.0 | 7.76 | NaN | NaN | 2.0 | 6.0 | 05/03/24 00:00:00 |
21633 | WDHIG2023147 | H3 | DHIG04 | NaN | 605.715107 | 7.543339 | 05/03/24 00:00:00 | 6.0 | DHIG | 2023147.0 | ... | 14.1980 | 14.330000 | 17.0 | 0.0 | 7.67 | NaN | NaN | 2.0 | 2.0 | 05/03/24 00:00:00 |
21634 rows × 27 columns
The MARIS data processing workflow involves two key steps:
NetCDFDecoder
.This section focuses on the first step: converting NetCDF files to a format suitable for OpenRefine processing using the NetCDFDecoder
class.