NetCDF2CSV

A data pipeline that converts MARIS NetCDF files into MARIS Standard OpenRefine CSV format.

This module converts NetCDF files into CSV files that follow the MARIS Standard OpenRefine format. While MARISCO has replaced OpenRefine in the data cleaning and preparation pipeline, the MARIS master database still requires input files to conform to this CSV format specification. The conversion is performed using the marisco library.

Tip

For new MARIS users, please refer to field definitions for detailed information about Maris fields.

Dependencies

Required packages and internal modules for data format transformations

from IPython.display import display, Markdown

Configuration and File Paths

# fname_in =  Path('../../_data/output/tepco.nc')
fname_in =  Path('../../_data/output/100-HELCOM-MORS-2024.nc')
fname_out = fname_in.with_suffix('.csv')
output_format = 'openrefine_csv'

Data Loading

Load data from standardized MARIS NetCDF files using ExtractNetcdfContents. The NetCDF files follow CF conventions and include standardized variable names and metadata according to MARIS specifications.

contents = ExtractNetcdfContents(fname_in)

Show the dictionary of dataframes extracted from the NetCDF file.

contents.dfs.keys()
dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT'])

Show an example of the DataFrame extracted from the NetCDF file.

with pd.option_context('display.max_columns', None):
    display(contents.dfs['SEAWATER'].head())
SMP_ID_PROVIDER LON LAT SMP_DEPTH TOT_DEPTH TIME SMP_ID NUCLIDE VALUE UNIT UNC DL FILT
0 WKRIL2012003 29.333300 60.083302 0.0 NaN 1337731200 1 33 5.300000 1 1.696 1 0
1 WKRIL2012004 29.333300 60.083302 29.0 NaN 1337731200 2 33 19.900000 1 3.980 1 0
2 WKRIL2012005 23.150000 59.433300 0.0 NaN 1339891200 3 33 25.500000 1 5.100 1 0
3 WKRIL2012006 27.983299 60.250000 0.0 NaN 1337817600 4 33 17.000000 1 4.930 1 0
4 WKRIL2012007 27.983299 60.250000 39.0 NaN 1337817600 5 33 22.200001 1 3.996 1 0

Show an example of the dictionary of enums extracted from the NetCDF file as a DataFrame.

grp='SEAWATER'
print(f'Variables in {grp} group: {contents.enum_dicts[grp].keys()}')
var='dl'
with pd.option_context('display.max_columns', None):
    display(pd.DataFrame.from_dict(contents.enum_dicts[grp][var], orient='index').T)
Variables in SEAWATER group: dict_keys(['nuclide', 'unit', 'dl', 'filt'])
Not applicable Not available Detected value Detection limit Not detected Derived
0 -1 0 1 2 3 4

Show the global attributes extracted from the NetCDF file.

print("First few attributes from global attributes:", list(contents.global_attrs.items())[:5])
First few attributes from global attributes: [('id', '26VMZZ2Q'), ('title', 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances'), ('summary', 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting data under Recommendation 26/3.\n\nThe database is updated and quality assured annually by HELCOM MORS EG.'), ('keywords', 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)'), ('history', 'TBD')]

Show the custom maps extracted from the NetCDF file.

grp='SEAWATER'
print(f'Custom maps in {grp} group: {contents.custom_maps[grp].keys()}')
with pd.option_context('display.max_columns', None):
    display(pd.DataFrame.from_dict(contents.custom_maps[grp], orient='index'))
Custom maps in SEAWATER group: dict_keys([])

Validate NetCDF Enumerations

Verify that enumerated values in the NetCDF file match current MARIS lookup tables.

ImportantFEEDBACK TO DATA PROVIDERS

The enumeration validation process is a diagnostic step that identifies inconsistencies between NetCDF enumerations and MARIS lookup tables. While this validation does not modify the dataset, it generates detailed feedback about any mismatches or undefined values.

NC_VARS
{'SMP_ID': 'smp_id',
 'SMP_ID_PROVIDER': 'id_provider',
 'LON': 'lon',
 'LAT': 'lat',
 'SMP_DEPTH': 'smp_depth',
 'TOT_DEPTH': 'tot_depth',
 'TIME': 'time',
 'AREA': 'area',
 'NUCLIDE': 'nuclide',
 'VALUE': 'value',
 'UNIT': 'unit',
 'UNC': 'unc',
 'DL': 'dl',
 'DLV': 'dlv',
 'FILT': 'filt',
 'COUNT_MET': 'count_met',
 'SAMP_MET': 'samp_met',
 'PREP_MET': 'prep_met',
 'VOL': 'vol',
 'SAL': 'sal',
 'TEMP': 'temp',
 'PH': 'ph',
 'BIO_GROUP': 'bio_group',
 'SPECIES': 'species',
 'BODY_PART': 'body_part',
 'SED_TYPE': 'sed_type',
 'TOP': 'top',
 'BOTTOM': 'bottom',
 'DRYWT': 'drywt',
 'WETWT': 'wetwt',
 'PERCENTWT': 'percentwt',
 'LAB': 'lab',
 'PROFILE_ID': 'profile_id',
 'STATION': 'station'}

source

ValidateEnumsCB

 ValidateEnumsCB (contents, maris_enums, verbose=False)

Validate enumeration mappings between NetCDF file and MARIS lookup tables.

contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data= contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        ValidateEnumsCB(
            contents = contents,
            maris_enums=Enums(lut_src_dir=lut_path())
        ),
    ]
)
tfm()
{'BIOTA':       SMP_ID_PROVIDER        LON        LAT  SMP_DEPTH        TIME  SMP_ID  \
 0        BVTIG2012041  12.316667  54.283333        NaN  1348358400       1   
 1        BVTIG2012041  12.316667  54.283333        NaN  1348358400       2   
 2        BVTIG2012041  12.316667  54.283333        NaN  1348358400       3   
 3        BVTIG2012041  12.316667  54.283333        NaN  1348358400       4   
 4        BVTIG2012040  12.316667  54.283333        NaN  1348358400       5   
 ...               ...        ...        ...        ...         ...     ...   
 16089    BSTUK2022010  21.395000  61.241501        2.0  1652140800   16090   
 16090    BSTUK2022010  21.395000  61.241501        2.0  1652140800   16091   
 16091    BSTUK2022011  21.385000  61.343334        NaN  1663200000   16092   
 16092    BSTUK2022011  21.385000  61.343334        NaN  1663200000   16093   
 16093    BSTUK2022011  21.385000  61.343334        NaN  1663200000   16094   
 
        NUCLIDE       VALUE  UNIT       UNC  DL  BIO_GROUP  SPECIES  BODY_PART  \
 0           31    0.010140     5       NaN   2          4       99         52   
 1            4  135.300003     5  4.830210   1          4       99         52   
 2            9    0.013980     5       NaN   2          4       99         52   
 3           33    4.338000     5  0.150962   1          4       99         52   
 4           31    0.009614     5       NaN   2          4       99         52   
 ...        ...         ...   ...       ...  ..        ...      ...        ...   
 16089       33   13.700000     4  0.520600   1         11       96         55   
 16090        9    0.500000     4  0.045500   1         11       96         55   
 16091        4   50.700001     4  4.106700   1         14      129          1   
 16092       33    0.880000     4  0.140800   1         14      129          1   
 16093       12    6.600000     4  0.349800   1         14      129          1   
 
             DRYWT  WETWT  PERCENTWT  
 0      174.934433  948.0    0.18453  
 1      174.934433  948.0    0.18453  
 2      174.934433  948.0    0.18453  
 3      174.934433  948.0    0.18453  
 4      177.935120  964.0    0.18458  
 ...           ...    ...        ...  
 16089         NaN    NaN        NaN  
 16090         NaN    NaN        NaN  
 16091         NaN    NaN        NaN  
 16092         NaN    NaN        NaN  
 16093         NaN    NaN        NaN  
 
 [16094 rows x 17 columns],
 'SEAWATER':       SMP_ID_PROVIDER        LON        LAT  SMP_DEPTH  TOT_DEPTH        TIME  \
 0        WKRIL2012003  29.333300  60.083302        0.0        NaN  1337731200   
 1        WKRIL2012004  29.333300  60.083302       29.0        NaN  1337731200   
 2        WKRIL2012005  23.150000  59.433300        0.0        NaN  1339891200   
 3        WKRIL2012006  27.983299  60.250000        0.0        NaN  1337817600   
 4        WKRIL2012007  27.983299  60.250000       39.0        NaN  1337817600   
 ...               ...        ...        ...        ...        ...         ...   
 21468    WDHIG2023112  13.499833  54.600334        0.0       47.0  1686441600   
 21469    WDHIG2023113  13.499833  54.600334       45.0       47.0  1686441600   
 21470    WDHIG2023143  14.200833  54.600334        0.0       11.0  1686614400   
 21471    WDHIG2023145  14.665500  54.600334        0.0       20.0  1686614400   
 21472    WDHIG2023147  14.330000  54.600334        0.0       17.0  1686614400   
 
        SMP_ID  NUCLIDE       VALUE  UNIT        UNC  DL  FILT  
 0           1       33    5.300000     1   1.696000   1     0  
 1           2       33   19.900000     1   3.980000   1     0  
 2           3       33   25.500000     1   5.100000   1     0  
 3           4       33   17.000000     1   4.930000   1     0  
 4           5       33   22.200001     1   3.996000   1     0  
 ...       ...      ...         ...   ...        ...  ..   ...  
 21468   21469        1  702.838074     1  51.276207   1     0  
 21469   21470        1  725.855713     1  52.686260   1     0  
 21470   21471        1  648.992920     1  48.154419   1     0  
 21471   21472        1  627.178406     1  46.245316   1     0  
 21472   21473        1  605.715088     1  45.691143   1     0  
 
 [21473 rows x 13 columns],
 'SEDIMENT':       SMP_ID_PROVIDER        LON        LAT  TOT_DEPTH        TIME  SMP_ID  \
 0        SKRIL2012116  27.799999  60.466667       25.0  1337904000       1   
 1        SKRIL2012117  27.799999  60.466667       25.0  1337904000       2   
 2        SKRIL2012118  27.799999  60.466667       25.0  1337904000       3   
 3        SKRIL2012119  27.799999  60.466667       25.0  1337904000       4   
 4        SKRIL2012120  27.799999  60.466667       25.0  1337904000       5   
 ...               ...        ...        ...        ...         ...     ...   
 70444    SCLOR2022071  15.537800  54.617832       62.0  1654646400   70445   
 70445    SCLOR2022071  15.537800  54.617832       62.0  1654646400   70446   
 70446    SCLOR2022072  15.537800  54.617832       62.0  1654646400   70447   
 70447    SCLOR2022072  15.537800  54.617832       62.0  1654646400   70448   
 70448    SCLOR2022072  15.537800  54.617832       62.0  1654646400   70449   
 
        NUCLIDE        VALUE  UNIT         UNC  DL  SED_TYPE   TOP  BOTTOM  \
 0           33  1200.000000     3  240.000000   1         0  15.0    20.0   
 1           33   250.000000     3   50.000000   1         0  20.0    25.0   
 2           33   140.000000     3   29.400000   1         0  25.0    30.0   
 3           33    79.000000     3   15.800000   1         0  30.0    35.0   
 4           33    29.000000     3    6.960000   1         0  35.0    40.0   
 ...        ...          ...   ...         ...  ..       ...   ...     ...   
 70444       67     0.044000     2    0.015312   1        10  15.0    17.0   
 70445       77     2.500000     2    0.185000   1        10  15.0    17.0   
 70446        4  5873.000000     2  164.444000   1        10  17.0    19.0   
 70447       33    21.200001     2    2.162400   1        10  17.0    19.0   
 70448       77     0.370000     2    0.048100   1        10  17.0    19.0   
 
        PERCENTWT  
 0            NaN  
 1            NaN  
 2            NaN  
 3            NaN  
 4            NaN  
 ...          ...  
 70444   0.257642  
 70445   0.257642  
 70446   0.263965  
 70447   0.263965  
 70448   0.263965  
 
 [70449 rows x 15 columns]}

Remove Non Compatible Columns

The [RemoveNonCompatibleVariablesCB](https://franckalbinet.github.io/marisco/api/netcdf2csv.html#removenoncompatiblevariablescb) callback filters out variables from the NetCDF format that are not listed in the VARS configuration.


source

RemoveNonCompatibleVariablesCB

 RemoveNonCompatibleVariablesCB (vars:Dict[str,str]={'LON': 'longitude',
                                 'LAT': 'latitude', 'SMP_DEPTH':
                                 'sampdepth', 'TOT_DEPTH': 'totdepth',
                                 'TIME': 'begperiod', 'AREA': 'area',
                                 'NUCLIDE': 'nuclide_id', 'VALUE':
                                 'activity', 'UNIT': 'unit_id', 'UNC':
                                 'uncertaint', 'DL': 'detection', 'DLV':
                                 'detection_lim', 'FILT': 'filtered',
                                 'COUNT_MET': 'counmet_id', 'SAMP_MET':
                                 'sampmet_id', 'PREP_MET': 'prepmet_id',
                                 'VOL': 'volume', 'SAL': 'salinity',
                                 'TEMP': 'temperatur', 'SPECIES':
                                 'species_id', 'BODY_PART': 'bodypar_id',
                                 'SED_TYPE': 'sedtype_id', 'TOP':
                                 'sliceup', 'BOTTOM': 'slicedown',
                                 'DRYWT': 'drywt', 'WETWT': 'wetwt',
                                 'PERCENTWT': 'percentwt', 'LAB':
                                 'lab_id', 'PROFILE_ID': 'profile_id',
                                 'SAMPLE_TYPE': 'samptype_id',
                                 'TAXONNAME': 'taxonname', 'TAXONREPNAME':
                                 'taxonrepname', 'TAXONRANK': 'taxonrank',
                                 'TAXONDB': 'taxondb', 'TAXONDBID':
                                 'taxondb_id', 'TAXONDBURL':
                                 'taxondb_url', 'REF_ID': 'ref_id',
                                 'SMP_ID_PROVIDER': 'samplabcode',
                                 'STATION': 'station'},
                                 verbose:bool=False)

Remove variables not listed in VARS configuration.

Type Default Details
vars Dict {‘LON’: ‘longitude’, ‘LAT’: ‘latitude’, ‘SMP_DEPTH’: ‘sampdepth’, ‘TOT_DEPTH’: ‘totdepth’, ‘TIME’: ‘begperiod’, ‘AREA’: ‘area’, ‘NUCLIDE’: ‘nuclide_id’, ‘VALUE’: ‘activity’, ‘UNIT’: ‘unit_id’, ‘UNC’: ‘uncertaint’, ‘DL’: ‘detection’, ‘DLV’: ‘detection_lim’, ‘FILT’: ‘filtered’, ‘COUNT_MET’: ‘counmet_id’, ‘SAMP_MET’: ‘sampmet_id’, ‘PREP_MET’: ‘prepmet_id’, ‘VOL’: ‘volume’, ‘SAL’: ‘salinity’, ‘TEMP’: ‘temperatur’, ‘SPECIES’: ‘species_id’, ‘BODY_PART’: ‘bodypar_id’, ‘SED_TYPE’: ‘sedtype_id’, ‘TOP’: ‘sliceup’, ‘BOTTOM’: ‘slicedown’, ‘DRYWT’: ‘drywt’, ‘WETWT’: ‘wetwt’, ‘PERCENTWT’: ‘percentwt’, ‘LAB’: ‘lab_id’, ‘PROFILE_ID’: ‘profile_id’, ‘SAMPLE_TYPE’: ‘samptype_id’, ‘TAXONNAME’: ‘taxonname’, ‘TAXONREPNAME’: ‘taxonrepname’, ‘TAXONRANK’: ‘taxonrank’, ‘TAXONDB’: ‘taxondb’, ‘TAXONDBID’: ‘taxondb_id’, ‘TAXONDBURL’: ‘taxondb_url’, ‘REF_ID’: ‘ref_id’, ‘SMP_ID_PROVIDER’: ‘samplabcode’, ‘STATION’: ‘station’} Dictionary mapping OR vars to NC vars
verbose bool False
contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        RemoveNonCompatibleVariablesCB(vars=CSV_VARS, verbose=True),
    ]
)
tfm()
print(tfm.dfs['BIOTA'].head())
Removing variables that are not compatible with vars provided. 
Removing SMP_ID, BIO_GROUP from BIOTA dataset.
Removing variables that are not compatible with vars provided. 
Removing SMP_ID from SEAWATER dataset.
Removing variables that are not compatible with vars provided. 
Removing SMP_ID from SEDIMENT dataset.
  SMP_ID_PROVIDER        LON        LAT  SMP_DEPTH        TIME  NUCLIDE  \
0    BVTIG2012041  12.316667  54.283333        NaN  1348358400       31   
1    BVTIG2012041  12.316667  54.283333        NaN  1348358400        4   
2    BVTIG2012041  12.316667  54.283333        NaN  1348358400        9   
3    BVTIG2012041  12.316667  54.283333        NaN  1348358400       33   
4    BVTIG2012040  12.316667  54.283333        NaN  1348358400       31   

        VALUE  UNIT       UNC  DL  SPECIES  BODY_PART       DRYWT  WETWT  \
0    0.010140     5       NaN   2       99         52  174.934433  948.0   
1  135.300003     5  4.830210   1       99         52  174.934433  948.0   
2    0.013980     5       NaN   2       99         52  174.934433  948.0   
3    4.338000     5  0.150962   1       99         52  174.934433  948.0   
4    0.009614     5       NaN   2       99         52  177.935120  964.0   

   PERCENTWT  
0    0.18453  
1    0.18453  
2    0.18453  
3    0.18453  
4    0.18458  

Sample ID Handling

The MARISCO NetCDF encoding pipeline uses SMP_ID to store unique identifiers for each sample measurement. These IDs can come from two sources:

  1. Directly from data providers who supply their own unique identifiers
  2. Auto-generated by MARISCO as incremented integers when providers don’t supply IDs

The Maris database ingestion pipeline expects:

  • The original SMP_ID if provided by the data source
  • None if the ID was auto-generated by MARISCO

Note: When writing to CSV, SMP_ID is renamed to samplabcode as defined in the CSV_VARS mapping in configs.ipynb.

The callback below implements this logic by providing an option to convert auto-generated SMP_ID values to None while preserving original provider-supplied IDs.


source

SampleIDConversionCB

 SampleIDConversionCB (nonify:bool=False)

Convert auto-generated SMP_ID values to None while preserving original provider-supplied IDs.

Type Default Details
nonify bool False If True, convert auto-generated SMP_ID values to None
contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        RemoveNonCompatibleVariablesCB(vars=CSV_VARS, verbose=True),
        # SampleIDConversionCB(nonify=True)
    ]
)
tfm()
print(tfm.dfs['BIOTA'].head())
Removing variables that are not compatible with vars provided. 
Removing SMP_ID, BIO_GROUP from BIOTA dataset.
Removing variables that are not compatible with vars provided. 
Removing SMP_ID from SEAWATER dataset.
Removing variables that are not compatible with vars provided. 
Removing SMP_ID from SEDIMENT dataset.
  SMP_ID_PROVIDER        LON        LAT  SMP_DEPTH        TIME  NUCLIDE  \
0    BVTIG2012041  12.316667  54.283333        NaN  1348358400       31   
1    BVTIG2012041  12.316667  54.283333        NaN  1348358400        4   
2    BVTIG2012041  12.316667  54.283333        NaN  1348358400        9   
3    BVTIG2012041  12.316667  54.283333        NaN  1348358400       33   
4    BVTIG2012040  12.316667  54.283333        NaN  1348358400       31   

        VALUE  UNIT       UNC  DL  SPECIES  BODY_PART       DRYWT  WETWT  \
0    0.010140     5       NaN   2       99         52  174.934433  948.0   
1  135.300003     5  4.830210   1       99         52  174.934433  948.0   
2    0.013980     5       NaN   2       99         52  174.934433  948.0   
3    4.338000     5  0.150962   1       99         52  174.934433  948.0   
4    0.009614     5       NaN   2       99         52  177.935120  964.0   

   PERCENTWT  
0    0.18453  
1    0.18453  
2    0.18453  
3    0.18453  
4    0.18458  

Add Taxon Information


source

get_taxon_info_lut

 get_taxon_info_lut (maris_lut:str, key_names:dict={'Taxonname':
                     'TAXONNAME', 'Taxonrank': 'TAXONRANK', 'TaxonDB':
                     'TAXONDB', 'TaxonDBID': 'TAXONDBID', 'TaxonDBURL':
                     'TAXONDBURL'})

Create lookup dictionary for taxon information from MARIS species lookup table.


source

AddTaxonInformationCB

 AddTaxonInformationCB (fn_lut:Callable=<function <lambda>>,
                        verbose:bool=False)

Add taxon information to BIOTA group based on species lookup table.

Type Default Details
fn_lut Callable Function that returns taxon lookup dictionary
verbose bool False
contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        AddTaxonInformationCB(
            fn_lut=lut_taxon
        ),
    ]
)

tfm()
{'BIOTA':       SMP_ID_PROVIDER        LON        LAT  SMP_DEPTH        TIME  SMP_ID  \
 0        BVTIG2012041  12.316667  54.283333        NaN  1348358400       1   
 1        BVTIG2012041  12.316667  54.283333        NaN  1348358400       2   
 2        BVTIG2012041  12.316667  54.283333        NaN  1348358400       3   
 3        BVTIG2012041  12.316667  54.283333        NaN  1348358400       4   
 4        BVTIG2012040  12.316667  54.283333        NaN  1348358400       5   
 ...               ...        ...        ...        ...         ...     ...   
 16089    BSTUK2022010  21.395000  61.241501        2.0  1652140800   16090   
 16090    BSTUK2022010  21.395000  61.241501        2.0  1652140800   16091   
 16091    BSTUK2022011  21.385000  61.343334        NaN  1663200000   16092   
 16092    BSTUK2022011  21.385000  61.343334        NaN  1663200000   16093   
 16093    BSTUK2022011  21.385000  61.343334        NaN  1663200000   16094   
 
        NUCLIDE       VALUE  UNIT       UNC  ...  SPECIES  BODY_PART  \
 0           31    0.010140     5       NaN  ...       99         52   
 1            4  135.300003     5  4.830210  ...       99         52   
 2            9    0.013980     5       NaN  ...       99         52   
 3           33    4.338000     5  0.150962  ...       99         52   
 4           31    0.009614     5       NaN  ...       99         52   
 ...        ...         ...   ...       ...  ...      ...        ...   
 16089       33   13.700000     4  0.520600  ...       96         55   
 16090        9    0.500000     4  0.045500  ...       96         55   
 16091        4   50.700001     4  4.106700  ...      129          1   
 16092       33    0.880000     4  0.140800  ...      129          1   
 16093       12    6.600000     4  0.349800  ...      129          1   
 
             DRYWT  WETWT  PERCENTWT          TAXONNAME  TAXONRANK   TAXONDB  \
 0      174.934433  948.0    0.18453       Gadus morhua    species  Wikidata   
 1      174.934433  948.0    0.18453       Gadus morhua    species  Wikidata   
 2      174.934433  948.0    0.18453       Gadus morhua    species  Wikidata   
 3      174.934433  948.0    0.18453       Gadus morhua    species  Wikidata   
 4      177.935120  964.0    0.18458       Gadus morhua    species  Wikidata   
 ...           ...    ...        ...                ...        ...       ...   
 16089         NaN    NaN        NaN  Fucus vesiculosus    species  Wikidata   
 16090         NaN    NaN        NaN  Fucus vesiculosus    species  Wikidata   
 16091         NaN    NaN        NaN     Mytilus edulis    species  Wikidata   
 16092         NaN    NaN        NaN     Mytilus edulis    species  Wikidata   
 16093         NaN    NaN        NaN     Mytilus edulis    species  Wikidata   
 
       TAXONDBID                             TAXONDBURL  
 0       Q199788  https://www.wikidata.org/wiki/Q199788  
 1       Q199788  https://www.wikidata.org/wiki/Q199788  
 2       Q199788  https://www.wikidata.org/wiki/Q199788  
 3       Q199788  https://www.wikidata.org/wiki/Q199788  
 4       Q199788  https://www.wikidata.org/wiki/Q199788  
 ...         ...                                    ...  
 16089   Q754755  https://www.wikidata.org/wiki/Q754755  
 16090   Q754755  https://www.wikidata.org/wiki/Q754755  
 16091    Q27855   https://www.wikidata.org/wiki/Q27855  
 16092    Q27855   https://www.wikidata.org/wiki/Q27855  
 16093    Q27855   https://www.wikidata.org/wiki/Q27855  
 
 [16094 rows x 22 columns],
 'SEAWATER':       SMP_ID_PROVIDER        LON        LAT  SMP_DEPTH  TOT_DEPTH        TIME  \
 0        WKRIL2012003  29.333300  60.083302        0.0        NaN  1337731200   
 1        WKRIL2012004  29.333300  60.083302       29.0        NaN  1337731200   
 2        WKRIL2012005  23.150000  59.433300        0.0        NaN  1339891200   
 3        WKRIL2012006  27.983299  60.250000        0.0        NaN  1337817600   
 4        WKRIL2012007  27.983299  60.250000       39.0        NaN  1337817600   
 ...               ...        ...        ...        ...        ...         ...   
 21468    WDHIG2023112  13.499833  54.600334        0.0       47.0  1686441600   
 21469    WDHIG2023113  13.499833  54.600334       45.0       47.0  1686441600   
 21470    WDHIG2023143  14.200833  54.600334        0.0       11.0  1686614400   
 21471    WDHIG2023145  14.665500  54.600334        0.0       20.0  1686614400   
 21472    WDHIG2023147  14.330000  54.600334        0.0       17.0  1686614400   
 
        SMP_ID  NUCLIDE       VALUE  UNIT        UNC  DL  FILT  
 0           1       33    5.300000     1   1.696000   1     0  
 1           2       33   19.900000     1   3.980000   1     0  
 2           3       33   25.500000     1   5.100000   1     0  
 3           4       33   17.000000     1   4.930000   1     0  
 4           5       33   22.200001     1   3.996000   1     0  
 ...       ...      ...         ...   ...        ...  ..   ...  
 21468   21469        1  702.838074     1  51.276207   1     0  
 21469   21470        1  725.855713     1  52.686260   1     0  
 21470   21471        1  648.992920     1  48.154419   1     0  
 21471   21472        1  627.178406     1  46.245316   1     0  
 21472   21473        1  605.715088     1  45.691143   1     0  
 
 [21473 rows x 13 columns],
 'SEDIMENT':       SMP_ID_PROVIDER        LON        LAT  TOT_DEPTH        TIME  SMP_ID  \
 0        SKRIL2012116  27.799999  60.466667       25.0  1337904000       1   
 1        SKRIL2012117  27.799999  60.466667       25.0  1337904000       2   
 2        SKRIL2012118  27.799999  60.466667       25.0  1337904000       3   
 3        SKRIL2012119  27.799999  60.466667       25.0  1337904000       4   
 4        SKRIL2012120  27.799999  60.466667       25.0  1337904000       5   
 ...               ...        ...        ...        ...         ...     ...   
 70444    SCLOR2022071  15.537800  54.617832       62.0  1654646400   70445   
 70445    SCLOR2022071  15.537800  54.617832       62.0  1654646400   70446   
 70446    SCLOR2022072  15.537800  54.617832       62.0  1654646400   70447   
 70447    SCLOR2022072  15.537800  54.617832       62.0  1654646400   70448   
 70448    SCLOR2022072  15.537800  54.617832       62.0  1654646400   70449   
 
        NUCLIDE        VALUE  UNIT         UNC  DL  SED_TYPE   TOP  BOTTOM  \
 0           33  1200.000000     3  240.000000   1         0  15.0    20.0   
 1           33   250.000000     3   50.000000   1         0  20.0    25.0   
 2           33   140.000000     3   29.400000   1         0  25.0    30.0   
 3           33    79.000000     3   15.800000   1         0  30.0    35.0   
 4           33    29.000000     3    6.960000   1         0  35.0    40.0   
 ...        ...          ...   ...         ...  ..       ...   ...     ...   
 70444       67     0.044000     2    0.015312   1        10  15.0    17.0   
 70445       77     2.500000     2    0.185000   1        10  15.0    17.0   
 70446        4  5873.000000     2  164.444000   1        10  17.0    19.0   
 70447       33    21.200001     2    2.162400   1        10  17.0    19.0   
 70448       77     0.370000     2    0.048100   1        10  17.0    19.0   
 
        PERCENTWT  
 0            NaN  
 1            NaN  
 2            NaN  
 3            NaN  
 4            NaN  
 ...          ...  
 70444   0.257642  
 70445   0.257642  
 70446   0.263965  
 70447   0.263965  
 70448   0.263965  
 
 [70449 rows x 15 columns]}

Standardize Time

contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        DecodeTimeCB(),
    ]
)

tfm()

print(tfm.dfs['SEAWATER']['TIME'])
0       2012-05-23
1       2012-05-23
2       2012-06-17
3       2012-05-24
4       2012-05-24
           ...    
21468   2023-06-11
21469   2023-06-11
21470   2023-06-13
21471   2023-06-13
21472   2023-06-13
Name: TIME, Length: 21473, dtype: datetime64[ns]

Add Sample Type ID

contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        AddSampleTypeIdColumnCB(),
    ]
)

tfm()
print(tfm.dfs['SEAWATER']['SAMPLE_TYPE'].unique())
# print(tfm.dfs['BIOTA']['SAMPLE_TYPE'].unique())
# print(tfm.dfs['SEDIMENT']['SAMPLE_TYPE'].unique())
[1]

Add Reference ID

Include the ref_id (i.e., Zotero Archive Location). The ZoteroArchiveLocationCB performs a lookup of the Zotero Archive Location based on the Zotero key defined in the global attributes of the MARIS NetCDF file as id.

contents.global_attrs['id']
'26VMZZ2Q'

source

AddZoteroArchiveLocationCB

 AddZoteroArchiveLocationCB (attrs:str, cfg:dict)

Fetch and append ‘Loc. in Archive’ from Zotero to DataFrame.

contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        AddZoteroArchiveLocationCB(contents.global_attrs, cfg=cfg()),
    ]
)
tfm()
print(tfm.dfs['SEAWATER']['REF_ID'].unique())
[100]

Remap to Open Refine specific mappings

WarningFEEDBACK FOR NEXT VERSION

[To be further clarified]

The current approach of remapping to OR-specific mappings should be reconsidered. Considering that we already utilize MARISCO lookup tables in NetCDF for creating enums, it would be beneficial to extend their use to OpenRefine data formats as well. By doing so, we could eliminate the need for OpenRefine-specific mappings, streamlining the data transformation process. Lets review the lookup tables used to create the enums for NetCDF:

enums = Enums(lut_src_dir=lut_path())
print(f'DL enums: {enums.types["DL"]}')
print(f'FILT enums: {enums.types["FILT"]}')
DL enums: {'Not applicable': -1, 'Not available': 0, 'Detected value': 1, 'Detection limit': 2, 'Not detected': 3, 'Derived': 4}
FILT enums: {'Not applicable': -1, 'Not available': 0, 'Yes': 1, 'No': 2}

For the detection limit lookup table (LUT), as shown below, the values required for the OpenRefine CSV format are listed under the ‘name’ column, whereas the enums utilize the ‘name_sanitized’ column.

Additionally, for the filtered LUT, also shown below, the values do not align consistently with the OpenRefine CSV format, which uses (Y, N, NA).

dl_lut = pd.read_excel(detection_limit_lut_path())
dl_lut
id name name_sanitized
0 -1 Not applicable Not applicable
1 0 Not Available Not available
2 1 = Detected value
3 2 < Detection limit
4 3 ND Not detected
5 4 DE Derived
filtered_lut = pd.read_excel(filtered_lut_path())
filtered_lut
id name
0 -1 Not applicable
1 0 Not available
2 1 Yes
3 2 No

We will create OpenRefine specific mappings for the detection limit and filtered data:

RemapToORSpecificMappingsCB remaps the values of the detection limit and filtered data to the OpenRefine CSV format.


source

RemapToORSpecificMappingsCB

 RemapToORSpecificMappingsCB (or_mappings:Dict[str,Dict]={'DL': {3: 'ND',
                              1: '=', 2: '<'}, 'FILT': {0: 'NA', 1: 'Y',
                              2: 'N'}},
                              output_format:str='openrefine_csv',
                              verbose:bool=False)

Convert values using OR mappings if columns exist in dataframe.

Type Default Details
or_mappings Dict {‘DL’: {3: ‘ND’, 1: ‘=’, 2: ‘<’}, ‘FILT’: {0: ‘NA’, 1: ‘Y’, 2: ‘N’}} Dictionary of column mappings,
output_format str openrefine_csv
verbose bool False
contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data= contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        RemapToORSpecificMappingsCB(),
    ]
)

tfm()

# Loop through each group in the 'dfs' dictionary
for group_name, df in tfm.dfs.items():
    # Check if the group dataframe contains any of the columns specified in or_mappings.keys()
    relevant_columns = [col for col in or_mappings.keys() if col in df.columns]
    if relevant_columns:
        # Print the unique values from the relevant columns
        print(f"\nUnique values in {group_name} for columns {relevant_columns}:")
        for col in relevant_columns:
            print(f"{col}: {df[col].unique()}")
    else:
        print(f"No relevant columns found in {group_name} based on or_mappings keys.")

Unique values in BIOTA for columns ['DL']:
DL: ['<' '=' nan]

Unique values in SEAWATER for columns ['DL', 'FILT']:
DL: ['=' '<' nan]
FILT: ['NA' 'N' 'Y']

Unique values in SEDIMENT for columns ['DL']:
DL: ['=' '<' nan]

Remap to CSV data type format

CSV_DTYPES (defined in configs.ipynb) defines a state for each variable that contains a lookup table (i.e. enums). The state is either ‘decoded’ or ‘encoded’. Lets review the variable states as a DataFrame:

with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(pd.DataFrame.from_dict(CSV_DTYPES, orient='index').T)
AREA NUCLIDE UNIT DL FILT COUNT_MET SAMP_MET PREP_MET SPECIES BODY_PART SED_TYPE LAB
state decoded encoded encoded decoded decoded encoded encoded encoded encoded encoded encoded encoded
enums = Enums(lut_src_dir=lut_path())
enums.types.keys()
dict_keys(['AREA', 'BIO_GROUP', 'BODY_PART', 'COUNT_MET', 'DL', 'FILT', 'NUCLIDE', 'PREP_MET', 'SAMP_MET', 'SED_TYPE', 'SPECIES', 'UNIT', 'LAB'])

source

get_excluded_enums

 get_excluded_enums (output_format:str='openrefine_csv')

Get excluded enums based on output format.


source

DataFormatConversionCB

 DataFormatConversionCB (dtypes:Dict, excluded_mappings:Callable=<function
                         get_excluded_enums>,
                         output_format:str='openrefine_csv',
                         verbose:bool=False)

A callback to convert DataFrame enum values between encoded and decoded formats based on specified settings.

Type Default Details
dtypes Dict Dictionary defining data types and states for each lookup table
excluded_mappings Callable get_excluded_enums Dictionary of columns to exclude from conversion
output_format str openrefine_csv
verbose bool False Flag for verbose output
contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    contents.dfs,
    cbs=[
        RemoveNonCompatibleVariablesCB(vars=CSV_VARS, verbose=True),
        DataFormatConversionCB(
            dtypes=CSV_DTYPES,
            excluded_mappings = get_excluded_enums,
            output_format='openrefine_csv',
            verbose=True
        ),
    ]
)
tfm()
Removing variables that are not compatible with vars provided. 
Removing SMP_ID, BIO_GROUP from BIOTA dataset.
Removing variables that are not compatible with vars provided. 
Removing SMP_ID from SEAWATER dataset.
Removing variables that are not compatible with vars provided. 
Removing SMP_ID from SEDIMENT dataset.
Loaded enums: dict_keys(['AREA', 'BIO_GROUP', 'BODY_PART', 'COUNT_MET', 'DL', 'FILT', 'NUCLIDE', 'PREP_MET', 'SAMP_MET', 'SED_TYPE', 'SPECIES', 'UNIT', 'LAB'])
{'BIOTA':       SMP_ID_PROVIDER        LON        LAT  SMP_DEPTH        TIME  NUCLIDE  \
 0        BVTIG2012041  12.316667  54.283333        NaN  1348358400       31   
 1        BVTIG2012041  12.316667  54.283333        NaN  1348358400        4   
 2        BVTIG2012041  12.316667  54.283333        NaN  1348358400        9   
 3        BVTIG2012041  12.316667  54.283333        NaN  1348358400       33   
 4        BVTIG2012040  12.316667  54.283333        NaN  1348358400       31   
 ...               ...        ...        ...        ...         ...      ...   
 16089    BSTUK2022010  21.395000  61.241501        2.0  1652140800       33   
 16090    BSTUK2022010  21.395000  61.241501        2.0  1652140800        9   
 16091    BSTUK2022011  21.385000  61.343334        NaN  1663200000        4   
 16092    BSTUK2022011  21.385000  61.343334        NaN  1663200000       33   
 16093    BSTUK2022011  21.385000  61.343334        NaN  1663200000       12   
 
             VALUE  UNIT       UNC  DL  SPECIES  BODY_PART       DRYWT  WETWT  \
 0        0.010140     5       NaN   2       99         52  174.934433  948.0   
 1      135.300003     5  4.830210   1       99         52  174.934433  948.0   
 2        0.013980     5       NaN   2       99         52  174.934433  948.0   
 3        4.338000     5  0.150962   1       99         52  174.934433  948.0   
 4        0.009614     5       NaN   2       99         52  177.935120  964.0   
 ...           ...   ...       ...  ..      ...        ...         ...    ...   
 16089   13.700000     4  0.520600   1       96         55         NaN    NaN   
 16090    0.500000     4  0.045500   1       96         55         NaN    NaN   
 16091   50.700001     4  4.106700   1      129          1         NaN    NaN   
 16092    0.880000     4  0.140800   1      129          1         NaN    NaN   
 16093    6.600000     4  0.349800   1      129          1         NaN    NaN   
 
        PERCENTWT  
 0        0.18453  
 1        0.18453  
 2        0.18453  
 3        0.18453  
 4        0.18458  
 ...          ...  
 16089        NaN  
 16090        NaN  
 16091        NaN  
 16092        NaN  
 16093        NaN  
 
 [16094 rows x 15 columns],
 'SEAWATER':       SMP_ID_PROVIDER        LON        LAT  SMP_DEPTH  TOT_DEPTH        TIME  \
 0        WKRIL2012003  29.333300  60.083302        0.0        NaN  1337731200   
 1        WKRIL2012004  29.333300  60.083302       29.0        NaN  1337731200   
 2        WKRIL2012005  23.150000  59.433300        0.0        NaN  1339891200   
 3        WKRIL2012006  27.983299  60.250000        0.0        NaN  1337817600   
 4        WKRIL2012007  27.983299  60.250000       39.0        NaN  1337817600   
 ...               ...        ...        ...        ...        ...         ...   
 21468    WDHIG2023112  13.499833  54.600334        0.0       47.0  1686441600   
 21469    WDHIG2023113  13.499833  54.600334       45.0       47.0  1686441600   
 21470    WDHIG2023143  14.200833  54.600334        0.0       11.0  1686614400   
 21471    WDHIG2023145  14.665500  54.600334        0.0       20.0  1686614400   
 21472    WDHIG2023147  14.330000  54.600334        0.0       17.0  1686614400   
 
        NUCLIDE       VALUE  UNIT        UNC  DL  FILT  
 0           33    5.300000     1   1.696000   1     0  
 1           33   19.900000     1   3.980000   1     0  
 2           33   25.500000     1   5.100000   1     0  
 3           33   17.000000     1   4.930000   1     0  
 4           33   22.200001     1   3.996000   1     0  
 ...        ...         ...   ...        ...  ..   ...  
 21468        1  702.838074     1  51.276207   1     0  
 21469        1  725.855713     1  52.686260   1     0  
 21470        1  648.992920     1  48.154419   1     0  
 21471        1  627.178406     1  46.245316   1     0  
 21472        1  605.715088     1  45.691143   1     0  
 
 [21473 rows x 12 columns],
 'SEDIMENT':       SMP_ID_PROVIDER        LON        LAT  TOT_DEPTH        TIME  NUCLIDE  \
 0        SKRIL2012116  27.799999  60.466667       25.0  1337904000       33   
 1        SKRIL2012117  27.799999  60.466667       25.0  1337904000       33   
 2        SKRIL2012118  27.799999  60.466667       25.0  1337904000       33   
 3        SKRIL2012119  27.799999  60.466667       25.0  1337904000       33   
 4        SKRIL2012120  27.799999  60.466667       25.0  1337904000       33   
 ...               ...        ...        ...        ...         ...      ...   
 70444    SCLOR2022071  15.537800  54.617832       62.0  1654646400       67   
 70445    SCLOR2022071  15.537800  54.617832       62.0  1654646400       77   
 70446    SCLOR2022072  15.537800  54.617832       62.0  1654646400        4   
 70447    SCLOR2022072  15.537800  54.617832       62.0  1654646400       33   
 70448    SCLOR2022072  15.537800  54.617832       62.0  1654646400       77   
 
              VALUE  UNIT         UNC  DL  SED_TYPE   TOP  BOTTOM  PERCENTWT  
 0      1200.000000     3  240.000000   1         0  15.0    20.0        NaN  
 1       250.000000     3   50.000000   1         0  20.0    25.0        NaN  
 2       140.000000     3   29.400000   1         0  25.0    30.0        NaN  
 3        79.000000     3   15.800000   1         0  30.0    35.0        NaN  
 4        29.000000     3    6.960000   1         0  35.0    40.0        NaN  
 ...            ...   ...         ...  ..       ...   ...     ...        ...  
 70444     0.044000     2    0.015312   1        10  15.0    17.0   0.257642  
 70445     2.500000     2    0.185000   1        10  15.0    17.0   0.257642  
 70446  5873.000000     2  164.444000   1        10  17.0    19.0   0.263965  
 70447    21.200001     2    2.162400   1        10  17.0    19.0   0.263965  
 70448     0.370000     2    0.048100   1        10  17.0    19.0   0.263965  
 
 [70449 rows x 14 columns]}

Review all callbacks

contents = ExtractNetcdfContents(fname_in)
output_format = 'openrefine_csv'
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        ValidateEnumsCB(
            contents = contents,
            maris_enums=Enums(lut_src_dir=lut_path())
        ),
        RemoveNonCompatibleVariablesCB(vars=CSV_VARS),
        # SampleIDConversionCB(nonify=False),
        # RemapCustomMapsCB(),
        RemapToORSpecificMappingsCB(output_format=output_format),
        AddTaxonInformationCB(
            fn_lut=lut_taxon
        ),
        DecodeTimeCB(),
        AddSampleTypeIdColumnCB(),
        AddZoteroArchiveLocationCB(contents.global_attrs, cfg=cfg()),
        DataFormatConversionCB(
            dtypes=CSV_DTYPES,
            excluded_mappings = get_excluded_enums,
            output_format=output_format,
        ) 
        ]
)
tfm()
for grp in ['SEAWATER', 'BIOTA']:
    display(Markdown(f"<b>Head of the transformed `{grp}` DataFrame:</b>"))
    with pd.option_context('display.max_rows', None):
        display(tfm.dfs[grp].head())

Head of the transformed SEAWATER DataFrame:

SMP_ID_PROVIDER LON LAT SMP_DEPTH TOT_DEPTH TIME NUCLIDE VALUE UNIT UNC DL FILT SAMPLE_TYPE REF_ID
0 WKRIL2012003 29.333300 60.083302 0.0 NaN 2012-05-23 33 5.300000 1 1.696 = NA 1 100
1 WKRIL2012004 29.333300 60.083302 29.0 NaN 2012-05-23 33 19.900000 1 3.980 = NA 1 100
2 WKRIL2012005 23.150000 59.433300 0.0 NaN 2012-06-17 33 25.500000 1 5.100 = NA 1 100
3 WKRIL2012006 27.983299 60.250000 0.0 NaN 2012-05-24 33 17.000000 1 4.930 = NA 1 100
4 WKRIL2012007 27.983299 60.250000 39.0 NaN 2012-05-24 33 22.200001 1 3.996 = NA 1 100

Head of the transformed BIOTA DataFrame:

SMP_ID_PROVIDER LON LAT SMP_DEPTH TIME NUCLIDE VALUE UNIT UNC DL ... DRYWT WETWT PERCENTWT TAXONNAME TAXONRANK TAXONDB TAXONDBID TAXONDBURL SAMPLE_TYPE REF_ID
0 BVTIG2012041 12.316667 54.283333 NaN 2012-09-23 31 0.010140 5 NaN < ... 174.934433 948.0 0.18453 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100
1 BVTIG2012041 12.316667 54.283333 NaN 2012-09-23 4 135.300003 5 4.830210 = ... 174.934433 948.0 0.18453 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100
2 BVTIG2012041 12.316667 54.283333 NaN 2012-09-23 9 0.013980 5 NaN < ... 174.934433 948.0 0.18453 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100
3 BVTIG2012041 12.316667 54.283333 NaN 2012-09-23 33 4.338000 5 0.150962 = ... 174.934433 948.0 0.18453 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100
4 BVTIG2012040 12.316667 54.283333 NaN 2012-09-23 31 0.009614 5 NaN < ... 177.935120 964.0 0.18458 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100

5 rows × 22 columns

Decode


source

decode

 decode (fname_in:str, dest_out:str|None=None,
         output_format:str='openrefine_csv',
         remap_vars:Dict[str,str]={'LON': 'longitude', 'LAT': 'latitude',
         'SMP_DEPTH': 'sampdepth', 'TOT_DEPTH': 'totdepth', 'TIME':
         'begperiod', 'AREA': 'area', 'NUCLIDE': 'nuclide_id', 'VALUE':
         'activity', 'UNIT': 'unit_id', 'UNC': 'uncertaint', 'DL':
         'detection', 'DLV': 'detection_lim', 'FILT': 'filtered',
         'COUNT_MET': 'counmet_id', 'SAMP_MET': 'sampmet_id', 'PREP_MET':
         'prepmet_id', 'VOL': 'volume', 'SAL': 'salinity', 'TEMP':
         'temperatur', 'SPECIES': 'species_id', 'BODY_PART': 'bodypar_id',
         'SED_TYPE': 'sedtype_id', 'TOP': 'sliceup', 'BOTTOM':
         'slicedown', 'DRYWT': 'drywt', 'WETWT': 'wetwt', 'PERCENTWT':
         'percentwt', 'LAB': 'lab_id', 'PROFILE_ID': 'profile_id',
         'SAMPLE_TYPE': 'samptype_id', 'TAXONNAME': 'taxonname',
         'TAXONREPNAME': 'taxonrepname', 'TAXONRANK': 'taxonrank',
         'TAXONDB': 'taxondb', 'TAXONDBID': 'taxondb_id', 'TAXONDBURL':
         'taxondb_url', 'REF_ID': 'ref_id', 'SMP_ID_PROVIDER':
         'samplabcode', 'STATION': 'station'},
         remap_dtypes:Dict[str,str]={'AREA': {'state': 'decoded'},
         'NUCLIDE': {'state': 'encoded'}, 'UNIT': {'state': 'encoded'},
         'DL': {'state': 'decoded'}, 'FILT': {'state': 'decoded'},
         'COUNT_MET': {'state': 'encoded'}, 'SAMP_MET': {'state':
         'encoded'}, 'PREP_MET': {'state': 'encoded'}, 'SPECIES':
         {'state': 'encoded'}, 'BODY_PART': {'state': 'encoded'},
         'SED_TYPE': {'state': 'encoded'}, 'LAB': {'state': 'encoded'}},
         verbose:bool=False, **kwargs)

Decode data from NetCDF.

Type Default Details
fname_in str Input file name
dest_out str | None None Output file name (optional)
output_format str openrefine_csv Output format
remap_vars Dict {‘LON’: ‘longitude’, ‘LAT’: ‘latitude’, ‘SMP_DEPTH’: ‘sampdepth’, ‘TOT_DEPTH’: ‘totdepth’, ‘TIME’: ‘begperiod’, ‘AREA’: ‘area’, ‘NUCLIDE’: ‘nuclide_id’, ‘VALUE’: ‘activity’, ‘UNIT’: ‘unit_id’, ‘UNC’: ‘uncertaint’, ‘DL’: ‘detection’, ‘DLV’: ‘detection_lim’, ‘FILT’: ‘filtered’, ‘COUNT_MET’: ‘counmet_id’, ‘SAMP_MET’: ‘sampmet_id’, ‘PREP_MET’: ‘prepmet_id’, ‘VOL’: ‘volume’, ‘SAL’: ‘salinity’, ‘TEMP’: ‘temperatur’, ‘SPECIES’: ‘species_id’, ‘BODY_PART’: ‘bodypar_id’, ‘SED_TYPE’: ‘sedtype_id’, ‘TOP’: ‘sliceup’, ‘BOTTOM’: ‘slicedown’, ‘DRYWT’: ‘drywt’, ‘WETWT’: ‘wetwt’, ‘PERCENTWT’: ‘percentwt’, ‘LAB’: ‘lab_id’, ‘PROFILE_ID’: ‘profile_id’, ‘SAMPLE_TYPE’: ‘samptype_id’, ‘TAXONNAME’: ‘taxonname’, ‘TAXONREPNAME’: ‘taxonrepname’, ‘TAXONRANK’: ‘taxonrank’, ‘TAXONDB’: ‘taxondb’, ‘TAXONDBID’: ‘taxondb_id’, ‘TAXONDBURL’: ‘taxondb_url’, ‘REF_ID’: ‘ref_id’, ‘SMP_ID_PROVIDER’: ‘samplabcode’, ‘STATION’: ‘station’} Mapping of OR vars to NC vars
remap_dtypes Dict {‘AREA’: {‘state’: ‘decoded’}, ‘NUCLIDE’: {‘state’: ‘encoded’}, ‘UNIT’: {‘state’: ‘encoded’}, ‘DL’: {‘state’: ‘decoded’}, ‘FILT’: {‘state’: ‘decoded’}, ‘COUNT_MET’: {‘state’: ‘encoded’}, ‘SAMP_MET’: {‘state’: ‘encoded’}, ‘PREP_MET’: {‘state’: ‘encoded’}, ‘SPECIES’: {‘state’: ‘encoded’}, ‘BODY_PART’: {‘state’: ‘encoded’}, ‘SED_TYPE’: {‘state’: ‘encoded’}, ‘LAB’: {‘state’: ‘encoded’}} Mapping of OR vars to NC dtypes
verbose bool False If True, print verbose output
kwargs VAR_KEYWORD
Returns None Additional arguments
fname = Path('../../_data/output/100-HELCOM-MORS-2024.nc')
decode(fname_in=fname, dest_out=fname.with_suffix(''), output_format='openrefine_csv')