NetCDF2CSV

A data pipeline that converts MARIS NetCDF files into MARIS Standard OpenRefine CSV format.

This module converts NetCDF files into CSV files that follow the MARIS Standard OpenRefine format. While MARISCO has replaced OpenRefine in the data cleaning and preparation pipeline, the MARIS master database still requires input files to conform to this CSV format specification. The conversion is performed using the marisco library.

Tip

For new MARIS users, please refer to field definitions for detailed information about Maris fields.

Dependencies

Required packages and internal modules for data format transformations

from IPython.display import display, Markdown

Configuration and File Paths

fname_in =  Path('../../_data/output/100-HELCOM-MORS-2024.nc')
fname_out = fname_in.with_suffix('.csv')
output_format = 'openrefine_csv'

Data Loading

Load data from standardized MARIS NetCDF files using ExtractNetcdfContents. The NetCDF files follow CF conventions and include standardized variable names and metadata according to MARIS specifications.

contents = ExtractNetcdfContents(fname_in)

Show the dictionary of dataframes extracted from the NetCDF file.

contents.dfs.keys()
dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT'])

Show an example of the DataFrame extracted from the NetCDF file.

with pd.option_context('display.max_columns', None):
    display(contents.dfs['SEAWATER'].head())
LON LAT SMP_DEPTH TOT_DEPTH TIME SMP_ID NUCLIDE VALUE UNIT UNC DL FILT
0 29.333300 60.083302 0.0 NaN 1337731200 0 33 5.300000 1 1.696 1 0
1 29.333300 60.083302 29.0 NaN 1337731200 1 33 19.900000 1 3.980 1 0
2 23.150000 59.433300 0.0 NaN 1339891200 2 33 25.500000 1 5.100 1 0
3 27.983299 60.250000 0.0 NaN 1337817600 3 33 17.000000 1 4.930 1 0
4 27.983299 60.250000 39.0 NaN 1337817600 4 33 22.200001 1 3.996 1 0

Show an example of the dictionary of enums extracted from the NetCDF file as a DataFrame.

grp='SEAWATER'
print(f'Variables in {grp} group: {contents.enum_dicts[grp].keys()}')
var='nuclide'
with pd.option_context('display.max_columns', None):
    display(pd.DataFrame.from_dict(contents.enum_dicts[grp][var], orient='index').T)
Variables in SEAWATER group: dict_keys(['nuclide', 'unit', 'dl', 'filt'])
NOT APPLICABLE NOT AVAILABLE h3 be7 c14 k40 cr51 mn54 co57 co58 co60 zn65 sr89 sr90 zr95 nb95 tc99 ru103 ru106 rh106 ag106m ag108 ag108m ag110m sb124 sb125 te129m i129 i131 cs127 cs134 cs137 ba140 la140 ce141 ce144 pm147 eu154 eu155 pb210 pb212 pb214 bi207 bi211 bi214 po210 rn220 rn222 ra223 ra224 ra225 ra226 ra228 ac228 th227 th228 th232 th234 pa234 u234 u235 u238 np237 np239 pu238 pu239 pu240 pu241 am240 am241 cm242 cm243 cm244 cs134_137_tot pu239_240_tot pu239_240_iii_iv_tot pu239_240_v_vi_tot cm243_244_tot pu238_pu239_240_tot_ratio am241_pu239_240_tot_ratio cs137_134_ratio cd109 eu152 fe59 gd153 ir192 pu238_240_tot rb86 sc46 sn113 sn117m tl208 mo99 tc99m ru105 te129 te132 i132 i135 cs136 tbeta talpha i133 th230 pa231 u236 ag111 in116m te123m sb127 ba133 ce139 tl201 hg203 na22 pa234m am243 se75 sr85 y88 ce140 bi212 u236_238_ratio i125 ba137m u232 pa233 ru106_rh106_tot tu tbeta40k fe55 ce144_pr144_tot pu240_pu239_ratio u233 pu239_242_tot ac227
0 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 28 29 30 31 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 116 117 122 123 124 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144

Show the global attributes extracted from the NetCDF file.

print("First few attributes from global attributes:", list(contents.global_attrs.items())[:5])
First few attributes from global attributes: [('id', '26VMZZ2Q'), ('title', 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances'), ('summary', 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting data under Recommendation 26/3.\n\nThe database is updated and quality assured annually by HELCOM MORS EG.'), ('keywords', 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)'), ('history', 'TBD')]

Show the custom maps extracted from the NetCDF file.

grp='SEAWATER'
print(f'Custom maps in {grp} group: {contents.custom_maps[grp].keys()}')
with pd.option_context('display.max_columns', None):
    display(pd.DataFrame.from_dict(contents.custom_maps[grp], orient='index'))
Custom maps in SEAWATER group: dict_keys([])

Validate NetCDF Enumerations

Verify that enumerated values in the NetCDF file match current MARIS lookup tables.

FEEDBACK TO DATA PROVIDERS

The enumeration validation process is a diagnostic step that identifies inconsistencies between NetCDF enumerations and MARIS lookup tables. While this validation does not modify the dataset, it generates detailed feedback about any mismatches or undefined values.


source

ValidateEnumsCB

 ValidateEnumsCB (contents, maris_enums, verbose=False)

Validate enumeration mappings between NetCDF file and MARIS lookup tables.

contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data= contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        ValidateEnumsCB(
            contents = contents,
            maris_enums=Enums(lut_src_dir=lut_path())
        ),
    ]
)
tfm()
print('\n')

Remove Non Compatible Columns

The [RemoveNonCompatibleVariablesCB](https://franckalbinet.github.io/marisco/api/netcdf2csv.html#removenoncompatiblevariablescb) callback filters out variables from the NetCDF format that are not listed in the VARS configuration.


source

RemoveNonCompatibleVariablesCB

 RemoveNonCompatibleVariablesCB (vars:Dict[str,str]={'LON': 'longitude',
                                 'LAT': 'latitude', 'SMP_DEPTH':
                                 'sampdepth', 'TOT_DEPTH': 'totdepth',
                                 'TIME': 'begperiod', 'AREA': 'area',
                                 'NUCLIDE': 'nuclide_id', 'VALUE':
                                 'activity', 'UNIT': 'unit_id', 'UNC':
                                 'uncertaint', 'DL': 'detection', 'FILT':
                                 'filtered', 'COUNT_MET': 'counmet_id',
                                 'SAMP_MET': 'sampmet_id', 'PREP_MET':
                                 'prepmet_id', 'VOL': 'volume', 'SAL':
                                 'salinity', 'TEMP': 'temperatur',
                                 'SPECIES': 'species_id', 'BODY_PART':
                                 'bodypar_id', 'SED_TYPE': 'sedtype_id',
                                 'TOP': 'sliceup', 'BOTTOM': 'slicedown',
                                 'DRYWT': 'drywt', 'WETWT': 'wetwt',
                                 'PERCENTWT': 'percentwt', 'LAB':
                                 'lab_id', 'PROFILE_ID': 'profile_id',
                                 'SAMPLE_TYPE': 'samptype_id',
                                 'TAXONNAME': 'taxonname', 'TAXONREPNAME':
                                 'taxonrepname', 'TAXONRANK': 'taxonrank',
                                 'TAXONDB': 'taxondb', 'TAXONDBID':
                                 'taxondb_id', 'TAXONDBURL':
                                 'taxondb_url', 'REF_ID': 'ref_id',
                                 'SMP_ID': 'samplabcode'},
                                 verbose:bool=False)

Remove variables not listed in VARS configuration.

Type Default Details
vars Dict {‘LON’: ‘longitude’, ‘LAT’: ‘latitude’, ‘SMP_DEPTH’: ‘sampdepth’, ‘TOT_DEPTH’: ‘totdepth’, ‘TIME’: ‘begperiod’, ‘AREA’: ‘area’, ‘NUCLIDE’: ‘nuclide_id’, ‘VALUE’: ‘activity’, ‘UNIT’: ‘unit_id’, ‘UNC’: ‘uncertaint’, ‘DL’: ‘detection’, ‘FILT’: ‘filtered’, ‘COUNT_MET’: ‘counmet_id’, ‘SAMP_MET’: ‘sampmet_id’, ‘PREP_MET’: ‘prepmet_id’, ‘VOL’: ‘volume’, ‘SAL’: ‘salinity’, ‘TEMP’: ‘temperatur’, ‘SPECIES’: ‘species_id’, ‘BODY_PART’: ‘bodypar_id’, ‘SED_TYPE’: ‘sedtype_id’, ‘TOP’: ‘sliceup’, ‘BOTTOM’: ‘slicedown’, ‘DRYWT’: ‘drywt’, ‘WETWT’: ‘wetwt’, ‘PERCENTWT’: ‘percentwt’, ‘LAB’: ‘lab_id’, ‘PROFILE_ID’: ‘profile_id’, ‘SAMPLE_TYPE’: ‘samptype_id’, ‘TAXONNAME’: ‘taxonname’, ‘TAXONREPNAME’: ‘taxonrepname’, ‘TAXONRANK’: ‘taxonrank’, ‘TAXONDB’: ‘taxondb’, ‘TAXONDBID’: ‘taxondb_id’, ‘TAXONDBURL’: ‘taxondb_url’, ‘REF_ID’: ‘ref_id’, ‘SMP_ID’: ‘samplabcode’} Dictionary mapping OR vars to NC vars
verbose bool False
contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        RemoveNonCompatibleVariablesCB(vars=CSV_VARS, verbose=True),
    ]
)
tfm()
print('\n')
Removing variables that are not compatible with vars provided. 
Removing BIO_GROUP from BIOTA dataset.

Add Taxon Information


source

get_taxon_info_lut

 get_taxon_info_lut (maris_lut:str, key_names:dict={'Taxonname':
                     'TAXONNAME', 'Taxonrank': 'TAXONRANK', 'TaxonDB':
                     'TAXONDB', 'TaxonDBID': 'TAXONDBID', 'TaxonDBURL':
                     'TAXONDBURL'})

Create lookup dictionary for taxon information from MARIS species lookup table.


source

AddTaxonInformationCB

 AddTaxonInformationCB (fn_lut:Callable=<function <lambda>>,
                        verbose:bool=False)

Add taxon information to BIOTA group based on species lookup table.

Type Default Details
fn_lut Callable Function that returns taxon lookup dictionary
verbose bool False
contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        AddTaxonInformationCB(
            fn_lut=lut_taxon
        ),
    ]
)

tfm()
print(tfm.dfs['BIOTA'][['TAXONNAME','TAXONRANK','TAXONDB','TAXONDBID','TAXONDBURL']])
               TAXONNAME TAXONRANK   TAXONDB TAXONDBID  \
0           Gadus morhua   species  Wikidata   Q199788   
1           Gadus morhua   species  Wikidata   Q199788   
2           Gadus morhua   species  Wikidata   Q199788   
3           Gadus morhua   species  Wikidata   Q199788   
4           Gadus morhua   species  Wikidata   Q199788   
...                  ...       ...       ...       ...   
16089  Fucus vesiculosus   species  Wikidata   Q754755   
16090  Fucus vesiculosus   species  Wikidata   Q754755   
16091     Mytilus edulis   species  Wikidata    Q27855   
16092     Mytilus edulis   species  Wikidata    Q27855   
16093     Mytilus edulis   species  Wikidata    Q27855   

                                  TAXONDBURL  
0      https://www.wikidata.org/wiki/Q199788  
1      https://www.wikidata.org/wiki/Q199788  
2      https://www.wikidata.org/wiki/Q199788  
3      https://www.wikidata.org/wiki/Q199788  
4      https://www.wikidata.org/wiki/Q199788  
...                                      ...  
16089  https://www.wikidata.org/wiki/Q754755  
16090  https://www.wikidata.org/wiki/Q754755  
16091   https://www.wikidata.org/wiki/Q27855  
16092   https://www.wikidata.org/wiki/Q27855  
16093   https://www.wikidata.org/wiki/Q27855  

[16094 rows x 5 columns]

Standardize Time

contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        DecodeTimeCB(),
    ]
)

tfm()

print(tfm.dfs['BIOTA']['TIME'])
0       2012-09-23
1       2012-09-23
2       2012-09-23
3       2012-09-23
4       2012-09-23
           ...    
16089   2022-05-10
16090   2022-05-10
16091   2022-09-15
16092   2022-09-15
16093   2022-09-15
Name: TIME, Length: 16094, dtype: datetime64[ns]

Add Sample Type ID

contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        AddSampleTypeIdColumnCB(),
    ]
)

tfm()
print(tfm.dfs['SEAWATER']['SAMPLE_TYPE'].unique())
print(tfm.dfs['BIOTA']['SAMPLE_TYPE'].unique())
print(tfm.dfs['SEDIMENT']['SAMPLE_TYPE'].unique())
[1]
[2]
[3]

Add Reference ID

Include the ref_id (i.e., Zotero Archive Location). The ZoteroArchiveLocationCB performs a lookup of the Zotero Archive Location based on the Zotero key defined in the global attributes of the MARIS NetCDF file as id.

contents.global_attrs['id']
'26VMZZ2Q'

source

AddZoteroArchiveLocationCB

 AddZoteroArchiveLocationCB (attrs:str, cfg:dict)

Fetch and append ‘Loc. in Archive’ from Zotero to DataFrame.

contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        AddZoteroArchiveLocationCB(contents.global_attrs, cfg=cfg()),
    ]
)
tfm()
print(tfm.dfs['SEAWATER']['REF_ID'].unique())
[100]

Remap encoded custom maps

NetCDF variables can store custom mappings as attributes, which provide a way to map between encoded values and their human-readable representations. The RemapCustomMapsCB callback handles this conversion process.


source

RemapCustomMapsCB

 RemapCustomMapsCB (verbose:bool=False)

Remap encoded custom maps to decoded values.

contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        RemapCustomMapsCB(verbose=True),
    ]
)
tfm()
print('Example of remapped custom maps:')
for grp in tfm.dfs:
    print(grp)
    print(tfm.dfs[grp]['SMP_ID'].head())
Example of remapped custom maps:
BIOTA
0    0
1    0
2    0
3    0
4    1
Name: SMP_ID, dtype: uint64
SEAWATER
0    0
1    1
2    2
3    3
4    4
Name: SMP_ID, dtype: uint64
SEDIMENT
0    0
1    1
2    2
3    3
4    4
Name: SMP_ID, dtype: uint64

Remap to Open Refine specific mappings

FEEDBACK FOR NEXT VERSION

[To be further clarified]

The current approach of remapping to OR-specific mappings should be reconsidered. Considering that we already utilize MARISCO lookup tables in NetCDF for creating enums, it would be beneficial to extend their use to OpenRefine data formats as well. By doing so, we could eliminate the need for OpenRefine-specific mappings, streamlining the data transformation process. Lets review the lookup tables used to create the enums for NetCDF:

enums = Enums(lut_src_dir=lut_path())
print(f'DL enums: {enums.types["DL"]}')
print(f'FILT enums: {enums.types["FILT"]}')
DL enums: {'Not applicable': -1, 'Not available': 0, 'Detected value': 1, 'Detection limit': 2, 'Not detected': 3, 'Derived': 4}
FILT enums: {'Not applicable': -1, 'Not available': 0, 'Yes': 1, 'No': 2}

For the detection limit lookup table (LUT), as shown below, the values required for the OpenRefine CSV format are listed under the ‘name’ column, whereas the enums utilize the ‘name_sanitized’ column. Additionally, for the filtered LUT, also shown below, the values do not align consistently with the OpenRefine CSV format, which uses (Y, N, NA).

dl_lut = pd.read_excel(detection_limit_lut_path())
dl_lut
id name name_sanitized
0 -1 Not applicable Not applicable
1 0 Not Available Not available
2 1 = Detected value
3 2 < Detection limit
4 3 ND Not detected
5 4 DE Derived
filtered_lut = pd.read_excel(filtered_lut_path())
filtered_lut
id name
0 -1 Not applicable
1 0 Not available
2 1 Yes
3 2 No

We will create OpenRefine specific mappings for the detection limit and filtered data:

RemapToORSpecificMappingsCB remaps the values of the detection limit and filtered data to the OpenRefine CSV format.


source

RemapToORSpecificMappingsCB

 RemapToORSpecificMappingsCB (or_mappings:Dict[str,Dict]={'DL': {0: 'ND',
                              1: '=', 2: '<'}, 'FILT': {0: 'NA', 1: 'Y',
                              2: 'N'}},
                              output_format:str='openrefine_csv',
                              verbose:bool=False)

Convert values using OR mappings if columns exist in dataframe.

Type Default Details
or_mappings Dict {‘DL’: {0: ‘ND’, 1: ‘=’, 2: ‘<’}, ‘FILT’: {0: ‘NA’, 1: ‘Y’, 2: ‘N’}} Dictionary of column mappings,
output_format str openrefine_csv
verbose bool False
contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    data= contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        RemapToORSpecificMappingsCB(),
    ]
)

tfm()

# Loop through each group in the 'dfs' dictionary
for group_name, df in tfm.dfs.items():
    # Check if the group dataframe contains any of the columns specified in or_mappings.keys()
    relevant_columns = [col for col in or_mappings.keys() if col in df.columns]
    if relevant_columns:
        # Print the unique values from the relevant columns
        print(f"\nUnique values in {group_name} for columns {relevant_columns}:")
        for col in relevant_columns:
            print(f"{col}: {df[col].unique()}")
    else:
        print(f"No relevant columns found in {group_name} based on or_mappings keys.")

Unique values in BIOTA for columns ['DL']:
DL: ['<' '=' 'ND']

Unique values in SEAWATER for columns ['DL', 'FILT']:
DL: ['=' '<' 'ND']
FILT: ['NA' 'N' 'Y']

Unique values in SEDIMENT for columns ['DL']:
DL: ['=' '<' 'ND']

Remap to CSV data type format

CSV_DTYPES (defined in configs.ipynb) defines a state for each variable that contains a lookup table (i.e. enums). The state is either ‘decoded’ or ‘encoded’. Lets review the variable states as a DataFrame:

with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(pd.DataFrame.from_dict(CSV_DTYPES, orient='index').T)
AREA NUCLIDE UNIT DL FILT COUNT_MET SAMP_MET PREP_MET SPECIES BODY_PART SED_TYPE LAB
state decoded encoded encoded decoded decoded encoded encoded encoded encoded encoded encoded encoded
enums = Enums(lut_src_dir=lut_path())
enums.types.keys()
dict_keys(['AREA', 'BIO_GROUP', 'BODY_PART', 'COUNT_MET', 'DL', 'FILT', 'NUCLIDE', 'PREP_MET', 'SAMP_MET', 'SED_TYPE', 'SPECIES', 'UNIT', 'LAB'])

source

get_excluded_enums

 get_excluded_enums (output_format:str='openrefine_csv')

Get excluded enums based on output format.


source

DataFormatConversionCB

 DataFormatConversionCB (dtypes:Dict, excluded_mappings:Callable=<function
                         get_excluded_enums>,
                         output_format:str='openrefine_csv',
                         verbose:bool=False)

A callback to convert DataFrame enum values between encoded and decoded formats based on specified settings.

Type Default Details
dtypes Dict Dictionary defining data types and states for each lookup table
excluded_mappings Callable get_excluded_enums Dictionary of columns to exclude from conversion
output_format str openrefine_csv
verbose bool False Flag for verbose output
contents = ExtractNetcdfContents(fname_in)
tfm = Transformer(
    contents.dfs,
    cbs=[
        RemoveNonCompatibleVariablesCB(vars=CSV_VARS, verbose=True),
        DataFormatConversionCB(
            dtypes=CSV_DTYPES,
            excluded_mappings = get_excluded_enums,
            output_format='openrefine_csv',
            verbose=True
        ),
    ]
)
tfm()
Removing variables that are not compatible with vars provided. 
Removing BIO_GROUP from BIOTA dataset.
Loaded enums: dict_keys(['AREA', 'BIO_GROUP', 'BODY_PART', 'COUNT_MET', 'DL', 'FILT', 'NUCLIDE', 'PREP_MET', 'SAMP_MET', 'SED_TYPE', 'SPECIES', 'UNIT', 'LAB'])
{'BIOTA':              LON        LAT  SMP_DEPTH        TIME  SMP_ID  NUCLIDE  \
 0      12.316667  54.283333        NaN  1348358400       0       31   
 1      12.316667  54.283333        NaN  1348358400       0        4   
 2      12.316667  54.283333        NaN  1348358400       0        9   
 3      12.316667  54.283333        NaN  1348358400       0       33   
 4      12.316667  54.283333        NaN  1348358400       1       31   
 ...          ...        ...        ...         ...     ...      ...   
 16089  21.395000  61.241501        2.0  1652140800    4789       33   
 16090  21.395000  61.241501        2.0  1652140800    4789        9   
 16091  21.385000  61.343334        NaN  1663200000    4790        4   
 16092  21.385000  61.343334        NaN  1663200000    4790       33   
 16093  21.385000  61.343334        NaN  1663200000    4790       12   
 
             VALUE  UNIT       UNC  DL  SPECIES  BODY_PART       DRYWT  WETWT  \
 0        0.010140     5       NaN   2       99         52  174.934433  948.0   
 1      135.300003     5  4.830210   1       99         52  174.934433  948.0   
 2        0.013980     5       NaN   2       99         52  174.934433  948.0   
 3        4.338000     5  0.150962   1       99         52  174.934433  948.0   
 4        0.009614     5       NaN   2       99         52  177.935120  964.0   
 ...           ...   ...       ...  ..      ...        ...         ...    ...   
 16089   13.700000     4  0.520600   1       96         55         NaN    NaN   
 16090    0.500000     4  0.045500   1       96         55         NaN    NaN   
 16091   50.700001     4  4.106700   1      129          1         NaN    NaN   
 16092    0.880000     4  0.140800   1      129          1         NaN    NaN   
 16093    6.600000     4  0.349800   1      129          1         NaN    NaN   
 
        PERCENTWT  
 0        0.18453  
 1        0.18453  
 2        0.18453  
 3        0.18453  
 4        0.18458  
 ...          ...  
 16089        NaN  
 16090        NaN  
 16091        NaN  
 16092        NaN  
 16093        NaN  
 
 [16094 rows x 15 columns],
 'SEAWATER':              LON        LAT  SMP_DEPTH  TOT_DEPTH        TIME  SMP_ID  \
 0      29.333300  60.083302        0.0        NaN  1337731200       0   
 1      29.333300  60.083302       29.0        NaN  1337731200       1   
 2      23.150000  59.433300        0.0        NaN  1339891200       2   
 3      27.983299  60.250000        0.0        NaN  1337817600       3   
 4      27.983299  60.250000       39.0        NaN  1337817600       4   
 ...          ...        ...        ...        ...         ...     ...   
 21468  13.499833  54.600334        0.0       47.0  1686441600    9724   
 21469  13.499833  54.600334       45.0       47.0  1686441600    9725   
 21470  14.200833  54.600334        0.0       11.0  1686614400    9731   
 21471  14.665500  54.600334        0.0       20.0  1686614400    9732   
 21472  14.330000  54.600334        0.0       17.0  1686614400    9734   
 
        NUCLIDE       VALUE  UNIT        UNC  DL  FILT  
 0           33    5.300000     1   1.696000   1     0  
 1           33   19.900000     1   3.980000   1     0  
 2           33   25.500000     1   5.100000   1     0  
 3           33   17.000000     1   4.930000   1     0  
 4           33   22.200001     1   3.996000   1     0  
 ...        ...         ...   ...        ...  ..   ...  
 21468        1  702.838074     1  51.276207   1     0  
 21469        1  725.855713     1  52.686260   1     0  
 21470        1  648.992920     1  48.154419   1     0  
 21471        1  627.178406     1  46.245316   1     0  
 21472        1  605.715088     1  45.691143   1     0  
 
 [21473 rows x 12 columns],
 'SEDIMENT':              LON        LAT  TOT_DEPTH        TIME  SMP_ID  NUCLIDE  \
 0      27.799999  60.466667       25.0  1337904000       0       33   
 1      27.799999  60.466667       25.0  1337904000       1       33   
 2      27.799999  60.466667       25.0  1337904000       2       33   
 3      27.799999  60.466667       25.0  1337904000       3       33   
 4      27.799999  60.466667       25.0  1337904000       4       33   
 ...          ...        ...        ...         ...     ...      ...   
 70444  15.537800  54.617832       62.0  1654646400   14121       67   
 70445  15.537800  54.617832       62.0  1654646400   14121       77   
 70446  15.537800  54.617832       62.0  1654646400   14122        4   
 70447  15.537800  54.617832       62.0  1654646400   14122       33   
 70448  15.537800  54.617832       62.0  1654646400   14122       77   
 
              VALUE  UNIT         UNC  DL  SED_TYPE   TOP  BOTTOM  PERCENTWT  
 0      1200.000000     3  240.000000   1         0  15.0    20.0        NaN  
 1       250.000000     3   50.000000   1         0  20.0    25.0        NaN  
 2       140.000000     3   29.400000   1         0  25.0    30.0        NaN  
 3        79.000000     3   15.800000   1         0  30.0    35.0        NaN  
 4        29.000000     3    6.960000   1         0  35.0    40.0        NaN  
 ...            ...   ...         ...  ..       ...   ...     ...        ...  
 70444     0.044000     2    0.015312   1        10  15.0    17.0   0.257642  
 70445     2.500000     2    0.185000   1        10  15.0    17.0   0.257642  
 70446  5873.000000     2  164.444000   1        10  17.0    19.0   0.263965  
 70447    21.200001     2    2.162400   1        10  17.0    19.0   0.263965  
 70448     0.370000     2    0.048100   1        10  17.0    19.0   0.263965  
 
 [70449 rows x 14 columns]}

Review all callbacks

contents = ExtractNetcdfContents(fname_in)
output_format = 'openrefine_csv'
tfm = Transformer(
    data=contents.dfs,
    custom_maps=contents.custom_maps,
    cbs=[
        ValidateEnumsCB(
            contents = contents,
            maris_enums=Enums(lut_src_dir=lut_path())
        ),
        RemoveNonCompatibleVariablesCB(vars=CSV_VARS) ,
        RemapCustomMapsCB(),
        RemapToORSpecificMappingsCB(output_format=output_format),
        AddTaxonInformationCB(
            fn_lut=lut_taxon
        ),
        DecodeTimeCB(),
        AddSampleTypeIdColumnCB(),
        AddZoteroArchiveLocationCB(contents.global_attrs, cfg=cfg()),
        DataFormatConversionCB(
            dtypes=CSV_DTYPES,
            excluded_mappings = get_excluded_enums,
            output_format=output_format,
        ) 
        ]
)
tfm()
for grp in ['SEAWATER', 'BIOTA']:
    display(Markdown(f"<b>Head of the transformed `{grp}` DataFrame:</b>"))
    with pd.option_context('display.max_rows', None):
        display(tfm.dfs[grp].head())

Head of the transformed SEAWATER DataFrame:

LON LAT SMP_DEPTH TOT_DEPTH TIME SMP_ID NUCLIDE VALUE UNIT UNC DL FILT SAMPLE_TYPE REF_ID
0 29.333300 60.083302 0.0 NaN 2012-05-23 0 33 5.300000 1 1.696 = NA 1 100
1 29.333300 60.083302 29.0 NaN 2012-05-23 1 33 19.900000 1 3.980 = NA 1 100
2 23.150000 59.433300 0.0 NaN 2012-06-17 2 33 25.500000 1 5.100 = NA 1 100
3 27.983299 60.250000 0.0 NaN 2012-05-24 3 33 17.000000 1 4.930 = NA 1 100
4 27.983299 60.250000 39.0 NaN 2012-05-24 4 33 22.200001 1 3.996 = NA 1 100

Head of the transformed BIOTA DataFrame:

LON LAT SMP_DEPTH TIME SMP_ID NUCLIDE VALUE UNIT UNC DL ... DRYWT WETWT PERCENTWT TAXONNAME TAXONRANK TAXONDB TAXONDBID TAXONDBURL SAMPLE_TYPE REF_ID
0 12.316667 54.283333 NaN 2012-09-23 0 31 0.010140 5 NaN < ... 174.934433 948.0 0.18453 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100
1 12.316667 54.283333 NaN 2012-09-23 0 4 135.300003 5 4.830210 = ... 174.934433 948.0 0.18453 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100
2 12.316667 54.283333 NaN 2012-09-23 0 9 0.013980 5 NaN < ... 174.934433 948.0 0.18453 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100
3 12.316667 54.283333 NaN 2012-09-23 0 33 4.338000 5 0.150962 = ... 174.934433 948.0 0.18453 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100
4 12.316667 54.283333 NaN 2012-09-23 1 31 0.009614 5 NaN < ... 177.935120 964.0 0.18458 Gadus morhua species Wikidata Q199788 https://www.wikidata.org/wiki/Q199788 2 100

5 rows × 22 columns

Decode


source

decode

 decode (fname_in:str, dest_out:str|None=None,
         output_format:str='openrefine_csv',
         remap_vars:Dict[str,str]={'LON': 'longitude', 'LAT': 'latitude',
         'SMP_DEPTH': 'sampdepth', 'TOT_DEPTH': 'totdepth', 'TIME':
         'begperiod', 'AREA': 'area', 'NUCLIDE': 'nuclide_id', 'VALUE':
         'activity', 'UNIT': 'unit_id', 'UNC': 'uncertaint', 'DL':
         'detection', 'FILT': 'filtered', 'COUNT_MET': 'counmet_id',
         'SAMP_MET': 'sampmet_id', 'PREP_MET': 'prepmet_id', 'VOL':
         'volume', 'SAL': 'salinity', 'TEMP': 'temperatur', 'SPECIES':
         'species_id', 'BODY_PART': 'bodypar_id', 'SED_TYPE':
         'sedtype_id', 'TOP': 'sliceup', 'BOTTOM': 'slicedown', 'DRYWT':
         'drywt', 'WETWT': 'wetwt', 'PERCENTWT': 'percentwt', 'LAB':
         'lab_id', 'PROFILE_ID': 'profile_id', 'SAMPLE_TYPE':
         'samptype_id', 'TAXONNAME': 'taxonname', 'TAXONREPNAME':
         'taxonrepname', 'TAXONRANK': 'taxonrank', 'TAXONDB': 'taxondb',
         'TAXONDBID': 'taxondb_id', 'TAXONDBURL': 'taxondb_url', 'REF_ID':
         'ref_id', 'SMP_ID': 'samplabcode'},
         remap_dtypes:Dict[str,str]={'AREA': {'state': 'decoded'},
         'NUCLIDE': {'state': 'encoded'}, 'UNIT': {'state': 'encoded'},
         'DL': {'state': 'decoded'}, 'FILT': {'state': 'decoded'},
         'COUNT_MET': {'state': 'encoded'}, 'SAMP_MET': {'state':
         'encoded'}, 'PREP_MET': {'state': 'encoded'}, 'SPECIES':
         {'state': 'encoded'}, 'BODY_PART': {'state': 'encoded'},
         'SED_TYPE': {'state': 'encoded'}, 'LAB': {'state': 'encoded'}},
         verbose:bool=False, **kwargs)

Decode data from NetCDF.

Type Default Details
fname_in str Input file name
dest_out str | None None Output file name (optional)
output_format str openrefine_csv
remap_vars Dict {‘LON’: ‘longitude’, ‘LAT’: ‘latitude’, ‘SMP_DEPTH’: ‘sampdepth’, ‘TOT_DEPTH’: ‘totdepth’, ‘TIME’: ‘begperiod’, ‘AREA’: ‘area’, ‘NUCLIDE’: ‘nuclide_id’, ‘VALUE’: ‘activity’, ‘UNIT’: ‘unit_id’, ‘UNC’: ‘uncertaint’, ‘DL’: ‘detection’, ‘FILT’: ‘filtered’, ‘COUNT_MET’: ‘counmet_id’, ‘SAMP_MET’: ‘sampmet_id’, ‘PREP_MET’: ‘prepmet_id’, ‘VOL’: ‘volume’, ‘SAL’: ‘salinity’, ‘TEMP’: ‘temperatur’, ‘SPECIES’: ‘species_id’, ‘BODY_PART’: ‘bodypar_id’, ‘SED_TYPE’: ‘sedtype_id’, ‘TOP’: ‘sliceup’, ‘BOTTOM’: ‘slicedown’, ‘DRYWT’: ‘drywt’, ‘WETWT’: ‘wetwt’, ‘PERCENTWT’: ‘percentwt’, ‘LAB’: ‘lab_id’, ‘PROFILE_ID’: ‘profile_id’, ‘SAMPLE_TYPE’: ‘samptype_id’, ‘TAXONNAME’: ‘taxonname’, ‘TAXONREPNAME’: ‘taxonrepname’, ‘TAXONRANK’: ‘taxonrank’, ‘TAXONDB’: ‘taxondb’, ‘TAXONDBID’: ‘taxondb_id’, ‘TAXONDBURL’: ‘taxondb_url’, ‘REF_ID’: ‘ref_id’, ‘SMP_ID’: ‘samplabcode’}
remap_dtypes Dict {‘AREA’: {‘state’: ‘decoded’}, ‘NUCLIDE’: {‘state’: ‘encoded’}, ‘UNIT’: {‘state’: ‘encoded’}, ‘DL’: {‘state’: ‘decoded’}, ‘FILT’: {‘state’: ‘decoded’}, ‘COUNT_MET’: {‘state’: ‘encoded’}, ‘SAMP_MET’: {‘state’: ‘encoded’}, ‘PREP_MET’: {‘state’: ‘encoded’}, ‘SPECIES’: {‘state’: ‘encoded’}, ‘BODY_PART’: {‘state’: ‘encoded’}, ‘SED_TYPE’: {‘state’: ‘encoded’}, ‘LAB’: {‘state’: ‘encoded’}}
verbose bool False
kwargs VAR_KEYWORD
Returns None Additional arguments
fname = Path('../../_data/output/100-HELCOM-MORS-2024.nc')
decode(fname_in=fname, dest_out=fname.with_suffix(''), output_format='openrefine_csv')