from IPython.display import display, Markdown
NetCDF2CSV
A data pipeline that converts MARIS NetCDF files into MARIS Standard OpenRefine CSV format.
This module converts NetCDF files into CSV files that follow the MARIS Standard OpenRefine format. While MARISCO has replaced OpenRefine in the data cleaning and preparation pipeline, the MARIS master database still requires input files to conform to this CSV format specification. The conversion is performed using the marisco library.
For new MARIS users, please refer to field definitions for detailed information about Maris fields.
Dependencies
Required packages and internal modules for data format transformations
Configuration and File Paths
= Path('../../_data/output/100-HELCOM-MORS-2024.nc')
fname_in = fname_in.with_suffix('.csv')
fname_out = 'openrefine_csv' output_format
Data Loading
Load data from standardized MARIS NetCDF files using ExtractNetcdfContents. The NetCDF files follow CF conventions and include standardized variable names and metadata according to MARIS specifications.
= ExtractNetcdfContents(fname_in) contents
Show the dictionary of dataframes extracted from the NetCDF file.
contents.dfs.keys()
dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT'])
Show an example of the DataFrame extracted from the NetCDF file.
with pd.option_context('display.max_columns', None):
'SEAWATER'].head()) display(contents.dfs[
LON | LAT | SMP_DEPTH | TOT_DEPTH | TIME | SMP_ID | NUCLIDE | VALUE | UNIT | UNC | DL | FILT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 29.333300 | 60.083302 | 0.0 | NaN | 1337731200 | 0 | 33 | 5.300000 | 1 | 1.696 | 1 | 0 |
1 | 29.333300 | 60.083302 | 29.0 | NaN | 1337731200 | 1 | 33 | 19.900000 | 1 | 3.980 | 1 | 0 |
2 | 23.150000 | 59.433300 | 0.0 | NaN | 1339891200 | 2 | 33 | 25.500000 | 1 | 5.100 | 1 | 0 |
3 | 27.983299 | 60.250000 | 0.0 | NaN | 1337817600 | 3 | 33 | 17.000000 | 1 | 4.930 | 1 | 0 |
4 | 27.983299 | 60.250000 | 39.0 | NaN | 1337817600 | 4 | 33 | 22.200001 | 1 | 3.996 | 1 | 0 |
Show an example of the dictionary of enums extracted from the NetCDF file as a DataFrame.
='SEAWATER'
grpprint(f'Variables in {grp} group: {contents.enum_dicts[grp].keys()}')
='nuclide'
varwith pd.option_context('display.max_columns', None):
='index').T) display(pd.DataFrame.from_dict(contents.enum_dicts[grp][var], orient
Variables in SEAWATER group: dict_keys(['nuclide', 'unit', 'dl', 'filt'])
NOT APPLICABLE | NOT AVAILABLE | h3 | be7 | c14 | k40 | cr51 | mn54 | co57 | co58 | co60 | zn65 | sr89 | sr90 | zr95 | nb95 | tc99 | ru103 | ru106 | rh106 | ag106m | ag108 | ag108m | ag110m | sb124 | sb125 | te129m | i129 | i131 | cs127 | cs134 | cs137 | ba140 | la140 | ce141 | ce144 | pm147 | eu154 | eu155 | pb210 | pb212 | pb214 | bi207 | bi211 | bi214 | po210 | rn220 | rn222 | ra223 | ra224 | ra225 | ra226 | ra228 | ac228 | th227 | th228 | th232 | th234 | pa234 | u234 | u235 | u238 | np237 | np239 | pu238 | pu239 | pu240 | pu241 | am240 | am241 | cm242 | cm243 | cm244 | cs134_137_tot | pu239_240_tot | pu239_240_iii_iv_tot | pu239_240_v_vi_tot | cm243_244_tot | pu238_pu239_240_tot_ratio | am241_pu239_240_tot_ratio | cs137_134_ratio | cd109 | eu152 | fe59 | gd153 | ir192 | pu238_240_tot | rb86 | sc46 | sn113 | sn117m | tl208 | mo99 | tc99m | ru105 | te129 | te132 | i132 | i135 | cs136 | tbeta | talpha | i133 | th230 | pa231 | u236 | ag111 | in116m | te123m | sb127 | ba133 | ce139 | tl201 | hg203 | na22 | pa234m | am243 | se75 | sr85 | y88 | ce140 | bi212 | u236_238_ratio | i125 | ba137m | u232 | pa233 | ru106_rh106_tot | tu | tbeta40k | fe55 | ce144_pr144_tot | pu240_pu239_ratio | u233 | pu239_242_tot | ac227 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 28 | 29 | 30 | 31 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 116 | 117 | 122 | 123 | 124 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 |
Show the global attributes extracted from the NetCDF file.
print("First few attributes from global attributes:", list(contents.global_attrs.items())[:5])
First few attributes from global attributes: [('id', '26VMZZ2Q'), ('title', 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances'), ('summary', 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting data under Recommendation 26/3.\n\nThe database is updated and quality assured annually by HELCOM MORS EG.'), ('keywords', 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)'), ('history', 'TBD')]
Show the custom maps extracted from the NetCDF file.
='SEAWATER'
grpprint(f'Custom maps in {grp} group: {contents.custom_maps[grp].keys()}')
with pd.option_context('display.max_columns', None):
='index')) display(pd.DataFrame.from_dict(contents.custom_maps[grp], orient
Custom maps in SEAWATER group: dict_keys([])
Validate NetCDF Enumerations
Verify that enumerated values in the NetCDF file match current MARIS lookup tables.
The enumeration validation process is a diagnostic step that identifies inconsistencies between NetCDF enumerations and MARIS lookup tables. While this validation does not modify the dataset, it generates detailed feedback about any mismatches or undefined values.
ValidateEnumsCB
ValidateEnumsCB (contents, maris_enums, verbose=False)
Validate enumeration mappings between NetCDF file and MARIS lookup tables.
= ExtractNetcdfContents(fname_in)
contents = Transformer(
tfm = contents.dfs,
data=contents.custom_maps,
custom_maps=[
cbs
ValidateEnumsCB(= contents,
contents =Enums(lut_src_dir=lut_path())
maris_enums
),
]
)
tfm()print('\n')
Remove Non Compatible Columns
The [
RemoveNonCompatibleVariablesCB](https://franckalbinet.github.io/marisco/api/netcdf2csv.html#removenoncompatiblevariablescb)
callback filters out variables from the NetCDF format that are not listed in the VARS configuration.
RemoveNonCompatibleVariablesCB
RemoveNonCompatibleVariablesCB (vars:Dict[str,str]={'LON': 'longitude', 'LAT': 'latitude', 'SMP_DEPTH': 'sampdepth', 'TOT_DEPTH': 'totdepth', 'TIME': 'begperiod', 'AREA': 'area', 'NUCLIDE': 'nuclide_id', 'VALUE': 'activity', 'UNIT': 'unit_id', 'UNC': 'uncertaint', 'DL': 'detection', 'FILT': 'filtered', 'COUNT_MET': 'counmet_id', 'SAMP_MET': 'sampmet_id', 'PREP_MET': 'prepmet_id', 'VOL': 'volume', 'SAL': 'salinity', 'TEMP': 'temperatur', 'SPECIES': 'species_id', 'BODY_PART': 'bodypar_id', 'SED_TYPE': 'sedtype_id', 'TOP': 'sliceup', 'BOTTOM': 'slicedown', 'DRYWT': 'drywt', 'WETWT': 'wetwt', 'PERCENTWT': 'percentwt', 'LAB': 'lab_id', 'PROFILE_ID': 'profile_id', 'SAMPLE_TYPE': 'samptype_id', 'TAXONNAME': 'taxonname', 'TAXONREPNAME': 'taxonrepname', 'TAXONRANK': 'taxonrank', 'TAXONDB': 'taxondb', 'TAXONDBID': 'taxondb_id', 'TAXONDBURL': 'taxondb_url', 'REF_ID': 'ref_id', 'SMP_ID': 'samplabcode'}, verbose:bool=False)
Remove variables not listed in VARS configuration.
Type | Default | Details | |
---|---|---|---|
vars | Dict | {‘LON’: ‘longitude’, ‘LAT’: ‘latitude’, ‘SMP_DEPTH’: ‘sampdepth’, ‘TOT_DEPTH’: ‘totdepth’, ‘TIME’: ‘begperiod’, ‘AREA’: ‘area’, ‘NUCLIDE’: ‘nuclide_id’, ‘VALUE’: ‘activity’, ‘UNIT’: ‘unit_id’, ‘UNC’: ‘uncertaint’, ‘DL’: ‘detection’, ‘FILT’: ‘filtered’, ‘COUNT_MET’: ‘counmet_id’, ‘SAMP_MET’: ‘sampmet_id’, ‘PREP_MET’: ‘prepmet_id’, ‘VOL’: ‘volume’, ‘SAL’: ‘salinity’, ‘TEMP’: ‘temperatur’, ‘SPECIES’: ‘species_id’, ‘BODY_PART’: ‘bodypar_id’, ‘SED_TYPE’: ‘sedtype_id’, ‘TOP’: ‘sliceup’, ‘BOTTOM’: ‘slicedown’, ‘DRYWT’: ‘drywt’, ‘WETWT’: ‘wetwt’, ‘PERCENTWT’: ‘percentwt’, ‘LAB’: ‘lab_id’, ‘PROFILE_ID’: ‘profile_id’, ‘SAMPLE_TYPE’: ‘samptype_id’, ‘TAXONNAME’: ‘taxonname’, ‘TAXONREPNAME’: ‘taxonrepname’, ‘TAXONRANK’: ‘taxonrank’, ‘TAXONDB’: ‘taxondb’, ‘TAXONDBID’: ‘taxondb_id’, ‘TAXONDBURL’: ‘taxondb_url’, ‘REF_ID’: ‘ref_id’, ‘SMP_ID’: ‘samplabcode’} | Dictionary mapping OR vars to NC vars |
verbose | bool | False |
= ExtractNetcdfContents(fname_in)
contents = Transformer(
tfm =contents.dfs,
data=contents.custom_maps,
custom_maps=[
cbsvars=CSV_VARS, verbose=True),
RemoveNonCompatibleVariablesCB(
]
)
tfm()print('\n')
Removing variables that are not compatible with vars provided.
Removing BIO_GROUP from BIOTA dataset.
Add Taxon Information
get_taxon_info_lut
get_taxon_info_lut (maris_lut:str, key_names:dict={'Taxonname': 'TAXONNAME', 'Taxonrank': 'TAXONRANK', 'TaxonDB': 'TAXONDB', 'TaxonDBID': 'TAXONDBID', 'TaxonDBURL': 'TAXONDBURL'})
Create lookup dictionary for taxon information from MARIS species lookup table.
AddTaxonInformationCB
AddTaxonInformationCB (fn_lut:Callable=<function <lambda>>, verbose:bool=False)
Add taxon information to BIOTA group based on species lookup table.
Type | Default | Details | |
---|---|---|---|
fn_lut | Callable | Function that returns taxon lookup dictionary | |
verbose | bool | False |
= ExtractNetcdfContents(fname_in)
contents = Transformer(
tfm =contents.dfs,
data=contents.custom_maps,
custom_maps=[
cbs
AddTaxonInformationCB(=lut_taxon
fn_lut
),
]
)
tfm()print(tfm.dfs['BIOTA'][['TAXONNAME','TAXONRANK','TAXONDB','TAXONDBID','TAXONDBURL']])
TAXONNAME TAXONRANK TAXONDB TAXONDBID \
0 Gadus morhua species Wikidata Q199788
1 Gadus morhua species Wikidata Q199788
2 Gadus morhua species Wikidata Q199788
3 Gadus morhua species Wikidata Q199788
4 Gadus morhua species Wikidata Q199788
... ... ... ... ...
16089 Fucus vesiculosus species Wikidata Q754755
16090 Fucus vesiculosus species Wikidata Q754755
16091 Mytilus edulis species Wikidata Q27855
16092 Mytilus edulis species Wikidata Q27855
16093 Mytilus edulis species Wikidata Q27855
TAXONDBURL
0 https://www.wikidata.org/wiki/Q199788
1 https://www.wikidata.org/wiki/Q199788
2 https://www.wikidata.org/wiki/Q199788
3 https://www.wikidata.org/wiki/Q199788
4 https://www.wikidata.org/wiki/Q199788
... ...
16089 https://www.wikidata.org/wiki/Q754755
16090 https://www.wikidata.org/wiki/Q754755
16091 https://www.wikidata.org/wiki/Q27855
16092 https://www.wikidata.org/wiki/Q27855
16093 https://www.wikidata.org/wiki/Q27855
[16094 rows x 5 columns]
Standardize Time
= ExtractNetcdfContents(fname_in)
contents = Transformer(
tfm =contents.dfs,
data=contents.custom_maps,
custom_maps=[
cbs
DecodeTimeCB(),
]
)
tfm()
print(tfm.dfs['BIOTA']['TIME'])
0 2012-09-23
1 2012-09-23
2 2012-09-23
3 2012-09-23
4 2012-09-23
...
16089 2022-05-10
16090 2022-05-10
16091 2022-09-15
16092 2022-09-15
16093 2022-09-15
Name: TIME, Length: 16094, dtype: datetime64[ns]
Add Sample Type ID
= ExtractNetcdfContents(fname_in)
contents = Transformer(
tfm =contents.dfs,
data=contents.custom_maps,
custom_maps=[
cbs
AddSampleTypeIdColumnCB(),
]
)
tfm()print(tfm.dfs['SEAWATER']['SAMPLE_TYPE'].unique())
print(tfm.dfs['BIOTA']['SAMPLE_TYPE'].unique())
print(tfm.dfs['SEDIMENT']['SAMPLE_TYPE'].unique())
[1]
[2]
[3]
Add Reference ID
Include the ref_id
(i.e., Zotero Archive Location). The ZoteroArchiveLocationCB
performs a lookup of the Zotero Archive Location based on the Zotero key
defined in the global attributes of the MARIS NetCDF file as id
.
'id'] contents.global_attrs[
'26VMZZ2Q'
AddZoteroArchiveLocationCB
AddZoteroArchiveLocationCB (attrs:str, cfg:dict)
Fetch and append ‘Loc. in Archive’ from Zotero to DataFrame.
= ExtractNetcdfContents(fname_in)
contents = Transformer(
tfm =contents.dfs,
data=contents.custom_maps,
custom_maps=[
cbs=cfg()),
AddZoteroArchiveLocationCB(contents.global_attrs, cfg
]
)
tfm()print(tfm.dfs['SEAWATER']['REF_ID'].unique())
[100]
Remap encoded custom maps
NetCDF variables can store custom mappings as attributes, which provide a way to map between encoded values and their human-readable representations. The RemapCustomMapsCB
callback handles this conversion process.
RemapCustomMapsCB
RemapCustomMapsCB (verbose:bool=False)
Remap encoded custom maps to decoded values.
= ExtractNetcdfContents(fname_in)
contents = Transformer(
tfm =contents.dfs,
data=contents.custom_maps,
custom_maps=[
cbs=True),
RemapCustomMapsCB(verbose
]
)
tfm()print('Example of remapped custom maps:')
for grp in tfm.dfs:
print(grp)
print(tfm.dfs[grp]['SMP_ID'].head())
Example of remapped custom maps:
BIOTA
0 0
1 0
2 0
3 0
4 1
Name: SMP_ID, dtype: uint64
SEAWATER
0 0
1 1
2 2
3 3
4 4
Name: SMP_ID, dtype: uint64
SEDIMENT
0 0
1 1
2 2
3 3
4 4
Name: SMP_ID, dtype: uint64
Remap to Open Refine specific mappings
[To be further clarified]
The current approach of remapping to OR-specific mappings should be reconsidered. Considering that we already utilize MARISCO lookup tables in NetCDF for creating enums, it would be beneficial to extend their use to OpenRefine data formats as well. By doing so, we could eliminate the need for OpenRefine-specific mappings, streamlining the data transformation process. Lets review the lookup tables used to create the enums for NetCDF:
= Enums(lut_src_dir=lut_path())
enums print(f'DL enums: {enums.types["DL"]}')
print(f'FILT enums: {enums.types["FILT"]}')
DL enums: {'Not applicable': -1, 'Not available': 0, 'Detected value': 1, 'Detection limit': 2, 'Not detected': 3, 'Derived': 4}
FILT enums: {'Not applicable': -1, 'Not available': 0, 'Yes': 1, 'No': 2}
For the detection limit lookup table (LUT), as shown below, the values required for the OpenRefine CSV format are listed under the ‘name’ column, whereas the enums utilize the ‘name_sanitized’ column. Additionally, for the filtered LUT, also shown below, the values do not align consistently with the OpenRefine CSV format, which uses (Y
, N
, NA
).
= pd.read_excel(detection_limit_lut_path())
dl_lut dl_lut
id | name | name_sanitized | |
---|---|---|---|
0 | -1 | Not applicable | Not applicable |
1 | 0 | Not Available | Not available |
2 | 1 | = | Detected value |
3 | 2 | < | Detection limit |
4 | 3 | ND | Not detected |
5 | 4 | DE | Derived |
= pd.read_excel(filtered_lut_path())
filtered_lut filtered_lut
id | name | |
---|---|---|
0 | -1 | Not applicable |
1 | 0 | Not available |
2 | 1 | Yes |
3 | 2 | No |
We will create OpenRefine specific mappings for the detection limit and filtered data:
RemapToORSpecificMappingsCB remaps the values of the detection limit and filtered data to the OpenRefine CSV format.
RemapToORSpecificMappingsCB
RemapToORSpecificMappingsCB (or_mappings:Dict[str,Dict]={'DL': {0: 'ND', 1: '=', 2: '<'}, 'FILT': {0: 'NA', 1: 'Y', 2: 'N'}}, output_format:str='openrefine_csv', verbose:bool=False)
Convert values using OR mappings if columns exist in dataframe.
Type | Default | Details | |
---|---|---|---|
or_mappings | Dict | {‘DL’: {0: ‘ND’, 1: ‘=’, 2: ‘<’}, ‘FILT’: {0: ‘NA’, 1: ‘Y’, 2: ‘N’}} | Dictionary of column mappings, |
output_format | str | openrefine_csv | |
verbose | bool | False |
= ExtractNetcdfContents(fname_in)
contents = Transformer(
tfm = contents.dfs,
data=contents.custom_maps,
custom_maps=[
cbs
RemapToORSpecificMappingsCB(),
]
)
tfm()
# Loop through each group in the 'dfs' dictionary
for group_name, df in tfm.dfs.items():
# Check if the group dataframe contains any of the columns specified in or_mappings.keys()
= [col for col in or_mappings.keys() if col in df.columns]
relevant_columns if relevant_columns:
# Print the unique values from the relevant columns
print(f"\nUnique values in {group_name} for columns {relevant_columns}:")
for col in relevant_columns:
print(f"{col}: {df[col].unique()}")
else:
print(f"No relevant columns found in {group_name} based on or_mappings keys.")
Unique values in BIOTA for columns ['DL']:
DL: ['<' '=' 'ND']
Unique values in SEAWATER for columns ['DL', 'FILT']:
DL: ['=' '<' 'ND']
FILT: ['NA' 'N' 'Y']
Unique values in SEDIMENT for columns ['DL']:
DL: ['=' '<' 'ND']
Remap to CSV data type format
CSV_DTYPES
(defined in configs.ipynb) defines a state for each variable that contains a lookup table (i.e. enums). The state is either ‘decoded’ or ‘encoded’. Lets review the variable states as a DataFrame:
with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
='index').T) display(pd.DataFrame.from_dict(CSV_DTYPES, orient
AREA | NUCLIDE | UNIT | DL | FILT | COUNT_MET | SAMP_MET | PREP_MET | SPECIES | BODY_PART | SED_TYPE | LAB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
state | decoded | encoded | encoded | decoded | decoded | encoded | encoded | encoded | encoded | encoded | encoded | encoded |
= Enums(lut_src_dir=lut_path())
enums enums.types.keys()
dict_keys(['AREA', 'BIO_GROUP', 'BODY_PART', 'COUNT_MET', 'DL', 'FILT', 'NUCLIDE', 'PREP_MET', 'SAMP_MET', 'SED_TYPE', 'SPECIES', 'UNIT', 'LAB'])
get_excluded_enums
get_excluded_enums (output_format:str='openrefine_csv')
Get excluded enums based on output format.
DataFormatConversionCB
DataFormatConversionCB (dtypes:Dict, excluded_mappings:Callable=<function get_excluded_enums>, output_format:str='openrefine_csv', verbose:bool=False)
A callback to convert DataFrame enum values between encoded and decoded formats based on specified settings.
Type | Default | Details | |
---|---|---|---|
dtypes | Dict | Dictionary defining data types and states for each lookup table | |
excluded_mappings | Callable | get_excluded_enums | Dictionary of columns to exclude from conversion |
output_format | str | openrefine_csv | |
verbose | bool | False | Flag for verbose output |
= ExtractNetcdfContents(fname_in)
contents = Transformer(
tfm
contents.dfs,=[
cbsvars=CSV_VARS, verbose=True),
RemoveNonCompatibleVariablesCB(
DataFormatConversionCB(=CSV_DTYPES,
dtypes= get_excluded_enums,
excluded_mappings ='openrefine_csv',
output_format=True
verbose
),
]
) tfm()
Removing variables that are not compatible with vars provided.
Removing BIO_GROUP from BIOTA dataset.
Loaded enums: dict_keys(['AREA', 'BIO_GROUP', 'BODY_PART', 'COUNT_MET', 'DL', 'FILT', 'NUCLIDE', 'PREP_MET', 'SAMP_MET', 'SED_TYPE', 'SPECIES', 'UNIT', 'LAB'])
{'BIOTA': LON LAT SMP_DEPTH TIME SMP_ID NUCLIDE \
0 12.316667 54.283333 NaN 1348358400 0 31
1 12.316667 54.283333 NaN 1348358400 0 4
2 12.316667 54.283333 NaN 1348358400 0 9
3 12.316667 54.283333 NaN 1348358400 0 33
4 12.316667 54.283333 NaN 1348358400 1 31
... ... ... ... ... ... ...
16089 21.395000 61.241501 2.0 1652140800 4789 33
16090 21.395000 61.241501 2.0 1652140800 4789 9
16091 21.385000 61.343334 NaN 1663200000 4790 4
16092 21.385000 61.343334 NaN 1663200000 4790 33
16093 21.385000 61.343334 NaN 1663200000 4790 12
VALUE UNIT UNC DL SPECIES BODY_PART DRYWT WETWT \
0 0.010140 5 NaN 2 99 52 174.934433 948.0
1 135.300003 5 4.830210 1 99 52 174.934433 948.0
2 0.013980 5 NaN 2 99 52 174.934433 948.0
3 4.338000 5 0.150962 1 99 52 174.934433 948.0
4 0.009614 5 NaN 2 99 52 177.935120 964.0
... ... ... ... .. ... ... ... ...
16089 13.700000 4 0.520600 1 96 55 NaN NaN
16090 0.500000 4 0.045500 1 96 55 NaN NaN
16091 50.700001 4 4.106700 1 129 1 NaN NaN
16092 0.880000 4 0.140800 1 129 1 NaN NaN
16093 6.600000 4 0.349800 1 129 1 NaN NaN
PERCENTWT
0 0.18453
1 0.18453
2 0.18453
3 0.18453
4 0.18458
... ...
16089 NaN
16090 NaN
16091 NaN
16092 NaN
16093 NaN
[16094 rows x 15 columns],
'SEAWATER': LON LAT SMP_DEPTH TOT_DEPTH TIME SMP_ID \
0 29.333300 60.083302 0.0 NaN 1337731200 0
1 29.333300 60.083302 29.0 NaN 1337731200 1
2 23.150000 59.433300 0.0 NaN 1339891200 2
3 27.983299 60.250000 0.0 NaN 1337817600 3
4 27.983299 60.250000 39.0 NaN 1337817600 4
... ... ... ... ... ... ...
21468 13.499833 54.600334 0.0 47.0 1686441600 9724
21469 13.499833 54.600334 45.0 47.0 1686441600 9725
21470 14.200833 54.600334 0.0 11.0 1686614400 9731
21471 14.665500 54.600334 0.0 20.0 1686614400 9732
21472 14.330000 54.600334 0.0 17.0 1686614400 9734
NUCLIDE VALUE UNIT UNC DL FILT
0 33 5.300000 1 1.696000 1 0
1 33 19.900000 1 3.980000 1 0
2 33 25.500000 1 5.100000 1 0
3 33 17.000000 1 4.930000 1 0
4 33 22.200001 1 3.996000 1 0
... ... ... ... ... .. ...
21468 1 702.838074 1 51.276207 1 0
21469 1 725.855713 1 52.686260 1 0
21470 1 648.992920 1 48.154419 1 0
21471 1 627.178406 1 46.245316 1 0
21472 1 605.715088 1 45.691143 1 0
[21473 rows x 12 columns],
'SEDIMENT': LON LAT TOT_DEPTH TIME SMP_ID NUCLIDE \
0 27.799999 60.466667 25.0 1337904000 0 33
1 27.799999 60.466667 25.0 1337904000 1 33
2 27.799999 60.466667 25.0 1337904000 2 33
3 27.799999 60.466667 25.0 1337904000 3 33
4 27.799999 60.466667 25.0 1337904000 4 33
... ... ... ... ... ... ...
70444 15.537800 54.617832 62.0 1654646400 14121 67
70445 15.537800 54.617832 62.0 1654646400 14121 77
70446 15.537800 54.617832 62.0 1654646400 14122 4
70447 15.537800 54.617832 62.0 1654646400 14122 33
70448 15.537800 54.617832 62.0 1654646400 14122 77
VALUE UNIT UNC DL SED_TYPE TOP BOTTOM PERCENTWT
0 1200.000000 3 240.000000 1 0 15.0 20.0 NaN
1 250.000000 3 50.000000 1 0 20.0 25.0 NaN
2 140.000000 3 29.400000 1 0 25.0 30.0 NaN
3 79.000000 3 15.800000 1 0 30.0 35.0 NaN
4 29.000000 3 6.960000 1 0 35.0 40.0 NaN
... ... ... ... .. ... ... ... ...
70444 0.044000 2 0.015312 1 10 15.0 17.0 0.257642
70445 2.500000 2 0.185000 1 10 15.0 17.0 0.257642
70446 5873.000000 2 164.444000 1 10 17.0 19.0 0.263965
70447 21.200001 2 2.162400 1 10 17.0 19.0 0.263965
70448 0.370000 2 0.048100 1 10 17.0 19.0 0.263965
[70449 rows x 14 columns]}
Review all callbacks
= ExtractNetcdfContents(fname_in)
contents = 'openrefine_csv'
output_format = Transformer(
tfm =contents.dfs,
data=contents.custom_maps,
custom_maps=[
cbs
ValidateEnumsCB(= contents,
contents =Enums(lut_src_dir=lut_path())
maris_enums
),vars=CSV_VARS) ,
RemoveNonCompatibleVariablesCB(
RemapCustomMapsCB(),=output_format),
RemapToORSpecificMappingsCB(output_format
AddTaxonInformationCB(=lut_taxon
fn_lut
),
DecodeTimeCB(),
AddSampleTypeIdColumnCB(),=cfg()),
AddZoteroArchiveLocationCB(contents.global_attrs, cfg
DataFormatConversionCB(=CSV_DTYPES,
dtypes= get_excluded_enums,
excluded_mappings =output_format,
output_format
)
]
)
tfm()for grp in ['SEAWATER', 'BIOTA']:
f"<b>Head of the transformed `{grp}` DataFrame:</b>"))
display(Markdown(with pd.option_context('display.max_rows', None):
display(tfm.dfs[grp].head())
Head of the transformed SEAWATER
DataFrame:
LON | LAT | SMP_DEPTH | TOT_DEPTH | TIME | SMP_ID | NUCLIDE | VALUE | UNIT | UNC | DL | FILT | SAMPLE_TYPE | REF_ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 29.333300 | 60.083302 | 0.0 | NaN | 2012-05-23 | 0 | 33 | 5.300000 | 1 | 1.696 | = | NA | 1 | 100 |
1 | 29.333300 | 60.083302 | 29.0 | NaN | 2012-05-23 | 1 | 33 | 19.900000 | 1 | 3.980 | = | NA | 1 | 100 |
2 | 23.150000 | 59.433300 | 0.0 | NaN | 2012-06-17 | 2 | 33 | 25.500000 | 1 | 5.100 | = | NA | 1 | 100 |
3 | 27.983299 | 60.250000 | 0.0 | NaN | 2012-05-24 | 3 | 33 | 17.000000 | 1 | 4.930 | = | NA | 1 | 100 |
4 | 27.983299 | 60.250000 | 39.0 | NaN | 2012-05-24 | 4 | 33 | 22.200001 | 1 | 3.996 | = | NA | 1 | 100 |
Head of the transformed BIOTA
DataFrame:
LON | LAT | SMP_DEPTH | TIME | SMP_ID | NUCLIDE | VALUE | UNIT | UNC | DL | ... | DRYWT | WETWT | PERCENTWT | TAXONNAME | TAXONRANK | TAXONDB | TAXONDBID | TAXONDBURL | SAMPLE_TYPE | REF_ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12.316667 | 54.283333 | NaN | 2012-09-23 | 0 | 31 | 0.010140 | 5 | NaN | < | ... | 174.934433 | 948.0 | 0.18453 | Gadus morhua | species | Wikidata | Q199788 | https://www.wikidata.org/wiki/Q199788 | 2 | 100 |
1 | 12.316667 | 54.283333 | NaN | 2012-09-23 | 0 | 4 | 135.300003 | 5 | 4.830210 | = | ... | 174.934433 | 948.0 | 0.18453 | Gadus morhua | species | Wikidata | Q199788 | https://www.wikidata.org/wiki/Q199788 | 2 | 100 |
2 | 12.316667 | 54.283333 | NaN | 2012-09-23 | 0 | 9 | 0.013980 | 5 | NaN | < | ... | 174.934433 | 948.0 | 0.18453 | Gadus morhua | species | Wikidata | Q199788 | https://www.wikidata.org/wiki/Q199788 | 2 | 100 |
3 | 12.316667 | 54.283333 | NaN | 2012-09-23 | 0 | 33 | 4.338000 | 5 | 0.150962 | = | ... | 174.934433 | 948.0 | 0.18453 | Gadus morhua | species | Wikidata | Q199788 | https://www.wikidata.org/wiki/Q199788 | 2 | 100 |
4 | 12.316667 | 54.283333 | NaN | 2012-09-23 | 1 | 31 | 0.009614 | 5 | NaN | < | ... | 177.935120 | 964.0 | 0.18458 | Gadus morhua | species | Wikidata | Q199788 | https://www.wikidata.org/wiki/Q199788 | 2 | 100 |
5 rows × 22 columns
Decode
decode
decode (fname_in:str, dest_out:str|None=None, output_format:str='openrefine_csv', remap_vars:Dict[str,str]={'LON': 'longitude', 'LAT': 'latitude', 'SMP_DEPTH': 'sampdepth', 'TOT_DEPTH': 'totdepth', 'TIME': 'begperiod', 'AREA': 'area', 'NUCLIDE': 'nuclide_id', 'VALUE': 'activity', 'UNIT': 'unit_id', 'UNC': 'uncertaint', 'DL': 'detection', 'FILT': 'filtered', 'COUNT_MET': 'counmet_id', 'SAMP_MET': 'sampmet_id', 'PREP_MET': 'prepmet_id', 'VOL': 'volume', 'SAL': 'salinity', 'TEMP': 'temperatur', 'SPECIES': 'species_id', 'BODY_PART': 'bodypar_id', 'SED_TYPE': 'sedtype_id', 'TOP': 'sliceup', 'BOTTOM': 'slicedown', 'DRYWT': 'drywt', 'WETWT': 'wetwt', 'PERCENTWT': 'percentwt', 'LAB': 'lab_id', 'PROFILE_ID': 'profile_id', 'SAMPLE_TYPE': 'samptype_id', 'TAXONNAME': 'taxonname', 'TAXONREPNAME': 'taxonrepname', 'TAXONRANK': 'taxonrank', 'TAXONDB': 'taxondb', 'TAXONDBID': 'taxondb_id', 'TAXONDBURL': 'taxondb_url', 'REF_ID': 'ref_id', 'SMP_ID': 'samplabcode'}, remap_dtypes:Dict[str,str]={'AREA': {'state': 'decoded'}, 'NUCLIDE': {'state': 'encoded'}, 'UNIT': {'state': 'encoded'}, 'DL': {'state': 'decoded'}, 'FILT': {'state': 'decoded'}, 'COUNT_MET': {'state': 'encoded'}, 'SAMP_MET': {'state': 'encoded'}, 'PREP_MET': {'state': 'encoded'}, 'SPECIES': {'state': 'encoded'}, 'BODY_PART': {'state': 'encoded'}, 'SED_TYPE': {'state': 'encoded'}, 'LAB': {'state': 'encoded'}}, verbose:bool=False, **kwargs)
Decode data from NetCDF.
Type | Default | Details | |
---|---|---|---|
fname_in | str | Input file name | |
dest_out | str | None | None | Output file name (optional) |
output_format | str | openrefine_csv | |
remap_vars | Dict | {‘LON’: ‘longitude’, ‘LAT’: ‘latitude’, ‘SMP_DEPTH’: ‘sampdepth’, ‘TOT_DEPTH’: ‘totdepth’, ‘TIME’: ‘begperiod’, ‘AREA’: ‘area’, ‘NUCLIDE’: ‘nuclide_id’, ‘VALUE’: ‘activity’, ‘UNIT’: ‘unit_id’, ‘UNC’: ‘uncertaint’, ‘DL’: ‘detection’, ‘FILT’: ‘filtered’, ‘COUNT_MET’: ‘counmet_id’, ‘SAMP_MET’: ‘sampmet_id’, ‘PREP_MET’: ‘prepmet_id’, ‘VOL’: ‘volume’, ‘SAL’: ‘salinity’, ‘TEMP’: ‘temperatur’, ‘SPECIES’: ‘species_id’, ‘BODY_PART’: ‘bodypar_id’, ‘SED_TYPE’: ‘sedtype_id’, ‘TOP’: ‘sliceup’, ‘BOTTOM’: ‘slicedown’, ‘DRYWT’: ‘drywt’, ‘WETWT’: ‘wetwt’, ‘PERCENTWT’: ‘percentwt’, ‘LAB’: ‘lab_id’, ‘PROFILE_ID’: ‘profile_id’, ‘SAMPLE_TYPE’: ‘samptype_id’, ‘TAXONNAME’: ‘taxonname’, ‘TAXONREPNAME’: ‘taxonrepname’, ‘TAXONRANK’: ‘taxonrank’, ‘TAXONDB’: ‘taxondb’, ‘TAXONDBID’: ‘taxondb_id’, ‘TAXONDBURL’: ‘taxondb_url’, ‘REF_ID’: ‘ref_id’, ‘SMP_ID’: ‘samplabcode’} | |
remap_dtypes | Dict | {‘AREA’: {‘state’: ‘decoded’}, ‘NUCLIDE’: {‘state’: ‘encoded’}, ‘UNIT’: {‘state’: ‘encoded’}, ‘DL’: {‘state’: ‘decoded’}, ‘FILT’: {‘state’: ‘decoded’}, ‘COUNT_MET’: {‘state’: ‘encoded’}, ‘SAMP_MET’: {‘state’: ‘encoded’}, ‘PREP_MET’: {‘state’: ‘encoded’}, ‘SPECIES’: {‘state’: ‘encoded’}, ‘BODY_PART’: {‘state’: ‘encoded’}, ‘SED_TYPE’: {‘state’: ‘encoded’}, ‘LAB’: {‘state’: ‘encoded’}} | |
verbose | bool | False | |
kwargs | VAR_KEYWORD | ||
Returns | None | Additional arguments |
= Path('../../_data/output/100-HELCOM-MORS-2024.nc')
fname =fname, dest_out=fname.with_suffix(''), output_format='openrefine_csv') decode(fname_in