Exported source
= 'configs.toml'
CFG_FNAME = 'cdl.toml'
CDL_FNAME = 'dbo_nuclide.xlsx'
NUCLIDE_LOOKUP_FNAME = '.marisco' MARISCO_CFG_DIRNAME
.toml
configuration files copied under /home/.marisco
folder and associated utilities function. These .toml
files can be then adapted to your specific needs if required.
base_path ()
Return the path to the .marisco
folder under your home directory.
By default, we create a folder named .marisco
under your home directory that will receive all configuration files as defined in BASE_PATH
:
CONFIGS = {
'gh': {
'owner': 'franckalbinet',
'repo': 'marisco'
},
'names': {
'nc_template': 'maris-template.nc'
},
'dirs': {
'lut': str(base_path() / 'lut'), # Look-up tables
'cache': str(base_path() / 'cache'), # Cache (e.f WoRMS species)
'tmp': str(base_path() / 'tmp')
},
'paths': {
'luts': 'nbs/files/lut'
},
'units': {
'time': 'seconds since 1970-01-01 00:00:00.0'
},
'zotero': {
'api_key': os.getenv('ZOTERO_API_KEY'),
'lib_id': '2432820'
}
}
The CONFIGS
dictionary defines general settings:
key | Value | Description |
---|---|---|
dirs/lut |
/Users/franckalbinet/.marisco/lut |
Location & name of the directory receiving lookup tables. |
dirs/cache |
/Users/franckalbinet/.marisco/cache |
Location & name of the directory receiving cache files such as WoRMS species retrieved. |
dirs/tmp |
/Users/franckalbinet/.marisco/tmp |
Location & name of temporary files. |
gh/owner |
franckalbinet |
GitHub account owner. |
gh/repo |
marisco |
GitHub user used to download specific files (e.g lookup tables) during installation. |
names/nc_template |
maris-template.nc |
Name of the MARIS NetCDF4 template. |
paths_luts |
nbs/files/lut |
GitHub repository directory name containing lookup tables. |
units_time |
seconds since 1970-01-01 00:00:00.0 |
Reference date and time used for NetCDF time encoding. |
zotero/api_key |
your-zotero-api-key |
Zotero API key (“ZOTERO_API_KEY” environment variable). |
zotero/lib_id |
2432820 |
Zotero library ID. |
The main CONFIGS_CDL
dictionary, used to generate a NetCDF CDL (Common Data Language) .toml
file. This file is then used to generate a template MARIS netcdf file. For further details refers to the configs.ipynb
file.
Below, the vars/defaults section printed:
cfg ()
Return the configuration as a dictionary.
nuc_lut_path ()
Return the path to the nuclide lookup table.
lut_path ()
Return the path to the lookup tables directory.
cache_path ()
Return the path to the cache directory.
CONFIGS_CDL = {
'placeholder': '_to_be_filled_in_',
'grps': {
'sea': {
'name': 'seawater',
'id': 1
},
'bio': {
'name': 'biota',
'id': 2
},
'sed': {
'name': 'sediment',
'id': 3
},
'sus': {
'name': 'suspended-matter',
'id': 4
}
},
'global_attrs': {
# Do not update keys. Only values if required
'id': '', # zotero?
'title': '',
'summary': '',
'keywords': '',
'keywords_vocabulary': 'GCMD Science Keywords',
'keywords_vocabulary_url': 'https://gcmd.earthdata.nasa.gov/static/kms/',
'record': '',
'featureType': '',
'cdm_data_type': '',
# Conventions
'Conventions': 'CF-1.10 ACDD-1.3',
# Publisher [ACDD1.3]
'publisher_name': 'Paul MCGINNITY, Iolanda OSVATH, Florence DESCROIX-COMANDUCCI',
'publisher_email': 'p.mc-ginnity@iaea.org, i.osvath@iaea.org, F.Descroix-Comanducci@iaea.org',
'publisher_url': 'https://maris.iaea.org',
'publisher_institution': 'International Atomic Energy Agency - IAEA',
# Creator info [ACDD1.3]
'creator_name': '',
'institution': '',
'metadata_link': '',
'creator_email': '',
'creator_url': '',
'references': '',
'license': ' '.join(['Without prejudice to the applicable Terms and Conditions',
'(https://nucleus.iaea.org/Pages/Others/Disclaimer.aspx),',
'I hereby agree that any use of the data will contain appropriate',
'acknowledgement of the data source(s) and the IAEA Marine',
'Radioactivity Information System (MARIS).']),
'comment': '',
# Dataset info & coordinates [ACDD1.3]
#'project': '', # Network long name
#'platform': '', # Should be a long / full name
'geospatial_lat_min': '',
'geospatial_lon_min': '',
'geospatial_lat_max': '',
'geospatial_lon_max': '',
'geospatial_vertical_min': '',
'geospatial_vertical_max': '',
'geospatial_bounds': '', # wkt representation
'geospatial_bounds_crs': 'EPSG:4326',
# Time information
'time_coverage_start': '',
'time_coverage_end': '',
#'time_coverage_resolution': '',
'local_time_zone': '',
'date_created': '',
'date_modified': '',
#
# -- Additional metadata (custom to MARIS)
#
'publisher_postprocess_logs': ''
},
'dim': {
'name': 'sample',
'attrs': {
'long_name': 'Sample ID of measurement'
},
'dtype': 'u8'
},
'vars': {
'defaults': {
'lon': {
'name': 'lon',
'attrs': {
'long_name': 'Measurement longitude',
'standard_name': 'longitude',
'units': 'degrees_north',
'axis': 'Y',
'_CoordinateAxisType': 'Lon'
},
'dtype': 'f4'
},
'lat': {
'name': 'lat',
'attrs': {
'long_name': 'Measurement latitude',
'standard_name': 'latitude',
'units': 'degrees_east',
'axis': 'X',
'_CoordinateAxisType': 'Lat'
},
'dtype': 'f4'
},
'smp_depth': {
'name': 'smp_depth',
'attrs': {
'long_name': 'Sample depth below seal level',
'standard_name': 'sample_depth_below_sea_floor',
'units': 'm',
'axis': 'Z'
},
'dtype': 'f4'
},
'tot_depth': {
'name': 'tot_depth',
'attrs': {
'long_name': 'Total depth below seal level',
'standard_name': 'total_depth_below_sea_floor',
'units': 'm',
'axis': 'Z'
},
'dtype': 'f4'
},
'time': {
'name': 'time',
'attrs': {
'long_name': 'Time of measurement',
'standard_name': 'time',
'units': 'seconds since 1970-01-01 00:00:00.0',
'time_origin': '1970-01-01 00:00:00',
'time_zone': 'UTC',
'abbreviation': 'Date/Time',
'axis': 'T',
'calendar': 'gregorian'
},
'dtype': 'u8',
},
'area': {
'name': 'area',
'attrs': {
'long_name': 'Marine area/region id',
'standard_name': 'area_id'
},
'dtype': 'area_t'
},
},
'bio': {
'bio_group': {
'name': 'bio_group',
'attrs': {
'long_name': 'Biota group',
'standard_name': 'biota_group_tbd'
},
'dtype': 'bio_group_t'
},
'species': {
'name': 'species',
'attrs': {
'long_name': 'Species',
'standard_name': 'species'
},
'dtype': 'species_t'
},
'body_part': {
'name': 'body_part',
'attrs': {
'long_name': 'Body part',
'standard_name': 'body_part_tbd'
},
'dtype': 'body_part_t'
}
},
'sed': {
'sed_type': {
'name': 'sed_type',
'attrs': {
'long_name': 'Sediment type',
'standard_name': 'sediment_type_tbd'
},
'dtype': 'sed_type_t'
},
'top': {
'name': 'top',
'attrs': {
'long_name': 'Top depth of sediment layer',
'standard_name': 'top_depth_of_sediment_layer_tbd'
},
'dtype': 'f4'
},
'bottom': {
'name': 'bottom',
'attrs': {
'long_name': 'Bottom depth of sediment layer',
'standard_name': 'bottom_depth_of_sediment_layer_tbd'
},
'dtype': 'f4'
},
},
'suffixes': {
'uncertainty': {
'name': '_unc',
'attrs': {
'long_name': ' uncertainty',
'standard_name': '_uncertainty'
},
'dtype': 'f4'
},
'detection_limit': {
'name': '_dl',
'attrs': {
'long_name': ' detection limit',
'standard_name': '_detection_limit'
},
'dtype': 'dl_t'
},
'volume': {
'name': '_vol',
'attrs': {
'long_name': ' volume',
'standard_name': '_volume'
},
'dtype': 'f4'
},
'salinity': {
'name': '_sal',
'attrs': {
'long_name': ' salinity',
'standard_name': '_sal'
},
'dtype': 'f4'
},
'temperature': {
'name': '_temp',
'attrs': {
'long_name': ' temperature',
'standard_name': '_temp'
},
'dtype': 'f4'
},
'filtered': {
'name': '_filt',
'attrs': {
'long_name': ' filtered',
'standard_name': '_filtered'
},
'dtype': 'filt_t'
},
'counting_method': {
'name': '_counmet',
'attrs': {
'long_name': ' counting method',
'standard_name': '_counting_method'
},
'dtype': 'counmet_t'
},
'sampling_method': {
'name': '_sampmet',
'attrs': {
'long_name': ' sampling method',
'standard_name': '_sampling_method'
},
'dtype': 'sampmet_t'
},
'preparation_method': {
'name': '_prepmet',
'attrs': {
'long_name': ' preparation method',
'standard_name': '_preparation_method'
},
'dtype': 'prepmet_t'
},
'unit': {
'name': '_unit',
'attrs': {
'long_name': ' unit',
'standard_name': '_unit'
},
'dtype': 'unit_t'
}
}
},
'enums': [
{
'name': 'area_t',
'fname': 'dbo_area.xlsx',
'key': 'displayName',
'value':'areaId'
},
{
'name': 'bio_group_t',
'fname': 'dbo_biogroup.xlsx',
'key': 'biogroup',
'value':'biogroup_id'
},
{
'name': 'body_part_t',
'fname': 'dbo_bodypar.xlsx',
'key': 'bodypar',
'value':'bodypar_id'
},
{
'name': 'species_t',
'fname': 'dbo_species_cleaned.xlsx',
'key': 'species',
'value':'species_id'
},
{
'name': 'sed_type_t',
'fname': 'dbo_sedtype.xlsx',
'key': 'sedtype',
'value':'sedtype_id'
},
{
'name': 'unit_t',
'fname': 'dbo_unit.xlsx',
'key': 'unit_sanitized',
'value':'unit_id'
},
{
'name': 'dl_t',
'fname': 'dbo_detectlimit.xlsx',
'key': 'name_sanitized',
'value':'id'
},
{
'name': 'filt_t',
'fname': 'dbo_filtered.xlsx',
'key': 'name',
'value':'id'
},
{
'name': 'counmet_t',
'fname': 'dbo_counmet.xlsx',
'key': 'counmet',
'value':'counmet_id'
},
{
'name': 'sampmet_t',
'fname': 'dbo_sampmet.xlsx',
'key': 'sampmet',
'value':'sampmet_id'
},
{
'name': 'prepmet_t',
'fname': 'dbo_prepmet.xlsx',
'key': 'prepmet',
'value':'prepmet_id'
}
]
}
{ 'area': { 'attrs': { 'long_name': 'Marine area/region id',
'standard_name': 'area_id'},
'dtype': 'area_t',
'name': 'area'},
'lat': { 'attrs': { '_CoordinateAxisType': 'Lat',
'axis': 'X',
'long_name': 'Measurement latitude',
'standard_name': 'latitude',
'units': 'degrees_east'},
'dtype': 'f4',
'name': 'lat'},
'lon': { 'attrs': { '_CoordinateAxisType': 'Lon',
'axis': 'Y',
'long_name': 'Measurement longitude',
'standard_name': 'longitude',
'units': 'degrees_north'},
'dtype': 'f4',
'name': 'lon'},
'smp_depth': { 'attrs': { 'axis': 'Z',
'long_name': 'Sample depth below seal level',
'standard_name': 'sample_depth_below_sea_floor',
'units': 'm'},
'dtype': 'f4',
'name': 'smp_depth'},
'time': { 'attrs': { 'abbreviation': 'Date/Time',
'axis': 'T',
'calendar': 'gregorian',
'long_name': 'Time of measurement',
'standard_name': 'time',
'time_origin': '1970-01-01 00:00:00',
'time_zone': 'UTC',
'units': 'seconds since 1970-01-01 00:00:00.0'},
'dtype': 'u8',
'name': 'time'},
'tot_depth': { 'attrs': { 'axis': 'Z',
'long_name': 'Total depth below seal level',
'standard_name': 'total_depth_below_sea_floor',
'units': 'm'},
'dtype': 'f4',
'name': 'tot_depth'}}
cdl_cfg ()
Return the CDL (Common Data Language) configuration as a dictionary.
grp_names ()
Return the group names as defined in cdl.toml
.
species_lut_path ()
Return the path to the species lookup table.
bodyparts_lut_path ()
Return the path to the body parts lookup table.
biogroup_lut_path ()
Return the path to the biota group lookup table.
sediments_lut_path ()
Return the path to the sediment type lookup table.
unit_lut_path ()
Return the path to the unit lookup table.
detection_limit_lut_path ()
Return the path to the detection limit lookup table.
filtered_lut_path ()
Return the path to the filtered lookup table.
area_lut_path ()
Return the path to the area lookup table.
name2grp (name:str, cdl:dict)
Type | Details | |
---|---|---|
name | str | Group name |
cdl | dict | CDL configuration |
Example:
nc_tpl_name ()
Return the name of the MARIS NetCDF template as defined in configs.toml
nc_tpl_path ()
Return the path of the MARIS NetCDF template as defined in configs.toml
Enumeration types are used to avoid using strings as NetCDF4 variable values. Instead, enumeration types (lookup tables) such as {'Crustaceans': 2, 'Echinoderms': 3, ...}
are prepended to the NetCDF file template and associated ids (integers) are used as values.
sanitize (s:str|float)
*Sanitize dictionary key to comply with NetCDF enumeration type:
(
, )
, .
, /
, -
Type | Details | |
---|---|---|
s | str | float | String or float to sanitize |
Returns | str | float | Sanitized string or original float |
def sanitize(
s: str|float # String or float to sanitize
) -> str|float: # Sanitized string or original float
"""
Sanitize dictionary key to comply with NetCDF enumeration type:
- Remove `(`, `)`, `.`, `/`, `-`
- Strip the string
- Return original value if it's not a string (e.g., NaN)
"""
if isinstance(s, str):
s = re.sub(r'[().]', '', s)
return re.sub(r'[/-]', ' ', s).strip()
elif pd.isna(s): # This covers np.nan, None, and pandas NaT
return s
else:
return str(s).strip()
For example:
NetCDF4 enumeration type seems to not accept keys containing non alphanumeric characters like parentheses, dots, slash, … As a result, MARIS lookup table needs to be sanitized.
get_lut (src_dir:str, fname:str, key:str, value:str, do_sanitize:bool=True, reverse:bool=False)
Convert MARIS db lookup table excel file to dictionary {'name': id, ...}
or {id: name, ...}
if reverse
is True.
Type | Default | Details | |
---|---|---|---|
src_dir | str | Directory containing lookup tables | |
fname | str | Excel file lookup table name | |
key | str | Excel file column name to be used as dict keys | |
value | str | Excel file column name to be used as dict values | |
do_sanitize | bool | True | Sanitization required? |
reverse | bool | False | Reverse lookup table (value, key) |
Returns | dict | MARIS lookup table (key, value) |
def get_lut(
src_dir: str, # Directory containing lookup tables
fname: str, # Excel file lookup table name
key: str, # Excel file column name to be used as dict keys
value: str, # Excel file column name to be used as dict values
do_sanitize: bool=True, # Sanitization required?
reverse: bool=False # Reverse lookup table (value, key)
) -> dict: # MARIS lookup table (key, value)
"Convert MARIS db lookup table excel file to dictionary `{'name': id, ...}` or `{id: name, ...}` if `reverse` is True."
fname = Path(src_dir) / fname
df = pd.read_excel(fname, usecols=[key, value]).dropna(subset=value)
df[value] = df[value].astype('int')
df = df.set_index(key)
lut = df[value].to_dict()
if do_sanitize:
lut = {sanitize(k): v for k, v in lut.items()}
lut = {try_int(k): try_int(v) for k, v in lut.items()}
return {v: k for k, v in lut.items()} if reverse else lut
For example:
lut_src_dir = './files/lut'
get_lut(lut_src_dir, 'dbo_biogroup.xlsx', key='biogroup', value='biogroup_id', reverse=False)
{'Not applicable': -1,
'Not available': 0,
'Birds': 1,
'Crustaceans': 2,
'Echinoderms': 3,
'Fish': 4,
'Mammals': 5,
'Molluscs': 6,
'Others': 7,
'Plankton': 8,
'Polychaete worms': 9,
'Reptile': 10,
'Seaweeds and plants': 11,
'Cephalopods': 12,
'Gastropods': 13,
'Bivalves': 14}
Enums (lut_src_dir:str, cdl_enums:dict)
Return dictionaries of MARIS NetCDF’s enumeration types.
Type | Details | |
---|---|---|
lut_src_dir | str | Directory containing lookup tables |
cdl_enums | dict | CDL configuration enumeration types |
class Enums():
"Return dictionaries of MARIS NetCDF's enumeration types."
def __init__(self,
lut_src_dir:str, # Directory containing lookup tables
cdl_enums:dict # CDL configuration enumeration types
):
fc.store_attr()
self.types = self.lookup()
def filter(self, name, values):
return {name: id for name, id in self.types[name].items() if id in values}
def lookup(self):
types = {}
for enum in self.cdl_enums:
name, fname, key, value = enum.values()
lut = get_lut(self.lut_src_dir, fname, key=key, value=value)
types[name] = lut
return types
{'Not applicable': -1,
'Not available': 0,
'Detected value': 1,
'Detection limit': 2,
'Not detected': 3,
'Derived': 4}
{'Not applicable': -1,
'NOT AVAILABLE': 0,
'Bq per m3': 1,
'Bq per m2': 2,
'Bq per kg': 3,
'Bq per kgd': 4,
'Bq per kgw': 5,
'kg per kg': 6,
'TU': 7,
'DELTA per mill': 8,
'atom per kg': 9,
'atom per kgd': 10,
'atom per kgw': 11,
'atom per l': 12,
'Bq per kgC': 13}
get_enum_dicts (lut_src_dir:str, cdl_enums:dict, **kwargs)
Return a dict of NetCDF enumeration types.
Type | Details | |
---|---|---|
lut_src_dir | str | Directory containing lookup tables |
cdl_enums | dict | CDL configuration enumeration types |
kwargs |
def get_enum_dicts(
lut_src_dir:str, # Directory containing lookup tables
cdl_enums:dict, # CDL configuration enumeration types
**kwargs # Additional arguments
):
"Return a dict of NetCDF enumeration types."
enum_types = {}
for enum in cdl_enums:
name, fname, key, value = enum.values()
lut = get_lut(lut_src_dir, fname, key=key, value=value, **kwargs)
enum_types[name] = lut
return enum_types
For example:
lut_src_dir_test = './files/lut'
cdl_enums_test = read_toml('./files/cdl.toml')['enums']
enums = get_enum_dicts(lut_src_dir=lut_src_dir_test,
cdl_enums=cdl_enums_test)
enums.keys()
dict_keys(['area_t', 'bio_group_t', 'body_part_t', 'species_t', 'sed_type_t', 'unit_t', 'dl_t', 'filt_t', 'counmet_t', 'sampmet_t', 'prepmet_t'])