Several dictionaries used to generate .toml configuration files copied under /home/.marisco folder and associated utilities function. These .toml files can be then adapted to your specific needs if required.

Configuration files



 base_path ()

Return the path to the .marisco folder under your home directory.

CFG_FNAME = 'configs.toml'
CDL_FNAME = 'cdl.toml'
NUCLIDE_LOOKUP_FNAME = 'dbo_nuclide.xlsx'
def base_path(): 
    "Return the path to the `.marisco` folder under your home directory."
    return Path.home() / MARISCO_CFG_DIRNAME

By default, we create a folder named .marisco under your home directory that will receive all configuration files as defined in BASE_PATH:

    'gh': {
        'owner': 'franckalbinet',
        'repo': 'marisco'
    'names': {  
        'nc_template': ''
    'dirs': {
        'lut': str(base_path() / 'lut'), # Look-up tables
        'cache': str(base_path() / 'cache'), # Cache (e.f WoRMS species)
        'tmp': str(base_path() / 'tmp')
    'paths': {
        'luts': 'nbs/files/lut'
    'units': {
        'time': 'seconds since 1970-01-01 00:00:00.0'
    'zotero': {
        'api_key': os.getenv('ZOTERO_API_KEY'),
        'lib_id': '2432820'

The CONFIGS dictionary defines general settings:

key Value Description
dirs/lut /Users/franckalbinet/.marisco/lut Location & name of the directory receiving lookup tables.
dirs/cache /Users/franckalbinet/.marisco/cache Location & name of the directory receiving cache files such as WoRMS species retrieved.
dirs/tmp /Users/franckalbinet/.marisco/tmp Location & name of temporary files.
gh/owner franckalbinet GitHub account owner.
gh/repo marisco GitHub user used to download specific files (e.g lookup tables) during installation.
names/nc_template Name of the MARIS NetCDF4 template.
paths_luts nbs/files/lut GitHub repository directory name containing lookup tables.
units_time seconds since 1970-01-01 00:00:00.0 Reference date and time used for NetCDF time encoding.
zotero/api_key your-zotero-api-key Zotero API key (“ZOTERO_API_KEY” environment variable).
zotero/lib_id 2432820 Zotero library ID.

The main CONFIGS_CDL dictionary, used to generate a NetCDF CDL (Common Data Language) .toml file. This file is then used to generate a template MARIS netcdf file. For further details refers to the configs.ipynb file.

Below, the vars/defaults section printed:



 cfg ()

Return the configuration as a dictionary.

def cfg(): 
    "Return the configuration as a dictionary."
    return read_toml(base_path() / CFG_FNAME)



 nuc_lut_path ()

Return the path to the nuclide lookup table.

def nuc_lut_path(): 
    "Return the path to the nuclide lookup table."
    return Path(cfg()['dirs']['lut']) / NUCLIDE_LOOKUP_FNAME



 lut_path ()

Return the path to the lookup tables directory.

def lut_path(): 
    "Return the path to the lookup tables directory."
    return Path(cfg()['dirs']['lut'])



 cache_path ()

Return the path to the cache directory.

def cache_path(): 
    "Return the path to the cache directory."
    return Path(cfg()['dirs']['cache'])
    'placeholder': '_to_be_filled_in_',
    'grps': {
        'sea': {
            'name': 'seawater',
            'id': 1
        'bio': {
            'name': 'biota',
            'id': 2
        'sed': {
            'name': 'sediment',
            'id': 3
        'sus': {
            'name': 'suspended-matter',
            'id': 4
    'global_attrs': {
        # Do not update keys. Only values if required
        'id': '', # zotero?
        'title': '',
        'summary': '',
        'keywords': '',
        'keywords_vocabulary': 'GCMD Science Keywords',
        'keywords_vocabulary_url': '',
        'record': '',
        'featureType': '',
        'cdm_data_type': '',

        # Conventions
        'Conventions': 'CF-1.10 ACDD-1.3',

        # Publisher [ACDD1.3]
        'publisher_name': 'Paul MCGINNITY, Iolanda OSVATH, Florence DESCROIX-COMANDUCCI',
        'publisher_email': ',,', 
        'publisher_url': '',
        'publisher_institution': 'International Atomic Energy Agency - IAEA', 

        # Creator info [ACDD1.3]
        'creator_name': '',
        'institution': '',
        'metadata_link': '',
        'creator_email': '',
        'creator_url': '',
        'references': '',
        'license': ' '.join(['Without prejudice to the applicable Terms and Conditions', 
                             'I hereby agree that any use of the data will contain appropriate',
                             'acknowledgement of the data source(s) and the IAEA Marine',
                             'Radioactivity Information System (MARIS).']),
        'comment': '',
        # Dataset info & coordinates [ACDD1.3]
        #'project': '', # Network long name
        #'platform': '', # Should be a long / full name
        'geospatial_lat_min': '', 
        'geospatial_lon_min': '',
        'geospatial_lat_max': '',
        'geospatial_lon_max': '',
        'geospatial_vertical_min': '',
        'geospatial_vertical_max': '',
        'geospatial_bounds': '', # wkt representation
        'geospatial_bounds_crs': 'EPSG:4326',

        # Time information
        'time_coverage_start': '',
        'time_coverage_end': '',
        #'time_coverage_resolution': '',
        'local_time_zone': '',
        'date_created': '',
        'date_modified': '',
        # -- Additional metadata (custom to MARIS)
        'publisher_postprocess_logs': ''
    'dim': {
        'name': 'sample',
        'attrs': {
            'long_name': 'Sample ID of measurement'
        'dtype': 'u8'
    'vars': {    
        'defaults': {
            'lon': {
                'name': 'lon',
                'attrs': {
                    'long_name': 'Measurement longitude',
                    'standard_name': 'longitude',
                    'units': 'degrees_north',
                    'axis': 'Y',
                    '_CoordinateAxisType': 'Lon'
                'dtype': 'f4'
            'lat': {
                'name': 'lat',
                'attrs': {
                    'long_name': 'Measurement latitude',
                    'standard_name': 'latitude',
                    'units': 'degrees_east',
                    'axis': 'X',
                    '_CoordinateAxisType': 'Lat'
                'dtype': 'f4'
            'smp_depth': {
                'name': 'smp_depth',
                'attrs': {
                    'long_name': 'Sample depth below seal level',
                    'standard_name': 'sample_depth_below_sea_floor',
                    'units': 'm',
                    'axis': 'Z'
                'dtype': 'f4'
            'tot_depth': {
                'name': 'tot_depth',
                'attrs': {
                    'long_name': 'Total depth below seal level',
                    'standard_name': 'total_depth_below_sea_floor',
                    'units': 'm',
                    'axis': 'Z'
                'dtype': 'f4'
            'time': {
                'name': 'time',
                'attrs': {
                    'long_name': 'Time of measurement',
                    'standard_name': 'time',
                    'units': 'seconds since 1970-01-01 00:00:00.0',
                    'time_origin': '1970-01-01 00:00:00',
                    'time_zone': 'UTC',
                    'abbreviation': 'Date/Time',
                    'axis': 'T',
                    'calendar': 'gregorian'
                'dtype': 'u8',
            'area': {
                'name': 'area',
                'attrs': {
                    'long_name': 'Marine area/region id',
                    'standard_name': 'area_id'
                'dtype': 'area_t'
        'bio': {
            'bio_group': {
                'name': 'bio_group',
                'attrs': {
                    'long_name': 'Biota group',
                    'standard_name': 'biota_group_tbd'
                'dtype': 'bio_group_t'
            'species': {
                'name': 'species',
                'attrs': {  
                    'long_name': 'Species',
                    'standard_name': 'species'
                'dtype': 'species_t'
            'body_part': {
                'name': 'body_part',
                'attrs': {
                    'long_name': 'Body part',
                    'standard_name': 'body_part_tbd'
                'dtype': 'body_part_t' 
        'sed': {
            'sed_type': {
                'name': 'sed_type',
                'attrs': {
                    'long_name': 'Sediment type',
                    'standard_name': 'sediment_type_tbd'
                'dtype': 'sed_type_t'
            'top': {
                'name': 'top',
                'attrs': {
                    'long_name': 'Top depth of sediment layer',
                    'standard_name': 'top_depth_of_sediment_layer_tbd'
                'dtype': 'f4'
            'bottom': {
                'name': 'bottom',
                'attrs': {
                    'long_name': 'Bottom depth of sediment layer',
                    'standard_name': 'bottom_depth_of_sediment_layer_tbd'
                'dtype': 'f4'
        'suffixes':  {
            'uncertainty': {
                'name': '_unc',
                'attrs': {
                    'long_name': ' uncertainty',
                    'standard_name': '_uncertainty'
                'dtype': 'f4'
            'detection_limit': {
                'name': '_dl',
                'attrs': {
                    'long_name': ' detection limit',
                    'standard_name': '_detection_limit'
                'dtype': 'dl_t'
            'volume': {
                'name': '_vol',
                'attrs': {
                    'long_name': ' volume',
                    'standard_name': '_volume'
                'dtype': 'f4'
            'salinity': {
                'name': '_sal',
                'attrs': {
                    'long_name': ' salinity',
                    'standard_name': '_sal'
                'dtype': 'f4'
            'temperature': {
                'name': '_temp',
                'attrs': {
                    'long_name': ' temperature',
                    'standard_name': '_temp'
                'dtype': 'f4'
            'filtered': {
                'name': '_filt',
                'attrs': {
                    'long_name': ' filtered',
                    'standard_name': '_filtered'
                'dtype': 'filt_t'
            'counting_method': {
                'name': '_counmet',
                'attrs': {
                    'long_name': ' counting method',
                    'standard_name': '_counting_method'
                'dtype': 'counmet_t'
            'sampling_method': {
                'name': '_sampmet',
                'attrs': {
                    'long_name': ' sampling method',
                    'standard_name': '_sampling_method'
                'dtype': 'sampmet_t'
            'preparation_method': {
                'name': '_prepmet',
                'attrs': {
                    'long_name': ' preparation method',
                    'standard_name': '_preparation_method'
                'dtype': 'prepmet_t'
            'unit': {
                'name': '_unit',
                'attrs': {
                    'long_name': ' unit',
                    'standard_name': '_unit'
                'dtype': 'unit_t'
    'enums': [
            'name': 'area_t', 
            'fname': 'dbo_area.xlsx', 
            'key': 'displayName', 
            'name': 'bio_group_t', 
            'fname': 'dbo_biogroup.xlsx', 
            'key': 'biogroup', 
            'name': 'body_part_t', 
            'fname': 'dbo_bodypar.xlsx', 
            'key': 'bodypar', 
            'name': 'species_t', 
            'fname': 'dbo_species_cleaned.xlsx', 
            'key': 'species', 
            'name': 'sed_type_t', 
            'fname': 'dbo_sedtype.xlsx', 
            'key': 'sedtype', 
            'name': 'unit_t', 
            'fname': 'dbo_unit.xlsx', 
            'key': 'unit_sanitized', 
            'name': 'dl_t', 
            'fname': 'dbo_detectlimit.xlsx', 
            'key': 'name_sanitized', 
            'name': 'filt_t', 
            'fname': 'dbo_filtered.xlsx', 
            'key': 'name',
            'name': 'counmet_t', 
            'fname': 'dbo_counmet.xlsx', 
            'key': 'counmet',
            'name': 'sampmet_t', 
            'fname': 'dbo_sampmet.xlsx', 
            'key': 'sampmet',
            'name': 'prepmet_t', 
            'fname': 'dbo_prepmet.xlsx', 
            'key': 'prepmet',
{ 'area': { 'attrs': { 'long_name': 'Marine area/region id',
                       'standard_name': 'area_id'},
            'dtype': 'area_t',
            'name': 'area'},
  'lat': { 'attrs': { '_CoordinateAxisType': 'Lat',
                      'axis': 'X',
                      'long_name': 'Measurement latitude',
                      'standard_name': 'latitude',
                      'units': 'degrees_east'},
           'dtype': 'f4',
           'name': 'lat'},
  'lon': { 'attrs': { '_CoordinateAxisType': 'Lon',
                      'axis': 'Y',
                      'long_name': 'Measurement longitude',
                      'standard_name': 'longitude',
                      'units': 'degrees_north'},
           'dtype': 'f4',
           'name': 'lon'},
  'smp_depth': { 'attrs': { 'axis': 'Z',
                            'long_name': 'Sample depth below seal level',
                            'standard_name': 'sample_depth_below_sea_floor',
                            'units': 'm'},
                 'dtype': 'f4',
                 'name': 'smp_depth'},
  'time': { 'attrs': { 'abbreviation': 'Date/Time',
                       'axis': 'T',
                       'calendar': 'gregorian',
                       'long_name': 'Time of measurement',
                       'standard_name': 'time',
                       'time_origin': '1970-01-01 00:00:00',
                       'time_zone': 'UTC',
                       'units': 'seconds since 1970-01-01 00:00:00.0'},
            'dtype': 'u8',
            'name': 'time'},
  'tot_depth': { 'attrs': { 'axis': 'Z',
                            'long_name': 'Total depth below seal level',
                            'standard_name': 'total_depth_below_sea_floor',
                            'units': 'm'},
                 'dtype': 'f4',
                 'name': 'tot_depth'}}
write_toml(Path('./files') / CDL_FNAME, CONFIGS_CDL)
Creating files/cdl.toml



 cdl_cfg ()

Return the CDL (Common Data Language) configuration as a dictionary.

def cdl_cfg():
    "Return the CDL (Common Data Language) configuration as a dictionary."
        return read_toml(base_path() / CDL_FNAME)
    except FileNotFoundError:
        return CONFIGS_CDL



 grp_names ()

Return the group names as defined in cdl.toml.

def grp_names(): 
    "Return the group names as defined in `cdl.toml`."
    return [v['name'] for v in cdl_cfg()['grps'].values()]



 species_lut_path ()

Return the path to the species lookup table.

def species_lut_path():
    "Return the path to the species lookup table."
    src_dir = lut_path()
    fname = [enum for enum in cdl_cfg()['enums'] if enum['name'] == 'species_t'][0]['fname']
    return src_dir / fname



 bodyparts_lut_path ()

Return the path to the body parts lookup table.

def bodyparts_lut_path():
    "Return the path to the body parts lookup table."
    src_dir = lut_path()
    fname = [enum for enum in cdl_cfg()['enums'] if enum['name'] == 'body_part_t'][0]['fname']
    return src_dir / fname



 biogroup_lut_path ()

Return the path to the biota group lookup table.

def biogroup_lut_path():
    "Return the path to the biota group lookup table."
    src_dir = lut_path()
    fname = [enum for enum in cdl_cfg()['enums'] if enum['name'] == 'bio_group_t'][0]['fname']
    return src_dir / fname



 sediments_lut_path ()

Return the path to the sediment type lookup table.

def sediments_lut_path():
    "Return the path to the sediment type lookup table."
    src_dir = lut_path()
    fname = [enum for enum in cdl_cfg()['enums'] if enum['name'] == 'sed_type_t'][0]['fname']
    return src_dir / fname



 unit_lut_path ()

Return the path to the unit lookup table.

def unit_lut_path():
    "Return the path to the unit lookup table."
    src_dir = lut_path()
    fname = [enum for enum in cdl_cfg()['enums'] if enum['name'] == 'unit_t'][0]['fname']
    return src_dir / fname



 detection_limit_lut_path ()

Return the path to the detection limit lookup table.

def detection_limit_lut_path():
    "Return the path to the detection limit lookup table."
    src_dir = lut_path()
    fname = [enum for enum in cdl_cfg()['enums'] if enum['name'] == 'dl_t'][0]['fname']
    return src_dir / fname



 filtered_lut_path ()

Return the path to the filtered lookup table.

def filtered_lut_path():
    "Return the path to the filtered lookup table."
    src_dir = lut_path()
    fname = [enum for enum in cdl_cfg()['enums'] if enum['name'] == 'filt_t'][0]['fname']
    return src_dir / fname



 area_lut_path ()

Return the path to the area lookup table.

def area_lut_path():
    "Return the path to the area lookup table."
    src_dir = lut_path()
    fname = [enum for enum in cdl_cfg()['enums'] if enum['name'] == 'area_t'][0]['fname']
    return src_dir / fname

Utilities function



 name2grp (name:str, cdl:dict)
Type Details
name str Group name
cdl dict CDL configuration
    'u8': int,
    'f4': float
def name2grp(
    name: str, # Group name
    cdl: dict, # CDL configuration
    # Reverse `cdl.toml` config group dict so that group config key can be retrieve based on its name
    return {v['name']:k  for k, v in cdl['grps'].items()}[name]


name2grp('seawater', cdl=CONFIGS_CDL)



 nc_tpl_name ()

Return the name of the MARIS NetCDF template as defined in configs.toml

def nc_tpl_name():
    "Return the name of the MARIS NetCDF template as defined in `configs.toml`"
    p = base_path()
    return read_toml(p / 'configs.toml')['names']['nc_template']



 nc_tpl_path ()

Return the path of the MARIS NetCDF template as defined in configs.toml

def nc_tpl_path():
    "Return the path of the MARIS NetCDF template as defined in `configs.toml`"
    p = base_path()
    return p / read_toml(p / 'configs.toml')['names']['nc_template']

Enumeration types

Enumeration types are used to avoid using strings as NetCDF4 variable values. Instead, enumeration types (lookup tables) such as {'Crustaceans': 2, 'Echinoderms': 3, ...} are prepended to the NetCDF file template and associated ids (integers) are used as values.



 sanitize (s:str|float)

*Sanitize dictionary key to comply with NetCDF enumeration type:

  • Remove (, ), ., /, -
  • Strip the string
  • Return original value if it’s not a string (e.g., NaN)*
Type Details
s str | float String or float to sanitize
Returns str | float Sanitized string or original float
def sanitize(
    s: str|float # String or float to sanitize
    ) -> str|float:  # Sanitized string or original float
    Sanitize dictionary key to comply with NetCDF enumeration type:
    - Remove `(`, `)`, `.`, `/`, `-`
    - Strip the string
    - Return original value if it's not a string (e.g., NaN)
    if isinstance(s, str):
        s = re.sub(r'[().]', '', s)
        return re.sub(r'[/-]', ' ', s).strip()
    elif pd.isna(s):  # This covers np.nan, None, and pandas NaT
        return s
        return str(s).strip()

For example:

fc.test_eq(sanitize('key (sanitized)'), 'key sanitized')
fc.test_eq(sanitize('key san.itized'), 'key sanitized')
fc.test_eq(sanitize('key-sanitized'), 'key sanitized')
fc.test_eq(sanitize('key/sanitized'), 'key sanitized')

NetCDF4 enumeration type seems to not accept keys containing non alphanumeric characters like parentheses, dots, slash, … As a result, MARIS lookup table needs to be sanitized.



 get_lut (src_dir:str, fname:str, key:str, value:str,
          do_sanitize:bool=True, reverse:bool=False)

Convert MARIS db lookup table excel file to dictionary {'name': id, ...} or {id: name, ...} if reverse is True.

Type Default Details
src_dir str Directory containing lookup tables
fname str Excel file lookup table name
key str Excel file column name to be used as dict keys
value str Excel file column name to be used as dict values
do_sanitize bool True Sanitization required?
reverse bool False Reverse lookup table (value, key)
Returns dict MARIS lookup table (key, value)
def get_lut(
    src_dir: str, # Directory containing lookup tables
    fname: str, # Excel file lookup table name
    key: str, # Excel file column name to be used as dict keys 
    value: str, # Excel file column name to be used as dict values 
    do_sanitize: bool=True, # Sanitization required?
    reverse: bool=False # Reverse lookup table (value, key)
    ) -> dict: # MARIS lookup table (key, value)
    "Convert MARIS db lookup table excel file to dictionary `{'name': id, ...}` or `{id: name, ...}` if `reverse` is True."
    fname = Path(src_dir) / fname
    df = pd.read_excel(fname, usecols=[key, value]).dropna(subset=value)
    df[value] = df[value].astype('int')
    df = df.set_index(key)
    lut = df[value].to_dict()
    if do_sanitize:
        lut = {sanitize(k): v for k, v in lut.items()}
    lut = {try_int(k): try_int(v) for k, v in lut.items()}    
    return {v: k for k, v in lut.items()} if reverse else lut

For example:

lut_src_dir = './files/lut'
get_lut(lut_src_dir, 'dbo_biogroup.xlsx', key='biogroup', value='biogroup_id', reverse=False)
{'Not applicable': -1,
 'Not available': 0,
 'Birds': 1,
 'Crustaceans': 2,
 'Echinoderms': 3,
 'Fish': 4,
 'Mammals': 5,
 'Molluscs': 6,
 'Others': 7,
 'Plankton': 8,
 'Polychaete worms': 9,
 'Reptile': 10,
 'Seaweeds and plants': 11,
 'Cephalopods': 12,
 'Gastropods': 13,
 'Bivalves': 14}



 Enums (lut_src_dir:str, cdl_enums:dict)

Return dictionaries of MARIS NetCDF’s enumeration types.

Type Details
lut_src_dir str Directory containing lookup tables
cdl_enums dict CDL configuration enumeration types
class Enums():
    "Return dictionaries of MARIS NetCDF's enumeration types."
    def __init__(self, 
               lut_src_dir:str, # Directory containing lookup tables
               cdl_enums:dict # CDL configuration enumeration types
        self.types = self.lookup()
    def filter(self, name, values):
        return {name: id for name, id in self.types[name].items() if id in values}
    def lookup(self):
        types = {}
        for enum in self.cdl_enums:
            name, fname, key, value = enum.values()
            lut = get_lut(self.lut_src_dir, fname, key=key, value=value)
            types[name] = lut
        return types
lut_src_dir_test = './files/lut'
cdl_enums_test = read_toml('./files/cdl.toml')['enums']

enums = Enums(lut_src_dir=lut_src_dir_test, 
{'Not applicable': -1,
 'Not available': 0,
 'Detected value': 1,
 'Detection limit': 2,
 'Not detected': 3,
 'Derived': 4}
{'Not applicable': -1,
 'Bq per m3': 1,
 'Bq per m2': 2,
 'Bq per kg': 3,
 'Bq per kgd': 4,
 'Bq per kgw': 5,
 'kg per kg': 6,
 'TU': 7,
 'DELTA per mill': 8,
 'atom per kg': 9,
 'atom per kgd': 10,
 'atom per kgw': 11,
 'atom per l': 12,
 'Bq per kgC': 13}



 get_enum_dicts (lut_src_dir:str, cdl_enums:dict, **kwargs)

Return a dict of NetCDF enumeration types.

Type Details
lut_src_dir str Directory containing lookup tables
cdl_enums dict CDL configuration enumeration types
def get_enum_dicts(
    lut_src_dir:str, # Directory containing lookup tables
    cdl_enums:dict, # CDL configuration enumeration types
    **kwargs # Additional arguments
    "Return a dict of NetCDF enumeration types."
    enum_types = {}
    for enum in cdl_enums:
        name, fname, key, value = enum.values()
        lut = get_lut(lut_src_dir, fname, key=key, value=value, **kwargs)
        enum_types[name] = lut
    return enum_types

For example:

lut_src_dir_test = './files/lut'
cdl_enums_test = read_toml('./files/cdl.toml')['enums']

enums = get_enum_dicts(lut_src_dir=lut_src_dir_test, 
dict_keys(['area_t', 'bio_group_t', 'body_part_t', 'species_t', 'sed_type_t', 'unit_t', 'dl_t', 'filt_t', 'counmet_t', 'sampmet_t', 'prepmet_t'])