This data pipeline, known as “handler” in Marisco terminology, contains a data pipeline (handler) that converts the master MARIS database dump into NetCDF format. It enables batch encoding of all legacy datasets into NetCDF.

Key functions of this handler:

The result is a set of NetCDF files, one for each unique reference ID in the input data.

Getting Started

For new MARIS users, please refer to Understanding MARIS Data Formats (NetCDF and Open Refine) for detailed information.

The present notebook pretends to be an instance of Literate Programming in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case marisco/handlers/helcom.py) the code snippet is added to the module using #| exports as provided by the wonderful nbdev library.

Configuration & file paths

  • fname_in: path to the folder containing the MARIS dump data in CSV format.

  • dir_dest: path to the folder where the NetCDF output will be saved.

Exported source
# fname_in = Path().home() / 'pro/data/maris/2024-11-20 MARIS_QA_shapetype_id=1.txt'
fname_in = Path().home() / 'pro/data/maris/2025-06-03 MARIS_QA_shapetype_id = 1.txt'

dir_dest = '../../_data/output/dump'
df = pd.read_csv(fname_in, sep='\t', encoding='utf-8', low_memory=False)
df.head()
sample_id area_id areaname samptype_id samptype ref_id displaytext zoterourl ref_note datbase ... profile_id sampnote ref_fulltext ref_yearpub ref_sampleTypes LongLat shiftedcoordinates shiftedlong shiftedlat id
0 18810 1904 Indian Ocean 1 Seawater 97 ASPAMARD, 2004 https://www.zotero.org/groups/2432820/maris/it... P. Scotto(1975-2001);P. Morris (2003-2014) ASPAMARD ... NaN Because the date does not have month & day, an... ASPAMARD, 2004. Asia-Pacific Marine Radioactiv... 2004 1,3 111.983,-25.05 0xE6100000010CCDCCCCCCCC0C39C0F4FDD478E9FE5B40 111.983333 -25.050000 1
1 63633 1904 Indian Ocean 1 Seawater 99 Aoyama and Hirose, 2004 https://www.zotero.org/groups/2432820/maris/it... NaN HAM 2008 ... NaN Author: Y.Bourlat, et.al. Unknown latitude and... Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo... 2004 1 61.533,-52.017 0xE6100000010CEEEBC03923024AC0787AA52C43C44E40 61.533333 -52.016667 2
2 63633 1904 Indian Ocean 1 Seawater 99 Aoyama and Hirose, 2004 https://www.zotero.org/groups/2432820/maris/it... NaN HAM 2008 ... NaN Author: Y.Bourlat, et.al. Unknown latitude and... Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo... 2004 1 61.533,-52.017 0xE6100000010CEEEBC03923024AC0787AA52C43C44E40 61.533333 -52.016667 3
3 63635 1904 Indian Ocean 1 Seawater 99 Aoyama and Hirose, 2004 https://www.zotero.org/groups/2432820/maris/it... NaN HAM 2008 ... NaN Author: Y.Bourlat, et.al. Unknown latitude and... Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo... 2004 1 57.483,-44.483 0xE6100000010C12143FC6DC3D46C012143FC6DCBD4C40 57.483333 -44.483333 4
4 63635 1904 Indian Ocean 1 Seawater 99 Aoyama and Hirose, 2004 https://www.zotero.org/groups/2432820/maris/it... NaN HAM 2008 ... NaN Author: Y.Bourlat, et.al. Unknown latitude and... Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo... 2004 1 57.483,-44.483 0xE6100000010C12143FC6DC3D46C012143FC6DCBD4C40 57.483333 -44.483333 5

5 rows × 80 columns

Utils

Below a utility class to load a specific MARIS dump dataset optionally filtered through its ref_id.


source

DataLoader

 DataLoader (fname:str, exclude_ref_id:Optional[List[int]]=[9999])

Load specific MARIS dataset through its ref_id.

Type Default Details
fname str Path to the MARIS global dump file
exclude_ref_id Optional [9999] Whether to filter the dataframe by ref_id
Exported source
class DataLoader:
    "Load specific MARIS dataset through its ref_id."
    LUT = {
        'Biota': 'BIOTA', 
        'Seawater': 'SEAWATER', 
        'Sediment': 'SEDIMENT', 
        'Suspended matter': 'SUSPENDED_MATTER'
    }

    def __init__(self, 
                 fname: str, # Path to the MARIS global dump file
                 exclude_ref_id: Optional[List[int]]=[9999] # Whether to filter the dataframe by ref_id
                 ):
        fc.store_attr()
        self.df = self._load_data()

    def _load_data(self):
        df = pd.read_csv(self.fname, sep='\t', encoding='utf-8', low_memory=False)
        return df[~df.ref_id.isin(self.exclude_ref_id)] if self.exclude_ref_id else df

    def __call__(self, 
                 ref_id: int # Reference ID of interest
                 ) -> dict: # Dictionary of dataframes
        df = self.df[self.df.ref_id == ref_id].copy() if ref_id else self.df.copy()
        return {self.LUT[name]: grp for name, grp in df.groupby('samptype') if name in self.LUT}

source

get_zotero_key

 get_zotero_key (dfs)

Retrieve Zotero key from MARIS dump.

Exported source
def get_zotero_key(dfs):
    "Retrieve Zotero key from MARIS dump."
    return dfs[next(iter(dfs))][['zoterourl']].iloc[0].values[0].split('/')[-1]

source

get_fname

 get_fname (dfs)

Get NetCDF filename.

Exported source
def get_fname(dfs):
    "Get NetCDF filename."
    return f"{next(iter(dfs.values()))['ref_id'].iloc[0]}.nc"

Load data

Here below a quick overview of the MARIS dump data structure.

dataloader = DataLoader(fname_in)
ref_id = 106 # Some other ref_id examples: OSPAR: 191, HELCOM: 100, 717 (only seawater)

dfs = dataloader(ref_id=ref_id)
print(f'keys: {dfs.keys()}')
dfs['SEAWATER'].head()
keys: dict_keys(['SEAWATER'])
sample_id area_id areaname samptype_id samptype ref_id displaytext zoterourl ref_note datbase ... profile_id sampnote ref_fulltext ref_yearpub ref_sampleTypes LongLat shiftedcoordinates shiftedlong shiftedlat id
5829 73703 1904 Indian Ocean 1 Seawater 106 Yamada et al., 2006 https://www.zotero.org/groups/2432820/maris/it... NaN NaN ... NaN NaN Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... 2006 1 92.983,-0.008 0xE6100000010CB289C4EB97DB7FBFA52C431CEB3E5740 92.983056 -0.007778 5830
5830 73705 1904 Indian Ocean 1 Seawater 106 Yamada et al., 2006 https://www.zotero.org/groups/2432820/maris/it... NaN NaN ... NaN NaN Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... 2006 1 96.596,-4.029 0xE6100000010CAF5A99F04B1D10C0D95F764F1E265840 96.595556 -4.028611 5831
5831 73711 1904 Indian Ocean 1 Seawater 106 Yamada et al., 2006 https://www.zotero.org/groups/2432820/maris/it... NaN NaN ... NaN NaN Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... 2006 1 101.991,-9.997 0xE6100000010C91D5AD9E93FE23C08195438B6C7F5940 101.990556 -9.997222 5832
5832 73715 1904 Indian Ocean 1 Seawater 106 Yamada et al., 2006 https://www.zotero.org/groups/2432820/maris/it... NaN NaN ... NaN NaN Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... 2006 1 114.394,-18.496 0xE6100000010C575BB1BFEC7E32C0F0A7C64B37995C40 114.393889 -18.495833 5833
5833 73717 1904 Indian Ocean 1 Seawater 106 Yamada et al., 2006 https://www.zotero.org/groups/2432820/maris/it... NaN NaN ... NaN NaN Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... 2006 1 112.552,-23.066 0xE6100000010C54742497FF1037C017D9CEF753235C40 112.551667 -23.066389 5834

5 rows × 80 columns

Transform data

Select columns

Exported source
cois_renaming_rules = {
    'sample_id': 'SMP_ID',
    'latitude': 'LAT',
    'longitude': 'LON',
    'begperiod': 'TIME',
    'sampdepth': 'SMP_DEPTH',
    'totdepth': 'TOT_DEPTH',
    'station': 'STATION',
    'uncertaint': 'UNC',
    'unit_id': 'UNIT',
    'detection': 'DL',
    'area_id': 'AREA',
    'species_id': 'SPECIES',
    'biogroup_id': 'BIO_GROUP',
    'bodypar_id': 'BODY_PART',
    'sedtype_id': 'SED_TYPE',
    'volume': 'VOL',
    'salinity': 'SAL',
    'temperatur': 'TEMP',
    'sampmet_id': 'SAMP_MET',
    'prepmet_id': 'PREP_MET',
    'counmet_id': 'COUNT_MET',
    'activity': 'VALUE',
    'nuclide_id': 'NUCLIDE',
    'sliceup': 'TOP',
    'slicedown': 'BOTTOM'
}
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules)
    ])

print('Keys:', tfm().keys())
print('Columns:', tfm()['SEAWATER'].columns)
Keys: dict_keys(['SEAWATER'])
Columns: Index(['sample_id', 'latitude', 'longitude', 'begperiod', 'sampdepth',
       'totdepth', 'station', 'uncertaint', 'unit_id', 'detection', 'area_id',
       'species_id', 'biogroup_id', 'bodypar_id', 'sedtype_id', 'volume',
       'salinity', 'temperatur', 'sampmet_id', 'prepmet_id', 'counmet_id',
       'activity', 'nuclide_id', 'sliceup', 'slicedown'],
      dtype='object')

Rename columns

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules)
    ])

dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['SEAWATER'].columns)
Keys: dict_keys(['SEAWATER'])
Columns: Index(['SMP_ID', 'LAT', 'LON', 'TIME', 'SMP_DEPTH', 'TOT_DEPTH', 'STATION',
       'UNC', 'UNIT', 'DL', 'AREA', 'SPECIES', 'BIO_GROUP', 'BODY_PART',
       'SED_TYPE', 'VOL', 'SAL', 'TEMP', 'SAMP_MET', 'PREP_MET', 'COUNT_MET',
       'VALUE', 'NUCLIDE', 'TOP', 'BOTTOM'],
      dtype='object')

Cast STATION to str type

This is required for VLEN netcdf variables.


source

CastStationToStringCB

 CastStationToStringCB ()

Convert STATION column to string type, filling any missing values with empty string

Exported source
class CastStationToStringCB(Callback):
    "Convert STATION column to string type, filling any missing values with empty string"
    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            if 'STATION' in tfm.dfs[k].columns:
                tfm.dfs[k]['STATION'] = tfm.dfs[k]['STATION'].fillna('').astype('string')
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB()
    ])

dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['SEAWATER'].dtypes)
Keys: dict_keys(['SEAWATER'])
Columns: SMP_ID                int64
LAT                 float64
LON                 float64
TIME                 object
SMP_DEPTH           float64
TOT_DEPTH           float64
STATION      string[python]
UNC                 float64
UNIT                  int64
DL                   object
AREA                  int64
SPECIES               int64
BIO_GROUP             int64
BODY_PART             int64
SED_TYPE              int64
VOL                 float64
SAL                 float64
TEMP                float64
SAMP_MET              int64
PREP_MET              int64
COUNT_MET             int64
VALUE               float64
NUCLIDE               int64
TOP                 float64
BOTTOM              float64
dtype: object

Drop NaN only columns

We then remove columns containing only NaN values or ‘Not available’ (id=0 in MARIS lookup tables).


source

DropNAColumnsCB

 DropNAColumnsCB (na_value=0)

Drop variable containing only NaN or ‘Not available’ (id=0 in MARIS lookup tables).

Exported source
class DropNAColumnsCB(Callback):
    "Drop variable containing only NaN or 'Not available' (id=0 in MARIS lookup tables)."
    def __init__(self, na_value=0): fc.store_attr()
    def isMarisNA(self, col): 
        return len(col.unique()) == 1 and col.iloc[0] == self.na_value
    
    def dropMarisNA(self, df):
        na_cols = [col for col in df.columns if self.isMarisNA(df[col])]
        return df.drop(labels=na_cols, axis=1)
        
    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            tfm.dfs[k] = tfm.dfs[k].dropna(axis=1, how='all')
            tfm.dfs[k] = self.dropMarisNA(tfm.dfs[k])
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB()
    ])

dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['SEAWATER'].columns)
Keys: dict_keys(['SEAWATER'])
Columns: Index(['SMP_ID', 'LAT', 'LON', 'TIME', 'STATION', 'UNC', 'UNIT', 'DL', 'AREA',
       'SAL', 'TEMP', 'VALUE', 'NUCLIDE'],
      dtype='object')

Remap detection limit values

Category-based NetCDF variables are encoded as integer values based on the MARIS lookup table dbo_detectlimit.xlsx. We recall that these lookup tables are included in the NetCDF file as custom enumeration types.

Exported source
dl_name_to_id = lambda: get_lut(lut_path(), 
                                'dbo_detectlimit.xlsx', 
                                key='name', 
                                value='id')
dl_name_to_id()
{'Not applicable': -1, 'Not Available': 0, '=': 1, '<': 2, 'ND': 3, 'DE': 4}

source

SanitizeDetectionLimitCB

 SanitizeDetectionLimitCB (fn_lut=<function <lambda>>, dl_name='DL')

Assign Detection Limit name to its id based on MARIS nomenclature.

Exported source
class SanitizeDetectionLimitCB(Callback):
    "Assign Detection Limit name to its id based on MARIS nomenclature."
    def __init__(self,
                 fn_lut=dl_name_to_id,
                 dl_name='DL'):
        fc.store_attr()

    def __call__(self, tfm):
        lut = self.fn_lut()
        for k in tfm.dfs.keys():
            tfm.dfs[k][self.dl_name] = tfm.dfs[k][self.dl_name].replace(lut)
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB()
    ])

dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['BIOTA'].columns)
print(f'{dfs_tfm["BIOTA"]["DL"].unique()}')
print(f'{dfs_tfm["BIOTA"].head()}')
Keys: dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT'])
Columns: Index(['SMP_ID', 'LAT', 'LON', 'TIME', 'SMP_DEPTH', 'STATION', 'UNC', 'UNIT',
       'DL', 'AREA', 'SPECIES', 'BIO_GROUP', 'BODY_PART', 'PREP_MET',
       'COUNT_MET', 'VALUE', 'NUCLIDE'],
      dtype='object')
[1 2]
        SMP_ID    LAT        LON                     TIME  SMP_DEPTH STATION  \
603199  638133  57.25  12.083333  1986-01-31 00:00:00.000        0.0  RINGHA   
603200  638133  57.25  12.083333  1986-01-31 00:00:00.000        0.0  RINGHA   
603201  638134  57.25  12.083333  1986-02-28 00:00:00.000        0.0  RINGHA   
603202  638134  57.25  12.083333  1986-02-28 00:00:00.000        0.0  RINGHA   
603203  638134  57.25  12.083333  1986-02-28 00:00:00.000        0.0  RINGHA   

         UNC  UNIT  DL  AREA  SPECIES  BIO_GROUP  BODY_PART  PREP_MET  \
603199  0.25     5   1  2374      129         14         19         7   
603200  0.45     5   1  2374      129         14         19         7   
603201  0.52     5   1  2374      129         14         19         7   
603202  0.49     5   1  2374      129         14         19         7   
603203  0.78     5   1  2374      129         14         19         7   

        COUNT_MET  VALUE  NUCLIDE  
603199         20    2.5       33  
603200         20    4.5        9  
603201         20    2.6       33  
603202         20    4.9        9  
603203         20    3.9       22  

Parse and encode time

We remind that in netCDF format time need to be encoded as integer representing the number of seconds since a time of reference. In our case we chose 1970-01-01 00:00:00.0 as defined in configs.ipynb.


source

ParseTimeCB

 ParseTimeCB (time_name='TIME')

Parse time column from MARIS dump.

Exported source
class ParseTimeCB(Callback):
    "Parse time column from MARIS dump."
    def __init__(self,
                 time_name='TIME'):
        fc.store_attr()
        
    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            tfm.dfs[k][self.time_name] = pd.to_datetime(tfm.dfs[k][self.time_name], format='ISO8601')
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB()
    ])

print(tfm()['BIOTA'])
        SMP_ID        LAT        LON        TIME  SMP_DEPTH STATION      UNC  \
603199  638133  57.250000  12.083333   507513600        0.0  RINGHA  0.25000   
603200  638133  57.250000  12.083333   507513600        0.0  RINGHA  0.45000   
603201  638134  57.250000  12.083333   509932800        0.0  RINGHA  0.52000   
603202  638134  57.250000  12.083333   509932800        0.0  RINGHA  0.49000   
603203  638134  57.250000  12.083333   509932800        0.0  RINGHA  0.78000   
...        ...        ...        ...         ...        ...     ...      ...   
965909  639100  63.050000  21.616667   518572800        0.0   VAASA  0.01440   
965910  639100  63.050000  21.616667   518572800        0.0   VAASA      NaN   
965911  639137  63.066667  21.400000  1114732800        0.0   VAASA  1.46500   
965912  639137  63.066667  21.400000  1114732800        0.0   VAASA  0.00204   
965913  639137  63.066667  21.400000  1114732800        0.0   VAASA  5.00000   

        UNIT  DL  AREA  SPECIES  BIO_GROUP  BODY_PART  PREP_MET  COUNT_MET  \
603199     5   1  2374      129         14         19         7         20   
603200     5   1  2374      129         14         19         7         20   
603201     5   1  2374      129         14         19         7         20   
603202     5   1  2374      129         14         19         7         20   
603203     5   1  2374      129         14         19         7         20   
...      ...  ..   ...      ...        ...        ...       ...        ...   
965909     5   1  9999      269          4         52        12          9   
965910     5   1  9999      269          4         52        12          9   
965911     5   1  9999      269          4         52         0         20   
965912     5   1  9999      269          4         52         0          8   
965913     5   1  9999      269          4         52         0         20   

          VALUE  NUCLIDE  
603199    2.500       33  
603200    4.500        9  
603201    2.600       33  
603202    4.900        9  
603203    3.900       22  
...         ...      ...  
965909    0.072       12  
965910    0.015       11  
965911   29.300       33  
965912    0.017       12  
965913  113.000        4  

[14872 rows x 17 columns]

Sanitize coordinates

We ensure that coordinates are within the valid range.

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
    ])

dfs_test = tfm()
dfs_test['BIOTA']
SMP_ID LAT LON TIME SMP_DEPTH STATION UNC UNIT DL AREA SPECIES BIO_GROUP BODY_PART PREP_MET COUNT_MET VALUE NUCLIDE
603199 638133 57.250000 12.083333 507513600 0.0 RINGHA 0.25000 5 1 2374 129 14 19 7 20 2.500 33
603200 638133 57.250000 12.083333 507513600 0.0 RINGHA 0.45000 5 1 2374 129 14 19 7 20 4.500 9
603201 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.52000 5 1 2374 129 14 19 7 20 2.600 33
603202 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.49000 5 1 2374 129 14 19 7 20 4.900 9
603203 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.78000 5 1 2374 129 14 19 7 20 3.900 22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
965909 639100 63.050000 21.616667 518572800 0.0 VAASA 0.01440 5 1 9999 269 4 52 12 9 0.072 12
965910 639100 63.050000 21.616667 518572800 0.0 VAASA NaN 5 1 9999 269 4 52 12 9 0.015 11
965911 639137 63.066667 21.400000 1114732800 0.0 VAASA 1.46500 5 1 9999 269 4 52 0 20 29.300 33
965912 639137 63.066667 21.400000 1114732800 0.0 VAASA 0.00204 5 1 9999 269 4 52 0 8 0.017 12
965913 639137 63.066667 21.400000 1114732800 0.0 VAASA 5.00000 5 1 9999 269 4 52 0 20 113.000 4

14872 rows × 17 columns

Set unique index

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    UniqueIndexCB()
    ])

dfs_test = tfm()    
dfs_test['BIOTA']
ID SMP_ID LAT LON TIME SMP_DEPTH STATION UNC UNIT DL AREA SPECIES BIO_GROUP BODY_PART PREP_MET COUNT_MET VALUE NUCLIDE
0 0 638133 57.250000 12.083333 507513600 0.0 RINGHA 0.25000 5 1 2374 129 14 19 7 20 2.500 33
1 1 638133 57.250000 12.083333 507513600 0.0 RINGHA 0.45000 5 1 2374 129 14 19 7 20 4.500 9
2 2 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.52000 5 1 2374 129 14 19 7 20 2.600 33
3 3 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.49000 5 1 2374 129 14 19 7 20 4.900 9
4 4 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.78000 5 1 2374 129 14 19 7 20 3.900 22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14867 14867 639100 63.050000 21.616667 518572800 0.0 VAASA 0.01440 5 1 9999 269 4 52 12 9 0.072 12
14868 14868 639100 63.050000 21.616667 518572800 0.0 VAASA NaN 5 1 9999 269 4 52 12 9 0.015 11
14869 14869 639137 63.066667 21.400000 1114732800 0.0 VAASA 1.46500 5 1 9999 269 4 52 0 20 29.300 33
14870 14870 639137 63.066667 21.400000 1114732800 0.0 VAASA 0.00204 5 1 9999 269 4 52 0 8 0.017 12
14871 14871 639137 63.066667 21.400000 1114732800 0.0 VAASA 5.00000 5 1 9999 269 4 52 0 20 113.000 4

14872 rows × 18 columns

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    UniqueIndexCB()
    ])

dfs_test = tfm()    
dfs_test['BIOTA']
ID SMP_ID LAT LON TIME SMP_DEPTH STATION UNC UNIT DL AREA SPECIES BIO_GROUP BODY_PART PREP_MET COUNT_MET VALUE NUCLIDE
0 0 638133 57.250000 12.083333 507513600 0.0 RINGHA 0.25000 5 1 2374 129 14 19 7 20 2.500 33
1 1 638133 57.250000 12.083333 507513600 0.0 RINGHA 0.45000 5 1 2374 129 14 19 7 20 4.500 9
2 2 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.52000 5 1 2374 129 14 19 7 20 2.600 33
3 3 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.49000 5 1 2374 129 14 19 7 20 4.900 9
4 4 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.78000 5 1 2374 129 14 19 7 20 3.900 22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14867 14867 639100 63.050000 21.616667 518572800 0.0 VAASA 0.01440 5 1 9999 269 4 52 12 9 0.072 12
14868 14868 639100 63.050000 21.616667 518572800 0.0 VAASA NaN 5 1 9999 269 4 52 12 9 0.015 11
14869 14869 639137 63.066667 21.400000 1114732800 0.0 VAASA 1.46500 5 1 9999 269 4 52 0 20 29.300 33
14870 14870 639137 63.066667 21.400000 1114732800 0.0 VAASA 0.00204 5 1 9999 269 4 52 0 8 0.017 12
14871 14871 639137 63.066667 21.400000 1114732800 0.0 VAASA 5.00000 5 1 9999 269 4 52 0 20 113.000 4

14872 rows × 18 columns

Encode to NetCDF

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    UniqueIndexCB()
    ])

dfs_tfm = tfm()
tfm.logs
['Select columns of interest.',
 'Renaming variables to MARIS standard names.',
 'Convert STATION column to string type, filling any missing values with empty string',
 "Drop variable containing only NaN or 'Not available' (id=0 in MARIS lookup tables).",
 'Assign Detection Limit name to its id based on MARIS nomenclature.',
 'Parse time column from MARIS dump.',
 'Encode time as seconds since epoch.',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.',
 'Set unique index for each group.']

source

get_attrs

 get_attrs (tfm, zotero_key, kw=['oceanography', 'Earth Science > Oceans >
            Ocean Chemistry> Radionuclides', 'Earth Science > Human
            Dimensions > Environmental Impacts > Nuclear Radiation
            Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean
            Tracers, Earth Science > Oceans > Marine Sediments', 'Earth
            Science > Oceans > Ocean Chemistry, Earth Science > Oceans >
            Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality >
            Ocean Contaminants', 'Earth Science > Biological
            Classification > Animals/Vertebrates > Fish', 'Earth Science >
            Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science >
            Biological Classification > Animals/Invertebrates > Mollusks',
            'Earth Science > Biological Classification >
            Animals/Invertebrates > Arthropods > Crustaceans', 'Earth
            Science > Biological Classification > Plants > Macroalgae
            (Seaweeds)'])

Retrieve global attributes from MARIS dump.

Exported source
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']
Exported source
def get_attrs(tfm, zotero_key, kw=kw):
    "Retrieve global attributes from MARIS dump."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()
get_attrs(tfm, zotero_key='3W354SQG', kw=kw)
{'geospatial_lat_min': '137.8475',
 'geospatial_lat_max': '31.2575',
 'geospatial_lon_min': '88.9988888888889',
 'geospatial_lon_max': '-42.3086111111111',
 'geospatial_bounds': 'POLYGON ((88.9988888888889 -42.3086111111111, 137.8475 -42.3086111111111, 137.8475 31.2575, 88.9988888888889 31.2575, 88.9988888888889 -42.3086111111111))',
 'time_coverage_start': '1996-12-20T00:00:00',
 'time_coverage_end': '1997-02-11T00:00:00',
 'id': '3W354SQG',
 'title': 'Radioactivity Monitoring of the Irish Marine Environment 1991 and 1992',
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "A.", "lastName": "McGarry"}, {"creatorType": "author", "firstName": "S.", "lastName": "Lyons"}, {"creatorType": "author", "firstName": "C.", "lastName": "McEnri"}, {"creatorType": "author", "firstName": "T.", "lastName": "Ryan"}, {"creatorType": "author", "firstName": "M.", "lastName": "O\'Colmain"}, {"creatorType": "author", "firstName": "J.D.", "lastName": "Cunningham"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
 'publisher_postprocess_logs': "Select columns of interest., Renaming variables to MARIS standard names., Convert STATION column to string type, filling any missing values with empty string, Drop variable containing only NaN or 'Not available' (id=0 in MARIS lookup tables)., Assign Detection Limit name to its id based on MARIS nomenclature., Parse time column from MARIS dump., Encode time as seconds since epoch., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator., Set unique index for each group."}

source

encode

 encode (fname_in:str, dir_dest:str, **kwargs)

Encode MARIS dump to NetCDF.

Type Details
fname_in str Path to the MARIS dump data in CSV format
dir_dest str Path to the folder where the NetCDF output will be saved
kwargs VAR_KEYWORD
Exported source
def encode(
    fname_in: str, # Path to the MARIS dump data in CSV format
    dir_dest: str, # Path to the folder where the NetCDF output will be saved
    **kwargs # Additional keyword arguments
    ):
    "Encode MARIS dump to NetCDF."
    dataloader = DataLoader(fname_in)
    ref_ids = kwargs.get('ref_ids')
    if ref_ids is None:
        ref_ids = dataloader.df.ref_id.unique()
    print('Encoding ...')
    for ref_id in tqdm(ref_ids, leave=False):
        # if ref_id == 736: continue
        dfs = dataloader(ref_id=ref_id)
        print(get_fname(dfs))
        tfm = Transformer(dfs, cbs=[
            SelectColumnsCB(cois_renaming_rules),
            RenameColumnsCB(cois_renaming_rules),
            CastStationToStringCB(),
            DropNAColumnsCB(),
            SanitizeDetectionLimitCB(),
            ParseTimeCB(),
            EncodeTimeCB(),
            SanitizeLonLatCB(),
            UniqueIndexCB(),
            ])
        
        tfm()
        encoder = NetCDFEncoder(tfm.dfs, 
                                dest_fname=Path(dir_dest) / get_fname(dfs), 
                                global_attrs=get_attrs(tfm, zotero_key=get_zotero_key(dfs), kw=kw),
                                verbose=kwargs.get('verbose', False)
                                )
        encoder.encode()

Single dataset

ref_id = 106
encode(
    fname_in,
    dir_dest,
    verbose=False, 
    ref_ids=[ref_id])
Encoding ...
  0%|          | 0/1 [00:00<?, ?it/s]
106.nc
                                             

All datasets

encode(
    fname_in, 
    dir_dest, 
    ref_ids=None,
    verbose=False)