MARIS Legacy

This data pipeline, known as “handler” in Marisco terminology, contains a data pipeline (handler) that converts the master MARIS database dump into NetCDF format. It enables batch encoding of all legacy datasets into NetCDF.

Key functions of this handler:

Load data from a MARIS dump file
Transform data by applying various transformations to clean and normalize the data
Encode data into NetCDF files

The result is a set of NetCDF files, one for each unique reference ID in the input data.

Getting Started

For new MARIS users, please refer to Understanding MARIS Data Formats (NetCDF and Open Refine) for detailed information.

The present notebook pretends to be an instance of Literate Programming in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case marisco/handlers/helcom.py) the code snippet is added to the module using #| exports as provided by the wonderful nbdev library.

Configuration & file paths

fname_in: path to the folder containing the MARIS dump data in CSV format.
dir_dest: path to the folder where the NetCDF output will be saved.

Exported source

# fname_in = Path().home() / 'pro/data/maris/2024-11-20 MARIS_QA_shapetype_id=1.txt'
fname_in = Path().home() / 'pro/data/maris/2025-06-03 MARIS_QA_shapetype_id = 1.txt'

dir_dest = '../../_data/output/dump'

df = pd.read_csv(fname_in, sep='\t', encoding='utf-8', low_memory=False)

df.head()

	sample_id	area_id	areaname	samptype_id	samptype	ref_id	displaytext	zoterourl	ref_note	datbase	...	profile_id	sampnote	ref_fulltext	ref_yearpub	ref_sampleTypes	LongLat	shiftedcoordinates	shiftedlong	shiftedlat	id
0	18810	1904	Indian Ocean	1	Seawater	97	ASPAMARD, 2004	https://www.zotero.org/groups/2432820/maris/it...	P. Scotto(1975-2001);P. Morris (2003-2014)	ASPAMARD	...	NaN	Because the date does not have month & day, an...	ASPAMARD, 2004. Asia-Pacific Marine Radioactiv...	2004	1,3	111.983,-25.05	0xE6100000010CCDCCCCCCCC0C39C0F4FDD478E9FE5B40	111.983333	-25.050000	1
1	63633	1904	Indian Ocean	1	Seawater	99	Aoyama and Hirose, 2004	https://www.zotero.org/groups/2432820/maris/it...	NaN	HAM 2008	...	NaN	Author: Y.Bourlat, et.al. Unknown latitude and...	Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo...	2004	1	61.533,-52.017	0xE6100000010CEEEBC03923024AC0787AA52C43C44E40	61.533333	-52.016667	2
2	63633	1904	Indian Ocean	1	Seawater	99	Aoyama and Hirose, 2004	https://www.zotero.org/groups/2432820/maris/it...	NaN	HAM 2008	...	NaN	Author: Y.Bourlat, et.al. Unknown latitude and...	Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo...	2004	1	61.533,-52.017	0xE6100000010CEEEBC03923024AC0787AA52C43C44E40	61.533333	-52.016667	3
3	63635	1904	Indian Ocean	1	Seawater	99	Aoyama and Hirose, 2004	https://www.zotero.org/groups/2432820/maris/it...	NaN	HAM 2008	...	NaN	Author: Y.Bourlat, et.al. Unknown latitude and...	Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo...	2004	1	57.483,-44.483	0xE6100000010C12143FC6DC3D46C012143FC6DCBD4C40	57.483333	-44.483333	4
4	63635	1904	Indian Ocean	1	Seawater	99	Aoyama and Hirose, 2004	https://www.zotero.org/groups/2432820/maris/it...	NaN	HAM 2008	...	NaN	Author: Y.Bourlat, et.al. Unknown latitude and...	Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo...	2004	1	57.483,-44.483	0xE6100000010C12143FC6DC3D46C012143FC6DCBD4C40	57.483333	-44.483333	5

5 rows × 80 columns

Utils

Below a utility class to load a specific MARIS dump dataset optionally filtered through its ref_id.

source

DataLoader

 DataLoader (fname:str, exclude_ref_id:Optional[List[int]]=[9999])

Load specific MARIS dataset through its ref_id.

	Type	Default	Details
fname	str		Path to the MARIS global dump file
exclude_ref_id	Optional	[9999]	Whether to filter the dataframe by ref_id

Exported source

class DataLoader:
    "Load specific MARIS dataset through its ref_id."
    LUT = {
        'Biota': 'BIOTA', 
        'Seawater': 'SEAWATER', 
        'Sediment': 'SEDIMENT', 
        'Suspended matter': 'SUSPENDED_MATTER'
    }

    def __init__(self, 
                 fname: str, # Path to the MARIS global dump file
                 exclude_ref_id: Optional[List[int]]=[9999] # Whether to filter the dataframe by ref_id
                 ):
        fc.store_attr()
        self.df = self._load_data()

    def _load_data(self):
        df = pd.read_csv(self.fname, sep='\t', encoding='utf-8', low_memory=False)
        return df[~df.ref_id.isin(self.exclude_ref_id)] if self.exclude_ref_id else df

    def __call__(self, 
                 ref_id: int # Reference ID of interest
                 ) -> dict: # Dictionary of dataframes
        df = self.df[self.df.ref_id == ref_id].copy() if ref_id else self.df.copy()
        return {self.LUT[name]: grp for name, grp in df.groupby('samptype') if name in self.LUT}

source

get_zotero_key

 get_zotero_key (dfs)

Retrieve Zotero key from MARIS dump.

Exported source

def get_zotero_key(dfs):
    "Retrieve Zotero key from MARIS dump."
    return dfs[next(iter(dfs))][['zoterourl']].iloc[0].values[0].split('/')[-1]

source

get_fname

 get_fname (dfs)

Get NetCDF filename.

Exported source

def get_fname(dfs):
    "Get NetCDF filename."
    return f"{next(iter(dfs.values()))['ref_id'].iloc[0]}.nc"

Load data

Here below a quick overview of the MARIS dump data structure.

dataloader = DataLoader(fname_in)
ref_id = 106 # Some other ref_id examples: OSPAR: 191, HELCOM: 100, 717 (only seawater)

dfs = dataloader(ref_id=ref_id)
print(f'keys: {dfs.keys()}')
dfs['SEAWATER'].head()

keys: dict_keys(['SEAWATER'])

	sample_id	area_id	areaname	samptype_id	samptype	ref_id	displaytext	zoterourl	ref_note	datbase	...	profile_id	sampnote	ref_fulltext	ref_yearpub	ref_sampleTypes	LongLat	shiftedcoordinates	shiftedlong	shiftedlat	id
5829	73703	1904	Indian Ocean	1	Seawater	106	Yamada et al., 2006	https://www.zotero.org/groups/2432820/maris/it...	NaN	NaN	...	NaN	NaN	Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C...	2006	1	92.983,-0.008	0xE6100000010CB289C4EB97DB7FBFA52C431CEB3E5740	92.983056	-0.007778	5830
5830	73705	1904	Indian Ocean	1	Seawater	106	Yamada et al., 2006	https://www.zotero.org/groups/2432820/maris/it...	NaN	NaN	...	NaN	NaN	Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C...	2006	1	96.596,-4.029	0xE6100000010CAF5A99F04B1D10C0D95F764F1E265840	96.595556	-4.028611	5831
5831	73711	1904	Indian Ocean	1	Seawater	106	Yamada et al., 2006	https://www.zotero.org/groups/2432820/maris/it...	NaN	NaN	...	NaN	NaN	Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C...	2006	1	101.991,-9.997	0xE6100000010C91D5AD9E93FE23C08195438B6C7F5940	101.990556	-9.997222	5832
5832	73715	1904	Indian Ocean	1	Seawater	106	Yamada et al., 2006	https://www.zotero.org/groups/2432820/maris/it...	NaN	NaN	...	NaN	NaN	Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C...	2006	1	114.394,-18.496	0xE6100000010C575BB1BFEC7E32C0F0A7C64B37995C40	114.393889	-18.495833	5833
5833	73717	1904	Indian Ocean	1	Seawater	106	Yamada et al., 2006	https://www.zotero.org/groups/2432820/maris/it...	NaN	NaN	...	NaN	NaN	Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C...	2006	1	112.552,-23.066	0xE6100000010C54742497FF1037C017D9CEF753235C40	112.551667	-23.066389	5834

5 rows × 80 columns

Transform data

Select columns

Exported source

cois_renaming_rules = {
    'sample_id': 'SMP_ID',
    'latitude': 'LAT',
    'longitude': 'LON',
    'begperiod': 'TIME',
    'sampdepth': 'SMP_DEPTH',
    'totdepth': 'TOT_DEPTH',
    'station': 'STATION',
    'uncertaint': 'UNC',
    'unit_id': 'UNIT',
    'detection': 'DL',
    'area_id': 'AREA',
    'species_id': 'SPECIES',
    'biogroup_id': 'BIO_GROUP',
    'bodypar_id': 'BODY_PART',
    'sedtype_id': 'SED_TYPE',
    'volume': 'VOL',
    'salinity': 'SAL',
    'temperatur': 'TEMP',
    'sampmet_id': 'SAMP_MET',
    'prepmet_id': 'PREP_MET',
    'counmet_id': 'COUNT_MET',
    'activity': 'VALUE',
    'nuclide_id': 'NUCLIDE',
    'sliceup': 'TOP',
    'slicedown': 'BOTTOM'
}

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules)
    ])

print('Keys:', tfm().keys())
print('Columns:', tfm()['SEAWATER'].columns)

Keys: dict_keys(['SEAWATER'])
Columns: Index(['sample_id', 'latitude', 'longitude', 'begperiod', 'sampdepth',
       'totdepth', 'station', 'uncertaint', 'unit_id', 'detection', 'area_id',
       'species_id', 'biogroup_id', 'bodypar_id', 'sedtype_id', 'volume',
       'salinity', 'temperatur', 'sampmet_id', 'prepmet_id', 'counmet_id',
       'activity', 'nuclide_id', 'sliceup', 'slicedown'],
      dtype='object')

Rename columns

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules)
    ])

dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['SEAWATER'].columns)

Keys: dict_keys(['SEAWATER'])
Columns: Index(['SMP_ID', 'LAT', 'LON', 'TIME', 'SMP_DEPTH', 'TOT_DEPTH', 'STATION',
       'UNC', 'UNIT', 'DL', 'AREA', 'SPECIES', 'BIO_GROUP', 'BODY_PART',
       'SED_TYPE', 'VOL', 'SAL', 'TEMP', 'SAMP_MET', 'PREP_MET', 'COUNT_MET',
       'VALUE', 'NUCLIDE', 'TOP', 'BOTTOM'],
      dtype='object')

Cast `STATION` to `str` type

This is required for VLEN netcdf variables.

source

CastStationToStringCB

 CastStationToStringCB ()

Convert STATION column to string type, filling any missing values with empty string

Exported source

class CastStationToStringCB(Callback):
    "Convert STATION column to string type, filling any missing values with empty string"
    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            if 'STATION' in tfm.dfs[k].columns:
                tfm.dfs[k]['STATION'] = tfm.dfs[k]['STATION'].fillna('').astype('string')

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB()
    ])

dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['SEAWATER'].dtypes)

Keys: dict_keys(['SEAWATER'])
Columns: SMP_ID                int64
LAT                 float64
LON                 float64
TIME                 object
SMP_DEPTH           float64
TOT_DEPTH           float64
STATION      string[python]
UNC                 float64
UNIT                  int64
DL                   object
AREA                  int64
SPECIES               int64
BIO_GROUP             int64
BODY_PART             int64
SED_TYPE              int64
VOL                 float64
SAL                 float64
TEMP                float64
SAMP_MET              int64
PREP_MET              int64
COUNT_MET             int64
VALUE               float64
NUCLIDE               int64
TOP                 float64
BOTTOM              float64
dtype: object

Drop NaN only columns

We then remove columns containing only NaN values or ‘Not available’ (id=0 in MARIS lookup tables).

source

DropNAColumnsCB

 DropNAColumnsCB (na_value=0)

Drop variable containing only NaN or ‘Not available’ (id=0 in MARIS lookup tables).

Exported source

class DropNAColumnsCB(Callback):
    "Drop variable containing only NaN or 'Not available' (id=0 in MARIS lookup tables)."
    def __init__(self, na_value=0): fc.store_attr()
    def isMarisNA(self, col): 
        return len(col.unique()) == 1 and col.iloc[0] == self.na_value
    
    def dropMarisNA(self, df):
        na_cols = [col for col in df.columns if self.isMarisNA(df[col])]
        return df.drop(labels=na_cols, axis=1)
        
    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            tfm.dfs[k] = tfm.dfs[k].dropna(axis=1, how='all')
            tfm.dfs[k] = self.dropMarisNA(tfm.dfs[k])

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB()
    ])

dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['SEAWATER'].columns)

Keys: dict_keys(['SEAWATER'])
Columns: Index(['SMP_ID', 'LAT', 'LON', 'TIME', 'STATION', 'UNC', 'UNIT', 'DL', 'AREA',
       'SAL', 'TEMP', 'VALUE', 'NUCLIDE'],
      dtype='object')

Remap detection limit values

Category-based NetCDF variables are encoded as integer values based on the MARIS lookup table dbo_detectlimit.xlsx. We recall that these lookup tables are included in the NetCDF file as custom enumeration types.

Exported source

dl_name_to_id = lambda: get_lut(lut_path(), 
                                'dbo_detectlimit.xlsx', 
                                key='name', 
                                value='id')

dl_name_to_id()

{'Not applicable': -1, 'Not Available': 0, '=': 1, '<': 2, 'ND': 3, 'DE': 4}

source

SanitizeDetectionLimitCB

 SanitizeDetectionLimitCB (fn_lut=<function <lambda>>, dl_name='DL')

Assign Detection Limit name to its id based on MARIS nomenclature.

Exported source

class SanitizeDetectionLimitCB(Callback):
    "Assign Detection Limit name to its id based on MARIS nomenclature."
    def __init__(self,
                 fn_lut=dl_name_to_id,
                 dl_name='DL'):
        fc.store_attr()

    def __call__(self, tfm):
        lut = self.fn_lut()
        for k in tfm.dfs.keys():
            tfm.dfs[k][self.dl_name] = tfm.dfs[k][self.dl_name].replace(lut)

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB()
    ])

dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['BIOTA'].columns)
print(f'{dfs_tfm["BIOTA"]["DL"].unique()}')
print(f'{dfs_tfm["BIOTA"].head()}')

Keys: dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT'])
Columns: Index(['SMP_ID', 'LAT', 'LON', 'TIME', 'SMP_DEPTH', 'STATION', 'UNC', 'UNIT',
       'DL', 'AREA', 'SPECIES', 'BIO_GROUP', 'BODY_PART', 'PREP_MET',
       'COUNT_MET', 'VALUE', 'NUCLIDE'],
      dtype='object')
[1 2]
        SMP_ID    LAT        LON                     TIME  SMP_DEPTH STATION  \
603199  638133  57.25  12.083333  1986-01-31 00:00:00.000        0.0  RINGHA   
603200  638133  57.25  12.083333  1986-01-31 00:00:00.000        0.0  RINGHA   
603201  638134  57.25  12.083333  1986-02-28 00:00:00.000        0.0  RINGHA   
603202  638134  57.25  12.083333  1986-02-28 00:00:00.000        0.0  RINGHA   
603203  638134  57.25  12.083333  1986-02-28 00:00:00.000        0.0  RINGHA   

         UNC  UNIT  DL  AREA  SPECIES  BIO_GROUP  BODY_PART  PREP_MET  \
603199  0.25     5   1  2374      129         14         19         7   
603200  0.45     5   1  2374      129         14         19         7   
603201  0.52     5   1  2374      129         14         19         7   
603202  0.49     5   1  2374      129         14         19         7   
603203  0.78     5   1  2374      129         14         19         7   

        COUNT_MET  VALUE  NUCLIDE  
603199         20    2.5       33  
603200         20    4.5        9  
603201         20    2.6       33  
603202         20    4.9        9  
603203         20    3.9       22

Parse and encode time

We remind that in netCDF format time need to be encoded as integer representing the number of seconds since a time of reference. In our case we chose 1970-01-01 00:00:00.0 as defined in configs.ipynb.

source

ParseTimeCB

 ParseTimeCB (time_name='TIME')

Parse time column from MARIS dump.

Exported source

class ParseTimeCB(Callback):
    "Parse time column from MARIS dump."
    def __init__(self,
                 time_name='TIME'):
        fc.store_attr()
        
    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            tfm.dfs[k][self.time_name] = pd.to_datetime(tfm.dfs[k][self.time_name], format='ISO8601')

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB()
    ])

print(tfm()['BIOTA'])

        SMP_ID        LAT        LON        TIME  SMP_DEPTH STATION      UNC  \
603199  638133  57.250000  12.083333   507513600        0.0  RINGHA  0.25000   
603200  638133  57.250000  12.083333   507513600        0.0  RINGHA  0.45000   
603201  638134  57.250000  12.083333   509932800        0.0  RINGHA  0.52000   
603202  638134  57.250000  12.083333   509932800        0.0  RINGHA  0.49000   
603203  638134  57.250000  12.083333   509932800        0.0  RINGHA  0.78000   
...        ...        ...        ...         ...        ...     ...      ...   
965909  639100  63.050000  21.616667   518572800        0.0   VAASA  0.01440   
965910  639100  63.050000  21.616667   518572800        0.0   VAASA      NaN   
965911  639137  63.066667  21.400000  1114732800        0.0   VAASA  1.46500   
965912  639137  63.066667  21.400000  1114732800        0.0   VAASA  0.00204   
965913  639137  63.066667  21.400000  1114732800        0.0   VAASA  5.00000   

        UNIT  DL  AREA  SPECIES  BIO_GROUP  BODY_PART  PREP_MET  COUNT_MET  \
603199     5   1  2374      129         14         19         7         20   
603200     5   1  2374      129         14         19         7         20   
603201     5   1  2374      129         14         19         7         20   
603202     5   1  2374      129         14         19         7         20   
603203     5   1  2374      129         14         19         7         20   
...      ...  ..   ...      ...        ...        ...       ...        ...   
965909     5   1  9999      269          4         52        12          9   
965910     5   1  9999      269          4         52        12          9   
965911     5   1  9999      269          4         52         0         20   
965912     5   1  9999      269          4         52         0          8   
965913     5   1  9999      269          4         52         0         20   

          VALUE  NUCLIDE  
603199    2.500       33  
603200    4.500        9  
603201    2.600       33  
603202    4.900        9  
603203    3.900       22  
...         ...      ...  
965909    0.072       12  
965910    0.015       11  
965911   29.300       33  
965912    0.017       12  
965913  113.000        4  

[14872 rows x 17 columns]

Sanitize coordinates

We ensure that coordinates are within the valid range.

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
    ])

dfs_test = tfm()
dfs_test['BIOTA']

	SMP_ID	LAT	LON	TIME	SMP_DEPTH	STATION	UNC	UNIT	DL	AREA	SPECIES	BIO_GROUP	BODY_PART	PREP_MET	COUNT_MET	VALUE	NUCLIDE
603199	638133	57.250000	12.083333	507513600	0.0	RINGHA	0.25000	5	1	2374	129	14	19	7	20	2.500	33
603200	638133	57.250000	12.083333	507513600	0.0	RINGHA	0.45000	5	1	2374	129	14	19	7	20	4.500	9
603201	638134	57.250000	12.083333	509932800	0.0	RINGHA	0.52000	5	1	2374	129	14	19	7	20	2.600	33
603202	638134	57.250000	12.083333	509932800	0.0	RINGHA	0.49000	5	1	2374	129	14	19	7	20	4.900	9
603203	638134	57.250000	12.083333	509932800	0.0	RINGHA	0.78000	5	1	2374	129	14	19	7	20	3.900	22
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
965909	639100	63.050000	21.616667	518572800	0.0	VAASA	0.01440	5	1	9999	269	4	52	12	9	0.072	12
965910	639100	63.050000	21.616667	518572800	0.0	VAASA	NaN	5	1	9999	269	4	52	12	9	0.015	11
965911	639137	63.066667	21.400000	1114732800	0.0	VAASA	1.46500	5	1	9999	269	4	52	0	20	29.300	33
965912	639137	63.066667	21.400000	1114732800	0.0	VAASA	0.00204	5	1	9999	269	4	52	0	8	0.017	12
965913	639137	63.066667	21.400000	1114732800	0.0	VAASA	5.00000	5	1	9999	269	4	52	0	20	113.000	4

14872 rows × 17 columns

Set unique index

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    UniqueIndexCB()
    ])

dfs_test = tfm()    
dfs_test['BIOTA']

	ID	SMP_ID	LAT	LON	TIME	SMP_DEPTH	STATION	UNC	UNIT	DL	AREA	SPECIES	BIO_GROUP	BODY_PART	PREP_MET	COUNT_MET	VALUE	NUCLIDE
0	0	638133	57.250000	12.083333	507513600	0.0	RINGHA	0.25000	5	1	2374	129	14	19	7	20	2.500	33
1	1	638133	57.250000	12.083333	507513600	0.0	RINGHA	0.45000	5	1	2374	129	14	19	7	20	4.500	9
2	2	638134	57.250000	12.083333	509932800	0.0	RINGHA	0.52000	5	1	2374	129	14	19	7	20	2.600	33
3	3	638134	57.250000	12.083333	509932800	0.0	RINGHA	0.49000	5	1	2374	129	14	19	7	20	4.900	9
4	4	638134	57.250000	12.083333	509932800	0.0	RINGHA	0.78000	5	1	2374	129	14	19	7	20	3.900	22
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14867	14867	639100	63.050000	21.616667	518572800	0.0	VAASA	0.01440	5	1	9999	269	4	52	12	9	0.072	12
14868	14868	639100	63.050000	21.616667	518572800	0.0	VAASA	NaN	5	1	9999	269	4	52	12	9	0.015	11
14869	14869	639137	63.066667	21.400000	1114732800	0.0	VAASA	1.46500	5	1	9999	269	4	52	0	20	29.300	33
14870	14870	639137	63.066667	21.400000	1114732800	0.0	VAASA	0.00204	5	1	9999	269	4	52	0	8	0.017	12
14871	14871	639137	63.066667	21.400000	1114732800	0.0	VAASA	5.00000	5	1	9999	269	4	52	0	20	113.000	4

14872 rows × 18 columns

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    UniqueIndexCB()
    ])

dfs_test = tfm()    
dfs_test['BIOTA']

	ID	SMP_ID	LAT	LON	TIME	SMP_DEPTH	STATION	UNC	UNIT	DL	AREA	SPECIES	BIO_GROUP	BODY_PART	PREP_MET	COUNT_MET	VALUE	NUCLIDE
0	0	638133	57.250000	12.083333	507513600	0.0	RINGHA	0.25000	5	1	2374	129	14	19	7	20	2.500	33
1	1	638133	57.250000	12.083333	507513600	0.0	RINGHA	0.45000	5	1	2374	129	14	19	7	20	4.500	9
2	2	638134	57.250000	12.083333	509932800	0.0	RINGHA	0.52000	5	1	2374	129	14	19	7	20	2.600	33
3	3	638134	57.250000	12.083333	509932800	0.0	RINGHA	0.49000	5	1	2374	129	14	19	7	20	4.900	9
4	4	638134	57.250000	12.083333	509932800	0.0	RINGHA	0.78000	5	1	2374	129	14	19	7	20	3.900	22
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14867	14867	639100	63.050000	21.616667	518572800	0.0	VAASA	0.01440	5	1	9999	269	4	52	12	9	0.072	12
14868	14868	639100	63.050000	21.616667	518572800	0.0	VAASA	NaN	5	1	9999	269	4	52	12	9	0.015	11
14869	14869	639137	63.066667	21.400000	1114732800	0.0	VAASA	1.46500	5	1	9999	269	4	52	0	20	29.300	33
14870	14870	639137	63.066667	21.400000	1114732800	0.0	VAASA	0.00204	5	1	9999	269	4	52	0	8	0.017	12
14871	14871	639137	63.066667	21.400000	1114732800	0.0	VAASA	5.00000	5	1	9999	269	4	52	0	20	113.000	4

14872 rows × 18 columns

Encode to NetCDF

dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
    SelectColumnsCB(cois_renaming_rules),
    RenameColumnsCB(cois_renaming_rules),
    CastStationToStringCB(),
    DropNAColumnsCB(),
    SanitizeDetectionLimitCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    UniqueIndexCB()
    ])

dfs_tfm = tfm()
tfm.logs

['Select columns of interest.',
 'Renaming variables to MARIS standard names.',
 'Convert STATION column to string type, filling any missing values with empty string',
 "Drop variable containing only NaN or 'Not available' (id=0 in MARIS lookup tables).",
 'Assign Detection Limit name to its id based on MARIS nomenclature.',
 'Parse time column from MARIS dump.',
 'Encode time as seconds since epoch.',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.',
 'Set unique index for each group.']

source

get_attrs

 get_attrs (tfm, zotero_key, kw=['oceanography', 'Earth Science > Oceans >
            Ocean Chemistry> Radionuclides', 'Earth Science > Human
            Dimensions > Environmental Impacts > Nuclear Radiation
            Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean
            Tracers, Earth Science > Oceans > Marine Sediments', 'Earth
            Science > Oceans > Ocean Chemistry, Earth Science > Oceans >
            Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality >
            Ocean Contaminants', 'Earth Science > Biological
            Classification > Animals/Vertebrates > Fish', 'Earth Science >
            Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science >
            Biological Classification > Animals/Invertebrates > Mollusks',
            'Earth Science > Biological Classification >
            Animals/Invertebrates > Arthropods > Crustaceans', 'Earth
            Science > Biological Classification > Plants > Macroalgae
            (Seaweeds)'])

Retrieve global attributes from MARIS dump.

Exported source

kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']

Exported source

def get_attrs(tfm, zotero_key, kw=kw):
    "Retrieve global attributes from MARIS dump."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()

get_attrs(tfm, zotero_key='3W354SQG', kw=kw)

{'geospatial_lat_min': '137.8475',
 'geospatial_lat_max': '31.2575',
 'geospatial_lon_min': '88.9988888888889',
 'geospatial_lon_max': '-42.3086111111111',
 'geospatial_bounds': 'POLYGON ((88.9988888888889 -42.3086111111111, 137.8475 -42.3086111111111, 137.8475 31.2575, 88.9988888888889 31.2575, 88.9988888888889 -42.3086111111111))',
 'time_coverage_start': '1996-12-20T00:00:00',
 'time_coverage_end': '1997-02-11T00:00:00',
 'id': '3W354SQG',
 'title': 'Radioactivity Monitoring of the Irish Marine Environment 1991 and 1992',
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "A.", "lastName": "McGarry"}, {"creatorType": "author", "firstName": "S.", "lastName": "Lyons"}, {"creatorType": "author", "firstName": "C.", "lastName": "McEnri"}, {"creatorType": "author", "firstName": "T.", "lastName": "Ryan"}, {"creatorType": "author", "firstName": "M.", "lastName": "O\'Colmain"}, {"creatorType": "author", "firstName": "J.D.", "lastName": "Cunningham"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
 'publisher_postprocess_logs': "Select columns of interest., Renaming variables to MARIS standard names., Convert STATION column to string type, filling any missing values with empty string, Drop variable containing only NaN or 'Not available' (id=0 in MARIS lookup tables)., Assign Detection Limit name to its id based on MARIS nomenclature., Parse time column from MARIS dump., Encode time as seconds since epoch., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator., Set unique index for each group."}

source

encode

 encode (fname_in:str, dir_dest:str, **kwargs)

Encode MARIS dump to NetCDF.

	Type	Details
fname_in	str	Path to the MARIS dump data in CSV format
dir_dest	str	Path to the folder where the NetCDF output will be saved
kwargs	VAR_KEYWORD

Exported source

def encode(
    fname_in: str, # Path to the MARIS dump data in CSV format
    dir_dest: str, # Path to the folder where the NetCDF output will be saved
    **kwargs # Additional keyword arguments
    ):
    "Encode MARIS dump to NetCDF."
    dataloader = DataLoader(fname_in)
    ref_ids = kwargs.get('ref_ids')
    if ref_ids is None:
        ref_ids = dataloader.df.ref_id.unique()
    print('Encoding ...')
    for ref_id in tqdm(ref_ids, leave=False):
        # if ref_id == 736: continue
        dfs = dataloader(ref_id=ref_id)
        print(get_fname(dfs))
        tfm = Transformer(dfs, cbs=[
            SelectColumnsCB(cois_renaming_rules),
            RenameColumnsCB(cois_renaming_rules),
            CastStationToStringCB(),
            DropNAColumnsCB(),
            SanitizeDetectionLimitCB(),
            ParseTimeCB(),
            EncodeTimeCB(),
            SanitizeLonLatCB(),
            UniqueIndexCB(),
            ])
        
        tfm()
        encoder = NetCDFEncoder(tfm.dfs, 
                                dest_fname=Path(dir_dest) / get_fname(dfs), 
                                global_attrs=get_attrs(tfm, zotero_key=get_zotero_key(dfs), kw=kw),
                                verbose=kwargs.get('verbose', False)
                                )
        encoder.encode()

Single dataset

ref_id = 106
encode(
    fname_in,
    dir_dest,
    verbose=False, 
    ref_ids=[ref_id])

Encoding ...

  0%|          | 0/1 [00:00<?, ?it/s]

106.nc

All datasets

encode(
    fname_in, 
    dir_dest, 
    ref_ids=None,
    verbose=False)

Configuration & file paths

Utils

DataLoader

get_zotero_key

get_fname

Load data

Transform data

Select columns

Rename columns

Cast STATION to str type

CastStationToStringCB

Drop NaN only columns

DropNAColumnsCB

Remap detection limit values

SanitizeDetectionLimitCB

Parse and encode time

ParseTimeCB

Sanitize coordinates

Set unique index

Encode to NetCDF

get_attrs

encode

Single dataset

All datasets

Cast `STATION` to `str` type