Exported source
# fname_in = Path().home() / 'pro/data/maris/2024-11-20 MARIS_QA_shapetype_id=1.txt'
= Path().home() / 'pro/data/maris/2025-06-03 MARIS_QA_shapetype_id = 1.txt'
fname_in
= '../../_data/output/dump' dir_dest
This data pipeline, known as “handler” in Marisco terminology, contains a data pipeline (handler) that converts the master MARIS database dump into
NetCDF
format. It enables batch encoding of all legacy datasets into NetCDF.
Key functions of this handler:
The result is a set of NetCDF files, one for each unique reference ID in the input data.
For new MARIS users, please refer to Understanding MARIS Data Formats (NetCDF and Open Refine) for detailed information.
The present notebook pretends to be an instance of Literate Programming in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case marisco/handlers/helcom.py
) the code snippet is added to the module using #| exports
as provided by the wonderful nbdev library.
fname_in: path to the folder containing the MARIS dump data in CSV format.
dir_dest: path to the folder where the NetCDF output will be saved.
sample_id | area_id | areaname | samptype_id | samptype | ref_id | displaytext | zoterourl | ref_note | datbase | ... | profile_id | sampnote | ref_fulltext | ref_yearpub | ref_sampleTypes | LongLat | shiftedcoordinates | shiftedlong | shiftedlat | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18810 | 1904 | Indian Ocean | 1 | Seawater | 97 | ASPAMARD, 2004 | https://www.zotero.org/groups/2432820/maris/it... | P. Scotto(1975-2001);P. Morris (2003-2014) | ASPAMARD | ... | NaN | Because the date does not have month & day, an... | ASPAMARD, 2004. Asia-Pacific Marine Radioactiv... | 2004 | 1,3 | 111.983,-25.05 | 0xE6100000010CCDCCCCCCCC0C39C0F4FDD478E9FE5B40 | 111.983333 | -25.050000 | 1 |
1 | 63633 | 1904 | Indian Ocean | 1 | Seawater | 99 | Aoyama and Hirose, 2004 | https://www.zotero.org/groups/2432820/maris/it... | NaN | HAM 2008 | ... | NaN | Author: Y.Bourlat, et.al. Unknown latitude and... | Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo... | 2004 | 1 | 61.533,-52.017 | 0xE6100000010CEEEBC03923024AC0787AA52C43C44E40 | 61.533333 | -52.016667 | 2 |
2 | 63633 | 1904 | Indian Ocean | 1 | Seawater | 99 | Aoyama and Hirose, 2004 | https://www.zotero.org/groups/2432820/maris/it... | NaN | HAM 2008 | ... | NaN | Author: Y.Bourlat, et.al. Unknown latitude and... | Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo... | 2004 | 1 | 61.533,-52.017 | 0xE6100000010CEEEBC03923024AC0787AA52C43C44E40 | 61.533333 | -52.016667 | 3 |
3 | 63635 | 1904 | Indian Ocean | 1 | Seawater | 99 | Aoyama and Hirose, 2004 | https://www.zotero.org/groups/2432820/maris/it... | NaN | HAM 2008 | ... | NaN | Author: Y.Bourlat, et.al. Unknown latitude and... | Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo... | 2004 | 1 | 57.483,-44.483 | 0xE6100000010C12143FC6DC3D46C012143FC6DCBD4C40 | 57.483333 | -44.483333 | 4 |
4 | 63635 | 1904 | Indian Ocean | 1 | Seawater | 99 | Aoyama and Hirose, 2004 | https://www.zotero.org/groups/2432820/maris/it... | NaN | HAM 2008 | ... | NaN | Author: Y.Bourlat, et.al. Unknown latitude and... | Aoyama, M., Hirose, K., 2004. HAM 2008 - Histo... | 2004 | 1 | 57.483,-44.483 | 0xE6100000010C12143FC6DC3D46C012143FC6DCBD4C40 | 57.483333 | -44.483333 | 5 |
5 rows × 80 columns
Below a utility class to load a specific MARIS dump dataset optionally filtered through its ref_id
.
DataLoader (fname:str, exclude_ref_id:Optional[List[int]]=[9999])
Load specific MARIS dataset through its ref_id.
Type | Default | Details | |
---|---|---|---|
fname | str | Path to the MARIS global dump file | |
exclude_ref_id | Optional | [9999] | Whether to filter the dataframe by ref_id |
class DataLoader:
"Load specific MARIS dataset through its ref_id."
LUT = {
'Biota': 'BIOTA',
'Seawater': 'SEAWATER',
'Sediment': 'SEDIMENT',
'Suspended matter': 'SUSPENDED_MATTER'
}
def __init__(self,
fname: str, # Path to the MARIS global dump file
exclude_ref_id: Optional[List[int]]=[9999] # Whether to filter the dataframe by ref_id
):
fc.store_attr()
self.df = self._load_data()
def _load_data(self):
df = pd.read_csv(self.fname, sep='\t', encoding='utf-8', low_memory=False)
return df[~df.ref_id.isin(self.exclude_ref_id)] if self.exclude_ref_id else df
def __call__(self,
ref_id: int # Reference ID of interest
) -> dict: # Dictionary of dataframes
df = self.df[self.df.ref_id == ref_id].copy() if ref_id else self.df.copy()
return {self.LUT[name]: grp for name, grp in df.groupby('samptype') if name in self.LUT}
get_zotero_key (dfs)
Retrieve Zotero key from MARIS dump.
get_fname (dfs)
Get NetCDF filename.
Here below a quick overview of the MARIS dump data structure.
dataloader = DataLoader(fname_in)
ref_id = 106 # Some other ref_id examples: OSPAR: 191, HELCOM: 100, 717 (only seawater)
dfs = dataloader(ref_id=ref_id)
print(f'keys: {dfs.keys()}')
dfs['SEAWATER'].head()
keys: dict_keys(['SEAWATER'])
sample_id | area_id | areaname | samptype_id | samptype | ref_id | displaytext | zoterourl | ref_note | datbase | ... | profile_id | sampnote | ref_fulltext | ref_yearpub | ref_sampleTypes | LongLat | shiftedcoordinates | shiftedlong | shiftedlat | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5829 | 73703 | 1904 | Indian Ocean | 1 | Seawater | 106 | Yamada et al., 2006 | https://www.zotero.org/groups/2432820/maris/it... | NaN | NaN | ... | NaN | NaN | Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... | 2006 | 1 | 92.983,-0.008 | 0xE6100000010CB289C4EB97DB7FBFA52C431CEB3E5740 | 92.983056 | -0.007778 | 5830 |
5830 | 73705 | 1904 | Indian Ocean | 1 | Seawater | 106 | Yamada et al., 2006 | https://www.zotero.org/groups/2432820/maris/it... | NaN | NaN | ... | NaN | NaN | Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... | 2006 | 1 | 96.596,-4.029 | 0xE6100000010CAF5A99F04B1D10C0D95F764F1E265840 | 96.595556 | -4.028611 | 5831 |
5831 | 73711 | 1904 | Indian Ocean | 1 | Seawater | 106 | Yamada et al., 2006 | https://www.zotero.org/groups/2432820/maris/it... | NaN | NaN | ... | NaN | NaN | Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... | 2006 | 1 | 101.991,-9.997 | 0xE6100000010C91D5AD9E93FE23C08195438B6C7F5940 | 101.990556 | -9.997222 | 5832 |
5832 | 73715 | 1904 | Indian Ocean | 1 | Seawater | 106 | Yamada et al., 2006 | https://www.zotero.org/groups/2432820/maris/it... | NaN | NaN | ... | NaN | NaN | Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... | 2006 | 1 | 114.394,-18.496 | 0xE6100000010C575BB1BFEC7E32C0F0A7C64B37995C40 | 114.393889 | -18.495833 | 5833 |
5833 | 73717 | 1904 | Indian Ocean | 1 | Seawater | 106 | Yamada et al., 2006 | https://www.zotero.org/groups/2432820/maris/it... | NaN | NaN | ... | NaN | NaN | Yamada, M., Zheng, J., Wang, Z.-L., 2006. 137C... | 2006 | 1 | 112.552,-23.066 | 0xE6100000010C54742497FF1037C017D9CEF753235C40 | 112.551667 | -23.066389 | 5834 |
5 rows × 80 columns
cois_renaming_rules = {
'sample_id': 'SMP_ID',
'latitude': 'LAT',
'longitude': 'LON',
'begperiod': 'TIME',
'sampdepth': 'SMP_DEPTH',
'totdepth': 'TOT_DEPTH',
'station': 'STATION',
'uncertaint': 'UNC',
'unit_id': 'UNIT',
'detection': 'DL',
'area_id': 'AREA',
'species_id': 'SPECIES',
'biogroup_id': 'BIO_GROUP',
'bodypar_id': 'BODY_PART',
'sedtype_id': 'SED_TYPE',
'volume': 'VOL',
'salinity': 'SAL',
'temperatur': 'TEMP',
'sampmet_id': 'SAMP_MET',
'prepmet_id': 'PREP_MET',
'counmet_id': 'COUNT_MET',
'activity': 'VALUE',
'nuclide_id': 'NUCLIDE',
'sliceup': 'TOP',
'slicedown': 'BOTTOM'
}
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules)
])
print('Keys:', tfm().keys())
print('Columns:', tfm()['SEAWATER'].columns)
Keys: dict_keys(['SEAWATER'])
Columns: Index(['sample_id', 'latitude', 'longitude', 'begperiod', 'sampdepth',
'totdepth', 'station', 'uncertaint', 'unit_id', 'detection', 'area_id',
'species_id', 'biogroup_id', 'bodypar_id', 'sedtype_id', 'volume',
'salinity', 'temperatur', 'sampmet_id', 'prepmet_id', 'counmet_id',
'activity', 'nuclide_id', 'sliceup', 'slicedown'],
dtype='object')
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules)
])
dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['SEAWATER'].columns)
Keys: dict_keys(['SEAWATER'])
Columns: Index(['SMP_ID', 'LAT', 'LON', 'TIME', 'SMP_DEPTH', 'TOT_DEPTH', 'STATION',
'UNC', 'UNIT', 'DL', 'AREA', 'SPECIES', 'BIO_GROUP', 'BODY_PART',
'SED_TYPE', 'VOL', 'SAL', 'TEMP', 'SAMP_MET', 'PREP_MET', 'COUNT_MET',
'VALUE', 'NUCLIDE', 'TOP', 'BOTTOM'],
dtype='object')
STATION
to str
typeThis is required for VLEN
netcdf variables.
CastStationToStringCB ()
Convert STATION column to string type, filling any missing values with empty string
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules),
CastStationToStringCB()
])
dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['SEAWATER'].dtypes)
Keys: dict_keys(['SEAWATER'])
Columns: SMP_ID int64
LAT float64
LON float64
TIME object
SMP_DEPTH float64
TOT_DEPTH float64
STATION string[python]
UNC float64
UNIT int64
DL object
AREA int64
SPECIES int64
BIO_GROUP int64
BODY_PART int64
SED_TYPE int64
VOL float64
SAL float64
TEMP float64
SAMP_MET int64
PREP_MET int64
COUNT_MET int64
VALUE float64
NUCLIDE int64
TOP float64
BOTTOM float64
dtype: object
We then remove columns containing only NaN values or ‘Not available’ (id=0 in MARIS lookup tables).
DropNAColumnsCB (na_value=0)
Drop variable containing only NaN or ‘Not available’ (id=0 in MARIS lookup tables).
class DropNAColumnsCB(Callback):
"Drop variable containing only NaN or 'Not available' (id=0 in MARIS lookup tables)."
def __init__(self, na_value=0): fc.store_attr()
def isMarisNA(self, col):
return len(col.unique()) == 1 and col.iloc[0] == self.na_value
def dropMarisNA(self, df):
na_cols = [col for col in df.columns if self.isMarisNA(df[col])]
return df.drop(labels=na_cols, axis=1)
def __call__(self, tfm):
for k in tfm.dfs.keys():
tfm.dfs[k] = tfm.dfs[k].dropna(axis=1, how='all')
tfm.dfs[k] = self.dropMarisNA(tfm.dfs[k])
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules),
CastStationToStringCB(),
DropNAColumnsCB()
])
dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['SEAWATER'].columns)
Keys: dict_keys(['SEAWATER'])
Columns: Index(['SMP_ID', 'LAT', 'LON', 'TIME', 'STATION', 'UNC', 'UNIT', 'DL', 'AREA',
'SAL', 'TEMP', 'VALUE', 'NUCLIDE'],
dtype='object')
Category-based NetCDF
variables are encoded as integer values based on the MARIS lookup table dbo_detectlimit.xlsx
. We recall that these lookup tables are included in the NetCDF
file as custom enumeration types.
SanitizeDetectionLimitCB (fn_lut=<function <lambda>>, dl_name='DL')
Assign Detection Limit name to its id based on MARIS nomenclature.
class SanitizeDetectionLimitCB(Callback):
"Assign Detection Limit name to its id based on MARIS nomenclature."
def __init__(self,
fn_lut=dl_name_to_id,
dl_name='DL'):
fc.store_attr()
def __call__(self, tfm):
lut = self.fn_lut()
for k in tfm.dfs.keys():
tfm.dfs[k][self.dl_name] = tfm.dfs[k][self.dl_name].replace(lut)
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules),
CastStationToStringCB(),
DropNAColumnsCB(),
SanitizeDetectionLimitCB()
])
dfs_tfm = tfm()
print('Keys:', dfs_tfm.keys())
print('Columns:', dfs_tfm['BIOTA'].columns)
print(f'{dfs_tfm["BIOTA"]["DL"].unique()}')
print(f'{dfs_tfm["BIOTA"].head()}')
Keys: dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT'])
Columns: Index(['SMP_ID', 'LAT', 'LON', 'TIME', 'SMP_DEPTH', 'STATION', 'UNC', 'UNIT',
'DL', 'AREA', 'SPECIES', 'BIO_GROUP', 'BODY_PART', 'PREP_MET',
'COUNT_MET', 'VALUE', 'NUCLIDE'],
dtype='object')
[1 2]
SMP_ID LAT LON TIME SMP_DEPTH STATION \
603199 638133 57.25 12.083333 1986-01-31 00:00:00.000 0.0 RINGHA
603200 638133 57.25 12.083333 1986-01-31 00:00:00.000 0.0 RINGHA
603201 638134 57.25 12.083333 1986-02-28 00:00:00.000 0.0 RINGHA
603202 638134 57.25 12.083333 1986-02-28 00:00:00.000 0.0 RINGHA
603203 638134 57.25 12.083333 1986-02-28 00:00:00.000 0.0 RINGHA
UNC UNIT DL AREA SPECIES BIO_GROUP BODY_PART PREP_MET \
603199 0.25 5 1 2374 129 14 19 7
603200 0.45 5 1 2374 129 14 19 7
603201 0.52 5 1 2374 129 14 19 7
603202 0.49 5 1 2374 129 14 19 7
603203 0.78 5 1 2374 129 14 19 7
COUNT_MET VALUE NUCLIDE
603199 20 2.5 33
603200 20 4.5 9
603201 20 2.6 33
603202 20 4.9 9
603203 20 3.9 22
We remind that in netCDF
format time need to be encoded as integer
representing the number of seconds since a time of reference. In our case we chose 1970-01-01 00:00:00.0
as defined in configs.ipynb
.
ParseTimeCB (time_name='TIME')
Parse time column from MARIS dump.
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules),
CastStationToStringCB(),
DropNAColumnsCB(),
SanitizeDetectionLimitCB(),
ParseTimeCB(),
EncodeTimeCB()
])
print(tfm()['BIOTA'])
SMP_ID LAT LON TIME SMP_DEPTH STATION UNC \
603199 638133 57.250000 12.083333 507513600 0.0 RINGHA 0.25000
603200 638133 57.250000 12.083333 507513600 0.0 RINGHA 0.45000
603201 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.52000
603202 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.49000
603203 638134 57.250000 12.083333 509932800 0.0 RINGHA 0.78000
... ... ... ... ... ... ... ...
965909 639100 63.050000 21.616667 518572800 0.0 VAASA 0.01440
965910 639100 63.050000 21.616667 518572800 0.0 VAASA NaN
965911 639137 63.066667 21.400000 1114732800 0.0 VAASA 1.46500
965912 639137 63.066667 21.400000 1114732800 0.0 VAASA 0.00204
965913 639137 63.066667 21.400000 1114732800 0.0 VAASA 5.00000
UNIT DL AREA SPECIES BIO_GROUP BODY_PART PREP_MET COUNT_MET \
603199 5 1 2374 129 14 19 7 20
603200 5 1 2374 129 14 19 7 20
603201 5 1 2374 129 14 19 7 20
603202 5 1 2374 129 14 19 7 20
603203 5 1 2374 129 14 19 7 20
... ... .. ... ... ... ... ... ...
965909 5 1 9999 269 4 52 12 9
965910 5 1 9999 269 4 52 12 9
965911 5 1 9999 269 4 52 0 20
965912 5 1 9999 269 4 52 0 8
965913 5 1 9999 269 4 52 0 20
VALUE NUCLIDE
603199 2.500 33
603200 4.500 9
603201 2.600 33
603202 4.900 9
603203 3.900 22
... ... ...
965909 0.072 12
965910 0.015 11
965911 29.300 33
965912 0.017 12
965913 113.000 4
[14872 rows x 17 columns]
We ensure that coordinates are within the valid range.
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules),
CastStationToStringCB(),
DropNAColumnsCB(),
SanitizeDetectionLimitCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB()
])
dfs_test = tfm()
dfs_test['BIOTA']
SMP_ID | LAT | LON | TIME | SMP_DEPTH | STATION | UNC | UNIT | DL | AREA | SPECIES | BIO_GROUP | BODY_PART | PREP_MET | COUNT_MET | VALUE | NUCLIDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
603199 | 638133 | 57.250000 | 12.083333 | 507513600 | 0.0 | RINGHA | 0.25000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 2.500 | 33 |
603200 | 638133 | 57.250000 | 12.083333 | 507513600 | 0.0 | RINGHA | 0.45000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 4.500 | 9 |
603201 | 638134 | 57.250000 | 12.083333 | 509932800 | 0.0 | RINGHA | 0.52000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 2.600 | 33 |
603202 | 638134 | 57.250000 | 12.083333 | 509932800 | 0.0 | RINGHA | 0.49000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 4.900 | 9 |
603203 | 638134 | 57.250000 | 12.083333 | 509932800 | 0.0 | RINGHA | 0.78000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 3.900 | 22 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
965909 | 639100 | 63.050000 | 21.616667 | 518572800 | 0.0 | VAASA | 0.01440 | 5 | 1 | 9999 | 269 | 4 | 52 | 12 | 9 | 0.072 | 12 |
965910 | 639100 | 63.050000 | 21.616667 | 518572800 | 0.0 | VAASA | NaN | 5 | 1 | 9999 | 269 | 4 | 52 | 12 | 9 | 0.015 | 11 |
965911 | 639137 | 63.066667 | 21.400000 | 1114732800 | 0.0 | VAASA | 1.46500 | 5 | 1 | 9999 | 269 | 4 | 52 | 0 | 20 | 29.300 | 33 |
965912 | 639137 | 63.066667 | 21.400000 | 1114732800 | 0.0 | VAASA | 0.00204 | 5 | 1 | 9999 | 269 | 4 | 52 | 0 | 8 | 0.017 | 12 |
965913 | 639137 | 63.066667 | 21.400000 | 1114732800 | 0.0 | VAASA | 5.00000 | 5 | 1 | 9999 | 269 | 4 | 52 | 0 | 20 | 113.000 | 4 |
14872 rows × 17 columns
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules),
CastStationToStringCB(),
DropNAColumnsCB(),
SanitizeDetectionLimitCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
UniqueIndexCB()
])
dfs_test = tfm()
dfs_test['BIOTA']
ID | SMP_ID | LAT | LON | TIME | SMP_DEPTH | STATION | UNC | UNIT | DL | AREA | SPECIES | BIO_GROUP | BODY_PART | PREP_MET | COUNT_MET | VALUE | NUCLIDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 638133 | 57.250000 | 12.083333 | 507513600 | 0.0 | RINGHA | 0.25000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 2.500 | 33 |
1 | 1 | 638133 | 57.250000 | 12.083333 | 507513600 | 0.0 | RINGHA | 0.45000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 4.500 | 9 |
2 | 2 | 638134 | 57.250000 | 12.083333 | 509932800 | 0.0 | RINGHA | 0.52000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 2.600 | 33 |
3 | 3 | 638134 | 57.250000 | 12.083333 | 509932800 | 0.0 | RINGHA | 0.49000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 4.900 | 9 |
4 | 4 | 638134 | 57.250000 | 12.083333 | 509932800 | 0.0 | RINGHA | 0.78000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 3.900 | 22 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14867 | 14867 | 639100 | 63.050000 | 21.616667 | 518572800 | 0.0 | VAASA | 0.01440 | 5 | 1 | 9999 | 269 | 4 | 52 | 12 | 9 | 0.072 | 12 |
14868 | 14868 | 639100 | 63.050000 | 21.616667 | 518572800 | 0.0 | VAASA | NaN | 5 | 1 | 9999 | 269 | 4 | 52 | 12 | 9 | 0.015 | 11 |
14869 | 14869 | 639137 | 63.066667 | 21.400000 | 1114732800 | 0.0 | VAASA | 1.46500 | 5 | 1 | 9999 | 269 | 4 | 52 | 0 | 20 | 29.300 | 33 |
14870 | 14870 | 639137 | 63.066667 | 21.400000 | 1114732800 | 0.0 | VAASA | 0.00204 | 5 | 1 | 9999 | 269 | 4 | 52 | 0 | 8 | 0.017 | 12 |
14871 | 14871 | 639137 | 63.066667 | 21.400000 | 1114732800 | 0.0 | VAASA | 5.00000 | 5 | 1 | 9999 | 269 | 4 | 52 | 0 | 20 | 113.000 | 4 |
14872 rows × 18 columns
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules),
DropNAColumnsCB(),
SanitizeDetectionLimitCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
UniqueIndexCB()
])
dfs_test = tfm()
dfs_test['BIOTA']
ID | SMP_ID | LAT | LON | TIME | SMP_DEPTH | STATION | UNC | UNIT | DL | AREA | SPECIES | BIO_GROUP | BODY_PART | PREP_MET | COUNT_MET | VALUE | NUCLIDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 638133 | 57.250000 | 12.083333 | 507513600 | 0.0 | RINGHA | 0.25000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 2.500 | 33 |
1 | 1 | 638133 | 57.250000 | 12.083333 | 507513600 | 0.0 | RINGHA | 0.45000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 4.500 | 9 |
2 | 2 | 638134 | 57.250000 | 12.083333 | 509932800 | 0.0 | RINGHA | 0.52000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 2.600 | 33 |
3 | 3 | 638134 | 57.250000 | 12.083333 | 509932800 | 0.0 | RINGHA | 0.49000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 4.900 | 9 |
4 | 4 | 638134 | 57.250000 | 12.083333 | 509932800 | 0.0 | RINGHA | 0.78000 | 5 | 1 | 2374 | 129 | 14 | 19 | 7 | 20 | 3.900 | 22 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14867 | 14867 | 639100 | 63.050000 | 21.616667 | 518572800 | 0.0 | VAASA | 0.01440 | 5 | 1 | 9999 | 269 | 4 | 52 | 12 | 9 | 0.072 | 12 |
14868 | 14868 | 639100 | 63.050000 | 21.616667 | 518572800 | 0.0 | VAASA | NaN | 5 | 1 | 9999 | 269 | 4 | 52 | 12 | 9 | 0.015 | 11 |
14869 | 14869 | 639137 | 63.066667 | 21.400000 | 1114732800 | 0.0 | VAASA | 1.46500 | 5 | 1 | 9999 | 269 | 4 | 52 | 0 | 20 | 29.300 | 33 |
14870 | 14870 | 639137 | 63.066667 | 21.400000 | 1114732800 | 0.0 | VAASA | 0.00204 | 5 | 1 | 9999 | 269 | 4 | 52 | 0 | 8 | 0.017 | 12 |
14871 | 14871 | 639137 | 63.066667 | 21.400000 | 1114732800 | 0.0 | VAASA | 5.00000 | 5 | 1 | 9999 | 269 | 4 | 52 | 0 | 20 | 113.000 | 4 |
14872 rows × 18 columns
dfs = dataloader(ref_id=ref_id)
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules),
CastStationToStringCB(),
DropNAColumnsCB(),
SanitizeDetectionLimitCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
UniqueIndexCB()
])
dfs_tfm = tfm()
tfm.logs
['Select columns of interest.',
'Renaming variables to MARIS standard names.',
'Convert STATION column to string type, filling any missing values with empty string',
"Drop variable containing only NaN or 'Not available' (id=0 in MARIS lookup tables).",
'Assign Detection Limit name to its id based on MARIS nomenclature.',
'Parse time column from MARIS dump.',
'Encode time as seconds since epoch.',
'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.',
'Set unique index for each group.']
get_attrs (tfm, zotero_key, kw=['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides', 'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments', 'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality > Ocean Contaminants', 'Earth Science > Biological Classification > Animals/Vertebrates > Fish', 'Earth Science > Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks', 'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans', 'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)'])
Retrieve global attributes from MARIS dump.
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
'Earth Science > Oceans > Water Quality > Ocean Contaminants',
'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']
def get_attrs(tfm, zotero_key, kw=kw):
"Retrieve global attributes from MARIS dump."
return GlobAttrsFeeder(tfm.dfs, cbs=[
BboxCB(),
DepthRangeCB(),
TimeRangeCB(),
ZoteroCB(zotero_key, cfg=cfg()),
KeyValuePairCB('keywords', ', '.join(kw)),
KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
])()
{'geospatial_lat_min': '137.8475',
'geospatial_lat_max': '31.2575',
'geospatial_lon_min': '88.9988888888889',
'geospatial_lon_max': '-42.3086111111111',
'geospatial_bounds': 'POLYGON ((88.9988888888889 -42.3086111111111, 137.8475 -42.3086111111111, 137.8475 31.2575, 88.9988888888889 31.2575, 88.9988888888889 -42.3086111111111))',
'time_coverage_start': '1996-12-20T00:00:00',
'time_coverage_end': '1997-02-11T00:00:00',
'id': '3W354SQG',
'title': 'Radioactivity Monitoring of the Irish Marine Environment 1991 and 1992',
'summary': '',
'creator_name': '[{"creatorType": "author", "firstName": "A.", "lastName": "McGarry"}, {"creatorType": "author", "firstName": "S.", "lastName": "Lyons"}, {"creatorType": "author", "firstName": "C.", "lastName": "McEnri"}, {"creatorType": "author", "firstName": "T.", "lastName": "Ryan"}, {"creatorType": "author", "firstName": "M.", "lastName": "O\'Colmain"}, {"creatorType": "author", "firstName": "J.D.", "lastName": "Cunningham"}]',
'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
'publisher_postprocess_logs': "Select columns of interest., Renaming variables to MARIS standard names., Convert STATION column to string type, filling any missing values with empty string, Drop variable containing only NaN or 'Not available' (id=0 in MARIS lookup tables)., Assign Detection Limit name to its id based on MARIS nomenclature., Parse time column from MARIS dump., Encode time as seconds since epoch., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator., Set unique index for each group."}
encode (fname_in:str, dir_dest:str, **kwargs)
Encode MARIS dump to NetCDF.
Type | Details | |
---|---|---|
fname_in | str | Path to the MARIS dump data in CSV format |
dir_dest | str | Path to the folder where the NetCDF output will be saved |
kwargs | VAR_KEYWORD |
def encode(
fname_in: str, # Path to the MARIS dump data in CSV format
dir_dest: str, # Path to the folder where the NetCDF output will be saved
**kwargs # Additional keyword arguments
):
"Encode MARIS dump to NetCDF."
dataloader = DataLoader(fname_in)
ref_ids = kwargs.get('ref_ids')
if ref_ids is None:
ref_ids = dataloader.df.ref_id.unique()
print('Encoding ...')
for ref_id in tqdm(ref_ids, leave=False):
# if ref_id == 736: continue
dfs = dataloader(ref_id=ref_id)
print(get_fname(dfs))
tfm = Transformer(dfs, cbs=[
SelectColumnsCB(cois_renaming_rules),
RenameColumnsCB(cois_renaming_rules),
CastStationToStringCB(),
DropNAColumnsCB(),
SanitizeDetectionLimitCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
UniqueIndexCB(),
])
tfm()
encoder = NetCDFEncoder(tfm.dfs,
dest_fname=Path(dir_dest) / get_fname(dfs),
global_attrs=get_attrs(tfm, zotero_key=get_zotero_key(dfs), kw=kw),
verbose=kwargs.get('verbose', False)
)
encoder.encode()