Exported source
= '../../_data/geotraces/GEOTRACES_IDP2021_v2/seawater/ascii/GEOTRACES_IDP2021_Seawater_Discrete_Sample_Data_v2.csv'
fname_in = '../../_data/output/190-geotraces-2021.nc'
fname_out = '97UIMEXN' zotero_key
This data pipeline, known as a “handler” in Marisco terminology, is designed to clean, standardize, and encode BODC Geotraces dataset into MARIS
NetCDF
format. The handler processesGeotraces
data, applying various transformations and lookups to align it withMARIS
data standards.
Key functions of this handler:
NetCDF
format compatible with MARIS requirementsThis handler is a crucial component in the Marisco data processing workflow, ensuring Geotraces data is properly integrated into the MARIS database.
For new MARIS users, please refer to Understanding MARIS Data Formats (NetCDF and Open Refine) for detailed information.
The present notebook pretends to be an instance of Literate Programming in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case marisco/handlers/geotraces.py
) the code snippet is added to the module using #| exports
as provided by the wonderful nbdev library.
fname_in: path to the folder containing the HELCOM data in CSV format. The path can be defined as a relative path.
fname_out: path and filename for the NetCDF output.The path can be defined as a relative path.
Zotero key: used to retrieve attributes related to the dataset from Zotero. The MARIS datasets include a library available on Zotero.
df shape: (105417, 1188)
Cruise | Station:METAVAR:INDEXED_TEXT | Type | yyyy-mm-ddThh:mm:ss.sss | Longitude [degrees_east] | Latitude [degrees_north] | Bot. Depth [m] | Operator's Cruise Name:METAVAR:INDEXED_TEXT | Ship Name:METAVAR:INDEXED_TEXT | Period:METAVAR:INDEXED_TEXT | ... | QV:SEADATANET.581 | Co_CELL_CONC_BOTTLE [amol/cell] | QV:SEADATANET.582 | Ni_CELL_CONC_BOTTLE [amol/cell] | QV:SEADATANET.583 | Cu_CELL_CONC_BOTTLE [amol/cell] | QV:SEADATANET.584 | Zn_CELL_CONC_BOTTLE [amol/cell] | QV:SEADATANET.585 | QV:ODV:SAMPLE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GA01 | 0 | B | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | GEOVIDE | Pourquoi pas? | 15/05/2014 - 30/06/2014 | ... | 9 | NaN | 9 | NaN | 9 | NaN | 9 | NaN | 9 | 1 |
1 | GA01 | 0 | B | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | GEOVIDE | Pourquoi pas? | 15/05/2014 - 30/06/2014 | ... | 9 | NaN | 9 | NaN | 9 | NaN | 9 | NaN | 9 | 1 |
2 | GA01 | 0 | B | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | GEOVIDE | Pourquoi pas? | 15/05/2014 - 30/06/2014 | ... | 9 | NaN | 9 | NaN | 9 | NaN | 9 | NaN | 9 | 1 |
3 | GA01 | 0 | B | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | GEOVIDE | Pourquoi pas? | 15/05/2014 - 30/06/2014 | ... | 9 | NaN | 9 | NaN | 9 | NaN | 9 | NaN | 9 | 1 |
4 | GA01 | 0 | B | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | GEOVIDE | Pourquoi pas? | 15/05/2014 - 30/06/2014 | ... | 9 | NaN | 9 | NaN | 9 | NaN | 9 | NaN | 9 | 1 |
5 rows × 1188 columns
We select the columns of interest and in particular the elements of interest, in our case radionuclides.
SelectColsOfInterestCB (common_coi, nuclides_pattern)
Select columns of interest.
common_coi = ['yyyy-mm-ddThh:mm:ss.sss', 'Longitude [degrees_east]',
'Latitude [degrees_north]', 'Bot. Depth [m]', 'DEPTH [m]', 'BODC Bottle Number:INTEGER']
nuclides_pattern = ['^TRITI', '^Th_228', '^Th_23[024]', '^Pa_231',
'^U_236_[DT]', '^Be_', '^Cs_137', '^Pb_210', '^Po_210',
'^Ra_22[3468]', 'Np_237', '^Pu_239_[D]', '^Pu_240', '^Pu_239_Pu_240',
'^I_129', '^Ac_227']
class SelectColsOfInterestCB(Callback):
"Select columns of interest."
def __init__(self, common_coi, nuclides_pattern): fc.store_attr()
def __call__(self, tfm):
nuc_of_interest = [c for c in tfm.df.columns if
any(re.match(pattern, c) for pattern in self.nuclides_pattern)]
tfm.df = tfm.df[self.common_coi + nuc_of_interest]
df_test shape: (105417, 86)
yyyy-mm-ddThh:mm:ss.sss | Longitude [degrees_east] | Latitude [degrees_north] | Bot. Depth [m] | DEPTH [m] | BODC Bottle Number:INTEGER | TRITIUM_D_CONC_BOTTLE [TU] | Cs_137_D_CONC_BOTTLE [uBq/kg] | I_129_D_CONC_BOTTLE [atoms/kg] | Np_237_D_CONC_BOTTLE [uBq/kg] | ... | Th_230_TP_CONC_PUMP [uBq/kg] | Th_230_SPT_CONC_PUMP [uBq/kg] | Th_230_LPT_CONC_PUMP [uBq/kg] | Th_232_TP_CONC_PUMP [pmol/kg] | Th_232_SPT_CONC_PUMP [pmol/kg] | Th_232_LPT_CONC_PUMP [pmol/kg] | Th_234_SPT_CONC_PUMP [mBq/kg] | Th_234_LPT_CONC_PUMP [mBq/kg] | Po_210_TP_CONC_UWAY [mBq/kg] | Pb_210_TP_CONC_UWAY [mBq/kg] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | 2957.1 | 1214048 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | 2957.2 | 1214039 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | 2957.2 | 1214027 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | 2957.2 | 1214018 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2014-05-17T22:29:00 | 349.29999 | 38.4329 | 4854.0 | 2957.2 | 1214036 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 86 columns
BODC Bottle Number:INTEGER
field allows to characterize uniquely a sample as shown below:
cols_measurements = [col for col in df_test.columns if col not in common_coi]
unique_key = ['BODC Bottle Number:INTEGER']
df_test.dropna(subset=cols_measurements, how='all', inplace=True);
print(f'df_test shape after dropping rows with no measurements: {df_test.shape}')
print(f'df_test duplicated keys: {df_test[unique_key].duplicated().sum()}')
df_test[df_test[unique_key].duplicated(keep=False)].sort_values(by=unique_key)
df_test shape after dropping rows with no measurements: (9389, 86)
df_test duplicated keys: 0
yyyy-mm-ddThh:mm:ss.sss | Longitude [degrees_east] | Latitude [degrees_north] | Bot. Depth [m] | DEPTH [m] | BODC Bottle Number:INTEGER | TRITIUM_D_CONC_BOTTLE [TU] | Cs_137_D_CONC_BOTTLE [uBq/kg] | I_129_D_CONC_BOTTLE [atoms/kg] | Np_237_D_CONC_BOTTLE [uBq/kg] | ... | Th_230_TP_CONC_PUMP [uBq/kg] | Th_230_SPT_CONC_PUMP [uBq/kg] | Th_230_LPT_CONC_PUMP [uBq/kg] | Th_232_TP_CONC_PUMP [pmol/kg] | Th_232_SPT_CONC_PUMP [pmol/kg] | Th_232_LPT_CONC_PUMP [pmol/kg] | Th_234_SPT_CONC_PUMP [mBq/kg] | Th_234_LPT_CONC_PUMP [mBq/kg] | Po_210_TP_CONC_UWAY [mBq/kg] | Pb_210_TP_CONC_UWAY [mBq/kg] |
---|
0 rows × 86 columns
So that we can extract information such as sample methodology, filtering status, units included in Geotraces nuclides name.
WideToLongCB (common_coi, nuclides_pattern, var_name='NUCLIDE', value_name='VALUE')
Get Geotraces nuclide names as values not column names to extract contained information (unit, sampling method, …).
class WideToLongCB(Callback):
"""
Get Geotraces nuclide names as values not column names
to extract contained information (unit, sampling method, ...).
"""
def __init__(self, common_coi, nuclides_pattern,
var_name='NUCLIDE', value_name='VALUE'):
fc.store_attr()
def __call__(self, tfm):
nuc_of_interest = [c for c in tfm.df.columns if
any(re.match(pattern, c) for pattern in self.nuclides_pattern)]
tfm.df = pd.melt(tfm.df, id_vars=self.common_coi, value_vars=nuc_of_interest,
var_name=self.var_name, value_name=self.value_name)
tfm.df.dropna(subset=self.value_name, inplace=True)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern)
])
df_test = tfm()
print(f'df_test shape: {df_test.shape}')
df_test.head()
df_test shape: (26745, 8)
yyyy-mm-ddThh:mm:ss.sss | Longitude [degrees_east] | Latitude [degrees_north] | Bot. Depth [m] | DEPTH [m] | BODC Bottle Number:INTEGER | NUCLIDE | VALUE | |
---|---|---|---|---|---|---|---|---|
9223 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | TRITIUM_D_CONC_BOTTLE [TU] | 0.733 |
9231 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | TRITIUM_D_CONC_BOTTLE [TU] | 0.696 |
9237 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | TRITIUM_D_CONC_BOTTLE [TU] | 0.718 |
9244 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | TRITIUM_D_CONC_BOTTLE [TU] | 0.709 |
9256 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | TRITIUM_D_CONC_BOTTLE [TU] | 0.692 |
Unit
, Filtering status
and Sampling method
are extracted from column names as embedded in Geotraces data source.
ExtractUnitCB (var_name='NUCLIDE')
Extract units from nuclide names.
class ExtractUnitCB(Callback):
"""
Extract units from nuclide names.
"""
def __init__(self, var_name='NUCLIDE'):
fc.store_attr()
self.unit_col_name = 'UNIT'
def extract_unit(self, s):
match = re.search(r'\[(.*?)\]', s)
return match.group(1) if match else None
def __call__(self, tfm):
tfm.df[self.unit_col_name] = tfm.df[self.var_name].apply(self.extract_unit)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB()
])
df_test = tfm()
df_test.head()
yyyy-mm-ddThh:mm:ss.sss | Longitude [degrees_east] | Latitude [degrees_north] | Bot. Depth [m] | DEPTH [m] | BODC Bottle Number:INTEGER | NUCLIDE | VALUE | UNIT | |
---|---|---|---|---|---|---|---|---|---|
9223 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | TRITIUM_D_CONC_BOTTLE [TU] | 0.733 | TU |
9231 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | TRITIUM_D_CONC_BOTTLE [TU] | 0.696 | TU |
9237 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | TRITIUM_D_CONC_BOTTLE [TU] | 0.718 | TU |
9244 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | TRITIUM_D_CONC_BOTTLE [TU] | 0.709 | TU |
9256 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | TRITIUM_D_CONC_BOTTLE [TU] | 0.692 | TU |
ExtractFilteringStatusCB (phase, var_name='NUCLIDE')
Extract filtering status from nuclide names.
class ExtractFilteringStatusCB(Callback):
"Extract filtering status from nuclide names."
def __init__(self, phase, var_name='NUCLIDE'):
fc.store_attr()
# self.filt_col_name = cdl_cfg()['vars']['suffixes']['filtered']['name']
self.filt_col_name = 'FILT'
def extract_filt_status(self, s):
matched_string = self.match(s)
return self.phase[matched_string.group(1)][self.filt_col_name] if matched_string else None
def match(self, s):
return re.search(r'_(' + '|'.join(self.phase.keys()) + ')_', s)
def extract_group(self, s):
matched_string = self.match(s)
return self.phase[matched_string.group(1)]['group'] if matched_string else None
def __call__(self, tfm):
tfm.df[self.filt_col_name] = tfm.df[self.var_name].apply(self.extract_filt_status)
tfm.df['GROUP'] = tfm.df[self.var_name].apply(self.extract_group)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase)
])
df_test = tfm()
df_test.head()
yyyy-mm-ddThh:mm:ss.sss | Longitude [degrees_east] | Latitude [degrees_north] | Bot. Depth [m] | DEPTH [m] | BODC Bottle Number:INTEGER | NUCLIDE | VALUE | UNIT | FILT | GROUP | |
---|---|---|---|---|---|---|---|---|---|---|---|
9223 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | TRITIUM_D_CONC_BOTTLE [TU] | 0.733 | TU | 1 | SEAWATER |
9231 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | TRITIUM_D_CONC_BOTTLE [TU] | 0.696 | TU | 1 | SEAWATER |
9237 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | TRITIUM_D_CONC_BOTTLE [TU] | 0.718 | TU | 1 | SEAWATER |
9244 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | TRITIUM_D_CONC_BOTTLE [TU] | 0.709 | TU | 1 | SEAWATER |
9256 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | TRITIUM_D_CONC_BOTTLE [TU] | 0.692 | TU | 1 | SEAWATER |
ExtractSamplingMethodCB (smp_method:dict={'BOTTLE': 1, 'FISH': 18, 'PUMP': 14, 'UWAY': 24}, var_name='NUCLIDE', smp_method_col_name='SAMP_MET')
Extract sampling method from nuclide names.
Type | Default | Details | |
---|---|---|---|
smp_method | dict | {‘BOTTLE’: 1, ‘FISH’: 18, ‘PUMP’: 14, ‘UWAY’: 24} | Sampling method lookup table |
var_name | str | NUCLIDE | Column name containing nuclide names |
smp_method_col_name | str | SAMP_MET | Column name for sampling method in output df |
class ExtractSamplingMethodCB(Callback):
"Extract sampling method from nuclide names."
def __init__(self,
smp_method:dict = smp_method, # Sampling method lookup table
var_name='NUCLIDE', # Column name containing nuclide names
smp_method_col_name = 'SAMP_MET' # Column name for sampling method in output df
):
fc.store_attr()
def extract_smp_method(self, s):
match = re.search(r'_(' + '|'.join(self.smp_method.keys()) + ') ', s)
return self.smp_method[match.group(1)] if match else None
def __call__(self, tfm):
tfm.df[self.smp_method_col_name] = tfm.df[self.var_name].apply(self.extract_smp_method)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method)
])
df_test = tfm()
df_test.head()
yyyy-mm-ddThh:mm:ss.sss | Longitude [degrees_east] | Latitude [degrees_north] | Bot. Depth [m] | DEPTH [m] | BODC Bottle Number:INTEGER | NUCLIDE | VALUE | UNIT | FILT | GROUP | SAMP_MET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
9223 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | TRITIUM_D_CONC_BOTTLE [TU] | 0.733 | TU | 1 | SEAWATER | 1 |
9231 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | TRITIUM_D_CONC_BOTTLE [TU] | 0.696 | TU | 1 | SEAWATER | 1 |
9237 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | TRITIUM_D_CONC_BOTTLE [TU] | 0.718 | TU | 1 | SEAWATER | 1 |
9244 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | TRITIUM_D_CONC_BOTTLE [TU] | 0.709 | TU | 1 | SEAWATER | 1 |
9256 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | TRITIUM_D_CONC_BOTTLE [TU] | 0.692 | TU | 1 | SEAWATER | 1 |
We normalize the nuclide names to MARIS standard for further lookup.
RenameNuclideCB (nuclides_name, var_name='NUCLIDE')
Remap nuclides name to MARIS standard.
class RenameNuclideCB(Callback):
"Remap nuclides name to MARIS standard."
def __init__(self, nuclides_name, var_name='NUCLIDE'):
fc.store_attr()
self.patterns = ['_D', '_T', '_TP', '_LPT', '_SPT']
def extract_nuclide_name(self, s):
match = re.search(r'(.*?)(' + '|'.join(self.patterns) + ')', s)
return match.group(1) if match else None
def standardize_name(self, s):
s = self.extract_nuclide_name(s)
return self.nuclides_name[s] if s in self.nuclides_name else s.lower().replace('_', '')
def __call__(self, tfm):
tfm.df[self.var_name] = tfm.df[self.var_name].apply(self.standardize_name)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name)
])
df_test = tfm()
df_test.head()
yyyy-mm-ddThh:mm:ss.sss | Longitude [degrees_east] | Latitude [degrees_north] | Bot. Depth [m] | DEPTH [m] | BODC Bottle Number:INTEGER | NUCLIDE | VALUE | UNIT | FILT | GROUP | SAMP_MET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
9223 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | h3 | 0.733 | TU | 1 | SEAWATER | 1 |
9231 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | h3 | 0.696 | TU | 1 | SEAWATER | 1 |
9237 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | h3 | 0.718 | TU | 1 | SEAWATER | 1 |
9244 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | h3 | 0.709 | TU | 1 | SEAWATER | 1 |
9256 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | h3 | 0.692 | TU | 1 | SEAWATER | 1 |
array(['h3', 'cs137', 'i129', 'np237', 'pu239', 'pu239_240_tot', 'pu240',
'u236', 'pa231', 'pb210', 'po210', 'ra224', 'ra226', 'ra228',
'th230', 'th232', 'th234', 'ac227', 'be7', 'ra223', 'th228'],
dtype=object)
Note that several measurements are negative as shown below. Further clarification is needed.
Here below unit values used by Geotraces data source. We need to remap (sometimes convert) them to MARIS standard.
StandardizeUnitCB (units_lut, unit_col_name='UNIT', var_name='VALUE')
Remap unit to MARIS standard ones and apply conversion where needed.
class StandardizeUnitCB(Callback):
"Remap unit to MARIS standard ones and apply conversion where needed."
def __init__(self,
units_lut,
unit_col_name='UNIT',
var_name='VALUE'):
fc.store_attr()
# self.unit_col_name = cdl_cfg()['vars']['suffixes']['unit']['name']
def __call__(self, tfm):
# Convert/rescale values
tfm.df[self.var_name] *= tfm.df[self.unit_col_name].map(
{k: v['factor'] for k, v in self.units_lut.items()})
# Match MARIS unit id
tfm.df[self.unit_col_name] = tfm.df[self.unit_col_name].map(
{k: v['id'] for k, v in self.units_lut.items()})
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut)
])
df_test = tfm()
print(f'df_test.UNIT.unique(): {df_test.UNIT.unique()}')
df_test.UNIT.unique(): [7 3 9]
We rename the common columns to MARIS standard names before NetCDF encoding.
RenameColumnCB (lut={'yyyy-mm-ddThh:mm:ss.sss': 'TIME', 'Longitude [degrees_east]': 'LON', 'Latitude [degrees_north]': 'LAT', 'DEPTH [m]': 'SMP_DEPTH', 'Bot. Depth [m]': 'TOT_DEPTH', 'BODC Bottle Number:INTEGER': 'SMP_ID'})
Renaming variables to MARIS standard names.
class RenameColumnCB(Callback):
"Renaming variables to MARIS standard names."
def __init__(self, lut=renaming_rules): fc.store_attr()
def __call__(self, tfm):
# lut = self.renaming_rules()
new_col_names = [self.lut[name] if name in self.lut else name for name in tfm.df.columns]
tfm.df.columns = new_col_names
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules)
])
df_test = tfm()
df_test.head()
TIME | LON | LAT | TOT_DEPTH | SMP_DEPTH | SMP_ID | NUCLIDE | VALUE | UNIT | FILT | GROUP | SAMP_MET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
9223 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | h3 | 0.733 | 7 | 1 | SEAWATER | 1 |
9231 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | h3 | 0.696 | 7 | 1 | SEAWATER | 1 |
9237 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | h3 | 0.718 | 7 | 1 | SEAWATER | 1 |
9244 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | h3 | 0.709 | 7 | 1 | SEAWATER | 1 |
9256 | 2010-10-17T00:13:29 | 350.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | h3 | 0.692 | 7 | 1 | SEAWATER | 1 |
In Geotraces, longitudes are coded between 0 and 360 in Geotraces. We rescale it between -180 and 180 instead.
UnshiftLongitudeCB (lon_col_name='LON')
Longitudes are coded between 0 and 360 in Geotraces. We rescale it between -180 and 180 instead.
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB()
])
df_test = tfm()
df_test.head()
TIME | LON | LAT | TOT_DEPTH | SMP_DEPTH | SMP_ID | NUCLIDE | VALUE | UNIT | FILT | GROUP | SAMP_MET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
9223 | 2010-10-17T00:13:29 | 170.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | h3 | 0.733 | 7 | 1 | SEAWATER | 1 |
9231 | 2010-10-17T00:13:29 | 170.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | h3 | 0.696 | 7 | 1 | SEAWATER | 1 |
9237 | 2010-10-17T00:13:29 | 170.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | h3 | 0.718 | 7 | 1 | SEAWATER | 1 |
9244 | 2010-10-17T00:13:29 | 170.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | h3 | 0.709 | 7 | 1 | SEAWATER | 1 |
9256 | 2010-10-17T00:13:29 | 170.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | h3 | 0.692 | 7 | 1 | SEAWATER | 1 |
We encode each sample type (seawater, suspended matter, …) into a dedicated dataframe as each sample type is further encoded as NetCDF group.
DispatchToGroupCB (group_name='GROUP')
Convert to a dictionary of dataframe with sample type (seawater,…) as keys.
class DispatchToGroupCB(Callback):
"Convert to a dictionary of dataframe with sample type (seawater,...) as keys."
def __init__(self, group_name='GROUP'):
fc.store_attr()
def __call__(self, tfm):
tfm.dfs = dict(tuple(tfm.df.groupby(self.group_name)))
for key in tfm.dfs:
tfm.dfs[key] = tfm.dfs[key].drop(self.group_name, axis=1)
df = pd.read_csv(fname_in)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB()
])
dfs_test = tfm()
print(f'dfs_test keys: {dfs_test.keys()}')
print(dfs_test['SEAWATER'].head())
dfs_test keys: dict_keys(['SEAWATER', 'SUSPENDED_MATTER'])
TIME LON LAT TOT_DEPTH SMP_DEPTH SMP_ID \
9223 2010-10-17T00:13:29 170.33792 38.3271 2827.0 17.8 842525
9231 2010-10-17T00:13:29 170.33792 38.3271 2827.0 34.7 842528
9237 2010-10-17T00:13:29 170.33792 38.3271 2827.0 67.5 842531
9244 2010-10-17T00:13:29 170.33792 38.3271 2827.0 91.9 842534
9256 2010-10-17T00:13:29 170.33792 38.3271 2827.0 136.6 842540
NUCLIDE VALUE UNIT FILT SAMP_MET
9223 h3 0.733 7 1 1
9231 h3 0.696 7 1 1
9237 h3 0.718 7 1 1
9244 h3 0.709 7 1 1
9256 h3 0.692 7 1 1
We parse the time column to datetime format.
ParseTimeCB ()
Base class for callbacks.
df = pd.read_csv(fname_in)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB()
])
dfs_test = tfm()
print('time data type: ', dfs_test['SEAWATER'].TIME.dtype)
print(dfs_test['SEAWATER'].head())
time data type: datetime64[ns]
TIME LON LAT TOT_DEPTH SMP_DEPTH SMP_ID \
9223 2010-10-17 00:13:29 170.33792 38.3271 2827.0 17.8 842525
9231 2010-10-17 00:13:29 170.33792 38.3271 2827.0 34.7 842528
9237 2010-10-17 00:13:29 170.33792 38.3271 2827.0 67.5 842531
9244 2010-10-17 00:13:29 170.33792 38.3271 2827.0 91.9 842534
9256 2010-10-17 00:13:29 170.33792 38.3271 2827.0 136.6 842540
NUCLIDE VALUE UNIT FILT SAMP_MET
9223 h3 0.733 7 1 1
9231 h3 0.696 7 1 1
9237 h3 0.718 7 1 1
9244 h3 0.709 7 1 1
9256 h3 0.692 7 1 1
Then encode it to seconds since 1970-01-01
as specified in MARIS NetCDF CDL and template.
df = pd.read_csv(fname_in)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB(),
EncodeTimeCB()
])
dfs_test = tfm()['SEAWATER']
dfs_test.head()
TIME | LON | LAT | TOT_DEPTH | SMP_DEPTH | SMP_ID | NUCLIDE | VALUE | UNIT | FILT | SAMP_MET | |
---|---|---|---|---|---|---|---|---|---|---|---|
9223 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | h3 | 0.733 | 7 | 1 | 1 |
9231 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | h3 | 0.696 | 7 | 1 | 1 |
9237 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | h3 | 0.718 | 7 | 1 | 1 |
9244 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | h3 | 0.709 | 7 | 1 | 1 |
9256 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | h3 | 0.692 | 7 | 1 | 1 |
df = pd.read_csv(fname_in)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB()
])
dfs_test = tfm()
dfs_test['SEAWATER'].head()
TIME | LON | LAT | TOT_DEPTH | SMP_DEPTH | SMP_ID | NUCLIDE | VALUE | UNIT | FILT | SAMP_MET | |
---|---|---|---|---|---|---|---|---|---|---|---|
9223 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 17.8 | 842525 | h3 | 0.733 | 7 | 1 | 1 |
9231 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 34.7 | 842528 | h3 | 0.696 | 7 | 1 | 1 |
9237 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 67.5 | 842531 | h3 | 0.718 | 7 | 1 | 1 |
9244 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 91.9 | 842534 | h3 | 0.709 | 7 | 1 | 1 |
9256 | 1287274409 | 170.33792 | 38.3271 | 2827.0 | 136.6 | 842540 | h3 | 0.692 | 7 | 1 | 1 |
All MARIS lookup tables are embeded in the NetCDF file as enumeration types. Data itself is encoded as integer for the sake of space efficiency. We need to remap it to the corresponding MARIS nuclide id.
{'NOT APPLICABLE': -1,
'NOT AVAILABLE': 0,
'h3': 1,
'be7': 2,
'c14': 3,
'k40': 4,
'cr51': 5,
'mn54': 6,
'co57': 7,
'co58': 8,
'co60': 9,
'zn65': 10,
'sr89': 11,
'sr90': 12,
'zr95': 13,
'nb95': 14,
'tc99': 15,
'ru103': 16,
'ru106': 17,
'rh106': 18,
'ag106m': 19,
'ag108': 20,
'ag108m': 21,
'ag110m': 22,
'sb124': 23,
'sb125': 24,
'te129m': 25,
'i129': 28,
'i131': 29,
'cs127': 30,
'cs134': 31,
'cs137': 33,
'ba140': 34,
'la140': 35,
'ce141': 36,
'ce144': 37,
'pm147': 38,
'eu154': 39,
'eu155': 40,
'pb210': 41,
'pb212': 42,
'pb214': 43,
'bi207': 44,
'bi211': 45,
'bi214': 46,
'po210': 47,
'rn220': 48,
'rn222': 49,
'ra223': 50,
'ra224': 51,
'ra225': 52,
'ra226': 53,
'ra228': 54,
'ac228': 55,
'th227': 56,
'th228': 57,
'th232': 59,
'th234': 60,
'pa234': 61,
'u234': 62,
'u235': 63,
'u238': 64,
'np237': 65,
'np239': 66,
'pu238': 67,
'pu239': 68,
'pu240': 69,
'pu241': 70,
'am240': 71,
'am241': 72,
'cm242': 73,
'cm243': 74,
'cm244': 75,
'cs134_137_tot': 76,
'pu239_240_tot': 77,
'pu239_240_iii_iv_tot': 78,
'pu239_240_v_vi_tot': 79,
'cm243_244_tot': 80,
'pu238_pu239_240_tot_ratio': 81,
'am241_pu239_240_tot_ratio': 82,
'cs137_134_ratio': 83,
'cd109': 84,
'eu152': 85,
'fe59': 86,
'gd153': 87,
'ir192': 88,
'pu238_240_tot': 89,
'rb86': 90,
'sc46': 91,
'sn113': 92,
'sn117m': 93,
'tl208': 94,
'mo99': 95,
'tc99m': 96,
'ru105': 97,
'te129': 98,
'te132': 99,
'i132': 100,
'i135': 101,
'cs136': 102,
'tbeta': 103,
'talpha': 104,
'i133': 105,
'th230': 106,
'pa231': 107,
'u236': 108,
'ag111': 109,
'in116m': 110,
'te123m': 111,
'sb127': 112,
'ba133': 113,
'ce139': 114,
'tl201': 116,
'hg203': 117,
'na22': 122,
'pa234m': 123,
'am243': 124,
'se75': 126,
'sr85': 127,
'y88': 128,
'ce140': 129,
'bi212': 130,
'u236_238_ratio': 131,
'i125': 132,
'ba137m': 133,
'u232': 134,
'pa233': 135,
'ru106_rh106_tot': 136,
'tu': 137,
'tbeta40k': 138,
'fe55': 139,
'ce144_pr144_tot': 140,
'pu240_pu239_ratio': 141,
'u233': 142,
'pu239_242_tot': 143,
'ac227': 144}
df = pd.read_csv(fname_in)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
RemapCB(fn_lut=lut_nuclides, col_remap='NUCLIDE', col_src='NUCLIDE')
])
dfs_test = tfm()
dfs_test['SEAWATER'].NUCLIDE.unique()
Group BIOTA not found in the dataframes.
Group SEDIMENT not found in the dataframes.
array([ 1, 33, 28, 65, 68, 77, 69, 108, 107, 41, 47, 51, 53,
54, 106, 59, 60, 144, 2, 50, 57])
df = pd.read_csv(fname_in)
tfm = Transformer(df, cbs=[
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(common_coi, nuclides_pattern),
ExtractUnitCB(),
ExtractFilteringStatusCB(phase),
ExtractSamplingMethodCB(smp_method),
RenameNuclideCB(nuclides_name),
StandardizeUnitCB(units_lut),
RenameColumnCB(renaming_rules),
UnshiftLongitudeCB(),
DispatchToGroupCB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
RemapCB(fn_lut=lut_nuclides, col_remap='NUCLIDE', col_src='NUCLIDE')
])
tfm();
Group BIOTA not found in the dataframes.
Group SEDIMENT not found in the dataframes.
['Select columns of interest.',
'\n Get Geotraces nuclide names as values not column names \n to extract contained information (unit, sampling method, ...).\n ',
'\n Extract units from nuclide names.\n ',
'Extract filtering status from nuclide names.',
'Extract sampling method from nuclide names.',
'Remap nuclides name to MARIS standard.',
'Remap unit to MARIS standard ones and apply conversion where needed.',
'Renaming variables to MARIS standard names.',
'Longitudes are coded between 0 and 360 in Geotraces. We rescale it between -180 and 180 instead.',
'Convert to a dictionary of dataframe with sample type (seawater,...) as keys.',
'Encode time as seconds since epoch.',
'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.',
"Remap values from 'NUCLIDE' to 'NUCLIDE' for groups: dict_keys(['BIOTA', 'SEAWATER', 'SEDIMENT', 'SUSPENDED_MATTER'])."]
get_attrs (tfm, zotero_key, kw=['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides', 'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments', 'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality > Ocean Contaminants', 'Earth Science > Biological Classification > Animals/Vertebrates > Fish', 'Earth Science > Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks', 'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans', 'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)'])
Retrieve global attributes from Geotraces dataset.
def get_attrs(tfm, zotero_key, kw=kw):
"Retrieve global attributes from Geotraces dataset."
return GlobAttrsFeeder(tfm.dfs, cbs=[
BboxCB(),
DepthRangeCB(),
TimeRangeCB(),
ZoteroCB(zotero_key, cfg=cfg()),
KeyValuePairCB('keywords', ', '.join(kw)),
KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
])()
zotero_metadata = get_attrs(tfm, zotero_key=zotero_key, kw=kw)
print('Keys: ', zotero_metadata.keys())
print('Title: ', zotero_metadata['title'])
Keys: dict_keys(['geospatial_lat_min', 'geospatial_lat_max', 'geospatial_lon_min', 'geospatial_lon_max', 'geospatial_bounds', 'geospatial_vertical_max', 'geospatial_vertical_min', 'time_coverage_start', 'time_coverage_end', 'id', 'title', 'summary', 'creator_name', 'keywords', 'publisher_postprocess_logs'])
Title: The GEOTRACES Intermediate Data Product 2017
encode (fname_in, fname_out, **kwargs)
Group BIOTA not found in the dataframes.
Group SEDIMENT not found in the dataframes.
TODO: