The JOIS datasets comprise four years (2021-2024) of seawater radionuclide measurements from the Beaufort Sea, collected as part of the Joint Ocean Ice Study (JOIS) expeditions aboard the CCGS Louis St. Laurent. Samples were analysed at ETH Zurich / LIP (Nuria Casacuberta’s group). The raw data are published as Zenodo archives, each containing a single Excel file with CTD (conductivity, temperature, depth) metadata and activity concentrations in wide format.
Nuclide coverage varies by year:
Year
I-129
U-236
U-238
U-236/U-238
2021
✓
2022
✓
✓
✓
✓
2023
✓
✓
✓
✓
2024
✓
The column layout is similar enough across years that the same pipeline handles all of them, with a few normalisation steps to absorb the differences.
Exported source
RECORDS = {2021: {'url': 'https://zenodo.org/records/18880401/files/annabel-payne/BGOS-JOIS-2021-v1.0.1.zip?download=1'},2022: {'url': 'https://zenodo.org/records/18880777/files/annabel-payne/BGOS-JOIS-2022-v1.0.1.zip?download=1'},2023: {'url': 'https://zenodo.org/records/18880591/files/annabel-payne/BGOS-JOIS-2023-v1.0.zip?download=1'},2024: {'url': 'https://zenodo.org/records/18880497/files/annabel-payne/BGOS-JOIS-2024-v1.0.1.zip?download=1'},}fname_out ='JOIS_Beaufort_Sea.nc'src_dir =None# remote-only, no local files
Raw data format
All four JOIS ZIP archives contain a single Excel sheet with CTD metadata columns and activity concentrations in wide format: each nuclide-unit pair gets its own column (e.g. I129_at_kg, I129_at_l), and each value column has a matching unc_ column for uncertainty. Value columns carry ( x 10^N) suffixes in some years indicating an unscaled value. The 2021 ZIP has a stray ( x 10^6) column header that should read Cruise.
load_data handles all four years by: - Normalising column names (stripping scale-factor suffixes) - Extracting and applying scale factors to unscaled values - Detecting the 2023 U-236 at_kg columns that are missing their suffix but are in the same scale as their at_l counterparts - Concatenating all years into a single DataFrame
ImportantFEEDBACK TO DATA PROVIDER
The ( x 10^N) suffix in column headers signals that the stored values are unscaled and must be multiplied by 10^N to obtain the true value (atoms/kg or atoms/l depending on the column). This convention is applied inconsistently across years and columns:
I-129: The 2021-2022 files use I129_at_kg ( x 10^7), where values are unscaled raw numbers (suffix present). The 2023-2024 files use I129_at_kg with the scale already applied and no suffix. The same physical quantity therefore requires different handling per year.
U-236 in 2023: U236_at_l correctly carries ( x 10^6) (values unscaled), but U236_at_kg is missing its suffix even though its values are also unscaled; this is confirmed by the expected seawater density ratio (~1025 kg/m³ between at/kg and at/l).
The provider should clarify the intended convention and ideally publish all values with scales pre-applied, removing the need for per-year suffix parsing.
def norm_cols( cols, # Column names to normalise)->list:
Normalise column names: strip scale-factor suffixes like ( x 10^7) or (x 10^6).
test_eq(norm_cols(['I129_at_kg ( x 10^7)', 'U236_at_l']), ['I129_at_kg', 'U236_at_l'])test_eq(norm_cols(['Stn', 'Depth_m']), ['Stn', 'Depth_m'])print("norm_cols: no-change case and suffix-stripping case pass. ✓")
norm_cols: no-change case and suffix-stripping case pass. ✓
The raw JOIS columns encode nuclide, unit, and sometimes method in a single string (e.g. I129_at_kg), with a couple of exceptions (U238_ppb, U236_U238). We map CTD column names to MARIS uppercase standards, combine separate date/time string columns, and handle the two special-case columns.
We address these in three callbacks. All run on the SEAWATER group only, since JOIS has no biota, sediment, or suspended matter.
ImportantFEEDBACK TO DATA PROVIDER
Column headers in the JOIS and GEOTRACES datasets encode multiple pieces of information (nuclide, unit, measurement method) into a single string. This adds friction to data ingestion, since every new dataset with a different naming convention requires custom parsing logic. MARIS prefers a tidy data layout (Wickham 2014, doi:10.18637/jss.v059.i10) where each column holds a single variable, and metadata like nuclide, unit, and method are stored as separate columns, not baked into the header.
RenameNucColsCB: columns renamed correctly on mock data. ✓
tfm = Transformer(dfs, cbs=[RenameNucColsCB()])tfm()print("Columns after RenameNucColsCB:", [c for c in tfm.dfs['SEAWATER'].columns if'238'in c or'236'in c])
Most JOIS concentration columns follow a {Nuclide}_{Unit} pattern (I129_at_kg, I129_at_l, U236_at_kg). Two columns do not:
U238_ppb names the measurand method (parts-per-billion) as the unit part, which would cause the melt split to produce NUCLIDE=U238_at and UNIT=ppb. We rename it to U238_at_ppb so the split is clean.
U236_U238 uses an underscore as a separator between two nuclide names, not between nuclide and unit. We rename it to U236_U238_at_ratio.
The downstream MeltJOISCB splits every value column on _at_ to derive NUCLIDE and UNIT. These renames are done in anticipation of that step, ensuring every column meets the {Nuclide}_{Unit} contract so the melt can proceed consistently.
def ParseDateTimeCB( col_date:str='Date', # Source date column name col_time:str='Time', # Source time column name):
Combine JOIS Date and Time columns into a single TIME column.
# Verify ParseDateTimeCB combines Date and Time into TIMEdfs_mock = {'SEAWATER': pd.DataFrame({'Date': ['2021-08-19'], 'Time': ['08:00:00']})}tfm = Transformer(dfs_mock, cbs=[ParseDateTimeCB()])tfm()test_eq('TIME'in tfm.dfs['SEAWATER'].columns, True)test_eq('Date'notin tfm.dfs['SEAWATER'].columns, True)print(f"ParseDateTimeCB: TIME = {tfm.dfs['SEAWATER']['TIME'].iloc[0]}. ✓")
ParseDateTimeCB: TIME = 2021-08-19 08:00:00. ✓
tfm = Transformer(dfs, cbs=[RenameNucColsCB(), RenameColsCB(), ParseDateTimeCB()])tfm()print("First 3 TIME values:\n", tfm.dfs['SEAWATER']['TIME'].head(3).to_string())
First 3 TIME values:
0 2021-08-28 13:29:58
1 2021-08-28 13:29:58
2 2021-08-28 13:29:58
NoteNote for MARIS DB team
The JOIS datasets include a Pressure_dbar column with CTD pressure values. This parameter is currently not present in the MARIS output schema. If useful for the database, it could be added alongside the other CTD metadata fields. The extra CTD columns (Conservative_Temp, Potential_Temp, Absolute_Salinity, Sigma0, Insitu_Density, unc_129_pct) and the Cruise column are not mapped in the NC_CSV dict (the central remapping from internal column names to NetCDF/CSV output names), so they are dropped after the melt. If the data team decides these fields are useful, the fix should go in NC_CSV, not in this handler.
Reshape wide to long
The raw JOIS data uses wide format: each sample has one row, and nuclide-unit concentrations are spread across separate columns (I129_at_kg, I129_at_l, U236_at_kg, etc.). MARIS requires long format (one row per measurement) with columns for NUCLIDE, UNIT, VALUE, and UNC.
MeltJOISCB melts the value columns, merges the matching uncertainty columns, then splits each column name on _at_ to derive the nuclide name and unit. Rows where VALUE is NaN are dropped.
def MeltJOISCB( meta_cols, # Columns to keep as identifiers val_cols, # Value columns to melt val_name:str='VALUE', # Name of melted value column unc_name:str='UNC', # Name of uncertainty column):
Reshape JOIS wide nuclide columns to long format with NUCLIDE, UNIT, VALUE, UNC columns.
Exported source
# Columns kept as identifiers during the wide-to-long reshapeMETA_COLS = ['Cruise', 'STATION', 'SMP_ID_PROVIDER', 'LAT', 'LON','TIME', 'Pressure_dbar', 'SMP_DEPTH', 'TEMP', 'SAL']# Columns to melt into VALUE/UNC long formatVAL_COLS = ['I129_at_kg', 'I129_at_l', 'U236_at_kg', 'U236_at_l','U238_at_ppb', 'U236_U238_at_ratio']
Uncertainty values in 2024 are three orders of magnitude lower than in 2021-2023 (relative uncertainty ~0.0002% versus ~3-5%), with no documented change in analytical method or instrumentation. All three uncertainty representations (unc_I129_at_kg, unc_I129_at_l, unc_129_pct) are internally consistent within each year. The provider should confirm whether this reflects a genuine precision improvement or a data reporting issue.
Convert U-238 units
JOIS reports U-238 in parts-per-billion (ppb, mass of U per mass of seawater), while MARIS requires atoms per kg. ConvertU238CB converts the VALUE column for U-238 rows using:
where \(N_A = 6.02214076 \times 10^{23}\ \text{mol}^{-1}\) is Avogadro’s number and \(M_{238} = 238.05\ \text{g/mol}\) is the molar mass of U-238, giving a conversion factor of \(2.530 \times 10^{12}\ \text{atoms/kg per ppb}\). Rows with other nuclides are left unchanged.
ConvertU238CB: I-129 unchanged, U-238 scaled, UNIT updated. ✓
tfm = Transformer(dfs, cbs=[RenameNucColsCB(), RenameColsCB(), ParseDateTimeCB(), MeltJOISCB(META_COLS, VAL_COLS), ConvertU238CB()])tfm()out = tfm.dfs['SEAWATER']u238 = out[out['NUCLIDE']=='U238']print(f"U-238 rows: {len(u238)}, mean VALUE = {u238['VALUE'].mean():.2e} atoms/kg")i129 = out[out['NUCLIDE']=='I129']print(f"I-129 rows: {len(i129)}, mean VALUE = {i129['VALUE'].mean():.2e} atoms/kg")
U-238 rows: 286, mean VALUE = 7.70e+12 atoms/kg
I-129 rows: 882, mean VALUE = 1.11e+09 atoms/kg
Remap nomenclatures to MARIS identifiers
The melt produces string columns: NUCLIDE (I129, U236, U238, U236_U238), UNIT (at_kg, at_l, at_ppb, at_ratio), and missing LAB and AREA columns. MARIS stores these as integer foreign-key IDs from the central nomenclatures.
RemapCB maps source column values through a lookup table to a target column. For constants like LAB (ETH Zurich/LIP, ID 345) and AREA (Beaufort Sea, ID 4256), an empty LUT with a default_val injects the same value for every row.
Exported source
# MARIS nuclide IDs confirmed via get_lut('NUCLIDE')NUCLIDE_LUT = {'I129': 28, 'U236': 108, 'U238': 64, 'U236_U238': 131}# MARIS unit IDs confirmed via get_lut('UNIT')UNIT_LUT = {'at_kg': 9, 'at_l': 12, 'at_ratio': 6}
The encoder wraps the full pipeline and writes the standardised data to a NetCDF4 file. Global attributes are assembled via GlobAttrsFeeder with BboxCB, DepthRangeCB, TimeRangeCB, plus keywords and processing logs.
We do not yet have an INIS entry for the JOIS datasets, so INISCB is commented out. The IAEA INIS repository will be used for bibliographic metadata once the record is created. A placeholder line is included for future use.