Joint Ocean Ice Study

The JOIS datasets comprise four years (2021-2024) of seawater radionuclide measurements from the Beaufort Sea, collected as part of the Joint Ocean Ice Study (JOIS) expeditions aboard the CCGS Louis St. Laurent. Samples were analysed at ETH Zurich / LIP (Nuria Casacuberta’s group). The raw data are published as Zenodo archives, each containing a single Excel file with CTD (conductivity, temperature, depth) metadata and activity concentrations in wide format.

Nuclide coverage varies by year:

Year	I-129	U-236	U-238	U-236/U-238
2021	✓
2022	✓	✓	✓	✓
2023	✓	✓	✓	✓
2024	✓

The column layout is similar enough across years that the same pipeline handles all of them, with a few normalisation steps to absorb the differences.

Exported source

RECORDS = {
    2021: {'url': 'https://zenodo.org/records/18880401/files/annabel-payne/BGOS-JOIS-2021-v1.0.1.zip?download=1'},
    2022: {'url': 'https://zenodo.org/records/18880777/files/annabel-payne/BGOS-JOIS-2022-v1.0.1.zip?download=1'},
    2023: {'url': 'https://zenodo.org/records/18880591/files/annabel-payne/BGOS-JOIS-2023-v1.0.zip?download=1'},
    2024: {'url': 'https://zenodo.org/records/18880497/files/annabel-payne/BGOS-JOIS-2024-v1.0.1.zip?download=1'},
}
fname_out = 'JOIS_Beaufort_Sea.nc'
src_dir   = None  # remote-only, no local files

Raw data format

All four JOIS ZIP archives contain a single Excel sheet with CTD metadata columns and activity concentrations in wide format: each nuclide-unit pair gets its own column (e.g. I129_at_kg, I129_at_l), and each value column has a matching unc_ column for uncertainty. Value columns carry ( x 10^N) suffixes in some years indicating an unscaled value. The 2021 ZIP has a stray ( x 10^6) column header that should read Cruise.

load_data handles all four years by: - Normalising column names (stripping scale-factor suffixes) - Extracting and applying scale factors to unscaled values - Detecting the 2023 U-236 at_kg columns that are missing their suffix but are in the same scale as their at_l counterparts - Concatenating all years into a single DataFrame

FEEDBACK TO DATA PROVIDER

The ( x 10^N) suffix in column headers signals that the stored values are unscaled and must be multiplied by 10^N to obtain the true value (atoms/kg or atoms/l depending on the column). This convention is applied inconsistently across years and columns:

I-129: The 2021-2022 files use I129_at_kg ( x 10^7), where values are unscaled raw numbers (suffix present). The 2023-2024 files use I129_at_kg with the scale already applied and no suffix. The same physical quantity therefore requires different handling per year.
U-236 in 2023: U236_at_l correctly carries ( x 10^6) (values unscaled), but U236_at_kg is missing its suffix even though its values are also unscaled; this is confirmed by the expected seawater density ratio (~1025 kg/m³ between at/kg and at/l).

The provider should clarify the intended convention and ideally publish all values with scales pre-applied, removing the need for per-year suffix parsing.

source

norm_cols


def norm_cols(
    cols, # Column names to normalise
)->list:

Normalise column names: strip scale-factor suffixes like ( x 10^7) or (x 10^6).

test_eq(norm_cols(['I129_at_kg ( x 10^7)', 'U236_at_l']),
        ['I129_at_kg', 'U236_at_l'])
test_eq(norm_cols(['Stn', 'Depth_m']), ['Stn', 'Depth_m'])
print("norm_cols: no-change case and suffix-stripping case pass. ✓")

norm_cols: no-change case and suffix-stripping case pass. ✓

source

extract_scales


def extract_scales(
    cols, # Column names to scan for scale-factor suffixes
)->dict:

Return {col: factor} for columns with ( x 10^N) suffix in original names, excluding empty-string results.

scales = extract_scales(['I129_at_kg ( x 10^7)', 'U236_at_l(x10^6)', 'Stn'])
test_eq(scales, {'I129_at_kg': 10_000_000, 'U236_at_l': 1_000_000})
print("extract_scales: two scales extracted, no spurious matches. ✓")

extract_scales: two scales extracted, no spurious matches. ✓

source

apply_scales


def apply_scales(
    df, # DataFrame to modify in place
    scales, # {col: factor} of scale factors to apply
)->DataFrame:

Multiply columns in df by their scale factor.

df = pd.DataFrame({'I129_at_kg': [1.0, 2.0], 'Stn': ['A', 'B']})
scales = {'I129_at_kg': 10_000_000}
result = apply_scales(df, scales)
test_eq(result['I129_at_kg'].tolist(), [10_000_000.0, 20_000_000.0])
test_eq(result['Stn'].tolist(), ['A', 'B'])
print("apply_scales: value columns scaled, non-value columns unchanged. ✓")

apply_scales: value columns scaled, non-value columns unchanged. ✓

source

load_data


def load_data(
    recs:NoneType=None, # Optional dict of year->record; defaults to all RECORDS
)->dict:

Fetch all JOIS records from Zenodo, align column names, apply scale factors, and return a single SEAWATER DataFrame.

dfs = load_data()

print(dfs['SEAWATER'].describe(include='number').T[['count', 'mean', 'min', 'max']])

                   count          mean          min           max
sample_number      441.0  6.377082e+02    25.000000  1.209000e+03
Latitude_degN      441.0  7.469298e+01    70.539500  7.969150e+01
Longitude_degE     441.0 -1.452820e+02  -153.320167 -1.229452e+02
Pressure_dbar      441.0  7.458816e+02     4.223000  3.817376e+03
Depth_m            441.0  7.350311e+02     4.179505  3.743998e+03
Temperature_degC   441.0 -1.417129e-01    -1.586100  5.142400e+00
Conservative_Temp  441.0 -1.736566e-01    -1.579121  5.199772e+00
Potential_Temp     441.0 -1.787647e-01    -1.586842  5.142001e+00
Salinity_psu       441.0  3.343552e+01    24.756100  3.495950e+01
Absolute_Salinity  441.0  3.359702e+01    24.874988  3.512872e+01
Sigma0             441.0  2.685045e+01    19.866195  2.810244e+01
Insitu_Density     441.0  1.026899e+00     1.019916  1.028151e+00
I129_at_kg         441.0  1.097553e+09     0.000000  6.033279e+09
unc_I129_at_kg     441.0  4.149385e+07     0.000000  2.503235e+08
unc_129_pct        440.0  9.664254e+00     0.000021  1.232857e+03
I129_at_l          441.0  1.128016e+09     0.000000  6.200499e+09
unc_I129_at_l      441.0  4.264495e+07     0.000000  2.573090e+08
U236_at_kg         286.0  1.457382e+07  6060.553755  3.343125e+07
unc_U236_at_kg     286.0  5.442092e+05  8747.926973  7.695057e+06
U236_at_l          284.0  1.498641e+07  6231.162715  3.435749e+07
unc_U236_at_l      284.0  5.344472e+05  8994.188551  2.728110e+06
U238_ppb           286.0  3.042647e+00     2.100575  6.216479e+00
unc_U238_ppb       285.0  1.938567e-01     0.035725  3.873545e+00
U236_U238          286.0  1.879611e+03     0.765689  4.082257e+03
unc_U236_U238      286.0  1.458633e+02     1.042832  7.845516e+02

Rename and standardise columns

The raw JOIS columns encode nuclide, unit, and sometimes method in a single string (e.g. I129_at_kg), with a couple of exceptions (U238_ppb, U236_U238). We map CTD column names to MARIS uppercase standards, combine separate date/time string columns, and handle the two special-case columns.

We address these in three callbacks. All run on the SEAWATER group only, since JOIS has no biota, sediment, or suspended matter.

FEEDBACK TO DATA PROVIDER

Column headers in the JOIS and GEOTRACES datasets encode multiple pieces of information (nuclide, unit, measurement method) into a single string. This adds friction to data ingestion, since every new dataset with a different naming convention requires custom parsing logic. MARIS prefers a tidy data layout (Wickham 2014, doi:10.18637/jss.v059.i10) where each column holds a single variable, and metadata like nuclide, unit, and method are stored as separate columns, not baked into the header.

source

RenameNucColsCB


def RenameNucColsCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

*Align U238_ppb and U236_U238 column names to {Nuc}_{Unit} pattern before melting.*

# Verify RenameNucColsCB renames columns correctly on mock data
dfs_mock = {'SEAWATER': pd.DataFrame({'U238_ppb': [1.0], 'U236_U238': [2.0], 'I129_at_kg': [3.0]})}
tfm = Transformer(dfs_mock, cbs=[RenameNucColsCB()])
tfm()
test_eq('U238_at_ppb' in tfm.dfs['SEAWATER'].columns, True)
test_eq('U236_U238_at_ratio' in tfm.dfs['SEAWATER'].columns, True)
test_eq('I129_at_kg' in tfm.dfs['SEAWATER'].columns, True)
print("RenameNucColsCB: columns renamed correctly on mock data. ✓")

RenameNucColsCB: columns renamed correctly on mock data. ✓

tfm = Transformer(dfs, cbs=[RenameNucColsCB()])
tfm()
print("Columns after RenameNucColsCB:", [c for c in tfm.dfs['SEAWATER'].columns if '238' in c or '236' in c])

Columns after RenameNucColsCB: ['U236_at_kg', 'unc_U236_at_kg', 'U236_at_l', 'unc_U236_at_l', 'U238_at_ppb', 'unc_U238_at_ppb', 'U236_U238_at_ratio', 'unc_U236_U238_at_ratio']

Align nuclide column names

Most JOIS concentration columns follow a {Nuclide}_{Unit} pattern (I129_at_kg, I129_at_l, U236_at_kg). Two columns do not:

U238_ppb names the measurand method (parts-per-billion) as the unit part, which would cause the melt split to produce NUCLIDE=U238_at and UNIT=ppb. We rename it to U238_at_ppb so the split is clean.
U236_U238 uses an underscore as a separator between two nuclide names, not between nuclide and unit. We rename it to U236_U238_at_ratio.

The downstream MeltJOISCB splits every value column on _at_ to derive NUCLIDE and UNIT. These renames are done in anticipation of that step, ensuring every column meets the {Nuclide}_{Unit} contract so the melt can proceed consistently.

source

RenameColsCB


def RenameColsCB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Map JOIS provider CTD and sample columns to MARIS standard names.

# Verify RenameColsCB maps provider columns to MARIS names
dfs_mock = {'SEAWATER': pd.DataFrame({'Latitude_degN': [70.5], 'Longitude_degE': [-140.0],
                                       'Depth_m': [200.0], 'sample_number': [101]})}
tfm = Transformer(dfs_mock, cbs=[RenameColsCB()])
tfm()
test_eq('LAT' in tfm.dfs['SEAWATER'].columns, True)
test_eq('LON' in tfm.dfs['SEAWATER'].columns, True)
test_eq('SMP_DEPTH' in tfm.dfs['SEAWATER'].columns, True)
test_eq('SMP_ID_PROVIDER' in tfm.dfs['SEAWATER'].columns, True)
print("RenameColsCB: provider columns mapped to MARIS names. ✓")

RenameColsCB: provider columns mapped to MARIS names. ✓

tfm = Transformer(dfs, cbs=[RenameNucColsCB(), RenameColsCB()])
tfm()
print("Sample of renamed columns:\n", tfm.dfs['SEAWATER'][['LAT', 'LON', 'STATION', 'SMP_DEPTH', 'SMP_ID_PROVIDER']].head(2).to_string())

Sample of renamed columns:
          LAT         LON STATION    SMP_DEPTH  SMP_ID_PROVIDER
0  75.000833 -150.001333     CB4  2001.727548            289.0
1  75.000833 -150.001333     CB4  1501.921299            290.0

source

ParseDateTimeCB


def ParseDateTimeCB(
    col_date:str='Date', # Source date column name
    col_time:str='Time', # Source time column name
):

Combine JOIS Date and Time columns into a single TIME column.

# Verify ParseDateTimeCB combines Date and Time into TIME
dfs_mock = {'SEAWATER': pd.DataFrame({'Date': ['2021-08-19'], 'Time': ['08:00:00']})}
tfm = Transformer(dfs_mock, cbs=[ParseDateTimeCB()])
tfm()
test_eq('TIME' in tfm.dfs['SEAWATER'].columns, True)
test_eq('Date' not in tfm.dfs['SEAWATER'].columns, True)
print(f"ParseDateTimeCB: TIME = {tfm.dfs['SEAWATER']['TIME'].iloc[0]}. ✓")

ParseDateTimeCB: TIME = 2021-08-19 08:00:00. ✓

tfm = Transformer(dfs, cbs=[RenameNucColsCB(), RenameColsCB(), ParseDateTimeCB()])
tfm()
print("First 3 TIME values:\n", tfm.dfs['SEAWATER']['TIME'].head(3).to_string())

First 3 TIME values:
 0   2021-08-28 13:29:58
1   2021-08-28 13:29:58
2   2021-08-28 13:29:58

Note for MARIS DB team

The JOIS datasets include a Pressure_dbar column with CTD pressure values. This parameter is currently not present in the MARIS output schema. If useful for the database, it could be added alongside the other CTD metadata fields. The extra CTD columns (Conservative_Temp, Potential_Temp, Absolute_Salinity, Sigma0, Insitu_Density, unc_129_pct) and the Cruise column are not mapped in the NC_CSV dict (the central remapping from internal column names to NetCDF/CSV output names), so they are dropped after the melt. If the data team decides these fields are useful, the fix should go in NC_CSV, not in this handler.

Reshape wide to long

The raw JOIS data uses wide format: each sample has one row, and nuclide-unit concentrations are spread across separate columns (I129_at_kg, I129_at_l, U236_at_kg, etc.). MARIS requires long format (one row per measurement) with columns for NUCLIDE, UNIT, VALUE, and UNC.

MeltJOISCB melts the value columns, merges the matching uncertainty columns, then splits each column name on _at_ to derive the nuclide name and unit. Rows where VALUE is NaN are dropped.

source

MeltJOISCB


def MeltJOISCB(
    meta_cols, # Columns to keep as identifiers
    val_cols, # Value columns to melt
    val_name:str='VALUE', # Name of melted value column
    unc_name:str='UNC', # Name of uncertainty column
):

Reshape JOIS wide nuclide columns to long format with NUCLIDE, UNIT, VALUE, UNC columns.

Exported source

# Columns kept as identifiers during the wide-to-long reshape
META_COLS = ['Cruise', 'STATION', 'SMP_ID_PROVIDER', 'LAT', 'LON',
             'TIME', 'Pressure_dbar', 'SMP_DEPTH', 'TEMP', 'SAL']

# Columns to melt into VALUE/UNC long format
VAL_COLS = ['I129_at_kg', 'I129_at_l', 'U236_at_kg', 'U236_at_l',
            'U238_at_ppb', 'U236_U238_at_ratio']

# Verify MeltJOISCB produces correct NUCLIDE, UNIT, VALUE, UNC columns on mock data
dfs_mock = {'SEAWATER': pd.DataFrame({
    'Cruise': ['2021'], 'STATION': ['CB4'], 'SMP_ID_PROVIDER': [1],
    'I129_at_kg': [6.4e8], 'unc_I129_at_kg': [2.0e7],
    'I129_at_l': [6.6e8], 'unc_I129_at_l': [2.1e7],
})}
MOCK_META = ['Cruise', 'STATION', 'SMP_ID_PROVIDER']
MOCK_VALS = ['I129_at_kg', 'I129_at_l']
tfm = Transformer(dfs_mock, cbs=[MeltJOISCB(MOCK_META, MOCK_VALS)])
tfm()
out = tfm.dfs['SEAWATER']
test_eq(len(out), 2)
test_eq(out['NUCLIDE'].tolist(), ['I129', 'I129'])
test_eq(out['UNIT'].tolist(), ['at_kg', 'at_l'])
test_eq(out['VALUE'].tolist(), [6.4e8, 6.6e8])
print("MeltJOISCB on mock data: 2 rows, correct NUCLIDE/UNIT split. ✓")

MeltJOISCB on mock data: 2 rows, correct NUCLIDE/UNIT split. ✓

tfm = Transformer(dfs, cbs=[RenameNucColsCB(), RenameColsCB(), ParseDateTimeCB(),
                             MeltJOISCB(META_COLS, VAL_COLS)])
tfm()
out = tfm.dfs['SEAWATER']
print(f"Shape: {out.shape}")
print("NUCLIDE values:", out['NUCLIDE'].unique())
print("UNIT values:", out['UNIT'].unique())
print(out[['NUCLIDE', 'UNIT', 'VALUE', 'UNC']].head(6).to_string())

Shape: (2024, 14)
NUCLIDE values: ['I129' 'U236' 'U238' 'U236_U238']
UNIT values: ['at_kg' 'at_l' 'at_ppb' 'at_ratio']
  NUCLIDE   UNIT         VALUE           UNC
0    I129  at_kg  6.398086e+08  2.027918e+07
1    I129  at_kg  1.315774e+09  4.084218e+07
2    I129  at_kg  1.409492e+09  4.401294e+07
3    I129  at_kg  1.433159e+09  4.477813e+07
4    I129  at_kg  1.267384e+09  3.931040e+07
5    I129  at_kg  1.903570e+09  5.869902e+07

FEEDBACK TO DATA PROVIDER

Uncertainty values in 2024 are three orders of magnitude lower than in 2021-2023 (relative uncertainty ~0.0002% versus ~3-5%), with no documented change in analytical method or instrumentation. All three uncertainty representations (unc_I129_at_kg, unc_I129_at_l, unc_129_pct) are internally consistent within each year. The provider should confirm whether this reflects a genuine precision improvement or a data reporting issue.

Convert U-238 units

JOIS reports U-238 in parts-per-billion (ppb, mass of U per mass of seawater), while MARIS requires atoms per kg. ConvertU238CB converts the VALUE column for U-238 rows using:

\[\text{atoms/kg} = C_{\text{ppb}} \times 10^{-9} \times \frac{N_A}{M_{238}}\]

where \(N_A = 6.02214076 \times 10^{23}\ \text{mol}^{-1}\) is Avogadro’s number and \(M_{238} = 238.05\ \text{g/mol}\) is the molar mass of U-238, giving a conversion factor of \(2.530 \times 10^{12}\ \text{atoms/kg per ppb}\). Rows with other nuclides are left unchanged.

source

ConvertU238CB


def ConvertU238CB(
    grps:list=None, # Groups to process; None = all groups in `tfm.dfs`
):

Convert U-238 VALUE from ppb to atoms/kg.

Exported source

# Convert U-238 from ppb to atoms/kg: ppb * 1e-9 * (1/238.05) * 6.02214076e23
U238_PPB_TO_AT_KG = 2.529_697e12

# Verify ConvertU238CB scales VALUE and UNC for U-238 only
dfs_mock = {'SEAWATER': pd.DataFrame({
    'NUCLIDE': ['I129', 'U238'],
    'UNIT': ['at_kg', 'at_ppb'],
    'VALUE': [1.0, 2.0],
    'UNC': [0.1, 0.2],
})}
tfm = Transformer(dfs_mock, cbs=[ConvertU238CB()])
tfm()
out = tfm.dfs['SEAWATER']
test_eq(out.loc[out['NUCLIDE']=='U238', 'VALUE'].iloc[0], 2.0 * U238_PPB_TO_AT_KG)
test_eq(out.loc[out['NUCLIDE']=='U238', 'UNC'].iloc[0], 0.2 * U238_PPB_TO_AT_KG)
test_eq(out.loc[out['NUCLIDE']=='U238', 'UNIT'].iloc[0], 'at_kg')
print("ConvertU238CB: I-129 unchanged, U-238 scaled, UNIT updated. ✓")

ConvertU238CB: I-129 unchanged, U-238 scaled, UNIT updated. ✓

tfm = Transformer(dfs, cbs=[RenameNucColsCB(), RenameColsCB(), ParseDateTimeCB(),
                             MeltJOISCB(META_COLS, VAL_COLS),
                             ConvertU238CB()])
tfm()
out = tfm.dfs['SEAWATER']
u238 = out[out['NUCLIDE']=='U238']
print(f"U-238 rows: {len(u238)}, mean VALUE = {u238['VALUE'].mean():.2e} atoms/kg")
i129 = out[out['NUCLIDE']=='I129']
print(f"I-129 rows: {len(i129)}, mean VALUE = {i129['VALUE'].mean():.2e} atoms/kg")

U-238 rows: 286, mean VALUE = 7.70e+12 atoms/kg
I-129 rows: 882, mean VALUE = 1.11e+09 atoms/kg

Remap nomenclatures to MARIS identifiers

The melt produces string columns: NUCLIDE (I129, U236, U238, U236_U238), UNIT (at_kg, at_l, at_ppb, at_ratio), and missing LAB and AREA columns. MARIS stores these as integer foreign-key IDs from the central nomenclatures.

RemapCB maps source column values through a lookup table to a target column. For constants like LAB (ETH Zurich/LIP, ID 345) and AREA (Beaufort Sea, ID 4256), an empty LUT with a default_val injects the same value for every row.

Exported source

# MARIS nuclide IDs confirmed via get_lut('NUCLIDE')
NUCLIDE_LUT = {'I129': 28, 'U236': 108, 'U238': 64, 'U236_U238': 131}

# MARIS unit IDs confirmed via get_lut('UNIT')
UNIT_LUT = {'at_kg': 9, 'at_l': 12, 'at_ratio': 6}

# Verify RemapCB assigns correct NUCLIDE, UNIT, LAB, AREA IDs
dfs_mock = {'SEAWATER': pd.DataFrame({
    'NUCLIDE': ['I129', 'U236'],
    'UNIT': ['at_kg', 'at_l'],
    'VALUE': [1.0, 2.0],
    'UNC': [0.1, 0.2],
})}
tfm = Transformer(dfs_mock, cbs=[
    RemapCB(lut=NUCLIDE_LUT, col_remap='NUCLIDE', col_src='NUCLIDE'),
    RemapCB(lut=UNIT_LUT, col_remap='UNIT', col_src='UNIT'),
    RemapCB(lut={}, col_remap='LAB', col_src='NUCLIDE', default_val=345),
    RemapCB(lut={}, col_remap='AREA', col_src='NUCLIDE', default_val=4256),
])
tfm()
out = tfm.dfs['SEAWATER']
test_eq(out['NUCLIDE'].tolist(), [28, 108])
test_eq(out['UNIT'].tolist(), [9, 12])
test_eq(out['LAB'].tolist(), [345, 345])
test_eq(out['AREA'].tolist(), [4256, 4256])
print("RemapCB: all nomenclatures mapped to correct MARIS IDs. ✓")

RemapCB: all nomenclatures mapped to correct MARIS IDs. ✓

tfm = Transformer(dfs, cbs=[
    RenameNucColsCB(), RenameColsCB(), ParseDateTimeCB(),
    MeltJOISCB(META_COLS, VAL_COLS),
    ConvertU238CB(),
    RemapCB(lut=NUCLIDE_LUT, col_remap='NUCLIDE', col_src='NUCLIDE'),
    RemapCB(lut=UNIT_LUT, col_remap='UNIT', col_src='UNIT'),
    RemapCB(lut={}, col_remap='LAB', col_src='NUCLIDE', default_val=345),
    RemapCB(lut={}, col_remap='AREA', col_src='NUCLIDE', default_val=4256),
])
tfm()
out = tfm.dfs['SEAWATER']
print(out[['NUCLIDE', 'UNIT', 'LAB', 'AREA']].drop_duplicates().to_string())

      NUCLIDE  UNIT  LAB  AREA
0          28     9  345  4256
441        28    12  345  4256
958       108     9  345  4256
1400      108    12  345  4256
1840       64     9  345  4256
2281      131     6  345  4256

Standardise final columns

Three shared callbacks complete the pipeline:

SanitizeLonLatCB: validates longitude/latitude ranges and ensures correct sign convention
EncodeTimeCB: encodes the TIME column into the NetCDF-compatible numeric representation
AddSampleIDCB: assigns a sequential SMP_ID and preserves the provider’s SMP_ID_PROVIDER

All three are imported from marisco.callbacks and require no configuration for JOIS.

tfm = Transformer(dfs, cbs=[
    RenameNucColsCB(), RenameColsCB(), ParseDateTimeCB(),
    MeltJOISCB(META_COLS, VAL_COLS),
    ConvertU238CB(),
    RemapCB(lut=NUCLIDE_LUT, col_remap='NUCLIDE', col_src='NUCLIDE'),
    RemapCB(lut=UNIT_LUT, col_remap='UNIT', col_src='UNIT'),
    RemapCB(lut={}, col_remap='LAB', col_src='NUCLIDE', default_val=345),
    RemapCB(lut={}, col_remap='AREA', col_src='NUCLIDE', default_val=4256),
    SanitizeLonLatCB(),
    EncodeTimeCB(),
    AddSampleIDCB(col_provider='SMP_ID_PROVIDER'),
])
tfm()
out = tfm.dfs['SEAWATER']
print(f"Final shape: {out.shape}")
print("Columns:", out.columns.tolist())
print(out[['SMP_ID', 'SMP_ID_PROVIDER', 'NUCLIDE', 'UNIT', 'LAB', 'AREA']].head(4).to_string())

Final shape: (2016, 17)
Columns: ['Cruise', 'STATION', 'SMP_ID_PROVIDER', 'LAT', 'LON', 'TIME', 'Pressure_dbar', 'SMP_DEPTH', 'TEMP', 'SAL', 'VALUE', 'UNC', 'NUCLIDE', 'UNIT', 'LAB', 'AREA', 'SMP_ID']
   SMP_ID SMP_ID_PROVIDER  NUCLIDE  UNIT  LAB  AREA
0       1           289.0       28     9  345  4256
1       2           290.0       28     9  345  4256
2       3           291.0       28     9  345  4256
3       4           292.0       28     9  345  4256

print("Final data summary (uppercase columns only):")
upper_cols = [c for c in out.columns if c.isupper()]
print(out[upper_cols].describe().to_string())

Final data summary (uppercase columns only):
               LAT          LON          TIME    SMP_DEPTH         TEMP          SAL         VALUE           UNC      NUCLIDE         UNIT     LAB    AREA       SMP_ID
count  2016.000000  2016.000000  2.016000e+03  2016.000000  2016.000000  2016.000000  2.016000e+03  2.015000e+03  2016.000000  2016.000000  2016.0  2016.0  2016.000000
mean     74.860128  -145.420826  1.670466e+09   750.954670    -0.133983    33.437905  1.088666e+12  6.931932e+10    70.189980     9.650298   345.0  4256.0  1008.500000
std       2.590004     7.100110  2.049542e+07   893.294636     0.835686     2.459200  2.702033e+12  2.975590e+11    41.316164     2.018581     0.0     0.0   582.113391
min      70.539500  -153.320167  1.630157e+09     4.179505    -1.586100    24.756100  0.000000e+00  0.000000e+00    28.000000     6.000000   345.0  4256.0     1.000000
25%      72.599667  -151.575833  1.664062e+09   151.645681    -0.549900    33.105800  9.085735e+06  2.997537e+05    28.000000     9.000000   345.0  4256.0   504.750000
50%      75.005167  -144.719000  1.664717e+09   393.656724    -0.249700    34.747200  1.045061e+08  1.853332e+06    64.000000     9.000000   345.0  4256.0  1008.500000
75%      77.006333  -140.014042  1.665390e+09  1002.757663     0.411200    34.879600  1.946451e+09  7.579162e+07   108.000000    12.000000   345.0  4256.0  1512.250000
max      79.691500  -122.945167  1.726664e+09  3743.997720     5.142400    34.959500  1.572581e+13  9.798895e+12   131.000000    12.000000   345.0  4256.0  2016.000000

NetCDF encoder

The encoder wraps the full pipeline and writes the standardised data to a NetCDF4 file. Global attributes are assembled via GlobAttrsFeeder with BboxCB, DepthRangeCB, TimeRangeCB, plus keywords and processing logs.

We do not yet have an INIS entry for the JOIS datasets, so INISCB is commented out. The IAEA INIS repository will be used for bibliographic metadata once the record is created. A placeholder line is included for future use.

source

get_attrs


def get_attrs(
    tfm
):

Retrieve global attributes for the JOIS handler.

Exported source

# NetCDF global attributes
JOIS_KEYWORDS = ['Beaufort Sea', 'JOIS', 'I-129', 'U-236', 'U-238', 'radionuclides', 'seawater', 'Arctic']

def get_attrs(tfm):
    "Retrieve global attributes for the JOIS handler."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        #INISCB('XXXXXXXX'),  # TODO: add INIS record id when available
        KeyValuePairCB('keywords', ', '.join(JOIS_KEYWORDS)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs)),
    ])()

source

encode


def encode(
    fname_out:NoneType=None, # Output NetCDF file path; defaults to fname_out
):

Encode JOIS data to NetCDF4.

Exported source

def encode(fname_out=None  # Output NetCDF file path; defaults to fname_out
            ):
    "Encode JOIS data to NetCDF4."
    fname_out = fname_out or globals().get('fname_out', 'JOIS_Beaufort_Sea.nc')
    dfs = load_data()
    tfm = Transformer(dfs, cbs=[
        RenameNucColsCB(), RenameColsCB(), ParseDateTimeCB(),
        MeltJOISCB(META_COLS, VAL_COLS),
        ConvertU238CB(),
        RemapCB(lut=NUCLIDE_LUT, col_remap='NUCLIDE', col_src='NUCLIDE'),
        RemapCB(lut=UNIT_LUT, col_remap='UNIT', col_src='UNIT'),
        RemapCB(lut={}, col_remap='LAB', col_src='NUCLIDE', default_val=345),
        RemapCB(lut={}, col_remap='AREA', col_src='NUCLIDE', default_val=4256),
        SanitizeLonLatCB(),
        EncodeTimeCB(),
        AddSampleIDCB(col_provider='SMP_ID_PROVIDER'),
    ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, dest_fname=fname_out,
                            global_attrs=get_attrs(tfm))
    encoder.encode()

# Encode to NetCDF
encode('../../_data/output/jois.nc')
print("JOIS NetCDF written.")

JOIS NetCDF written.

#decode(fname_in='../../_data/output/jois.nc', verbose=True)
#to_csv('../../_data/output/jois.nc')