import warnings
warnings.filterwarnings('ignore')TEPCO
NetCDF format
Refactoring in progress. This handler is being updated to use the new
mariscoAPI (fuzzy matching,make_lut/make_lut_from,RemapCB), following the approach used in the HELCOM and GEOTRACES handlers. Exports and execution are temporarily disabled.
[newer version in progress using the RAMDAS new API]
import pandas as pd
import re
import numpy as np
import fastcore.all as fc
from tqdm import tqdm
from collections import defaultdict
from marisco.callbacks import (
Callback,
Transformer,
EncodeTimeCB,
SanitizeLonLatCB,
EncodeTimeCB,
)
from marisco.encoders import NetCDFEncoder
from marisco.metadata import (
GlobAttrsFeeder,
BboxCB,
TimeRangeCB,
ZoteroCB,
KeyValuePairCB
)
from marisco.netcdf2csv import decodeConfiguration & file paths
fname_coastal_water = 'https://radioactivity.nra.go.jp/cont/en/results/sea/coastal_water.csv'
fname_clos1F = 'https://radioactivity.nra.go.jp/cont/en/results/sea/close1F_water.xlsx'
fname_iaea_orbs = 'https://raw.githubusercontent.com/RML-IAEA/iaea.orbs/refs/heads/main/src/iaea/orbs/stations/station_points.csv'
fname_out = '../../_data/output/tepco.nc'Load data
We here load the data from the NRA (Nuclear Regulatory Authority) website. For the moment, we only process radioactivity concentration data in the seawater around Fukushima Dai-ichi NPP [TEPCO] (coastal_water.csv) and in the close1F_water.xlsx file.
In near future, MARIS will provide a dedicated handler for all related ALPS data including measurements not only provided by TEPCO but also MOE, NRA, MLITT and Fukushima Prefecture.
The coastal_water.csv file contains two sections: the measurements and the locations. We identify below the line number where the locations begin. A single point of truth for the location of the stations would ease the processing in future.
def find_location_section(df,
col_idx=0,
pattern='Sampling point number'
):
"Find the line number where location data begins."
mask = df.iloc[:, col_idx] == pattern
indices = df[mask].index
return indices[0] if len(indices) > 0 else -1find_location_section(pd.read_csv(fname_coastal_water, low_memory=False))np.int64(29252)
Distinct parsing of the time from coastal_water.csv and close1F_water.xlsx files are required. Indeed:
coastal_water.csvuses the formatYYYY/MM/DDin theSampling HH:MMandclose1F_water.xlsxuses the formatYYYY-MM-DD HH:MM:SS.
def fix_sampling_time(x):
if pd.isna(x):
return '00:00:00'
else:
hour, min = x.split(':')[:2]
return f"{hour if len(hour) == 2 else '0' + hour}:{min}:00"def get_coastal_water_df(fname_coastal_water):
"Get the measurements dataframe from the `coastal_water.csv` file."
locs_idx = find_location_section(pd.read_csv(fname_coastal_water,
skiprows=0, low_memory=False))
df = pd.read_csv(fname_coastal_water, skiprows=1,
nrows=locs_idx - 1,
low_memory=False)
df.dropna(subset=['Sampling point number'], inplace=True)
df['Sampling time'] = df['Sampling time'].map(fix_sampling_time)
df['TIME'] = df['Sampling date'].replace('-', '/') + ' ' + df['Sampling time']
df = df.drop(columns=['Sampling date', 'Sampling time'])
return dfdf_coastal_water = get_coastal_water_df(fname_coastal_water)
df_coastal_water.tail()| Sampling point number | Collection layer of seawater | 131I radioactivity concentration (Bq/L) | 131I detection limit (Bq/L) | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | 132I radioactivity concentration (Bq/L) | 132I detection limit (Bq/L) | ... | 54Mn radioactivity concentration (Bq/L) | 54Mn detection limit (Bq/L) | 3H radioactivity concentration (Bq/L) | 3H detection limit (Bq/L) | 125Sb radioactivity concentration (Bq/L) | 125Sb detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | Unnamed: 49 | TIME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29219 | T-D5 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | ND | 6.2E+00 | NaN | NaN | NaN | NaN | NaN | 2025/7/17 07:56:00 |
| 29220 | T-S8 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | ND | 6.8E+00 | NaN | NaN | NaN | NaN | NaN | 2025/7/18 05:34:00 |
| 29221 | T-D5 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | ND | 7.9E+00 | NaN | NaN | NaN | NaN | NaN | 2025/7/21 08:05:00 |
| 29222 | T-S3 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | ND | 7.3E+00 | NaN | NaN | NaN | NaN | NaN | 2025/7/22 05:54:00 |
| 29223 | T-S4 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | ND | 7.4E+00 | NaN | NaN | NaN | NaN | NaN | 2025/7/22 06:17:00 |
5 rows × 49 columns
coi = [o for o in df_coastal_water.columns if "134Cs" in o]
df_coastal_water[coi + ['Sampling point number', 'TIME']].head(30)| 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | Sampling point number | TIME | |
|---|---|---|---|---|
| 0 | 4.8E+01 | 9.2E+00 | T-3 | 2011/3/21 23:15:00 |
| 1 | 3.1E+01 | 8.7E+00 | T-4 | 2011/3/21 23:45:00 |
| 2 | 4.6E+01 | 1.4E+01 | T-3 | 2011/3/22 14:28:00 |
| 3 | 3.9E+01 | 1.1E+01 | T-4 | 2011/3/22 15:06:00 |
| 4 | 5.1E+01 | 2.0E+01 | T-3 | 2011/3/23 13:51:00 |
| 5 | 3.3E+01 | 2.1E+01 | T-4 | 2011/3/23 14:25:00 |
| 6 | 9.9E+01 | 3.8E+01 | T-3 | 2011/3/24 09:30:00 |
| 7 | 3.5E+01 | 7.0E+00 | T-4 | 2011/3/24 08:45:00 |
| 8 | 2.6E+01 | 7.4E+00 | T-3 | 2011/3/25 10:00:00 |
| 9 | 2.0E+01 | 6.7E+00 | T-4 | 2011/3/25 09:10:00 |
| 10 | 2.6E+01 | 1.8E+01 | T-3 | 2011/3/26 15:15:00 |
| 11 | 1.3E+01 | 7.1E+00 | T-4 | 2011/3/26 15:50:00 |
| 12 | 5.4E+02 | 1.2E+01 | T-3 | 2011/3/27 14:30:00 |
| 13 | 2.0E+01 | 6.0E+00 | T-4 | 2011/3/27 08:45:00 |
| 14 | 6.1E+02 | 2.3E+01 | T-3 | 2011/3/28 09:35:00 |
| 15 | 3.3E+02 | 2.1E+01 | T-4 | 2011/3/28 08:45:00 |
| 16 | 3.2E+02 | 1.3E+01 | T-3 | 2011/3/29 10:15:00 |
| 17 | 2.3E+02 | 1.2E+01 | T-4 | 2011/3/29 09:20:00 |
| 18 | 3.6E+02 | 2.0E+01 | T-3 | 2011/3/30 10:00:00 |
| 19 | 1.8E+02 | 2.0E+01 | T-4 | 2011/3/30 09:05:00 |
| 20 | 3.6E+02 | 2.1E+01 | T-3 | 2011/3/31 10:00:00 |
| 21 | 1.6E+02 | 2.0E+01 | T-4 | 2011/3/31 09:15:00 |
| 22 | 3.0E+02 | 1.8E+01 | T-3 | 2011/4/1 09:50:00 |
| 23 | 2.0E+02 | 1.8E+01 | T-4 | 2011/4/1 09:00:00 |
| 24 | 1.9E+01 | 1.5E+01 | 8 | 2011/4/2 13:35:00 |
| 25 | 1.7E+02 | 1.7E+01 | T-3 | 2011/4/2 09:55:00 |
| 26 | 5.1E+01 | 1.7E+01 | T-4 | 2011/4/2 09:00:00 |
| 27 | 2.3E+01 | 4.9E+00 | T-5 | 2011/4/2 14:03:00 |
| 28 | NaN | NaN | T-7 | 2011/4/2 13:12:00 |
| 29 | NaN | NaN | 8 | 2011/4/3 12:20:00 |
coi['134Cs radioactivity concentration (Bq/L)', '134Cs detection limit (Bq/L)']
df_coastal_water.dropna(subset=coi, how='any')[coi + ['Sampling point number', 'TIME']]| 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | Sampling point number | TIME | |
|---|---|---|---|---|
| 0 | 4.8E+01 | 9.2E+00 | T-3 | 2011/3/21 23:15:00 |
| 1 | 3.1E+01 | 8.7E+00 | T-4 | 2011/3/21 23:45:00 |
| 2 | 4.6E+01 | 1.4E+01 | T-3 | 2011/3/22 14:28:00 |
| 3 | 3.9E+01 | 1.1E+01 | T-4 | 2011/3/22 15:06:00 |
| 4 | 5.1E+01 | 2.0E+01 | T-3 | 2011/3/23 13:51:00 |
| ... | ... | ... | ... | ... |
| 29209 | ND | 1.1E-03 | T-11 | 2025/6/27 09:41:00 |
| 29210 | ND | 1.3E-03 | T-5 | 2025/6/27 08:09:00 |
| 29211 | ND | 1.2E-03 | T-5 | 2025/6/27 08:09:00 |
| 29212 | ND | 1.1E-03 | T-D9 | 2025/6/27 09:03:00 |
| 29213 | ND | 1.0E-03 | T-D9 | 2025/6/27 09:03:00 |
19128 rows × 4 columns
mask = df_coastal_water['134Cs radioactivity concentration (Bq/L)'] == 'ND'
df_coastal_water[mask][coi + ['Sampling point number', 'TIME']]| 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | Sampling point number | TIME | |
|---|---|---|---|---|
| 53 | ND | NaN | 5 | 2011/4/6 11:30:00 |
| 57 | ND | NaN | 8 | 2011/4/6 12:52:00 |
| 59 | ND | NaN | 10 | 2011/4/6 13:37:00 |
| 64 | ND | NaN | T-7 | 2011/4/6 12:44:00 |
| 65 | ND | NaN | T-7 | 2011/4/6 13:15:00 |
| ... | ... | ... | ... | ... |
| 29209 | ND | 1.1E-03 | T-11 | 2025/6/27 09:41:00 |
| 29210 | ND | 1.3E-03 | T-5 | 2025/6/27 08:09:00 |
| 29211 | ND | 1.2E-03 | T-5 | 2025/6/27 08:09:00 |
| 29212 | ND | 1.1E-03 | T-D9 | 2025/6/27 09:03:00 |
| 29213 | ND | 1.0E-03 | T-D9 | 2025/6/27 09:03:00 |
19215 rows × 4 columns
len(df_coastal_water)29224
Identification of the stations location requires three distinct files:
- the second section of the
coastal_water.csvfile - the
R6zahyo.pdffile further processed by https://github.com/RML-IAEA/iaea.orbs - the second sections of all sheets of
close1F_water.xlsxfile
All files and sheets required to look up the location of the stations.
def get_locs_coastal_water(fname_coastal_water):
locs_idx = find_location_section(pd.read_csv(fname_coastal_water,
skiprows=0, low_memory=False))
df = pd.read_csv(fname_coastal_water, skiprows=locs_idx+1,
low_memory=False).iloc[:, :3]
df.columns = ['STATION', 'LON', 'LAT']
df.dropna(subset=['LAT'], inplace=True)
df['org'] = 'coastal_seawater.csv'
return dfdf_locs_coastal_water = get_locs_coastal_water(fname_coastal_water)
print(f'Nb. of stations: {len(df_locs_coastal_water)}')
df_locs_coastal_water.head()Nb. of stations: 48
| STATION | LON | LAT | org | |
|---|---|---|---|---|
| 0 | T-0 | 37.42 | 141.04 | coastal_seawater.csv |
| 1 | T-11 | 37.24 | 141.05 | coastal_seawater.csv |
| 2 | T-12 | 37.15 | 141.04 | coastal_seawater.csv |
| 3 | T-13-1 | 37.64 | 141.04 | coastal_seawater.csv |
| 4 | T-14 | 37.55 | 141.06 | coastal_seawater.csv |
df_locs_coastal_water.STATION.unique()array(['T-0', 'T-11', 'T-12', 'T-13-1', 'T-14', 'T-17-1', 'T-18', 'T-20',
'T-22', 'T-3', 'T-4', 'T-4-1', 'T-4-2', 'T-5', 'T-6', 'T-7', 'T-A',
'T-B', 'T-B1', 'T-B2', 'T-B3', 'T-B4', 'T-C', 'T-D', 'T-D1',
'T-D5', 'T-D9', 'T-E', 'T-E1', 'T-Z', 'T-MG6', 'T-S1', 'T-S7',
'T-H1', 'T-S2', 'T-S6', 'T-M10', 'T-MA', 'T-S3', 'T-S4', 'T-S8',
'T-MG4', 'T-G4', 'T-MG5', 'T-MG1', 'T-MG0', 'T-MG3', 'T-MG2'],
dtype=object)
Data contained in the close1F_water.xlsx file are spread in several sheets (one per station). Each sheet further contains two sections: the measurements and the locations.
For each sheet, we have to identify the line number where to split both measurements and the location. We then need to further iterate over all sheets to concatenate the results.
def get_clos1F_df(fname_clos1F):
"Get measurements dataframe from close1F_water.xlsx file and parse datetime."
excel_file = pd.ExcelFile(fname_clos1F)
dfs = {}
for sheet_name in tqdm(excel_file.sheet_names):
locs_idx = find_location_section(pd.read_excel(excel_file,
sheet_name=sheet_name,
skiprows=1))
df = pd.read_excel(excel_file,
sheet_name=sheet_name,
skiprows=1,
nrows=locs_idx-1)
df.dropna(subset=['Sampling point number'], inplace=True)
df['Sampling date'] = df['Sampling date']\
.astype(str)\
.apply(lambda x: x.split(' ')[0]\
.replace('-', '/'))
dfs[sheet_name] = df
df = pd.concat(dfs.values(), ignore_index=True)
df.dropna(subset=['Sampling date'], inplace=True)
df['TIME'] = df['Sampling date'] + ' ' + df['Sampling time'].astype(str)
df = df.drop(columns=['Sampling date', 'Sampling time'])
return dfdf_clos1F = get_clos1F_df(fname_clos1F)
df_clos1F.head()
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:00<00:02, 3.49it/s]
18%|█▊ | 2/11 [00:00<00:02, 3.44it/s]
27%|██▋ | 3/11 [00:00<00:02, 2.93it/s]
36%|███▋ | 4/11 [00:01<00:02, 3.17it/s]
45%|████▌ | 5/11 [00:01<00:01, 3.34it/s]
55%|█████▍ | 6/11 [00:05<00:08, 1.71s/it]
64%|██████▎ | 7/11 [00:09<00:09, 2.37s/it]
73%|███████▎ | 8/11 [00:10<00:05, 1.83s/it]
91%|█████████ | 10/11 [00:10<00:01, 1.03s/it]
100%|██████████| 11/11 [00:10<00:00, 1.02it/s]
| Sampling point number | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | Total beta radioactivity concentration (Bq/L) | Total beta detection limit (Bq/L) | 3H radioactivity concentration (Bq/L) | 3H detection limit (Bq/L) | Collection layer of seawater | ... | 106Ru detection limit (Bq/L) | 60Co radioactivity concentration (Bq/L) | 60Co detection limit (Bq/L) | 95Zr radioactivity concentration (Bq/L) | 95Zr detection limit (Bq/L) | 99Mo radioactivity concentration (Bq/L) | 99Mo detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | TIME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T-0-1 | ND | 1.5 | ND | 1.4 | ND | 18.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013/08/14 08:17:00 |
| 1 | T-0-1 | NaN | NaN | NaN | NaN | NaN | NaN | 4.7 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013/08/14 08:17:00 |
| 2 | T-0-1 | ND | 1.1 | ND | 1.4 | ND | 20.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013/08/21 08:09:00 |
| 3 | T-0-1 | NaN | NaN | NaN | NaN | NaN | NaN | ND | 2.9 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013/08/21 08:09:00 |
| 4 | T-0-1 | ND | 0.66 | ND | 0.49 | ND | 17.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013/08/27 08:14:00 |
5 rows × 57 columns
df_clos1F['Sampling point number'].unique()array(['T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)
def get_locs_clos1F(fname_clos1F):
"Get locations dataframe from close1F_water.xlsx file from each sheets."
excel_file = pd.ExcelFile(fname_clos1F)
dfs = {}
for sheet_name in tqdm(excel_file.sheet_names):
locs_idx = find_location_section(pd.read_excel(excel_file,
sheet_name=sheet_name,
skiprows=1))
df = pd.read_excel(excel_file,
sheet_name=sheet_name,
skiprows=locs_idx+2)
dfs[sheet_name] = df
df = pd.concat(dfs.values(), ignore_index=True).iloc[:, :3]
df.dropna(subset=['Sampling coordinate North latitude (Decimal)'], inplace=True)
df.columns = ['STATION', 'LON', 'LAT']
df['org'] = 'close1F.csv'
return dfdf_locs_clos1F = get_locs_clos1F(fname_clos1F)
print(f'Nb. of stations: {len(df_locs_clos1F)}')
df_locs_clos1F.head()
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:00<00:02, 3.64it/s]
18%|█▊ | 2/11 [00:00<00:02, 3.62it/s]
27%|██▋ | 3/11 [00:00<00:02, 3.17it/s]
36%|███▋ | 4/11 [00:01<00:02, 3.41it/s]
45%|████▌ | 5/11 [00:01<00:01, 3.58it/s]
55%|█████▍ | 6/11 [00:05<00:08, 1.67s/it]
64%|██████▎ | 7/11 [00:09<00:09, 2.39s/it]
73%|███████▎ | 8/11 [00:10<00:05, 1.84s/it]
91%|█████████ | 10/11 [00:10<00:01, 1.03s/it]
100%|██████████| 11/11 [00:10<00:00, 1.02it/s]
Nb. of stations: 11
| STATION | LON | LAT | org | |
|---|---|---|---|---|
| 0 | T-0-1 | 37.43 | 141.04 | close1F.csv |
| 11 | T-0-1A | 37.43 | 141.05 | close1F.csv |
| 22 | T-0-2 | 37.42 | 141.05 | close1F.csv |
| 33 | T-0-3 | 37.42 | 141.04 | close1F.csv |
| 44 | T-0-3A | 37.42 | 141.05 | close1F.csv |
The close1F_water.xlsx file contains station locations that are not present in the coastal_water.csv dataset, as demonstrated in the comparison below:
set(df_locs_clos1F.STATION) - set(df_locs_coastal_water.STATION){'T-0-1',
'T-0-1A',
'T-0-2',
'T-0-3',
'T-0-3A',
'T-1',
'T-2',
'T-2-1',
'T-A1',
'T-A2',
'T-A3'}
In theory all locations are supposed to be provided in the R6zahyo.pdf file. This file is further processed by https://github.com/RML-IAEA/iaea.orbs and the result is provided in the station_points.csv file.
However, this file lacks complete coverage of locations referenced in both coastal_water.csv and close1F_water.xlsx files, while simultaneously containing additional locations not present in either (see below). A more standardized and comprehensive location reference system would significantly improve the efficiency and reliability of the data ingestion process.
def get_locs_orbs(fname_iaea_orbs):
df = pd.read_csv(fname_iaea_orbs)
df.columns = ['org', 'STATION', 'LON', 'LAT']
return dfdf_locs_orbs = get_locs_orbs(fname_iaea_orbs)
df_locs_orbs.head()| org | STATION | LON | LAT | |
|---|---|---|---|---|
| 0 | MOE | E-31 | 141.727667 | 39.059167 |
| 1 | MOE | E-32 | 141.635667 | 38.996000 |
| 2 | MOE | E-37 | 141.948611 | 39.259167 |
| 3 | MOE | E-38 | 141.755000 | 39.008333 |
| 4 | MOE | E-39 | 141.766667 | 38.991667 |
set(df_locs_orbs.STATION) - (set(df_locs_clos1F.STATION) | set(df_locs_coastal_water.STATION)){'C-P1',
'C-P2',
'C-P3',
'C-P4',
'C-P5',
'C-P8',
'E-31',
'E-32',
'E-37',
'E-38',
'E-39',
'E-3A',
'E-41',
'E-42',
'E-43',
'E-44',
'E-45',
'E-46',
'E-47',
'E-48',
'E-49',
'E-4A',
'E-4B',
'E-4C',
'E-4F',
'E-4G',
'E-4H',
'E-4J',
'E-4K',
'E-4L',
'E-4M',
'E-71',
'E-72',
'E-73',
'E-74',
'E-75',
'E-76',
'E-77',
'E-78',
'E-79',
'E-7A',
'E-7B',
'E-7C',
'E-7D',
'E-7F',
'E-7G',
'E-7H',
'E-7I',
'E-7J',
'E-7K',
'E-7L',
'E-81',
'E-82',
'E-83',
'E-84',
'E-85',
'E-S1',
'E-S10',
'E-S13',
'E-S14',
'E-S15',
'E-S17',
'E-S18',
'E-S19',
'E-S20',
'E-S21',
'E-S22',
'E-S23',
'E-S24',
'E-S25',
'E-S26',
'E-S27',
'E-S28',
'E-S29',
'E-S3',
'E-S30',
'E-S31',
'E-S32',
'E-S33',
'E-S34',
'E-S35',
'E-S36',
'E-S4',
'E-S5',
'E-T1',
'E-T2',
'E-T3',
'E-T4',
'E-T5',
'E-T6',
'E-T7',
'E-T8',
'F-P01',
'F-P02',
'F-P03',
'F-P04',
'F-P05',
'F-P06',
'F-P07',
'F-P08',
'F-P09',
'F-P10',
'F-P11',
'F-P12',
'F-P13',
'F-P14',
'F-P15',
'F-P29',
'F-P30',
'F-P31',
'F-P32',
'F-P33',
'F-P34',
'F-P35',
'F-P37',
'F-P38',
'F-P39',
'F-P40',
'F-P41',
'F-P42',
'F-P43',
'F-P45',
'F-P46',
'F-P47',
'F-P48',
'F-P49',
'F-P50',
'F-P51',
'F-P52',
'F-P53',
'F-P54',
'F-P55',
'F-P56',
'F-P57',
'F-P58',
'F-P59',
'F-P60',
'F-P61',
'F-P62',
'F-P63',
'F-P64',
'F-P65',
'F-P66',
'F-P67',
'F-P68',
'F-P69',
'F-P70',
'F-P71',
'F-P72',
'F-P73',
'F-P74',
'F-P75',
'F-P76',
'F-P77',
'F-P78',
'F-P79',
'F-P80',
'F-P81',
'F-P82',
'F-P83',
'K-T1',
'K-T2',
'KK-U1',
'M-10',
'M-101',
'M-102',
'M-103',
'M-104',
'M-11',
'M-14',
'M-15',
'M-19',
'M-20',
'M-21',
'M-25',
'M-26',
'M-27',
'M-A1',
'M-A3',
'M-B1',
'M-B5',
'M-C1',
'M-C10',
'M-C2',
'M-C3',
'M-C4',
'M-C6',
'M-C7',
'M-C8',
'M-C9',
'M-D1',
'M-D3',
'M-E1',
'M-E3',
'M-E5',
'M-F1',
'M-F3',
'M-G0',
'M-G1',
'M-G3',
'M-G4',
'M-H1',
'M-H3',
'M-I0',
'M-I1',
'M-I3',
'M-IB2',
'M-IB4',
'M-J1',
'M-J3',
'M-K1',
'M-L1',
'M-L3',
'M-M1',
'M-MI4',
'T-S5',
'T-①',
'T-②',
'T-③',
'T-④',
'T-⑤',
'T-⑥',
'T-⑦',
'T-⑧',
'T-⑨',
'T-⑩',
'T-⑪',
'T-⑫',
'T-⑬'}
def concat_locs(dfs):
"Concatenate and drop duplicates from coastal_seawater.csv and iaea_orbs.csv (kept)"
df = pd.concat(dfs)
# Group by org to be used for sorting
df['org_grp'] = df['org'].apply(
lambda x: 1 if x == 'coastal_seawater.csv' else 2 if x == 'close1F.csv' else 0)
df.sort_values('org_grp', ascending=True, inplace=True)
# Drop duplicates and keep orbs data first
df.drop_duplicates(subset='STATION', keep='first', inplace=True)
df.drop(columns=['org_grp'], inplace=True)
df.sort_values('STATION', ascending=True, inplace=True)
return dfdf_locs = concat_locs([df_locs_clos1F, df_locs_coastal_water, df_locs_orbs])
df_locs.head()| STATION | LON | LAT | org | |
|---|---|---|---|---|
| 214 | C-P1 | 139.863333 | 35.425000 | NRA |
| 215 | C-P2 | 139.863333 | 35.401667 | NRA |
| 216 | C-P3 | 139.881667 | 35.370000 | NRA |
| 217 | C-P4 | 139.846667 | 35.356667 | NRA |
| 218 | C-P5 | 139.800000 | 35.343333 | NRA |
def align_dfs(df_from, df_to):
"Align columns structure of df_from to df_to."
df = defaultdict()
for c in df_to.columns:
df[c] = df_from[c].values if c in df_from.columns else np.nan
return pd.DataFrame(df)align_dfs(df_clos1F, df_coastal_water).head()| Sampling point number | Collection layer of seawater | 131I radioactivity concentration (Bq/L) | 131I detection limit (Bq/L) | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | 132I radioactivity concentration (Bq/L) | 132I detection limit (Bq/L) | ... | 54Mn radioactivity concentration (Bq/L) | 54Mn detection limit (Bq/L) | 3H radioactivity concentration (Bq/L) | 3H detection limit (Bq/L) | 125Sb radioactivity concentration (Bq/L) | 125Sb detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | Unnamed: 49 | TIME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T-0-1 | NaN | NaN | NaN | ND | 1.5 | ND | 1.4 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013/08/14 08:17:00 |
| 1 | T-0-1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 4.7 | NaN | NaN | NaN | NaN | NaN | NaN | 2013/08/14 08:17:00 |
| 2 | T-0-1 | NaN | NaN | NaN | ND | 1.1 | ND | 1.4 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013/08/21 08:09:00 |
| 3 | T-0-1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | ND | 2.9 | NaN | NaN | NaN | NaN | NaN | 2013/08/21 08:09:00 |
| 4 | T-0-1 | NaN | NaN | NaN | ND | 0.66 | ND | 0.49 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013/08/27 08:14:00 |
5 rows × 49 columns
def concat_dfs(df_coastal_water, df_clos1F):
"Concatenate and drop duplicates from coastal_seawater.csv and close1F_water.xlsx (kept)"
df_clos1F = align_dfs(df_clos1F, df_coastal_water)
df = pd.concat([df_coastal_water, df_clos1F])
return dfdf_meas = concat_dfs(df_coastal_water, df_clos1F)
df_meas.head()| Sampling point number | Collection layer of seawater | 131I radioactivity concentration (Bq/L) | 131I detection limit (Bq/L) | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | 132I radioactivity concentration (Bq/L) | 132I detection limit (Bq/L) | ... | 54Mn radioactivity concentration (Bq/L) | 54Mn detection limit (Bq/L) | 3H radioactivity concentration (Bq/L) | 3H detection limit (Bq/L) | 125Sb radioactivity concentration (Bq/L) | 125Sb detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | Unnamed: 49 | TIME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T-3 | NaN | 1.1E+03 | 1.3E+01 | 4.8E+01 | 9.2E+00 | 5.3E+01 | 8.8E+00 | 1.6E+02 | 44.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011/3/21 23:15:00 |
| 1 | T-4 | NaN | 6.6E+02 | 1.2E+01 | 3.1E+01 | 8.7E+00 | 3.3E+01 | 8.3E+00 | 1.2E+02 | 41.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011/3/21 23:45:00 |
| 2 | T-3 | NaN | 1.1E+03 | 2.0E+01 | 4.6E+01 | 1.4E+01 | 4.0E+01 | 1.4E+01 | ND | 88.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011/3/22 14:28:00 |
| 3 | T-4 | NaN | 6.7E+02 | 1.9E+01 | 3.9E+01 | 1.1E+01 | 4.4E+01 | 1.1E+01 | ND | 79.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2011/3/22 15:06:00 |
| 4 | T-3 | NaN | 7.4E+02 | 2.7E+01 | 5.1E+01 | 2.0E+01 | 5.5E+01 | 2.0E+01 | 2.0E+02 | 58.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 34.0 | 25.0 | NaN | 2011/3/23 13:51:00 |
5 rows × 49 columns
def georef_data(df_meas, df_locs):
"Georeference measurements dataframe using locations dataframe."
assert "Sampling point number" in df_meas.columns and "STATION" in df_locs.columns
return pd.merge(df_meas, df_locs, how="inner",
left_on='Sampling point number', right_on='STATION')df_meas_georef = georef_data(df_meas, df_locs)
df_meas_georef.head()| Sampling point number | Collection layer of seawater | 131I radioactivity concentration (Bq/L) | 131I detection limit (Bq/L) | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | 132I radioactivity concentration (Bq/L) | 132I detection limit (Bq/L) | ... | 125Sb radioactivity concentration (Bq/L) | 125Sb detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | Unnamed: 49 | TIME | STATION | LON | LAT | org | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T-3 | NaN | 1.1E+03 | 1.3E+01 | 4.8E+01 | 9.2E+00 | 5.3E+01 | 8.8E+00 | 1.6E+02 | 44.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/21 23:15:00 | T-3 | 141.026389 | 37.322222 | TEPCO |
| 1 | T-4 | NaN | 6.6E+02 | 1.2E+01 | 3.1E+01 | 8.7E+00 | 3.3E+01 | 8.3E+00 | 1.2E+02 | 41.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/21 23:45:00 | T-4 | 141.013889 | 37.241667 | TEPCO |
| 2 | T-3 | NaN | 1.1E+03 | 2.0E+01 | 4.6E+01 | 1.4E+01 | 4.0E+01 | 1.4E+01 | ND | 88.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/22 14:28:00 | T-3 | 141.026389 | 37.322222 | TEPCO |
| 3 | T-4 | NaN | 6.7E+02 | 1.9E+01 | 3.9E+01 | 1.1E+01 | 4.4E+01 | 1.1E+01 | ND | 79.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/22 15:06:00 | T-4 | 141.013889 | 37.241667 | TEPCO |
| 4 | T-3 | NaN | 7.4E+02 | 2.7E+01 | 5.1E+01 | 2.0E+01 | 5.5E+01 | 2.0E+01 | 2.0E+02 | 58.0 | ... | NaN | NaN | 34.0 | 25.0 | NaN | 2011/3/23 13:51:00 | T-3 | 141.026389 | 37.322222 | TEPCO |
5 rows × 53 columns
def load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs):
"Load, align and georeference TEPCO data"
df_locs = concat_locs(
[get_locs_coastal_water(fname_coastal_water),
get_locs_clos1F(fname_clos1F),
get_locs_orbs(fname_iaea_orbs)])
df_meas = concat_dfs(get_coastal_water_df(fname_coastal_water), get_clos1F_df(fname_clos1F))
df_meas.dropna(subset=['Sampling point number'], inplace=True)
return {'SEAWATER': georef_data(df_meas, df_locs)}dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
dfs['SEAWATER'].head()
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:00<00:02, 3.62it/s]
18%|█▊ | 2/11 [00:00<00:02, 3.57it/s]
27%|██▋ | 3/11 [00:00<00:02, 3.05it/s]
36%|███▋ | 4/11 [00:01<00:02, 3.32it/s]
45%|████▌ | 5/11 [00:01<00:01, 3.50it/s]
55%|█████▍ | 6/11 [00:05<00:08, 1.67s/it]
64%|██████▎ | 7/11 [00:09<00:09, 2.32s/it]
73%|███████▎ | 8/11 [00:10<00:05, 1.79s/it]
91%|█████████ | 10/11 [00:10<00:00, 1.01it/s]
100%|██████████| 11/11 [00:10<00:00, 1.05it/s]
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:00<00:02, 3.53it/s]
18%|█▊ | 2/11 [00:00<00:02, 3.44it/s]
27%|██▋ | 3/11 [00:00<00:02, 3.35it/s]
36%|███▋ | 4/11 [00:01<00:02, 3.07it/s]
45%|████▌ | 5/11 [00:01<00:01, 3.24it/s]
55%|█████▍ | 6/11 [00:05<00:08, 1.69s/it]
64%|██████▎ | 7/11 [00:09<00:09, 2.45s/it]
73%|███████▎ | 8/11 [00:10<00:05, 1.92s/it]
91%|█████████ | 10/11 [00:10<00:01, 1.05s/it]
100%|██████████| 11/11 [00:11<00:00, 1.00s/it]
| Sampling point number | Collection layer of seawater | 131I radioactivity concentration (Bq/L) | 131I detection limit (Bq/L) | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | 132I radioactivity concentration (Bq/L) | 132I detection limit (Bq/L) | ... | 125Sb radioactivity concentration (Bq/L) | 125Sb detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | Unnamed: 49 | TIME | STATION | LON | LAT | org | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T-3 | NaN | 1.1E+03 | 1.3E+01 | 4.8E+01 | 9.2E+00 | 5.3E+01 | 8.8E+00 | 1.6E+02 | 44.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/21 23:15:00 | T-3 | 141.026389 | 37.322222 | TEPCO |
| 1 | T-4 | NaN | 6.6E+02 | 1.2E+01 | 3.1E+01 | 8.7E+00 | 3.3E+01 | 8.3E+00 | 1.2E+02 | 41.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/21 23:45:00 | T-4 | 141.013889 | 37.241667 | TEPCO |
| 2 | T-3 | NaN | 1.1E+03 | 2.0E+01 | 4.6E+01 | 1.4E+01 | 4.0E+01 | 1.4E+01 | ND | 88.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/22 14:28:00 | T-3 | 141.026389 | 37.322222 | TEPCO |
| 3 | T-4 | NaN | 6.7E+02 | 1.9E+01 | 3.9E+01 | 1.1E+01 | 4.4E+01 | 1.1E+01 | ND | 79.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/22 15:06:00 | T-4 | 141.013889 | 37.241667 | TEPCO |
| 4 | T-3 | NaN | 7.4E+02 | 2.7E+01 | 5.1E+01 | 2.0E+01 | 5.5E+01 | 2.0E+01 | 2.0E+02 | 58.0 | ... | NaN | NaN | 34.0 | 25.0 | NaN | 2011/3/23 13:51:00 | T-3 | 141.026389 | 37.322222 | TEPCO |
5 rows × 53 columns
print(f"# of cols, rows: {dfs['SEAWATER'].shape}")
dfs['SEAWATER'].head()# of cols, rows: (49863, 53)
| Sampling point number | Collection layer of seawater | 131I radioactivity concentration (Bq/L) | 131I detection limit (Bq/L) | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | 132I radioactivity concentration (Bq/L) | 132I detection limit (Bq/L) | ... | 125Sb radioactivity concentration (Bq/L) | 125Sb detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | Unnamed: 49 | TIME | STATION | LON | LAT | org | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T-3 | NaN | 1.1E+03 | 1.3E+01 | 4.8E+01 | 9.2E+00 | 5.3E+01 | 8.8E+00 | 1.6E+02 | 44.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/21 23:15:00 | T-3 | 141.026389 | 37.322222 | TEPCO |
| 1 | T-4 | NaN | 6.6E+02 | 1.2E+01 | 3.1E+01 | 8.7E+00 | 3.3E+01 | 8.3E+00 | 1.2E+02 | 41.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/21 23:45:00 | T-4 | 141.013889 | 37.241667 | TEPCO |
| 2 | T-3 | NaN | 1.1E+03 | 2.0E+01 | 4.6E+01 | 1.4E+01 | 4.0E+01 | 1.4E+01 | ND | 88.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/22 14:28:00 | T-3 | 141.026389 | 37.322222 | TEPCO |
| 3 | T-4 | NaN | 6.7E+02 | 1.9E+01 | 3.9E+01 | 1.1E+01 | 4.4E+01 | 1.1E+01 | ND | 79.0 | ... | NaN | NaN | NaN | NaN | NaN | 2011/3/22 15:06:00 | T-4 | 141.013889 | 37.241667 | TEPCO |
| 4 | T-3 | NaN | 7.4E+02 | 2.7E+01 | 5.1E+01 | 2.0E+01 | 5.5E+01 | 2.0E+01 | 2.0E+02 | 58.0 | ... | NaN | NaN | 34.0 | 25.0 | NaN | 2011/3/23 13:51:00 | T-3 | 141.026389 | 37.322222 | TEPCO |
5 rows × 53 columns
dfs['SEAWATER'].STATION.unique()array(['T-3', 'T-4', 'T-5', 'T-7', 'T-11', 'T-12', 'T-14', 'T-18', 'T-20',
'T-22', 'T-MA', 'T-M10', 'T-A', 'T-D', 'T-E', 'T-B', 'T-C',
'T-MG1', 'T-MG2', 'T-MG3', 'T-MG4', 'T-MG5', 'T-MG6', 'T-D1',
'T-D5', 'T-D9', 'T-E1', 'T-G4', 'T-H1', 'T-S5', 'T-S6', 'T-17-1',
'T-B3', 'T-13-1', 'T-S3', 'T-S4', 'T-B4', 'T-S1', 'T-S2', 'T-MG0',
'T-Z', 'T-B1', 'T-B2', 'T-S7', 'T-S8', 'T-0', 'T-4-1', 'T-4-2',
'T-6', 'T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)
np.sum(dfs['SEAWATER'] == "ND")Sampling point number 0
Collection layer of seawater 0
131I radioactivity concentration (Bq/L) 8642
131I detection limit (Bq/L) 0
134Cs radioactivity concentration (Bq/L) 30967
134Cs detection limit (Bq/L) 0
137Cs radioactivity concentration (Bq/L) 17232
137Cs detection limit (Bq/L) 0
132I radioactivity concentration (Bq/L) 3
132I detection limit (Bq/L) 0
132Te radioactivity concentration (Bq/L) 0
132Te detection limit (Bq/L) 0
136Cs radioactivity concentration (Bq/L) 2
136Cs detection limit (Bq/L) 0
140La radioactivity concentration (Bq/L) 0
140La detection limit (Bq/L) 0
89Sr radioactivity concentration (Bq/L) 101
89Sr detection limit (Bq/L) 0
90Sr radioactivity concentration (Bq/L) 344
90Sr detection limit (Bq/L) 0
238Pu radioactivity concentration (Bq/L) 309
238Pu detection limit (Bq/L) 0
239Pu+240Pu radioactivity concentration (Bq/L) 231
239Pu+240Pu statistical error (Bq/L) 0
239Pu+240Pu detection limit (Bq/L) 0
Total alpha radioactivity concentration (Bq/L) 983
Total alpha detection limit (Bq/L) 0
Total beta radioactivity concentration (Bq/L) 4919
Total beta detection limit (Bq/L) 0
140Ba radioactivity concentration (Bq/L) 0
140Ba detection limit (Bq/L) 0
106Ru radioactivity concentration (Bq/L) 0
106Ru detection limit (Bq/L) 0
58Co radioactivity concentration (Bq/L) 3
58Co detection limit (Bq/L) 0
60Co radioactivity concentration (Bq/L) 9
60Co detection limit (Bq/L) 0
144Ce radioactivity concentration (Bq/L) 9
144Ce detection limit (Bq/L) 0
54Mn radioactivity concentration (Bq/L) 9
54Mn detection limit (Bq/L) 0
3H radioactivity concentration (Bq/L) 9657
3H detection limit (Bq/L) 0
125Sb radioactivity concentration (Bq/L) 647
125Sb detection limit (Bq/L) 0
105Ru radioactivity concentration (Bq/L) 0
105Ru detection limit (Bq/L) 0
Unnamed: 49 0
TIME 0
STATION 0
LON 0
LAT 0
org 0
dtype: int64
dfs['SEAWATER'][['TIME', '134Cs radioactivity concentration (Bq/L)', '134Cs detection limit (Bq/L)']]| TIME | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | |
|---|---|---|---|
| 0 | 2011/3/21 23:15:00 | 4.8E+01 | 9.2E+00 |
| 1 | 2011/3/21 23:45:00 | 3.1E+01 | 8.7E+00 |
| 2 | 2011/3/22 14:28:00 | 4.6E+01 | 1.4E+01 |
| 3 | 2011/3/22 15:06:00 | 3.9E+01 | 1.1E+01 |
| 4 | 2011/3/23 13:51:00 | 5.1E+01 | 2.0E+01 |
| ... | ... | ... | ... |
| 49858 | 2025/06/30 08:05 | ND | 0.4 |
| 49859 | 2025/07/07 08:36 | ND | 0.37 |
| 49860 | 2025/07/17 08:11 | ND | 0.29 |
| 49861 | 2025/07/21 08:20 | ND | 0.36 |
| 49862 | 2025/07/24 07:39 | NaN | NaN |
49863 rows × 3 columns
Remove 約 (about) character
We systematically remove the 約 character. Please confirm that this is the correct way to handle this. We could imagine that mentioning uncertainty would be less ambiguous in future.
class RemoveJapanaseCharCB(Callback):
"Remove 約 (about) char"
def _transform_if_about(self, value, about_char='約'):
if pd.isna(value): return value
return (value.replace(about_char, '') if str(value).count(about_char) != 0
else value)
def __call__(self, tfm):
for k in tfm.dfs.keys():
cols_rdn = [c for c in tfm.dfs[k].columns if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_about)tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB()])
tfm()['SEAWATER'].sample(10)| Sampling point number | Collection layer of seawater | 131I radioactivity concentration (Bq/L) | 131I detection limit (Bq/L) | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | 132I radioactivity concentration (Bq/L) | 132I detection limit (Bq/L) | ... | 125Sb radioactivity concentration (Bq/L) | 125Sb detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | Unnamed: 49 | TIME | STATION | LON | LAT | org | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40630 | T-1 | 上層 | NaN | NaN | ND | 0.92 | ND | 0.74 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2024/03/07 06:45:00 | T-1 | 141.034444 | 37.431111 | TEPCO |
| 46841 | T-2-1 | NaN | ND | 1.1 | ND | 1.0 | 2 | NaN | NaN | NaN | ... | ND | NaN | NaN | NaN | NaN | 2013/06/21 07:15:00 | T-2-1 | 37.410000 | 141.030000 | close1F.csv |
| 3793 | T-MG6 | 中層 | NaN | NaN | ND | 1.9E-03 | 4.1E-03 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2012/8/24 11:45:00 | T-MG6 | 141.000000 | 38.083333 | TEPCO |
| 12293 | T-17-1 | 下層 | NaN | NaN | ND | 1.2E-03 | 8.8E-03 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2017/1/11 06:50:00 | T-17-1 | 141.006944 | 37.055556 | TEPCO |
| 48670 | T-A1 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2024/10/14 07:08:00 | T-A1 | 141.050761 | 37.440794 | TEPCO |
| 35018 | T-1 | 上層 | ND | 0.68 | ND | 0.94 | 1.4 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2012/03/12 08:55:00 | T-1 | 141.034444 | 37.431111 | TEPCO |
| 43730 | T-2 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2020/01/27 07:00:00 | T-2 | 141.033611 | 37.415833 | TEPCO |
| 11322 | T-MG4 | 上層 | NaN | NaN | ND | 1.5E-03 | 3.1E-03 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2016/6/7 09:37:00 | T-MG4 | 141.133333 | 38.250000 | TEPCO |
| 2931 | T-E | 上層 | ND | 1.1E+00 | ND | 1.4E+00 | ND | 1.3E+00 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2012/3/21 14:01:00 | T-E | 140.837222 | 35.796111 | TEPCO |
| 3617 | T-MG0 | 中層 | NaN | NaN | 1.1E-03 | NaN | 2.5E-03 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2012/7/26 10:04:00 | T-MG0 | 141.583333 | 38.633333 | TEPCO |
10 rows × 53 columns
Fix values range string
Value ranges are provided as strings (e.g ‘4.0E+00<&<8.0E+00’ or ‘1.0~2.7’). We replace them by their mean. Please confirm that this is the correct way to handle this. Again, mentioning uncertainty would be less ambiguous in future.
class FixRangeValueStringCB(Callback):
"Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean"
def _extract_and_calculate_mean(self, s):
# For scientific notation ranges
float_strings = re.findall(r"[+-]?\d+\.?\d*E?[+-]?\d*", s)
if float_strings:
float_numbers = np.array(float_strings, dtype=float)
return float_numbers.mean()
return s
def _transform_if_range(self, value):
if pd.isna(value):
return value
value = str(value)
# Check for both range patterns
if '<&<' in value or '~' in value:
return self._extract_and_calculate_mean(value)
return value
def __call__(self, tfm):
for k in tfm.dfs.keys():
cols_rdn = [c for c in tfm.dfs[k].columns
if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
# tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_range).astype(float)
tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_range)tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB()
])
df_test = tfm()['SEAWATER']
df_test.sample(10)| Sampling point number | Collection layer of seawater | 131I radioactivity concentration (Bq/L) | 131I detection limit (Bq/L) | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | 132I radioactivity concentration (Bq/L) | 132I detection limit (Bq/L) | ... | 125Sb radioactivity concentration (Bq/L) | 125Sb detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | Unnamed: 49 | TIME | STATION | LON | LAT | org | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12202 | T-14 | 上層 | NaN | NaN | ND | 1.3E-03 | 6.6E-03 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2016/12/13 08:44:00 | T-14 | 141.062500 | 37.552778 | TEPCO |
| 24157 | T-18 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2023/12/25 09:21:00 | T-18 | 140.922222 | 36.905556 | TEPCO |
| 39939 | T-1 | 上層 | NaN | NaN | ND | 0.69 | ND | 0.81 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2022/10/05 07:40:00 | T-1 | 141.034444 | 37.431111 | TEPCO |
| 30050 | T-0-1A | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2024/05/25 07:00:00 | T-0-1A | 141.046667 | 37.430556 | TEPCO |
| 34441 | T-0-3A | NaN | NaN | NaN | ND | 0.39 | ND | 0.34 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2024/07/15 07:45:00 | T-0-3A | 141.046667 | 37.416111 | TEPCO |
| 4229 | T-MG5 | 上層 | NaN | NaN | 5.0E-03 | NaN | 8.9E-03 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2012/11/22 09:11:00 | T-MG5 | 141.250000 | 38.166667 | TEPCO |
| 29278 | T-0-1A | NaN | NaN | NaN | ND | 0.81 | ND | 0.65 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2018/04/24 06:55:00 | T-0-1A | 141.046667 | 37.430556 | TEPCO |
| 40055 | T-1 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2023/01/02 07:57:00 | T-1 | 141.034444 | 37.431111 | TEPCO |
| 38436 | T-1 | 上層 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2019/08/19 07:55:00 | T-1 | 141.034444 | 37.431111 | TEPCO |
| 22261 | T-D9 | 上層 | NaN | NaN | ND | 1.2E-03 | 2.0E-03 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 2022/11/15 09:01:00 | T-D9 | 141.072167 | 37.333333 | TEPCO |
10 rows × 53 columns
Select columns of interest
We select the columns of interest and in particular the elements of interest, in our case radionuclides.
common_coi = ['LON', 'LAT', 'TIME', 'STATION']
nuclides_pattern = '(Bq/L)'class SelectColsOfInterestCB(Callback):
"Select columns of interest."
def __init__(self, common_coi, nuclides_pattern): fc.store_attr()
def __call__(self, tfm):
nuc_of_interest = [c for c in tfm.dfs['SEAWATER'].columns if nuclides_pattern in c]
tfm.dfs['SEAWATER'] = tfm.dfs['SEAWATER'][self.common_coi + nuc_of_interest]tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern)
])
df_test = tfm()['SEAWATER']
df_test.sample(5)| LON | LAT | TIME | STATION | 131I radioactivity concentration (Bq/L) | 131I detection limit (Bq/L) | 134Cs radioactivity concentration (Bq/L) | 134Cs detection limit (Bq/L) | 137Cs radioactivity concentration (Bq/L) | 137Cs detection limit (Bq/L) | ... | 144Ce radioactivity concentration (Bq/L) | 144Ce detection limit (Bq/L) | 54Mn radioactivity concentration (Bq/L) | 54Mn detection limit (Bq/L) | 3H radioactivity concentration (Bq/L) | 3H detection limit (Bq/L) | 125Sb radioactivity concentration (Bq/L) | 125Sb detection limit (Bq/L) | 105Ru radioactivity concentration (Bq/L) | 105Ru detection limit (Bq/L) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 46500 | 141.033611 | 37.415833 | 2025/05/12 07:50 | T-2 | NaN | NaN | ND | 0.67 | ND | 0.82 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 34387 | 141.046667 | 37.416111 | 2024/03/04 08:22:00 | T-0-3A | NaN | NaN | ND | 0.36 | ND | 0.23 | ... | NaN | NaN | NaN | NaN | ND | 9.0 | NaN | NaN | NaN | NaN |
| 45183 | 141.033611 | 37.415833 | 2022/11/13 08:47:00 | T-2 | NaN | NaN | ND | 0.74 | ND | 0.67 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 44254 | 141.033611 | 37.415833 | 2021/01/28 07:05:00 | T-2 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN |
| 12995 | 141.666667 | 38.300000 | 2017/6/6 08:25:00 | T-MG2 | NaN | NaN | ND | 1.4E-03 | 1.6E-03 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 49 columns
Reshape: wide to long
So that we can extract information such as nuclide name, unit, derived quantities such as uncertainty, detection limit, …
class WideToLongCB(Callback):
"""
Get TEPCO nuclide names as values not column names
to extract contained information (nuclide name, unc, dl, ...).
"""
def __init__(self, id_vars=['LON', 'LAT', 'TIME', 'STATION']):
fc.store_attr()
def __call__(self, tfm):
tfm.dfs['SEAWATER'] = pd.melt(tfm.dfs['SEAWATER'], id_vars=self.id_vars)
#| eval: falsetfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB()
])
df_test = tfm()['SEAWATER']
df_test.head()| LON | LAT | TIME | STATION | variable | value | |
|---|---|---|---|---|---|---|
| 0 | 141.026389 | 37.322222 | 2011/3/21 23:15:00 | T-3 | 131I radioactivity concentration (Bq/L) | 1.1E+03 |
| 1 | 141.013889 | 37.241667 | 2011/3/21 23:45:00 | T-4 | 131I radioactivity concentration (Bq/L) | 6.6E+02 |
| 2 | 141.026389 | 37.322222 | 2011/3/22 14:28:00 | T-3 | 131I radioactivity concentration (Bq/L) | 1.1E+03 |
| 3 | 141.013889 | 37.241667 | 2011/3/22 15:06:00 | T-4 | 131I radioactivity concentration (Bq/L) | 6.7E+02 |
| 4 | 141.026389 | 37.322222 | 2011/3/23 13:51:00 | T-3 | 131I radioactivity concentration (Bq/L) | 7.4E+02 |
Extract
Nulide name, dl, unc, … are extracted from column names as embedded in TEPCO data source.
Nuclide name
def extract_nuclide(text: str) -> str:
"Extract the nuclide identifier from a measurement variable name using regex."
pattern = r'^(Total\s+(?:alpha|beta)|[^\s]+)'
match = re.match(pattern, text, re.IGNORECASE)
return match.group(1) if match else textFor instance:
print(extract_nuclide("Total alpha radioactivity concentration (Bq/L)"))
print(extract_nuclide("131I radioactivity concentration (Bq/L)"))Total alpha
131I
class ExtractNuclideNameCB(Callback):
"Extract nuclide name from TEPCO data."
def __init__(self, src_col='variable', dest_col='NUCLIDE'): fc.store_attr()
def __call__(self, tfm):
tfm.dfs['SEAWATER'][self.dest_col] = tfm.dfs['SEAWATER'][self.src_col].map(extract_nuclide)tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB()
])
df_test = tfm()['SEAWATER']
df_test.sample(5)| LON | LAT | TIME | STATION | variable | value | NUCLIDE | |
|---|---|---|---|---|---|---|---|
| 854848 | 141.200000 | 37.416667 | 2014/5/20 08:34:00 | T-5 | 90Sr detection limit (Bq/L) | NaN | 90Sr |
| 125473 | 141.026389 | 37.322222 | 2024/10/15 12:40:00 | T-3 | 134Cs radioactivity concentration (Bq/L) | NaN | 134Cs |
| 1684112 | 141.034444 | 37.431111 | 2020/01/15 08:00:00 | T-1 | 60Co radioactivity concentration (Bq/L) | NaN | 60Co |
| 869803 | 140.702222 | 35.987500 | 2022/10/21 13:08:00 | T-D | 90Sr detection limit (Bq/L) | NaN | 90Sr |
| 366571 | 141.250000 | 38.166667 | 2020/2/7 09:22:00 | T-MG5 | 132I detection limit (Bq/L) | NaN | 132I |
Unit
class ExtractUnitCB(Callback):
"Extract unit from TEPCO data."
def __init__(self, src_col='variable', dest_col='UNIT'): fc.store_attr()
def __call__(self, tfm):
tfm.dfs['SEAWATER'][self.dest_col] = tfm.dfs['SEAWATER'][self.src_col].str.extract(r'\((.*?)\)')tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB()
])
df_test = tfm()['SEAWATER']
df_test.sample(5)| LON | LAT | TIME | STATION | variable | value | NUCLIDE | UNIT | |
|---|---|---|---|---|---|---|---|---|
| 938328 | 141.034444 | 37.431111 | 2024/07/08 07:51:00 | T-1 | 238Pu radioactivity concentration (Bq/L) | NaN | 238Pu | Bq/L |
| 2221110 | 141.072222 | 37.500000 | 2025/6/2 08:36:00 | T-D1 | 105Ru detection limit (Bq/L) | NaN | 105Ru | Bq/L |
| 1951908 | 141.078889 | 37.458333 | 2014/5/29 06:07:00 | T-S3 | 3H radioactivity concentration (Bq/L) | NaN | 3H | Bq/L |
| 1077687 | 141.046667 | 37.423333 | 2016/02/01 08:16 | T-0-2 | 239Pu+240Pu statistical error (Bq/L) | NaN | 239Pu+240Pu | Bq/L |
| 969934 | 141.233333 | 37.516667 | 2023/1/26 07:26:00 | T-B2 | 238Pu detection limit (Bq/L) | NaN | 238Pu | Bq/L |
Value type
Is it a measurement or derived detection such as detection limit or uncertainty?
class ExtractValueTypeCB(Callback):
"Extract value type from TEPCO data."
def __init__(self, src_col='variable', dest_col='type'): fc.store_attr()
def __call__(self, tfm):
tfm.dfs['SEAWATER'][self.dest_col] = np.select(
[
tfm.dfs['SEAWATER'][self.src_col].str.contains('detection limit', case=False),
tfm.dfs['SEAWATER'][self.src_col].str.contains('statistical error', case=False)],
['DL', 'UNC'],
default='VALUE'
)tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB()
])
df_test = tfm()['SEAWATER']
df_test.sample(5)| LON | LAT | TIME | STATION | variable | value | NUCLIDE | UNIT | type | |
|---|---|---|---|---|---|---|---|---|---|
| 1315127 | 141.000000 | 36.966667 | 2020/10/15 10:53:00 | T-20 | Total beta detection limit (Bq/L) | NaN | Total beta | Bq/L | DL |
| 1839912 | 141.033611 | 37.415833 | 2022/03/19 09:05:00 | T-2 | 144Ce detection limit (Bq/L) | NaN | 144Ce | Bq/L | DL |
| 1842443 | 37.410000 | 141.030000 | 2014/08/11 05:35:00 | T-2-1 | 144Ce detection limit (Bq/L) | NaN | 144Ce | Bq/L | DL |
| 36817 | 141.034444 | 37.431111 | 2016/03/20 07:45 | T-1 | 131I radioactivity concentration (Bq/L) | ND | 131I | Bq/L | VALUE |
| 171526 | 141.072222 | 37.416667 | 2022/9/5 08:09:00 | T-D5 | 134Cs detection limit (Bq/L) | 1.4E-03 | 134Cs | Bq/L | DL |
Reshape: long to wide
Send type column to columns names (VALUE, DL, UNC)
class LongToWideCB(Callback):
"Reshape: long to wide"
def __init__(self, src_col='variable', dest_col='type'): fc.store_attr()
def __call__(self, tfm):
tfm.dfs['SEAWATER'] = pd.pivot_table(
tfm.dfs['SEAWATER'],
values='value',
index=['LON', 'LAT', 'TIME', 'STATION', 'NUCLIDE', 'UNIT'],
columns='type',
aggfunc='first'
).reset_index()
tfm.dfs['SEAWATER'].reset_index(inplace=True)
tfm.dfs['SEAWATER'].rename(columns={'index': 'SMP_ID'}, inplace=True)tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB()
])
df_test = tfm()['SEAWATER']
df_test.sample(5)| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE |
|---|---|---|---|---|---|---|---|---|---|---|
| 83649 | 83649 | 141.200000 | 37.233333 | 2011/8/26 07:40:00 | T-7 | 134Cs | Bq/L | 1.2E+01 | NaN | ND |
| 37560 | 37560 | 141.034444 | 37.431111 | 2013/10/27 06:45:00 | T-1 | 137Cs | Bq/L | NaN | NaN | 1.4 |
| 18204 | 18204 | 141.026389 | 37.322222 | 2014/8/5 10:10:00 | T-3 | Total beta | Bq/L | 1.7E+01 | NaN | ND |
| 65718 | 65718 | 141.046667 | 37.430556 | 2022/10/03 07:10:00 | T-0-1A | 134Cs | Bq/L | 0.31 | NaN | ND |
| 69961 | 69961 | 141.050761 | 37.424686 | 2025/07/07 07:35 | T-A2 | 3H | Bq/L | 9.4 | NaN | ND |
df_test[df_test.VALUE == 'ND'].groupby('NUCLIDE').size().sort_values(ascending=False)NUCLIDE
134Cs 25186
137Cs 16447
3H 8976
131I 7958
Total beta 4913
Total alpha 979
125Sb 647
90Sr 342
238Pu 308
239Pu+240Pu 231
89Sr 100
144Ce 9
54Mn 9
60Co 9
58Co 3
132I 3
136Cs 2
dtype: int64
df_test[df_test.VALUE == 'ND']| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 37.210000 | 141.01 | 2012/10/16 07:25:00 | T-4-1 | 131I | Bq/L | 1.3E-01 | NaN | ND |
| 1 | 1 | 37.210000 | 141.01 | 2012/10/16 07:25:00 | T-4-1 | 134Cs | Bq/L | 1.9E-01 | NaN | ND |
| 2 | 2 | 37.210000 | 141.01 | 2012/10/16 07:25:00 | T-4-1 | 137Cs | Bq/L | 2.7E-01 | NaN | ND |
| 3 | 3 | 37.210000 | 141.01 | 2012/10/2 07:30:00 | T-4-1 | 131I | Bq/L | 1.1E-01 | NaN | ND |
| 4 | 4 | 37.210000 | 141.01 | 2012/10/2 07:30:00 | T-4-1 | 134Cs | Bq/L | 2.2E-01 | NaN | ND |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 93158 | 93158 | 141.666667 | 38.30 | 2025/4/8 08:20:00 | T-MG2 | 134Cs | Bq/L | 1.3E-03 | NaN | ND |
| 93160 | 93160 | 141.666667 | 38.30 | 2025/5/13 07:36:00 | T-MG2 | 134Cs | Bq/L | 1.2E-03 | NaN | ND |
| 93162 | 93162 | 141.666667 | 38.30 | 2025/5/13 07:50:00 | T-MG2 | 134Cs | Bq/L | 8.7E-04 | NaN | ND |
| 93164 | 93164 | 141.666667 | 38.30 | 2025/6/3 08:15:00 | T-MG2 | 134Cs | Bq/L | 1.1E-03 | NaN | ND |
| 93166 | 93166 | 141.666667 | 38.30 | 2025/6/3 08:24:00 | T-MG2 | 134Cs | Bq/L | 1.2E-03 | NaN | ND |
66122 rows × 10 columns
df_test.VALUE == 'ND'0 True
1 True
2 True
3 True
4 True
...
93163 False
93164 True
93165 False
93166 True
93167 False
Name: VALUE, Length: 93168, dtype: bool
Remap UNIT name to MARIS nomenclature
Data are reported in Bq/L but MARIS uses Bq/m3 instead. So we assign it to MARIS unit_id = 3 (Bq/L). Later in the processing pipeline, we will convert the values from Bq/L to Bq/m3 by multiplying VALUE, DL, and DLV by 1000.
unit_mapping = {'Bq/L': 1}class RemapUnitNameCB(Callback):
"""
Remap `UNIT` name to MARIS id.
"""
def __init__(self, unit_mapping): fc.store_attr()
def __call__(self, tfm):
tfm.dfs['SEAWATER']['UNIT'] = tfm.dfs['SEAWATER']['UNIT'].map(self.unit_mapping)tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB(),
RemapUnitNameCB(unit_mapping)
])
df_test = tfm()['SEAWATER']
df_test.sample(5)| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE |
|---|---|---|---|---|---|---|---|---|---|---|
| 81874 | 81874 | 141.133333 | 38.250000 | 2012/7/26 10:25:00 | T-MG4 | 134Cs | 1 | NaN | NaN | 6.8E-03 |
| 56188 | 56188 | 141.040556 | 37.478889 | 2015/2/3 09:15:00 | T-6 | 3H | 1 | NaN | NaN | 4.6E-01 |
| 74609 | 74609 | 141.072167 | 37.333333 | 2024/9/18 08:21:00 | T-D9 | 137Cs | 1 | NaN | NaN | 2.9E-03 |
| 36442 | 36442 | 141.034444 | 37.431111 | 2012/11/19 08:30:00 | T-1 | 137Cs | 1 | 1.4 | NaN | ND |
| 36846 | 36846 | 141.034444 | 37.431111 | 2013/03/28 06:50:00 | T-1 | 134Cs | 1 | 1.1 | NaN | ND |
Remap NUCLIDE name to MARIS nomenclature
nuclide_mapping = {
'131I': 29,
'134Cs': 31,
'137Cs': 33,
'125Sb': 24,
'Total beta': 103,
'238Pu': 67,
'239Pu+240Pu': 77,
'3H': 1,
'89Sr': 11,
'90Sr': 12,
'Total alpha': 104,
'132I': 100,
'136Cs': 102,
'58Co': 8,
'105Ru': 97,
'106Ru': 17,
'140La': 35,
'140Ba': 34,
'132Te': 99,
'60Co': 9,
'144Ce': 37,
'54Mn': 6
}class RemapNuclideNameCB(Callback):
"Remap `NUCLIDE` name to MARIS id."
def __init__(self, nuclide_mapping): fc.store_attr()
def __call__(self, tfm):
tfm.dfs['SEAWATER']['NUCLIDE'] = tfm.dfs['SEAWATER']['NUCLIDE'].map(self.nuclide_mapping)tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB(),
RemapUnitNameCB(unit_mapping),
RemapNuclideNameCB(nuclide_mapping)
])
df_test = tfm()['SEAWATER']
df_test.sample(5)| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE |
|---|---|---|---|---|---|---|---|---|---|---|
| 63425 | 63425 | 141.046667 | 37.423333 | 2024/04/01 08:15:00 | T-0-2 | 31 | 1 | 0.4 | NaN | ND |
| 83876 | 83876 | 141.200000 | 37.233333 | 2019/6/6 06:58:00 | T-7 | 31 | 1 | 1.2E-03 | NaN | ND |
| 31743 | 31743 | 141.033611 | 37.415833 | 2023/07/31 08:55:00 | T-2 | 103 | 1 | NaN | NaN | 10 |
| 68484 | 68484 | 141.047222 | 37.311111 | 2023/12/5 05:48:00 | T-S7 | 1 | 1 | NaN | NaN | 1.5E-01 |
| 19853 | 19853 | 141.026389 | 37.322222 | 2025/3/4 12:20:00 | T-3 | 1 | 1 | 3.6E-01 | NaN | ND |
df_test.dropna(subset=['DL', 'VALUE'], how='any')| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 37.210000 | 141.01 | 2012/10/16 07:25:00 | T-4-1 | 29 | 1 | 1.3E-01 | NaN | ND |
| 1 | 1 | 37.210000 | 141.01 | 2012/10/16 07:25:00 | T-4-1 | 31 | 1 | 1.9E-01 | NaN | ND |
| 2 | 2 | 37.210000 | 141.01 | 2012/10/16 07:25:00 | T-4-1 | 33 | 1 | 2.7E-01 | NaN | ND |
| 3 | 3 | 37.210000 | 141.01 | 2012/10/2 07:30:00 | T-4-1 | 29 | 1 | 1.1E-01 | NaN | ND |
| 4 | 4 | 37.210000 | 141.01 | 2012/10/2 07:30:00 | T-4-1 | 31 | 1 | 2.2E-01 | NaN | ND |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 93158 | 93158 | 141.666667 | 38.30 | 2025/4/8 08:20:00 | T-MG2 | 31 | 1 | 1.3E-03 | NaN | ND |
| 93160 | 93160 | 141.666667 | 38.30 | 2025/5/13 07:36:00 | T-MG2 | 31 | 1 | 1.2E-03 | NaN | ND |
| 93162 | 93162 | 141.666667 | 38.30 | 2025/5/13 07:50:00 | T-MG2 | 31 | 1 | 8.7E-04 | NaN | ND |
| 93164 | 93164 | 141.666667 | 38.30 | 2025/6/3 08:15:00 | T-MG2 | 31 | 1 | 1.1E-03 | NaN | ND |
| 93166 | 93166 | 141.666667 | 38.30 | 2025/6/3 08:24:00 | T-MG2 | 31 | 1 | 1.2E-03 | NaN | ND |
66093 rows × 10 columns
Remap VALUE, DL, DLV
We remap DL (Detection Limit) value to MARIS ids as follows:
- First check if activity (
VALUE) is reported as “ND”, based on reported detection limitDL:
if VALUE is "ND":
if not DL:
VALUE, DLV, DL = NaN, NaN, 3
else:
VALUE, DLV, DL = DL, DL, 2
- Then if activity (
VALUE) is reported:
if VALUE:
VALUE, DLV, DL = VALUE, DL, 1
but if not reported, then based on detection level (DL) reported:
else:
if DL:
VALUE, DLV, DL = DL, DL, 2
else:
VALUE, DLV, DL = NaN, NaN, NaN (should be dropped)
With 1: Detected value (=), 2: Detection limit (<), 3: Not detected (ND) and where:
VALUEis the activity reported by TEPCODLis initially the detection limit as reported by TEPCO but later on remapped to MARIS detection level nomenclature (categorical)DLVis the detection limit value as reported by TEPCO (copied fromDL)
class RemapVALUE_DL_DLV_CB(Callback):
"Remap `DL`, `DLV`, `VALUE` based on TEPCO -> MARIS rules."
def map_all_columns(self, row):
"""Map all three columns (VALUE, DL, DLV) at once based on TEPCO rules"""
value, dl = row['VALUE'], row['DL']
new_value, new_dlv, new_dl = value, dl, 1
if value == 'ND':
if pd.isna(dl):
new_value, new_dlv, new_dl = np.nan, np.nan, 3
else:
new_value, new_dlv, new_dl = dl, dl, 2
elif pd.isna(value):
if pd.isna(dl):
new_value, new_dlv, new_dl = np.nan, np.nan, np.nan
else:
new_value, new_dlv, new_dl = dl, dl, 2
return pd.Series({
'VALUE': new_value,
'DLV': new_dlv,
'DL': new_dl
})
def __call__(self, tfm):
mapped = tfm.dfs['SEAWATER'].apply(self.map_all_columns, axis=1)
tfm.dfs['SEAWATER'][['VALUE', 'DLV', 'DL']] = mapped
tfm.dfs['SEAWATER']['DL'] = tfm.dfs['SEAWATER']['DL'].astype(int)
tfm.dfs['SEAWATER']['VALUE'] = tfm.dfs['SEAWATER']['VALUE'].astype(float)tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB(),
RemapUnitNameCB(unit_mapping),
RemapNuclideNameCB(nuclide_mapping),
RemapVALUE_DL_DLV_CB()
])
df_test = tfm()['SEAWATER']
df_test.sample(20)| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE | DLV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 24313 | 24313 | 141.033611 | 37.415833 | 2017/10/19 06:50:00 | T-2 | 33 | 1 | 2 | NaN | 0.7100 | 0.71 |
| 81171 | 81171 | 141.083333 | 37.750000 | 2011/11/22 07:10:00 | T-MA | 33 | 1 | 2 | NaN | 1.1000 | 1.1E+00 |
| 62025 | 62025 | 141.046667 | 37.423333 | 2018/01/05 07:37:00 | T-0-2 | 1 | 1 | 2 | NaN | 1.7000 | 1.7 |
| 61732 | 61732 | 141.046667 | 37.423333 | 2016/08/10 07:54 | T-0-2 | 33 | 1 | 2 | NaN | 0.5300 | 0.53 |
| 52714 | 52714 | 141.040278 | 37.416111 | 2023/11/06 06:56:00 | T-0-3 | 31 | 1 | 2 | NaN | 0.3000 | 0.3 |
| 9189 | 9189 | 140.763889 | 36.713889 | 2012/2/28 07:41:00 | T-A | 29 | 1 | 2 | NaN | 1.2000 | 1.2E+00 |
| 16645 | 16645 | 141.022500 | 37.824444 | 2022/10/6 05:44:00 | T-22 | 31 | 1 | 2 | NaN | 0.0013 | 1.3E-03 |
| 24544 | 24544 | 141.033611 | 37.415833 | 2017/12/15 06:55:00 | T-2 | 31 | 1 | 2 | NaN | 0.5500 | 0.55 |
| 69344 | 69344 | 141.050761 | 37.424686 | 2023/03/27 07:26:00 | T-A2 | 33 | 1 | 2 | NaN | 0.2900 | 0.29 |
| 21783 | 21783 | 141.033611 | 37.415833 | 2012/05/29 08:15:00 | T-2 | 33 | 1 | 2 | NaN | 1.6000 | 1.6 |
| 30460 | 30460 | 141.033611 | 37.415833 | 2022/07/11 09:13:00 | T-2 | 103 | 1 | 1 | NaN | 14.0000 | NaN |
| 29837 | 29837 | 141.033611 | 37.415833 | 2022/01/03 08:15:00 | T-2 | 33 | 1 | 2 | NaN | 0.8700 | 0.87 |
| 9304 | 9304 | 140.763889 | 36.713889 | 2014/11/10 09:34:00 | T-A | 33 | 1 | 2 | NaN | 1.3000 | 1.3E+00 |
| 28727 | 28727 | 141.033611 | 37.415833 | 2021/01/28 07:05:00 | T-2 | 31 | 1 | 2 | NaN | 0.8000 | 0.8 |
| 42442 | 42442 | 141.034444 | 37.431111 | 2017/11/27 07:05:00 | T-1 | 1 | 1 | 2 | NaN | 1.9000 | 1.9 |
| 24596 | 24596 | 141.033611 | 37.415833 | 2017/12/28 06:55:00 | T-2 | 103 | 1 | 1 | NaN | 12.0000 | NaN |
| 6689 | 6689 | 140.603889 | 36.299722 | 2019/10/18 13:06:00 | T-C | 33 | 1 | 2 | NaN | 1.2000 | 1.2E+00 |
| 35904 | 35904 | 141.034444 | 37.431111 | 2012/05/28 08:55:00 | T-1 | 29 | 1 | 2 | NaN | 0.4900 | 0.49 |
| 15422 | 15422 | 141.013889 | 37.241667 | 2017/12/5 13:30:00 | T-4 | 31 | 1 | 1 | NaN | 0.0038 | NaN |
| 59538 | 59538 | 141.046667 | 37.416111 | 2018/08/28 07:22:00 | T-0-3A | 103 | 1 | 2 | NaN | 15.0000 | 15.0 |
Convert activity to Bq/m3
Earlier in the pipeline, we assigned MARIS unit_id = 3 (Bq/L) to TEPCO UNIT = Bq/L. Now we need to convert the values from Bq/L to Bq/m3 by multiplying VALUE, DL, and DLV by 1000.
class ConvertToBqM3CB(Callback):
"Convert from Bq/L to Bq/m3."
def __call__(self, tfm, factor=1000):
tfm.dfs['SEAWATER']['VALUE'] = tfm.dfs['SEAWATER']['VALUE'] * factor
# Convert DLV to float, handling NaN values
tfm.dfs['SEAWATER']['DLV'] = pd.to_numeric(tfm.dfs['SEAWATER']['DLV'], errors='coerce')
tfm.dfs['SEAWATER']['DLV'] = tfm.dfs['SEAWATER']['DLV'] * factortfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB(),
RemapUnitNameCB(unit_mapping),
RemapNuclideNameCB(nuclide_mapping),
RemapVALUE_DL_DLV_CB(),
ConvertToBqM3CB()
])
df_test = tfm()['SEAWATER']
df_test.sample(20)| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE | DLV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 11653 | 11653 | 140.922222 | 36.905556 | 2018/7/17 09:28:00 | T-18 | 33 | 1 | 1 | NaN | 3.1 | NaN |
| 56255 | 56255 | 141.040556 | 37.478889 | 2015/8/11 08:00:00 | T-6 | 33 | 1 | 1 | NaN | 57.0 | NaN |
| 4414 | 4414 | 37.410000 | 141.030000 | 2015/07/13 05:30:00 | T-2-1 | 1 | 1 | 1 | NaN | 2100.0 | NaN |
| 71352 | 71352 | 141.062500 | 37.552778 | 2015/7/27 08:36:00 | T-14 | 31 | 1 | 1 | NaN | 2.0 | 1.4 |
| 28640 | 28640 | 141.033611 | 37.415833 | 2021/01/01 07:11:00 | T-2 | 103 | 1 | 1 | NaN | 15000.0 | NaN |
| 51235 | 51235 | 141.040278 | 37.416111 | 2017/01/02 07:16:00 | T-0-3 | 31 | 1 | 2 | NaN | 680.0 | 680.0 |
| 66512 | 66512 | 141.046667 | 37.430556 | 2025/04/29 07:18 | T-0-1A | 1 | 1 | 2 | NaN | 8100.0 | 8100.0 |
| 49336 | 49336 | 141.034444 | 37.431111 | 2025/03/29 06:50 | T-1 | 33 | 1 | 2 | NaN | 660.0 | 660.0 |
| 71187 | 71187 | 141.062500 | 37.552778 | 2014/2/26 09:09:00 | T-14 | 33 | 1 | 1 | NaN | 48.0 | NaN |
| 32088 | 32088 | 141.033611 | 37.415833 | 2023/10/27 06:45:00 | T-2 | 31 | 1 | 2 | NaN | 750.0 | 750.0 |
| 34498 | 34498 | 141.034444 | 37.431111 | 2011/04/20 14:20:00 | T-1 | 29 | 1 | 1 | NaN | 47000.0 | NaN |
| 76600 | 76600 | 141.072222 | 37.416667 | 2022/6/1 08:40:00 | T-D5 | 104 | 1 | 2 | NaN | 2500.0 | 2500.0 |
| 51215 | 51215 | 141.040278 | 37.416111 | 2016/11/28 07:47:00 | T-0-3 | 31 | 1 | 2 | NaN | 790.0 | 790.0 |
| 28565 | 28565 | 141.033611 | 37.415833 | 2020/12/09 06:45:00 | T-2 | 33 | 1 | 2 | NaN | 690.0 | 690.0 |
| 28344 | 28344 | 141.033611 | 37.415833 | 2020/10/04 06:45:00 | T-2 | 103 | 1 | 1 | NaN | 13000.0 | NaN |
| 70632 | 70632 | 141.062500 | 37.552778 | 2011/10/25 09:10:00 | T-14 | 29 | 1 | 2 | NaN | 740.0 | 740.0 |
| 7440 | 7440 | 140.665556 | 36.506389 | 2014/5/26 13:38:00 | T-B | 31 | 1 | 2 | NaN | 1100.0 | 1100.0 |
| 20878 | 20878 | 141.033611 | 37.415833 | 2011/10/03 08:30:00 | T-2 | 29 | 1 | 2 | NaN | 4000.0 | 4000.0 |
| 4115 | 4115 | 37.410000 | 141.030000 | 2015/04/27 05:30:00 | T-2-1 | 31 | 1 | 2 | NaN | 770.0 | 770.0 |
| 44483 | 44483 | 141.034444 | 37.431111 | 2019/10/01 08:05:00 | T-1 | 31 | 1 | 2 | NaN | 840.0 | 840.0 |
Parse & encode time
class ParseTimeCB(Callback):
"Parse time column from TEPCO."
def __init__(self, time_name='TIME'): fc.store_attr()
def __call__(self, tfm):
tfm.dfs['SEAWATER'][self.time_name] = pd.to_datetime(tfm.dfs['SEAWATER'][self.time_name],
format='%Y/%m/%d %H:%M:%S', errors='coerce')tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB(),
RemapUnitNameCB(unit_mapping),
RemapNuclideNameCB(nuclide_mapping),
RemapVALUE_DL_DLV_CB(),
ConvertToBqM3CB(),
ParseTimeCB(),
EncodeTimeCB(),
])
df_test = tfm()['SEAWATER']
df_test.sample(5)Warning: 4831 missing time value(s) in SEAWATER
| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE | DLV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 69085 | 69085 | 141.050739 | 37.409267 | 1733731260 | T-A3 | 33 | 1 | 2 | NaN | 320.0 | 320.0 |
| 44340 | 44340 | 141.034444 | 37.431111 | 1564646400 | T-1 | 33 | 1 | 2 | NaN | 720.0 | 720.0 |
| 30532 | 30532 | 141.033611 | 37.415833 | 1659345300 | T-2 | 103 | 1 | 1 | NaN | 7000.0 | NaN |
| 42965 | 42965 | 141.034444 | 37.431111 | 1525677000 | T-1 | 1 | 1 | 2 | NaN | 880.0 | 880.0 |
| 14651 | 14651 | 141.013889 | 37.241667 | 1315813500 | T-4 | 29 | 1 | 2 | NaN | 4000.0 | 4000.0 |
Sanitize coordinates
tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB(),
RemapUnitNameCB(unit_mapping),
RemapNuclideNameCB(nuclide_mapping),
RemapVALUE_DL_DLV_CB(),
ConvertToBqM3CB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB()
])
df_test = tfm()['SEAWATER']
df_test.sample(5)Warning: 4831 missing time value(s) in SEAWATER
| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE | DLV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 54074 | 54074 | 141.040278 | 37.430556 | 1515394980 | T-0-1 | 1 | 1 | 2 | NaN | 1700.0 | 1700.0 |
| 74378 | 74378 | 141.072167 | 37.333333 | 1681804920 | T-D9 | 103 | 1 | 2 | NaN | 14000.0 | 14000.0 |
| 91213 | 91213 | 141.583333 | 38.633333 | 1377003060 | T-MG0 | 31 | 1 | 2 | NaN | 2.2 | 2.2 |
| 72667 | 72667 | 141.072167 | 37.333333 | 1380362220 | T-D9 | 31 | 1 | 1 | NaN | 23.0 | NaN |
| 86114 | 86114 | 141.200000 | 37.416667 | 1638775740 | T-5 | 104 | 1 | 2 | NaN | 2300.0 | 2300.0 |
Add Sample ID
The SMP_ID_PROVIDER column stores the original sample ID from the data provider. TEPCO does not provide sample IDs, so this column will be set to None for all records.
class AddSampleIdCB(Callback):
"Convert from Bq/L to Bq/m3."
def __call__(self, tfm, factor=1000):
tfm.dfs['SEAWATER']['SMP_ID'] = range(1, len(tfm.dfs['SEAWATER']) + 1)
tfm.dfs['SEAWATER']['SMP_ID_PROVIDER'] = ""tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB(),
RemapUnitNameCB(unit_mapping),
RemapNuclideNameCB(nuclide_mapping),
RemapVALUE_DL_DLV_CB(),
ConvertToBqM3CB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
AddSampleIdCB(),
])
df_test = tfm()['SEAWATER']
df_test.sample(5)Warning: 4831 missing time value(s) in SEAWATER
| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE | DLV | SMP_ID_PROVIDER |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 67223 | 57586 | 141.047222 | 37.241667 | 1448351640 | T-11 | 31 | 1 | 1 | NaN | 3.5 | NaN | |
| 42868 | 35125 | 141.034444 | 37.431111 | 1523173800 | T-1 | 33 | 1 | 2 | NaN | 450.0 | 450.0 | |
| 44015 | 36272 | 141.034444 | 37.431111 | 1553155800 | T-1 | 31 | 1 | 2 | NaN | 760.0 | 760.0 | |
| 42519 | 34776 | 141.034444 | 37.431111 | 1513755900 | T-1 | 31 | 1 | 2 | NaN | 540.0 | 540.0 | |
| 13882 | 7902 | 141.013889 | 37.241667 | 1318838400 | T-4 | 29 | 1 | 2 | NaN | 4000.0 | 4000.0 |
Encode to NetCDF
tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB(),
RemapUnitNameCB(unit_mapping),
RemapNuclideNameCB(nuclide_mapping),
RemapVALUE_DL_DLV_CB(),
ConvertToBqM3CB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
AddSampleIdCB(),
])
dfs_tfm = tfm()
tfm.logsWarning: 4831 missing time value(s) in SEAWATER
['Remove 約 (about) char',
"Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean",
'Select columns of interest.',
'\n Get TEPCO nuclide names as values not column names \n to extract contained information (nuclide name, unc, dl, ...).\n ',
'Extract nuclide name from TEPCO data.',
'Extract unit from TEPCO data.',
'Extract value type from TEPCO data.',
'Reshape: long to wide',
'\n Remap `UNIT` name to MARIS id.\n ',
'Remap `NUCLIDE` name to MARIS id.',
'Remap `DL`, `DLV`, `VALUE` based on TEPCO -> MARIS rules.',
'Convert from Bq/L to Bq/m3.',
'Parse time column from TEPCO.',
'Encode time as seconds since epoch.',
'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.',
'Convert from Bq/L to Bq/m3.']
dfs_tfm['SEAWATER'].sample(10)| type | SMP_ID | LON | LAT | TIME | STATION | NUCLIDE | UNIT | DL | UNC | VALUE | DLV | SMP_ID_PROVIDER |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 64028 | 54701 | 141.046667 | 37.430556 | 1408352280 | T-0-1A | 1 | 1 | 2 | NaN | 1700.0 | 1700.0 | NaN |
| 82446 | 72534 | 141.133333 | 38.250000 | 1522926600 | T-MG4 | 31 | 1 | 2 | NaN | 1.5 | 1.5 | NaN |
| 39583 | 32834 | 141.034444 | 37.431111 | 1438068000 | T-1 | 31 | 1 | 2 | NaN | 790.0 | 790.0 | NaN |
| 32941 | 26702 | 141.033611 | 37.415833 | 1719583500 | T-2 | 31 | 1 | 2 | NaN | 640.0 | 640.0 | NaN |
| 31333 | 25094 | 141.033611 | 37.415833 | 1680244980 | T-2 | 33 | 1 | 2 | NaN | 740.0 | 740.0 | NaN |
| 87087 | 77175 | 141.216667 | 37.533333 | 1659696300 | T-B1 | 33 | 1 | 1 | NaN | 1.4 | NaN | NaN |
| 83010 | 73098 | 141.148611 | 37.348333 | 1647327180 | T-B4 | 31 | 1 | 2 | NaN | 1.4 | 1.4 | NaN |
| 65263 | 55757 | 141.046667 | 37.430556 | 1594621980 | T-0-1A | 33 | 1 | 2 | NaN | 750.0 | 750.0 | NaN |
| 87045 | 77133 | 141.216667 | 37.533333 | 1614322020 | T-B1 | 31 | 1 | 2 | NaN | 1.2 | 1.2 | NaN |
| 13558 | 7578 | 141.006944 | 37.055556 | 1380524760 | T-17-1 | 33 | 1 | 1 | NaN | 8.7 | NaN | NaN |
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
'Earth Science > Oceans > Water Quality > Ocean Contaminants',
'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']def get_attrs(tfm, zotero_key, kw=kw):
"Retrieve global attributes from MARIS dump."
return GlobAttrsFeeder(tfm.dfs, cbs=[
BboxCB(),
TimeRangeCB(),
ZoteroCB(zotero_key),
KeyValuePairCB('keywords', ', '.join(kw)),
KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
])()get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw){'geospatial_lat_min': '141.66666667',
'geospatial_lat_max': '38.63333333',
'geospatial_lon_min': '140.60388889',
'geospatial_lon_max': '35.79611111',
'geospatial_bounds': 'POLYGON ((140.60388889 35.79611111, 141.66666667 35.79611111, 141.66666667 38.63333333, 140.60388889 38.63333333, 140.60388889 35.79611111))',
'time_coverage_start': '2011-03-21T14:30:00',
'time_coverage_end': '2025-07-22T06:17:00',
'id': 'JEV6HP5A',
'title': "Readings of Sea Area Monitoring - Monitoring of sea water - Sea area close to TEPCO's Fukushima Daiichi NPS / Coastal area - Readings of Sea Area Monitoring [TEPCO]",
'summary': '',
'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "TEPCO - Tokyo Electric Power Company"}]',
'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
'publisher_postprocess_logs': "Remove 約 (about) char, Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean, Select columns of interest., \n Get TEPCO nuclide names as values not column names \n to extract contained information (nuclide name, unc, dl, ...).\n , Extract nuclide name from TEPCO data., Extract unit from TEPCO data., Extract value type from TEPCO data., Reshape: long to wide, \n Remap `UNIT` name to MARIS id.\n , Remap `NUCLIDE` name to MARIS id., Remap `DL`, `DLV`, `VALUE` based on TEPCO -> MARIS rules., Convert from Bq/L to Bq/m3., Parse time column from TEPCO., Encode time as seconds since epoch., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator., Convert from Bq/L to Bq/m3."}
def encode(
fname_out: str, # Path to the folder where the NetCDF output will be saved
**kwargs # Additional keyword arguments
):
"Encode TEPCO data to NetCDF."
dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
tfm = Transformer(dfs, cbs=[
RemoveJapanaseCharCB(),
FixRangeValueStringCB(),
SelectColsOfInterestCB(common_coi, nuclides_pattern),
WideToLongCB(),
ExtractNuclideNameCB(),
ExtractUnitCB(),
ExtractValueTypeCB(),
LongToWideCB(),
RemapUnitNameCB(unit_mapping),
RemapNuclideNameCB(nuclide_mapping),
RemapVALUE_DL_DLV_CB(),
ConvertToBqM3CB(),
ParseTimeCB(),
EncodeTimeCB(),
SanitizeLonLatCB(),
AddSampleIdCB()
])
tfm()
encoder = NetCDFEncoder(tfm.dfs,
dest_fname=fname_out,
global_attrs=get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw),
verbose=kwargs.get('verbose', False)
)
encoder.encode()encode(fname_out, verbose=False)100%|██████████| 11/11 [00:05<00:00, 2.17it/s]
100%|██████████| 11/11 [00:05<00:00, 2.14it/s]
Warning: 4831 missing time value(s) in SEAWATER
decode(fname_in=fname_out, verbose=True)Saved SEAWATER to ../../_data/output/tepco_SEAWATER.csv
df_output = pd.read_csv("../../_data/output/tepco_SEAWATER.csv")
df_output.head()| samplabcode | longitude | latitude | begperiod | station | nuclide_id | activity | unit_id | uncertaint | detection | detection_lim | samptype_id | ref_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 140.60388 | 36.29972 | 2011-10-13 13:21:00 | T-C | 29 | 4000.0 | 1 | NaN | < | 4000.0 | 1 | 679 |
| 1 | NaN | 140.60388 | 36.29972 | 2011-10-13 13:21:00 | T-C | 31 | 6000.0 | 1 | NaN | < | 6000.0 | 1 | 679 |
| 2 | NaN | 140.60388 | 36.29972 | 2011-10-13 13:21:00 | T-C | 33 | 9000.0 | 1 | NaN | < | 9000.0 | 1 | 679 |
| 3 | NaN | 140.60388 | 36.29972 | 2011-10-13 13:23:00 | T-C | 29 | 4000.0 | 1 | NaN | < | 4000.0 | 1 | 679 |
| 4 | NaN | 140.60388 | 36.29972 | 2011-10-13 13:23:00 | T-C | 31 | 6000.0 | 1 | NaN | < | 6000.0 | 1 | 679 |
Can you summarize the pipeline above and highlight: - the key transformation steps - the main hurdles