Data pipeline (handler) to convert TEPCO dataset (Source) to NetCDF format

Configuration & file paths

Exported source
fname_coastal_water = 'https://radioactivity.nra.go.jp/cont/en/results/sea/coastal_water.csv'
fname_clos1F = 'https://radioactivity.nra.go.jp/cont/en/results/sea/close1F_water.xlsx'
fname_iaea_orbs = 'https://raw.githubusercontent.com/RML-IAEA/iaea.orbs/refs/heads/main/src/iaea/orbs/stations/station_points.csv'

fname_out = '../../_data/output/tepco.nc'

Load data

We here load the data from the NRA (Nuclear Regulatory Authority) website. For the moment, we only process radioactivity concentration data in the seawater around Fukushima Dai-ichi NPP [TEPCO] (coastal_water.csv) and in the close1F_water.xlsx file.

In near future, MARIS will provide a dedicated handler for all related ALPS data including measurements not only provided by TEPCO but also MOE, NRA, MLITT and Fukushima Prefecture.

ImportantFEEDBACK TO DATA PROVIDER

The coastal_water.csv file contains two sections: the measurements and the locations. We identify below the line number where the locations begin. A single point of truth for the location of the stations would ease the processing in future.


source

find_location_section

 find_location_section (df, col_idx=0, pattern='Sampling point number')

Find the line number where location data begins.

Exported source
def find_location_section(df, 
                          col_idx=0,
                          pattern='Sampling point number'
                          ):
    "Find the line number where location data begins."
    mask = df.iloc[:, col_idx] == pattern
    indices = df[mask].index
    return indices[0] if len(indices) > 0 else -1
find_location_section(pd.read_csv(fname_coastal_water, low_memory=False))
np.int64(29252)
ImportantFEEDBACK TO DATA PROVIDER

Distinct parsing of the time from coastal_water.csv and close1F_water.xlsx files are required. Indeed:

  • coastal_water.csv uses the format YYYY/MM/DD in the Sampling HH:MM and
  • close1F_water.xlsx uses the format YYYY-MM-DD HH:MM:SS.

source

fix_sampling_time

 fix_sampling_time (x)
Exported source
def fix_sampling_time(x):
    if pd.isna(x): 
        return '00:00:00'
    else:
        hour, min =  x.split(':')[:2]
        return f"{hour if len(hour) == 2 else '0' + hour}:{min}:00"

source

get_coastal_water_df

 get_coastal_water_df (fname_coastal_water)

Get the measurements dataframe from the coastal_water.csv file.

Exported source
def get_coastal_water_df(fname_coastal_water):
    "Get the measurements dataframe from the `coastal_water.csv` file."
    
    locs_idx = find_location_section(pd.read_csv(fname_coastal_water, 
                                      skiprows=0, low_memory=False))
    
    df = pd.read_csv(fname_coastal_water, skiprows=1, 
                     nrows=locs_idx - 1,
                     low_memory=False)
    df.dropna(subset=['Sampling point number'], inplace=True)
    df['Sampling time'] = df['Sampling time'].map(fix_sampling_time)
    
    df['TIME'] = df['Sampling date'].replace('-', '/') + ' ' + df['Sampling time']
    
    df = df.drop(columns=['Sampling date', 'Sampling time'])
    return df
df_coastal_water = get_coastal_water_df(fname_coastal_water)
df_coastal_water.tail()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
29219 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.2E+00 NaN NaN NaN NaN NaN 2025/7/17 07:56:00
29220 T-S8 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.8E+00 NaN NaN NaN NaN NaN 2025/7/18 05:34:00
29221 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 7.9E+00 NaN NaN NaN NaN NaN 2025/7/21 08:05:00
29222 T-S3 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 7.3E+00 NaN NaN NaN NaN NaN 2025/7/22 05:54:00
29223 T-S4 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 7.4E+00 NaN NaN NaN NaN NaN 2025/7/22 06:17:00

5 rows × 49 columns

coi = [o for o in df_coastal_water.columns if "134Cs" in o]

df_coastal_water[coi + ['Sampling point number', 'TIME']].head(30)
134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) Sampling point number TIME
0 4.8E+01 9.2E+00 T-3 2011/3/21 23:15:00
1 3.1E+01 8.7E+00 T-4 2011/3/21 23:45:00
2 4.6E+01 1.4E+01 T-3 2011/3/22 14:28:00
3 3.9E+01 1.1E+01 T-4 2011/3/22 15:06:00
4 5.1E+01 2.0E+01 T-3 2011/3/23 13:51:00
5 3.3E+01 2.1E+01 T-4 2011/3/23 14:25:00
6 9.9E+01 3.8E+01 T-3 2011/3/24 09:30:00
7 3.5E+01 7.0E+00 T-4 2011/3/24 08:45:00
8 2.6E+01 7.4E+00 T-3 2011/3/25 10:00:00
9 2.0E+01 6.7E+00 T-4 2011/3/25 09:10:00
10 2.6E+01 1.8E+01 T-3 2011/3/26 15:15:00
11 1.3E+01 7.1E+00 T-4 2011/3/26 15:50:00
12 5.4E+02 1.2E+01 T-3 2011/3/27 14:30:00
13 2.0E+01 6.0E+00 T-4 2011/3/27 08:45:00
14 6.1E+02 2.3E+01 T-3 2011/3/28 09:35:00
15 3.3E+02 2.1E+01 T-4 2011/3/28 08:45:00
16 3.2E+02 1.3E+01 T-3 2011/3/29 10:15:00
17 2.3E+02 1.2E+01 T-4 2011/3/29 09:20:00
18 3.6E+02 2.0E+01 T-3 2011/3/30 10:00:00
19 1.8E+02 2.0E+01 T-4 2011/3/30 09:05:00
20 3.6E+02 2.1E+01 T-3 2011/3/31 10:00:00
21 1.6E+02 2.0E+01 T-4 2011/3/31 09:15:00
22 3.0E+02 1.8E+01 T-3 2011/4/1 09:50:00
23 2.0E+02 1.8E+01 T-4 2011/4/1 09:00:00
24 1.9E+01 1.5E+01 8 2011/4/2 13:35:00
25 1.7E+02 1.7E+01 T-3 2011/4/2 09:55:00
26 5.1E+01 1.7E+01 T-4 2011/4/2 09:00:00
27 2.3E+01 4.9E+00 T-5 2011/4/2 14:03:00
28 NaN NaN T-7 2011/4/2 13:12:00
29 NaN NaN 8 2011/4/3 12:20:00
coi
['134Cs radioactivity concentration (Bq/L)', '134Cs detection limit (Bq/L)']
df_coastal_water.dropna(subset=coi, how='any')[coi + ['Sampling point number', 'TIME']]
134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) Sampling point number TIME
0 4.8E+01 9.2E+00 T-3 2011/3/21 23:15:00
1 3.1E+01 8.7E+00 T-4 2011/3/21 23:45:00
2 4.6E+01 1.4E+01 T-3 2011/3/22 14:28:00
3 3.9E+01 1.1E+01 T-4 2011/3/22 15:06:00
4 5.1E+01 2.0E+01 T-3 2011/3/23 13:51:00
... ... ... ... ...
29209 ND 1.1E-03 T-11 2025/6/27 09:41:00
29210 ND 1.3E-03 T-5 2025/6/27 08:09:00
29211 ND 1.2E-03 T-5 2025/6/27 08:09:00
29212 ND 1.1E-03 T-D9 2025/6/27 09:03:00
29213 ND 1.0E-03 T-D9 2025/6/27 09:03:00

19128 rows × 4 columns

mask = df_coastal_water['134Cs radioactivity concentration (Bq/L)'] == 'ND'
df_coastal_water[mask][coi + ['Sampling point number', 'TIME']]
134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) Sampling point number TIME
53 ND NaN 5 2011/4/6 11:30:00
57 ND NaN 8 2011/4/6 12:52:00
59 ND NaN 10 2011/4/6 13:37:00
64 ND NaN T-7 2011/4/6 12:44:00
65 ND NaN T-7 2011/4/6 13:15:00
... ... ... ... ...
29209 ND 1.1E-03 T-11 2025/6/27 09:41:00
29210 ND 1.3E-03 T-5 2025/6/27 08:09:00
29211 ND 1.2E-03 T-5 2025/6/27 08:09:00
29212 ND 1.1E-03 T-D9 2025/6/27 09:03:00
29213 ND 1.0E-03 T-D9 2025/6/27 09:03:00

19215 rows × 4 columns

len(df_coastal_water)
29224
ImportantFEEDBACK TO DATA PROVIDER

Identification of the stations location requires three distinct files:

  • the second section of the coastal_water.csv file
  • the R6zahyo.pdf file further processed by https://github.com/RML-IAEA/iaea.orbs
  • the second sections of all sheets of close1F_water.xlsx file

All files and sheets required to look up the location of the stations.


source

get_locs_coastal_water

 get_locs_coastal_water (fname_coastal_water)
Exported source
def get_locs_coastal_water(fname_coastal_water):
    locs_idx = find_location_section(pd.read_csv(fname_coastal_water, 
                                      skiprows=0, low_memory=False))
    
    df = pd.read_csv(fname_coastal_water, skiprows=locs_idx+1, 
                     low_memory=False).iloc[:, :3]
    
    df.columns = ['STATION', 'LON', 'LAT']
    df.dropna(subset=['LAT'], inplace=True)
    df['org'] = 'coastal_seawater.csv'
    return df
df_locs_coastal_water = get_locs_coastal_water(fname_coastal_water)
print(f'Nb. of stations: {len(df_locs_coastal_water)}')
df_locs_coastal_water.head()
Nb. of stations: 48
STATION LON LAT org
0 T-0 37.42 141.04 coastal_seawater.csv
1 T-11 37.24 141.05 coastal_seawater.csv
2 T-12 37.15 141.04 coastal_seawater.csv
3 T-13-1 37.64 141.04 coastal_seawater.csv
4 T-14 37.55 141.06 coastal_seawater.csv
df_locs_coastal_water.STATION.unique()
array(['T-0', 'T-11', 'T-12', 'T-13-1', 'T-14', 'T-17-1', 'T-18', 'T-20',
       'T-22', 'T-3', 'T-4', 'T-4-1', 'T-4-2', 'T-5', 'T-6', 'T-7', 'T-A',
       'T-B', 'T-B1', 'T-B2', 'T-B3', 'T-B4', 'T-C', 'T-D', 'T-D1',
       'T-D5', 'T-D9', 'T-E', 'T-E1', 'T-Z', 'T-MG6', 'T-S1', 'T-S7',
       'T-H1', 'T-S2', 'T-S6', 'T-M10', 'T-MA', 'T-S3', 'T-S4', 'T-S8',
       'T-MG4', 'T-G4', 'T-MG5', 'T-MG1', 'T-MG0', 'T-MG3', 'T-MG2'],
      dtype=object)
ImportantFEEDBACK TO DATA PROVIDER

Data contained in the close1F_water.xlsx file are spread in several sheets (one per station). Each sheet further contains two sections: the measurements and the locations.

For each sheet, we have to identify the line number where to split both measurements and the location. We then need to further iterate over all sheets to concatenate the results.


source

get_clos1F_df

 get_clos1F_df (fname_clos1F)

Get measurements dataframe from close1F_water.xlsx file and parse datetime.

Exported source
def get_clos1F_df(fname_clos1F):
    "Get measurements dataframe from close1F_water.xlsx file and parse datetime."
    excel_file = pd.ExcelFile(fname_clos1F)
    dfs = {}
    
    for sheet_name in tqdm(excel_file.sheet_names):
        locs_idx = find_location_section(pd.read_excel(excel_file, 
                                                       sheet_name=sheet_name,
                                                       skiprows=1))
        df = pd.read_excel(excel_file, 
                   sheet_name=sheet_name, 
                   skiprows=1,
                   nrows=locs_idx-1)
        
        df.dropna(subset=['Sampling point number'], inplace=True)
        df['Sampling date'] = df['Sampling date']\
            .astype(str)\
            .apply(lambda x: x.split(' ')[0]\
            .replace('-', '/'))
            
        dfs[sheet_name] = df
    
    df = pd.concat(dfs.values(), ignore_index=True)
    df.dropna(subset=['Sampling date'], inplace=True)
    df['TIME'] = df['Sampling date'] + ' ' + df['Sampling time'].astype(str)
    df = df.drop(columns=['Sampling date', 'Sampling time'])
    return df
df_clos1F = get_clos1F_df(fname_clos1F)
df_clos1F.head()
100%|██████████| 11/11 [00:05<00:00,  2.17it/s]
Sampling point number 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) Total beta radioactivity concentration (Bq/L) Total beta detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) Collection layer of seawater ... 106Ru detection limit (Bq/L) 60Co radioactivity concentration (Bq/L) 60Co detection limit (Bq/L) 95Zr radioactivity concentration (Bq/L) 95Zr detection limit (Bq/L) 99Mo radioactivity concentration (Bq/L) 99Mo detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) TIME
0 T-0-1 ND 1.5 ND 1.4 ND 18.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
1 T-0-1 NaN NaN NaN NaN NaN NaN 4.7 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
2 T-0-1 ND 1.1 ND 1.4 ND 20.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
3 T-0-1 NaN NaN NaN NaN NaN NaN ND 2.9 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
4 T-0-1 ND 0.66 ND 0.49 ND 17.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/27 08:14:00

5 rows × 57 columns

df_clos1F['Sampling point number'].unique()
array(['T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
       'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)

source

get_locs_clos1F

 get_locs_clos1F (fname_clos1F)

Get locations dataframe from close1F_water.xlsx file from each sheets.

Exported source
def get_locs_clos1F(fname_clos1F):
    "Get locations dataframe from close1F_water.xlsx file from each sheets."
    excel_file = pd.ExcelFile(fname_clos1F)
    dfs = {}
    
    for sheet_name in tqdm(excel_file.sheet_names):
        locs_idx = find_location_section(pd.read_excel(excel_file, 
                                                       sheet_name=sheet_name,
                                                       skiprows=1))
        df = pd.read_excel(excel_file, 
                           sheet_name=sheet_name, 
                           skiprows=locs_idx+2)
            
        dfs[sheet_name] = df
    
    df = pd.concat(dfs.values(), ignore_index=True).iloc[:, :3]
    df.dropna(subset=['Sampling coordinate North latitude (Decimal)'], inplace=True)    
    df.columns = ['STATION', 'LON', 'LAT']
    df['org'] = 'close1F.csv'
    return df
df_locs_clos1F = get_locs_clos1F(fname_clos1F)
print(f'Nb. of stations: {len(df_locs_clos1F)}')
df_locs_clos1F.head()
100%|██████████| 11/11 [00:05<00:00,  1.97it/s]
Nb. of stations: 11
STATION LON LAT org
0 T-0-1 37.43 141.04 close1F.csv
11 T-0-1A 37.43 141.05 close1F.csv
22 T-0-2 37.42 141.05 close1F.csv
33 T-0-3 37.42 141.04 close1F.csv
44 T-0-3A 37.42 141.05 close1F.csv
ImportantFEEDBACK TO DATA PROVIDER

The close1F_water.xlsx file contains station locations that are not present in the coastal_water.csv dataset, as demonstrated in the comparison below:

set(df_locs_clos1F.STATION) - set(df_locs_coastal_water.STATION)
{'T-0-1',
 'T-0-1A',
 'T-0-2',
 'T-0-3',
 'T-0-3A',
 'T-1',
 'T-2',
 'T-2-1',
 'T-A1',
 'T-A2',
 'T-A3'}
ImportantFEEDBACK TO DATA PROVIDER

In theory all locations are supposed to be provided in the R6zahyo.pdf file. This file is further processed by https://github.com/RML-IAEA/iaea.orbs and the result is provided in the station_points.csv file.

However, this file lacks complete coverage of locations referenced in both coastal_water.csv and close1F_water.xlsx files, while simultaneously containing additional locations not present in either (see below). A more standardized and comprehensive location reference system would significantly improve the efficiency and reliability of the data ingestion process.


source

get_locs_orbs

 get_locs_orbs (fname_iaea_orbs)
Exported source
def get_locs_orbs(fname_iaea_orbs):
    df = pd.read_csv(fname_iaea_orbs)
    df.columns = ['org', 'STATION', 'LON', 'LAT']
    return df
df_locs_orbs = get_locs_orbs(fname_iaea_orbs)
df_locs_orbs.head()
org STATION LON LAT
0 MOE E-31 141.727667 39.059167
1 MOE E-32 141.635667 38.996000
2 MOE E-37 141.948611 39.259167
3 MOE E-38 141.755000 39.008333
4 MOE E-39 141.766667 38.991667
set(df_locs_orbs.STATION) - (set(df_locs_clos1F.STATION) | set(df_locs_coastal_water.STATION))
{'C-P1',
 'C-P2',
 'C-P3',
 'C-P4',
 'C-P5',
 'C-P8',
 'E-31',
 'E-32',
 'E-37',
 'E-38',
 'E-39',
 'E-3A',
 'E-41',
 'E-42',
 'E-43',
 'E-44',
 'E-45',
 'E-46',
 'E-47',
 'E-48',
 'E-49',
 'E-4A',
 'E-4B',
 'E-4C',
 'E-4F',
 'E-4G',
 'E-4H',
 'E-4J',
 'E-4K',
 'E-4L',
 'E-4M',
 'E-71',
 'E-72',
 'E-73',
 'E-74',
 'E-75',
 'E-76',
 'E-77',
 'E-78',
 'E-79',
 'E-7A',
 'E-7B',
 'E-7C',
 'E-7D',
 'E-7F',
 'E-7G',
 'E-7H',
 'E-7I',
 'E-7J',
 'E-7K',
 'E-7L',
 'E-81',
 'E-82',
 'E-83',
 'E-84',
 'E-85',
 'E-S1',
 'E-S10',
 'E-S13',
 'E-S14',
 'E-S15',
 'E-S17',
 'E-S18',
 'E-S19',
 'E-S20',
 'E-S21',
 'E-S22',
 'E-S23',
 'E-S24',
 'E-S25',
 'E-S26',
 'E-S27',
 'E-S28',
 'E-S29',
 'E-S3',
 'E-S30',
 'E-S31',
 'E-S32',
 'E-S33',
 'E-S34',
 'E-S35',
 'E-S36',
 'E-S4',
 'E-S5',
 'E-T1',
 'E-T2',
 'E-T3',
 'E-T4',
 'E-T5',
 'E-T6',
 'E-T7',
 'E-T8',
 'F-P01',
 'F-P02',
 'F-P03',
 'F-P04',
 'F-P05',
 'F-P06',
 'F-P07',
 'F-P08',
 'F-P09',
 'F-P10',
 'F-P11',
 'F-P12',
 'F-P13',
 'F-P14',
 'F-P15',
 'F-P29',
 'F-P30',
 'F-P31',
 'F-P32',
 'F-P33',
 'F-P34',
 'F-P35',
 'F-P37',
 'F-P38',
 'F-P39',
 'F-P40',
 'F-P41',
 'F-P42',
 'F-P43',
 'F-P45',
 'F-P46',
 'F-P47',
 'F-P48',
 'F-P49',
 'F-P50',
 'F-P51',
 'F-P52',
 'F-P53',
 'F-P54',
 'F-P55',
 'F-P56',
 'F-P57',
 'F-P58',
 'F-P59',
 'F-P60',
 'F-P61',
 'F-P62',
 'F-P63',
 'F-P64',
 'F-P65',
 'F-P66',
 'F-P67',
 'F-P68',
 'F-P69',
 'F-P70',
 'F-P71',
 'F-P72',
 'F-P73',
 'F-P74',
 'F-P75',
 'F-P76',
 'F-P77',
 'F-P78',
 'F-P79',
 'F-P80',
 'F-P81',
 'F-P82',
 'F-P83',
 'K-T1',
 'K-T2',
 'KK-U1',
 'M-10',
 'M-101',
 'M-102',
 'M-103',
 'M-104',
 'M-11',
 'M-14',
 'M-15',
 'M-19',
 'M-20',
 'M-21',
 'M-25',
 'M-26',
 'M-27',
 'M-A1',
 'M-A3',
 'M-B1',
 'M-B5',
 'M-C1',
 'M-C10',
 'M-C2',
 'M-C3',
 'M-C4',
 'M-C6',
 'M-C7',
 'M-C8',
 'M-C9',
 'M-D1',
 'M-D3',
 'M-E1',
 'M-E3',
 'M-E5',
 'M-F1',
 'M-F3',
 'M-G0',
 'M-G1',
 'M-G3',
 'M-G4',
 'M-H1',
 'M-H3',
 'M-I0',
 'M-I1',
 'M-I3',
 'M-IB2',
 'M-IB4',
 'M-J1',
 'M-J3',
 'M-K1',
 'M-L1',
 'M-L3',
 'M-M1',
 'M-MI4',
 'T-S5',
 'T-①',
 'T-②',
 'T-③',
 'T-④',
 'T-⑤',
 'T-⑥',
 'T-⑦',
 'T-⑧',
 'T-⑨',
 'T-⑩',
 'T-⑪',
 'T-⑫',
 'T-⑬'}

source

concat_locs

 concat_locs (dfs)

Concatenate and drop duplicates from coastal_seawater.csv and iaea_orbs.csv (kept)

Exported source
def concat_locs(dfs):
    "Concatenate and drop duplicates from coastal_seawater.csv and iaea_orbs.csv (kept)"
    df = pd.concat(dfs)
    # Group by org to be used for sorting
    df['org_grp'] = df['org'].apply(
        lambda x: 1 if x == 'coastal_seawater.csv' else 2 if x == 'close1F.csv' else 0)
    df.sort_values('org_grp', ascending=True, inplace=True)
    # Drop duplicates and keep orbs data first
    df.drop_duplicates(subset='STATION', keep='first', inplace=True)
    df.drop(columns=['org_grp'], inplace=True)
    df.sort_values('STATION', ascending=True, inplace=True)
    return df
df_locs = concat_locs([df_locs_clos1F, df_locs_coastal_water, df_locs_orbs])
df_locs.head()
STATION LON LAT org
214 C-P1 139.863333 35.425000 NRA
215 C-P2 139.863333 35.401667 NRA
216 C-P3 139.881667 35.370000 NRA
217 C-P4 139.846667 35.356667 NRA
218 C-P5 139.800000 35.343333 NRA

source

align_dfs

 align_dfs (df_from, df_to)

Align columns structure of df_from to df_to.

Exported source
def align_dfs(df_from, df_to):
    "Align columns structure of df_from to df_to."
    df = defaultdict()    
    for c in df_to.columns:
        df[c] = df_from[c].values if c in df_from.columns else np.nan
    return pd.DataFrame(df)
align_dfs(df_clos1F, df_coastal_water).head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
0 T-0-1 NaN NaN NaN ND 1.5 ND 1.4 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
1 T-0-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 4.7 NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
2 T-0-1 NaN NaN NaN ND 1.1 ND 1.4 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
3 T-0-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 2.9 NaN NaN NaN NaN NaN 2013/08/21 08:09:00
4 T-0-1 NaN NaN NaN ND 0.66 ND 0.49 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/27 08:14:00

5 rows × 49 columns


source

concat_dfs

 concat_dfs (df_coastal_water, df_clos1F)

Concatenate and drop duplicates from coastal_seawater.csv and close1F_water.xlsx (kept)

Exported source
def concat_dfs(df_coastal_water, df_clos1F):
    "Concatenate and drop duplicates from coastal_seawater.csv and close1F_water.xlsx (kept)"
    df_clos1F = align_dfs(df_clos1F, df_coastal_water)
    df = pd.concat([df_coastal_water, df_clos1F])
    return df
df_meas = concat_dfs(df_coastal_water, df_clos1F)
df_meas.head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/21 23:15:00
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/21 23:45:00
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/22 14:28:00
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/22 15:06:00
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN NaN NaN NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00

5 rows × 49 columns


source

georef_data

 georef_data (df_meas, df_locs)

Georeference measurements dataframe using locations dataframe.

Exported source
def georef_data(df_meas, df_locs):
    "Georeference measurements dataframe using locations dataframe."
    assert "Sampling point number" in df_meas.columns and "STATION" in df_locs.columns
    return pd.merge(df_meas, df_locs, how="inner", 
                    left_on='Sampling point number', right_on='STATION')
df_meas_georef = georef_data(df_meas, df_locs)
df_meas_georef.head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns


source

load_data

 load_data (fname_coastal_water, fname_clos1F, fname_iaea_orbs)

Load, align and georeference TEPCO data

Exported source
def load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs):
    "Load, align and georeference TEPCO data"
    df_locs = concat_locs(
        [get_locs_coastal_water(fname_coastal_water), 
         get_locs_clos1F(fname_clos1F),
         get_locs_orbs(fname_iaea_orbs)])
    df_meas = concat_dfs(get_coastal_water_df(fname_coastal_water), get_clos1F_df(fname_clos1F))
    df_meas.dropna(subset=['Sampling point number'], inplace=True)
    return {'SEAWATER': georef_data(df_meas, df_locs)}
dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
dfs['SEAWATER'].head()
100%|██████████| 11/11 [00:05<00:00,  2.14it/s]
100%|██████████| 11/11 [00:05<00:00,  2.14it/s]
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

print(f"# of cols, rows: {dfs['SEAWATER'].shape}")
dfs['SEAWATER'].head()
# of cols, rows: (49863, 53)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

dfs['SEAWATER'].STATION.unique()
array(['T-3', 'T-4', 'T-5', 'T-7', 'T-11', 'T-12', 'T-14', 'T-18', 'T-20',
       'T-22', 'T-MA', 'T-M10', 'T-A', 'T-D', 'T-E', 'T-B', 'T-C',
       'T-MG1', 'T-MG2', 'T-MG3', 'T-MG4', 'T-MG5', 'T-MG6', 'T-D1',
       'T-D5', 'T-D9', 'T-E1', 'T-G4', 'T-H1', 'T-S5', 'T-S6', 'T-17-1',
       'T-B3', 'T-13-1', 'T-S3', 'T-S4', 'T-B4', 'T-S1', 'T-S2', 'T-MG0',
       'T-Z', 'T-B1', 'T-B2', 'T-S7', 'T-S8', 'T-0', 'T-4-1', 'T-4-2',
       'T-6', 'T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
       'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)
np.sum(dfs['SEAWATER'] == "ND")
Sampling point number                                 0
Collection layer of seawater                          0
131I radioactivity concentration (Bq/L)            8642
131I detection limit (Bq/L)                           0
134Cs radioactivity concentration (Bq/L)          30967
134Cs detection limit (Bq/L)                          0
137Cs radioactivity concentration (Bq/L)          17232
137Cs detection limit (Bq/L)                          0
132I radioactivity concentration (Bq/L)               3
132I detection limit (Bq/L)                           0
132Te radioactivity concentration (Bq/L)              0
132Te detection limit (Bq/L)                          0
136Cs radioactivity concentration (Bq/L)              2
136Cs detection limit (Bq/L)                          0
140La radioactivity concentration (Bq/L)              0
140La detection limit (Bq/L)                          0
89Sr radioactivity concentration (Bq/L)             101
89Sr detection limit (Bq/L)                           0
90Sr radioactivity concentration (Bq/L)             344
90Sr detection limit (Bq/L)                           0
238Pu radioactivity concentration (Bq/L)            309
238Pu detection limit (Bq/L)                          0
239Pu+240Pu radioactivity concentration (Bq/L)      231
239Pu+240Pu statistical error (Bq/L)                  0
239Pu+240Pu detection limit (Bq/L)                    0
Total alpha radioactivity concentration (Bq/L)      983
Total alpha detection limit (Bq/L)                    0
Total beta radioactivity concentration (Bq/L)      4919
Total beta detection limit (Bq/L)                     0
140Ba radioactivity concentration (Bq/L)              0
140Ba detection limit (Bq/L)                          0
106Ru radioactivity concentration (Bq/L)              0
106Ru detection limit (Bq/L)                          0
58Co radioactivity concentration (Bq/L)               3
58Co detection limit (Bq/L)                           0
60Co radioactivity concentration (Bq/L)               9
60Co detection limit (Bq/L)                           0
144Ce radioactivity concentration (Bq/L)              9
144Ce detection limit (Bq/L)                          0
54Mn radioactivity concentration (Bq/L)               9
54Mn detection limit (Bq/L)                           0
3H radioactivity concentration (Bq/L)              9657
3H detection limit (Bq/L)                             0
125Sb radioactivity concentration (Bq/L)            647
125Sb detection limit (Bq/L)                          0
105Ru radioactivity concentration (Bq/L)              0
105Ru detection limit (Bq/L)                          0
Unnamed: 49                                           0
TIME                                                  0
STATION                                               0
LON                                                   0
LAT                                                   0
org                                                   0
dtype: int64
dfs['SEAWATER'][['TIME', '134Cs radioactivity concentration (Bq/L)', '134Cs detection limit (Bq/L)']]
TIME 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L)
0 2011/3/21 23:15:00 4.8E+01 9.2E+00
1 2011/3/21 23:45:00 3.1E+01 8.7E+00
2 2011/3/22 14:28:00 4.6E+01 1.4E+01
3 2011/3/22 15:06:00 3.9E+01 1.1E+01
4 2011/3/23 13:51:00 5.1E+01 2.0E+01
... ... ... ...
49858 2025/06/30 08:05 ND 0.4
49859 2025/07/07 08:36 ND 0.37
49860 2025/07/17 08:11 ND 0.29
49861 2025/07/21 08:20 ND 0.36
49862 2025/07/24 07:39 NaN NaN

49863 rows × 3 columns

Remove 約 (about) character

ImportantFEEDBACK TO DATA PROVIDER

We systematically remove the character. Please confirm that this is the correct way to handle this. We could imagine that mentioning uncertainty would be less ambiguous in future.


source

RemoveJapanaseCharCB

 RemoveJapanaseCharCB ()

Remove 約 (about) char

Exported source
class RemoveJapanaseCharCB(Callback):
    "Remove 約 (about) char"
    def _transform_if_about(self, value, about_char='約'):
        if pd.isna(value): return value
        return (value.replace(about_char, '') if str(value).count(about_char) != 0 
                else value)
    
    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            cols_rdn = [c for c in tfm.dfs[k].columns if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
            tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_about)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB()])

tfm()['SEAWATER'].sample(10)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
24196 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2024/1/6 08:28:00 T-D5 141.072222 37.416667 TEPCO
3988 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2012/10/16 09:40:00 T-D5 141.072222 37.416667 TEPCO
9505 T-5 上層 NaN NaN ND 1.8E-03 3.9E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2015/7/6 08:22:00 T-5 141.200000 37.416667 TEPCO
35102 T-1 上層 ND 0.49 ND 1.2 ND 1.5 NaN NaN ... NaN NaN NaN NaN NaN 2012/05/31 08:50:00 T-1 141.034444 37.431111 TEPCO
26240 T-14 下層 NaN NaN ND 1.4E-03 2.0E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2025/1/6 07:40:00 T-14 141.062500 37.552778 TEPCO
27384 T-0-1 NaN NaN NaN ND 0.71 ND 0.68 NaN NaN ... NaN NaN NaN NaN NaN 2014/06/24 09:53:00 T-0-1 141.040278 37.430556 TEPCO
39086 T-1 上層 NaN NaN ND 0.78 ND 0.58 NaN NaN ... NaN NaN NaN NaN NaN 2020/12/24 07:45:00 T-1 141.034444 37.431111 TEPCO
18880 T-D9 下層 NaN NaN ND 1.3E-03 4.2E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2020/11/24 09:38:00 T-D9 141.072167 37.333333 TEPCO
17717 T-MG1 上層 NaN NaN ND 1.2E-03 3.9E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2020/3/19 10:30:00 T-MG1 141.283333 38.333333 TEPCO
17990 T-5 下層 NaN NaN ND 1.3E-03 1.9E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2020/5/18 07:34:00 T-5 141.200000 37.416667 TEPCO

10 rows × 53 columns

Fix values range string

ImportantFEEDBACK TO DATA PROVIDER

Value ranges are provided as strings (e.g ‘4.0E+00<&<8.0E+00’ or ‘1.0~2.7’). We replace them by their mean. Please confirm that this is the correct way to handle this. Again, mentioning uncertainty would be less ambiguous in future.


source

FixRangeValueStringCB

 FixRangeValueStringCB ()

Replace range values (e.g ‘4.0E+00<&<8.0E+00’ or ‘1.0~2.7’) by their mean

Exported source
class FixRangeValueStringCB(Callback):
    "Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean"
    
    def _extract_and_calculate_mean(self, s):
        # For scientific notation ranges
        float_strings = re.findall(r"[+-]?\d+\.?\d*E?[+-]?\d*", s)
        if float_strings:
            float_numbers = np.array(float_strings, dtype=float)
            return float_numbers.mean()
        return s
    
    def _transform_if_range(self, value):
        if pd.isna(value): 
            return value
        value = str(value)
        # Check for both range patterns
        if '<&<' in value or '~' in value:
            return self._extract_and_calculate_mean(value)
        return value

    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            cols_rdn = [c for c in tfm.dfs[k].columns 
                       if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
            # tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_range).astype(float)
            tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_range)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB()
    ])

df_test = tfm()['SEAWATER']
df_test.sample(10)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
12202 T-14 上層 NaN NaN ND 1.3E-03 6.6E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2016/12/13 08:44:00 T-14 141.062500 37.552778 TEPCO
24157 T-18 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2023/12/25 09:21:00 T-18 140.922222 36.905556 TEPCO
39939 T-1 上層 NaN NaN ND 0.69 ND 0.81 NaN NaN ... NaN NaN NaN NaN NaN 2022/10/05 07:40:00 T-1 141.034444 37.431111 TEPCO
30050 T-0-1A NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2024/05/25 07:00:00 T-0-1A 141.046667 37.430556 TEPCO
34441 T-0-3A NaN NaN NaN ND 0.39 ND 0.34 NaN NaN ... NaN NaN NaN NaN NaN 2024/07/15 07:45:00 T-0-3A 141.046667 37.416111 TEPCO
4229 T-MG5 上層 NaN NaN 5.0E-03 NaN 8.9E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2012/11/22 09:11:00 T-MG5 141.250000 38.166667 TEPCO
29278 T-0-1A NaN NaN NaN ND 0.81 ND 0.65 NaN NaN ... NaN NaN NaN NaN NaN 2018/04/24 06:55:00 T-0-1A 141.046667 37.430556 TEPCO
40055 T-1 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2023/01/02 07:57:00 T-1 141.034444 37.431111 TEPCO
38436 T-1 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2019/08/19 07:55:00 T-1 141.034444 37.431111 TEPCO
22261 T-D9 上層 NaN NaN ND 1.2E-03 2.0E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2022/11/15 09:01:00 T-D9 141.072167 37.333333 TEPCO

10 rows × 53 columns

Select columns of interest

We select the columns of interest and in particular the elements of interest, in our case radionuclides.


source

SelectColsOfInterestCB

 SelectColsOfInterestCB (common_coi, nuclides_pattern)

Select columns of interest.

Exported source
common_coi = ['LON', 'LAT', 'TIME', 'STATION']
nuclides_pattern = '(Bq/L)'
Exported source
class SelectColsOfInterestCB(Callback):
    "Select columns of interest."
    def __init__(self, common_coi, nuclides_pattern): fc.store_attr()
    def __call__(self, tfm):
        nuc_of_interest = [c for c in tfm.dfs['SEAWATER'].columns if nuclides_pattern in c]
        tfm.dfs['SEAWATER'] = tfm.dfs['SEAWATER'][self.common_coi + nuc_of_interest]
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
LON LAT TIME STATION 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) ... 144Ce radioactivity concentration (Bq/L) 144Ce detection limit (Bq/L) 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L)
46500 141.033611 37.415833 2025/05/12 07:50 T-2 NaN NaN ND 0.67 ND 0.82 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
34387 141.046667 37.416111 2024/03/04 08:22:00 T-0-3A NaN NaN ND 0.36 ND 0.23 ... NaN NaN NaN NaN ND 9.0 NaN NaN NaN NaN
45183 141.033611 37.415833 2022/11/13 08:47:00 T-2 NaN NaN ND 0.74 ND 0.67 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
44254 141.033611 37.415833 2021/01/28 07:05:00 T-2 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN
12995 141.666667 38.300000 2017/6/6 08:25:00 T-MG2 NaN NaN ND 1.4E-03 1.6E-03 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 49 columns

Reshape: wide to long

So that we can extract information such as nuclide name, unit, derived quantities such as uncertainty, detection limit, …


source

WideToLongCB

 WideToLongCB (id_vars=['LON', 'LAT', 'TIME', 'STATION'])

Get TEPCO nuclide names as values not column names to extract contained information (nuclide name, unc, dl, …).

Exported source
class WideToLongCB(Callback):
    """
    Get TEPCO nuclide names as values not column names 
    to extract contained information (nuclide name, unc, dl, ...).
    """
    def __init__(self, id_vars=['LON', 'LAT', 'TIME', 'STATION']): 
        fc.store_attr()
        
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'] = pd.melt(tfm.dfs['SEAWATER'], id_vars=self.id_vars)
#| eval: false
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.head()
LON LAT TIME STATION variable value
0 141.026389 37.322222 2011/3/21 23:15:00 T-3 131I radioactivity concentration (Bq/L) 1.1E+03
1 141.013889 37.241667 2011/3/21 23:45:00 T-4 131I radioactivity concentration (Bq/L) 6.6E+02
2 141.026389 37.322222 2011/3/22 14:28:00 T-3 131I radioactivity concentration (Bq/L) 1.1E+03
3 141.013889 37.241667 2011/3/22 15:06:00 T-4 131I radioactivity concentration (Bq/L) 6.7E+02
4 141.026389 37.322222 2011/3/23 13:51:00 T-3 131I radioactivity concentration (Bq/L) 7.4E+02

Extract

Nulide name, dl, unc, … are extracted from column names as embedded in TEPCO data source.

Nuclide name


source

extract_nuclide

 extract_nuclide (text:str)

Extract the nuclide identifier from a measurement variable name using regex.

Exported source
def extract_nuclide(text: str) -> str:
    "Extract the nuclide identifier from a measurement variable name using regex."
    pattern = r'^(Total\s+(?:alpha|beta)|[^\s]+)'
    match = re.match(pattern, text, re.IGNORECASE)
    return match.group(1) if match else text

For instance:

print(extract_nuclide("Total alpha radioactivity concentration (Bq/L)"))
print(extract_nuclide("131I radioactivity concentration (Bq/L)"))
Total alpha
131I

source

ExtractNuclideNameCB

 ExtractNuclideNameCB (src_col='variable', dest_col='NUCLIDE')

Extract nuclide name from TEPCO data.

Exported source
class ExtractNuclideNameCB(Callback):
    "Extract nuclide name from TEPCO data."
    def __init__(self, src_col='variable', dest_col='NUCLIDE'): fc.store_attr()
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'][self.dest_col] = tfm.dfs['SEAWATER'][self.src_col].map(extract_nuclide)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
LON LAT TIME STATION variable value NUCLIDE
854848 141.200000 37.416667 2014/5/20 08:34:00 T-5 90Sr detection limit (Bq/L) NaN 90Sr
125473 141.026389 37.322222 2024/10/15 12:40:00 T-3 134Cs radioactivity concentration (Bq/L) NaN 134Cs
1684112 141.034444 37.431111 2020/01/15 08:00:00 T-1 60Co radioactivity concentration (Bq/L) NaN 60Co
869803 140.702222 35.987500 2022/10/21 13:08:00 T-D 90Sr detection limit (Bq/L) NaN 90Sr
366571 141.250000 38.166667 2020/2/7 09:22:00 T-MG5 132I detection limit (Bq/L) NaN 132I

Unit


source

ExtractUnitCB

 ExtractUnitCB (src_col='variable', dest_col='UNIT')

Extract unit from TEPCO data.

Exported source
class ExtractUnitCB(Callback):
    "Extract unit from TEPCO data."
    def __init__(self, src_col='variable', dest_col='UNIT'): fc.store_attr()
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'][self.dest_col] = tfm.dfs['SEAWATER'][self.src_col].str.extract(r'\((.*?)\)')
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
LON LAT TIME STATION variable value NUCLIDE UNIT
938328 141.034444 37.431111 2024/07/08 07:51:00 T-1 238Pu radioactivity concentration (Bq/L) NaN 238Pu Bq/L
2221110 141.072222 37.500000 2025/6/2 08:36:00 T-D1 105Ru detection limit (Bq/L) NaN 105Ru Bq/L
1951908 141.078889 37.458333 2014/5/29 06:07:00 T-S3 3H radioactivity concentration (Bq/L) NaN 3H Bq/L
1077687 141.046667 37.423333 2016/02/01 08:16 T-0-2 239Pu+240Pu statistical error (Bq/L) NaN 239Pu+240Pu Bq/L
969934 141.233333 37.516667 2023/1/26 07:26:00 T-B2 238Pu detection limit (Bq/L) NaN 238Pu Bq/L

Value type

Is it a measurement or derived detection such as detection limit or uncertainty?


source

ExtractValueTypeCB

 ExtractValueTypeCB (src_col='variable', dest_col='type')

Extract value type from TEPCO data.

Exported source
class ExtractValueTypeCB(Callback):
    "Extract value type from TEPCO data."
    def __init__(self, src_col='variable', dest_col='type'): fc.store_attr()
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'][self.dest_col] = np.select(
            [
                tfm.dfs['SEAWATER'][self.src_col].str.contains('detection limit', case=False),
                tfm.dfs['SEAWATER'][self.src_col].str.contains('statistical error', case=False)],
            ['DL', 'UNC'],
            default='VALUE'
        )
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
LON LAT TIME STATION variable value NUCLIDE UNIT type
1315127 141.000000 36.966667 2020/10/15 10:53:00 T-20 Total beta detection limit (Bq/L) NaN Total beta Bq/L DL
1839912 141.033611 37.415833 2022/03/19 09:05:00 T-2 144Ce detection limit (Bq/L) NaN 144Ce Bq/L DL
1842443 37.410000 141.030000 2014/08/11 05:35:00 T-2-1 144Ce detection limit (Bq/L) NaN 144Ce Bq/L DL
36817 141.034444 37.431111 2016/03/20 07:45 T-1 131I radioactivity concentration (Bq/L) ND 131I Bq/L VALUE
171526 141.072222 37.416667 2022/9/5 08:09:00 T-D5 134Cs detection limit (Bq/L) 1.4E-03 134Cs Bq/L DL

Reshape: long to wide

Send type column to columns names (VALUE, DL, UNC)


source

LongToWideCB

 LongToWideCB (src_col='variable', dest_col='type')

Reshape: long to wide

Exported source
class LongToWideCB(Callback):
    "Reshape: long to wide"
    def __init__(self, src_col='variable', dest_col='type'): fc.store_attr()
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'] = pd.pivot_table(
            tfm.dfs['SEAWATER'],
            values='value',
            index=['LON', 'LAT', 'TIME', 'STATION', 'NUCLIDE', 'UNIT'],
            columns='type',
            aggfunc='first'
        ).reset_index()
        tfm.dfs['SEAWATER'].reset_index(inplace=True)
        tfm.dfs['SEAWATER'].rename(columns={'index': 'SMP_ID'}, inplace=True)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
83649 83649 141.200000 37.233333 2011/8/26 07:40:00 T-7 134Cs Bq/L 1.2E+01 NaN ND
37560 37560 141.034444 37.431111 2013/10/27 06:45:00 T-1 137Cs Bq/L NaN NaN 1.4
18204 18204 141.026389 37.322222 2014/8/5 10:10:00 T-3 Total beta Bq/L 1.7E+01 NaN ND
65718 65718 141.046667 37.430556 2022/10/03 07:10:00 T-0-1A 134Cs Bq/L 0.31 NaN ND
69961 69961 141.050761 37.424686 2025/07/07 07:35 T-A2 3H Bq/L 9.4 NaN ND
df_test[df_test.VALUE == 'ND'].groupby('NUCLIDE').size().sort_values(ascending=False)
NUCLIDE
134Cs          25186
137Cs          16447
3H              8976
131I            7958
Total beta      4913
Total alpha      979
125Sb            647
90Sr             342
238Pu            308
239Pu+240Pu      231
89Sr             100
144Ce              9
54Mn               9
60Co               9
58Co               3
132I               3
136Cs              2
dtype: int64
df_test[df_test.VALUE == 'ND']
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
0 0 37.210000 141.01 2012/10/16 07:25:00 T-4-1 131I Bq/L 1.3E-01 NaN ND
1 1 37.210000 141.01 2012/10/16 07:25:00 T-4-1 134Cs Bq/L 1.9E-01 NaN ND
2 2 37.210000 141.01 2012/10/16 07:25:00 T-4-1 137Cs Bq/L 2.7E-01 NaN ND
3 3 37.210000 141.01 2012/10/2 07:30:00 T-4-1 131I Bq/L 1.1E-01 NaN ND
4 4 37.210000 141.01 2012/10/2 07:30:00 T-4-1 134Cs Bq/L 2.2E-01 NaN ND
... ... ... ... ... ... ... ... ... ... ...
93158 93158 141.666667 38.30 2025/4/8 08:20:00 T-MG2 134Cs Bq/L 1.3E-03 NaN ND
93160 93160 141.666667 38.30 2025/5/13 07:36:00 T-MG2 134Cs Bq/L 1.2E-03 NaN ND
93162 93162 141.666667 38.30 2025/5/13 07:50:00 T-MG2 134Cs Bq/L 8.7E-04 NaN ND
93164 93164 141.666667 38.30 2025/6/3 08:15:00 T-MG2 134Cs Bq/L 1.1E-03 NaN ND
93166 93166 141.666667 38.30 2025/6/3 08:24:00 T-MG2 134Cs Bq/L 1.2E-03 NaN ND

66122 rows × 10 columns

df_test.VALUE == 'ND'
0         True
1         True
2         True
3         True
4         True
         ...  
93163    False
93164     True
93165    False
93166     True
93167    False
Name: VALUE, Length: 93168, dtype: bool

Remap UNIT name to MARIS nomenclature

Data are reported in Bq/L but MARIS uses Bq/m3 instead. So we assign it to MARIS unit_id = 3 (Bq/L). Later in the processing pipeline, we will convert the values from Bq/L to Bq/m3 by multiplying VALUE, DL, and DLV by 1000.


source

RemapUnitNameCB

 RemapUnitNameCB (unit_mapping)

Remap UNIT name to MARIS id.

Exported source
unit_mapping = {'Bq/L': 1}
Exported source
class RemapUnitNameCB(Callback):
    """
    Remap `UNIT` name to MARIS id.
    """
    def __init__(self, unit_mapping): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER']['UNIT'] = tfm.dfs['SEAWATER']['UNIT'].map(self.unit_mapping)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
81874 81874 141.133333 38.250000 2012/7/26 10:25:00 T-MG4 134Cs 1 NaN NaN 6.8E-03
56188 56188 141.040556 37.478889 2015/2/3 09:15:00 T-6 3H 1 NaN NaN 4.6E-01
74609 74609 141.072167 37.333333 2024/9/18 08:21:00 T-D9 137Cs 1 NaN NaN 2.9E-03
36442 36442 141.034444 37.431111 2012/11/19 08:30:00 T-1 137Cs 1 1.4 NaN ND
36846 36846 141.034444 37.431111 2013/03/28 06:50:00 T-1 134Cs 1 1.1 NaN ND

Remap NUCLIDE name to MARIS nomenclature


source

RemapNuclideNameCB

 RemapNuclideNameCB (nuclide_mapping)

Remap NUCLIDE name to MARIS id.

Exported source
nuclide_mapping = {
    '131I': 29,
    '134Cs': 31,
    '137Cs': 33,
    '125Sb': 24,
    'Total beta': 103,
    '238Pu': 67,
    '239Pu+240Pu': 77,
    '3H': 1,
    '89Sr': 11,
    '90Sr': 12,
    'Total alpha': 104,
    '132I': 100,
    '136Cs': 102,
    '58Co': 8,
    '105Ru': 97,
    '106Ru': 17,
    '140La': 35,
    '140Ba': 34,
    '132Te': 99,
    '60Co': 9,
    '144Ce': 37,
    '54Mn': 6
}
Exported source
class RemapNuclideNameCB(Callback):
    "Remap `NUCLIDE` name to MARIS id."
    def __init__(self, nuclide_mapping): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER']['NUCLIDE'] = tfm.dfs['SEAWATER']['NUCLIDE'].map(self.nuclide_mapping)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
63425 63425 141.046667 37.423333 2024/04/01 08:15:00 T-0-2 31 1 0.4 NaN ND
83876 83876 141.200000 37.233333 2019/6/6 06:58:00 T-7 31 1 1.2E-03 NaN ND
31743 31743 141.033611 37.415833 2023/07/31 08:55:00 T-2 103 1 NaN NaN 10
68484 68484 141.047222 37.311111 2023/12/5 05:48:00 T-S7 1 1 NaN NaN 1.5E-01
19853 19853 141.026389 37.322222 2025/3/4 12:20:00 T-3 1 1 3.6E-01 NaN ND
df_test.dropna(subset=['DL', 'VALUE'], how='any')
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
0 0 37.210000 141.01 2012/10/16 07:25:00 T-4-1 29 1 1.3E-01 NaN ND
1 1 37.210000 141.01 2012/10/16 07:25:00 T-4-1 31 1 1.9E-01 NaN ND
2 2 37.210000 141.01 2012/10/16 07:25:00 T-4-1 33 1 2.7E-01 NaN ND
3 3 37.210000 141.01 2012/10/2 07:30:00 T-4-1 29 1 1.1E-01 NaN ND
4 4 37.210000 141.01 2012/10/2 07:30:00 T-4-1 31 1 2.2E-01 NaN ND
... ... ... ... ... ... ... ... ... ... ...
93158 93158 141.666667 38.30 2025/4/8 08:20:00 T-MG2 31 1 1.3E-03 NaN ND
93160 93160 141.666667 38.30 2025/5/13 07:36:00 T-MG2 31 1 1.2E-03 NaN ND
93162 93162 141.666667 38.30 2025/5/13 07:50:00 T-MG2 31 1 8.7E-04 NaN ND
93164 93164 141.666667 38.30 2025/6/3 08:15:00 T-MG2 31 1 1.1E-03 NaN ND
93166 93166 141.666667 38.30 2025/6/3 08:24:00 T-MG2 31 1 1.2E-03 NaN ND

66093 rows × 10 columns

Remap VALUE, DL, DLV

We remap DL (Detection Limit) value to MARIS ids as follows:

  • First check if activity (VALUE) is reported as “ND”, based on reported detection limit DL:
if VALUE is "ND":
    if not DL: 
        VALUE, DLV, DL = NaN, NaN, 3
    else:
        VALUE, DLV, DL = DL, DL, 2
  • Then if activity (VALUE) is reported:
if VALUE:
    VALUE, DLV, DL = VALUE, DL, 1

but if not reported, then based on detection level (DL) reported:

else:
    if DL:
        VALUE, DLV, DL = DL, DL, 2
    else:
        VALUE, DLV, DL = NaN, NaN, NaN (should be dropped)

With 1: Detected value (=), 2: Detection limit (<), 3: Not detected (ND) and where:

  • VALUE is the activity reported by TEPCO
  • DL is initially the detection limit as reported by TEPCO but later on remapped to MARIS detection level nomenclature (categorical)
  • DLV is the detection limit value as reported by TEPCO (copied from DL)

source

RemapVALUE_DL_DLV_CB

 RemapVALUE_DL_DLV_CB ()

Remap DL, DLV, VALUE based on TEPCO -> MARIS rules.

Exported source
class RemapVALUE_DL_DLV_CB(Callback):
    "Remap `DL`, `DLV`, `VALUE` based on TEPCO -> MARIS rules."    
    def map_all_columns(self, row):
        """Map all three columns (VALUE, DL, DLV) at once based on TEPCO rules"""
        value, dl = row['VALUE'], row['DL']
        new_value, new_dlv, new_dl = value, dl, 1
        
        if value == 'ND':
            if pd.isna(dl):
                new_value, new_dlv, new_dl = np.nan, np.nan, 3
            else:
                new_value, new_dlv, new_dl = dl, dl, 2
                
        elif pd.isna(value):
            if pd.isna(dl):
                new_value, new_dlv, new_dl = np.nan, np.nan, np.nan
            else:
                new_value, new_dlv, new_dl = dl, dl, 2
                
        return pd.Series({
            'VALUE': new_value,
            'DLV': new_dlv, 
            'DL': new_dl
        })
        
    def __call__(self, tfm):
        mapped = tfm.dfs['SEAWATER'].apply(self.map_all_columns, axis=1)
        tfm.dfs['SEAWATER'][['VALUE', 'DLV', 'DL']] = mapped
        tfm.dfs['SEAWATER']['DL'] = tfm.dfs['SEAWATER']['DL'].astype(int)
        tfm.dfs['SEAWATER']['VALUE'] = tfm.dfs['SEAWATER']['VALUE'].astype(float)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(20)
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV
24313 24313 141.033611 37.415833 2017/10/19 06:50:00 T-2 33 1 2 NaN 0.7100 0.71
81171 81171 141.083333 37.750000 2011/11/22 07:10:00 T-MA 33 1 2 NaN 1.1000 1.1E+00
62025 62025 141.046667 37.423333 2018/01/05 07:37:00 T-0-2 1 1 2 NaN 1.7000 1.7
61732 61732 141.046667 37.423333 2016/08/10 07:54 T-0-2 33 1 2 NaN 0.5300 0.53
52714 52714 141.040278 37.416111 2023/11/06 06:56:00 T-0-3 31 1 2 NaN 0.3000 0.3
9189 9189 140.763889 36.713889 2012/2/28 07:41:00 T-A 29 1 2 NaN 1.2000 1.2E+00
16645 16645 141.022500 37.824444 2022/10/6 05:44:00 T-22 31 1 2 NaN 0.0013 1.3E-03
24544 24544 141.033611 37.415833 2017/12/15 06:55:00 T-2 31 1 2 NaN 0.5500 0.55
69344 69344 141.050761 37.424686 2023/03/27 07:26:00 T-A2 33 1 2 NaN 0.2900 0.29
21783 21783 141.033611 37.415833 2012/05/29 08:15:00 T-2 33 1 2 NaN 1.6000 1.6
30460 30460 141.033611 37.415833 2022/07/11 09:13:00 T-2 103 1 1 NaN 14.0000 NaN
29837 29837 141.033611 37.415833 2022/01/03 08:15:00 T-2 33 1 2 NaN 0.8700 0.87
9304 9304 140.763889 36.713889 2014/11/10 09:34:00 T-A 33 1 2 NaN 1.3000 1.3E+00
28727 28727 141.033611 37.415833 2021/01/28 07:05:00 T-2 31 1 2 NaN 0.8000 0.8
42442 42442 141.034444 37.431111 2017/11/27 07:05:00 T-1 1 1 2 NaN 1.9000 1.9
24596 24596 141.033611 37.415833 2017/12/28 06:55:00 T-2 103 1 1 NaN 12.0000 NaN
6689 6689 140.603889 36.299722 2019/10/18 13:06:00 T-C 33 1 2 NaN 1.2000 1.2E+00
35904 35904 141.034444 37.431111 2012/05/28 08:55:00 T-1 29 1 2 NaN 0.4900 0.49
15422 15422 141.013889 37.241667 2017/12/5 13:30:00 T-4 31 1 1 NaN 0.0038 NaN
59538 59538 141.046667 37.416111 2018/08/28 07:22:00 T-0-3A 103 1 2 NaN 15.0000 15.0

Convert activity to Bq/m3

Earlier in the pipeline, we assigned MARIS unit_id = 3 (Bq/L) to TEPCO UNIT = Bq/L. Now we need to convert the values from Bq/L to Bq/m3 by multiplying VALUE, DL, and DLV by 1000.


source

ConvertToBqM3CB

 ConvertToBqM3CB ()

Convert from Bq/L to Bq/m3.

Exported source
class ConvertToBqM3CB(Callback):
    "Convert from Bq/L to Bq/m3."    
    def __call__(self, tfm, factor=1000):
        tfm.dfs['SEAWATER']['VALUE'] = tfm.dfs['SEAWATER']['VALUE'] * factor
        # Convert DLV to float, handling NaN values
        tfm.dfs['SEAWATER']['DLV'] = pd.to_numeric(tfm.dfs['SEAWATER']['DLV'], errors='coerce')
        tfm.dfs['SEAWATER']['DLV'] = tfm.dfs['SEAWATER']['DLV'] * factor
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB(),
    ConvertToBqM3CB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(20)
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV
11653 11653 140.922222 36.905556 2018/7/17 09:28:00 T-18 33 1 1 NaN 3.1 NaN
56255 56255 141.040556 37.478889 2015/8/11 08:00:00 T-6 33 1 1 NaN 57.0 NaN
4414 4414 37.410000 141.030000 2015/07/13 05:30:00 T-2-1 1 1 1 NaN 2100.0 NaN
71352 71352 141.062500 37.552778 2015/7/27 08:36:00 T-14 31 1 1 NaN 2.0 1.4
28640 28640 141.033611 37.415833 2021/01/01 07:11:00 T-2 103 1 1 NaN 15000.0 NaN
51235 51235 141.040278 37.416111 2017/01/02 07:16:00 T-0-3 31 1 2 NaN 680.0 680.0
66512 66512 141.046667 37.430556 2025/04/29 07:18 T-0-1A 1 1 2 NaN 8100.0 8100.0
49336 49336 141.034444 37.431111 2025/03/29 06:50 T-1 33 1 2 NaN 660.0 660.0
71187 71187 141.062500 37.552778 2014/2/26 09:09:00 T-14 33 1 1 NaN 48.0 NaN
32088 32088 141.033611 37.415833 2023/10/27 06:45:00 T-2 31 1 2 NaN 750.0 750.0
34498 34498 141.034444 37.431111 2011/04/20 14:20:00 T-1 29 1 1 NaN 47000.0 NaN
76600 76600 141.072222 37.416667 2022/6/1 08:40:00 T-D5 104 1 2 NaN 2500.0 2500.0
51215 51215 141.040278 37.416111 2016/11/28 07:47:00 T-0-3 31 1 2 NaN 790.0 790.0
28565 28565 141.033611 37.415833 2020/12/09 06:45:00 T-2 33 1 2 NaN 690.0 690.0
28344 28344 141.033611 37.415833 2020/10/04 06:45:00 T-2 103 1 1 NaN 13000.0 NaN
70632 70632 141.062500 37.552778 2011/10/25 09:10:00 T-14 29 1 2 NaN 740.0 740.0
7440 7440 140.665556 36.506389 2014/5/26 13:38:00 T-B 31 1 2 NaN 1100.0 1100.0
20878 20878 141.033611 37.415833 2011/10/03 08:30:00 T-2 29 1 2 NaN 4000.0 4000.0
4115 4115 37.410000 141.030000 2015/04/27 05:30:00 T-2-1 31 1 2 NaN 770.0 770.0
44483 44483 141.034444 37.431111 2019/10/01 08:05:00 T-1 31 1 2 NaN 840.0 840.0

Parse & encode time


source

ParseTimeCB

 ParseTimeCB (time_name='TIME')

Parse time column from TEPCO.

Exported source
class ParseTimeCB(Callback):
    "Parse time column from TEPCO."
    def __init__(self, time_name='TIME'): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER'][self.time_name] = pd.to_datetime(tfm.dfs['SEAWATER'][self.time_name], 
                                                             format='%Y/%m/%d %H:%M:%S', errors='coerce')
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB(),
    ConvertToBqM3CB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
Warning: 4831 missing time value(s) in SEAWATER
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV
69085 69085 141.050739 37.409267 1733731260 T-A3 33 1 2 NaN 320.0 320.0
44340 44340 141.034444 37.431111 1564646400 T-1 33 1 2 NaN 720.0 720.0
30532 30532 141.033611 37.415833 1659345300 T-2 103 1 1 NaN 7000.0 NaN
42965 42965 141.034444 37.431111 1525677000 T-1 1 1 2 NaN 880.0 880.0
14651 14651 141.013889 37.241667 1315813500 T-4 29 1 2 NaN 4000.0 4000.0

Sanitize coordinates

tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB(),
    ConvertToBqM3CB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
    ])

df_test = tfm()['SEAWATER']
df_test.sample(5)
Warning: 4831 missing time value(s) in SEAWATER
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV
54074 54074 141.040278 37.430556 1515394980 T-0-1 1 1 2 NaN 1700.0 1700.0
74378 74378 141.072167 37.333333 1681804920 T-D9 103 1 2 NaN 14000.0 14000.0
91213 91213 141.583333 38.633333 1377003060 T-MG0 31 1 2 NaN 2.2 2.2
72667 72667 141.072167 37.333333 1380362220 T-D9 31 1 1 NaN 23.0 NaN
86114 86114 141.200000 37.416667 1638775740 T-5 104 1 2 NaN 2300.0 2300.0

Add Sample ID

The SMP_ID_PROVIDER column stores the original sample ID from the data provider. TEPCO does not provide sample IDs, so this column will be set to None for all records.


source

AddSampleIdCB

 AddSampleIdCB ()

Convert from Bq/L to Bq/m3.

Exported source
class AddSampleIdCB(Callback):
    "Convert from Bq/L to Bq/m3."    
    def __call__(self, tfm, factor=1000):
        tfm.dfs['SEAWATER']['SMP_ID'] = range(1, len(tfm.dfs['SEAWATER']) + 1)
        tfm.dfs['SEAWATER']['SMP_ID_PROVIDER'] = ""
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB(),
    ConvertToBqM3CB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    AddSampleIdCB(),
    ])

df_test = tfm()['SEAWATER']
df_test.sample(5)
Warning: 4831 missing time value(s) in SEAWATER
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV SMP_ID_PROVIDER
67223 57586 141.047222 37.241667 1448351640 T-11 31 1 1 NaN 3.5 NaN
42868 35125 141.034444 37.431111 1523173800 T-1 33 1 2 NaN 450.0 450.0
44015 36272 141.034444 37.431111 1553155800 T-1 31 1 2 NaN 760.0 760.0
42519 34776 141.034444 37.431111 1513755900 T-1 31 1 2 NaN 540.0 540.0
13882 7902 141.013889 37.241667 1318838400 T-4 29 1 2 NaN 4000.0 4000.0

Encode to NetCDF

tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB(),
    ConvertToBqM3CB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB(),
    AddSampleIdCB(),
    ])

dfs_tfm = tfm()
tfm.logs
Warning: 4831 missing time value(s) in SEAWATER
['Remove 約 (about) char',
 "Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean",
 'Select columns of interest.',
 '\n    Get TEPCO nuclide names as values not column names \n    to extract contained information (nuclide name, unc, dl, ...).\n    ',
 'Extract nuclide name from TEPCO data.',
 'Extract unit from TEPCO data.',
 'Extract value type from TEPCO data.',
 'Reshape: long to wide',
 '\n    Remap `UNIT` name to MARIS id.\n    ',
 'Remap `NUCLIDE` name to MARIS id.',
 'Remap `DL`, `DLV`, `VALUE` based on TEPCO -> MARIS rules.',
 'Convert from Bq/L to Bq/m3.',
 'Parse time column from TEPCO.',
 'Encode time as seconds since epoch.',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.',
 'Convert from Bq/L to Bq/m3.']
dfs_tfm['SEAWATER'].sample(10)
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV SMP_ID_PROVIDER
64028 54701 141.046667 37.430556 1408352280 T-0-1A 1 1 2 NaN 1700.0 1700.0 NaN
82446 72534 141.133333 38.250000 1522926600 T-MG4 31 1 2 NaN 1.5 1.5 NaN
39583 32834 141.034444 37.431111 1438068000 T-1 31 1 2 NaN 790.0 790.0 NaN
32941 26702 141.033611 37.415833 1719583500 T-2 31 1 2 NaN 640.0 640.0 NaN
31333 25094 141.033611 37.415833 1680244980 T-2 33 1 2 NaN 740.0 740.0 NaN
87087 77175 141.216667 37.533333 1659696300 T-B1 33 1 1 NaN 1.4 NaN NaN
83010 73098 141.148611 37.348333 1647327180 T-B4 31 1 2 NaN 1.4 1.4 NaN
65263 55757 141.046667 37.430556 1594621980 T-0-1A 33 1 2 NaN 750.0 750.0 NaN
87045 77133 141.216667 37.533333 1614322020 T-B1 31 1 2 NaN 1.2 1.2 NaN
13558 7578 141.006944 37.055556 1380524760 T-17-1 33 1 1 NaN 8.7 NaN NaN

source

get_attrs

 get_attrs (tfm, zotero_key, kw=['oceanography', 'Earth Science > Oceans >
            Ocean Chemistry> Radionuclides', 'Earth Science > Human
            Dimensions > Environmental Impacts > Nuclear Radiation
            Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean
            Tracers, Earth Science > Oceans > Marine Sediments', 'Earth
            Science > Oceans > Ocean Chemistry, Earth Science > Oceans >
            Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality >
            Ocean Contaminants', 'Earth Science > Biological
            Classification > Animals/Vertebrates > Fish', 'Earth Science >
            Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science >
            Biological Classification > Animals/Invertebrates > Mollusks',
            'Earth Science > Biological Classification >
            Animals/Invertebrates > Arthropods > Crustaceans', 'Earth
            Science > Biological Classification > Plants > Macroalgae
            (Seaweeds)'])

Retrieve global attributes from MARIS dump.

Exported source
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']
Exported source
def get_attrs(tfm, zotero_key, kw=kw):
    "Retrieve global attributes from MARIS dump."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()
get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw)
{'geospatial_lat_min': '141.66666667',
 'geospatial_lat_max': '38.63333333',
 'geospatial_lon_min': '140.60388889',
 'geospatial_lon_max': '35.79611111',
 'geospatial_bounds': 'POLYGON ((140.60388889 35.79611111, 141.66666667 35.79611111, 141.66666667 38.63333333, 140.60388889 38.63333333, 140.60388889 35.79611111))',
 'time_coverage_start': '2011-03-21T14:30:00',
 'time_coverage_end': '2025-07-22T06:17:00',
 'id': 'JEV6HP5A',
 'title': "Readings of Sea Area Monitoring - Monitoring of sea water - Sea area close to TEPCO's Fukushima Daiichi NPS / Coastal area - Readings of Sea Area Monitoring [TEPCO]",
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "TEPCO - Tokyo Electric Power Company"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
 'publisher_postprocess_logs': "Remove 約 (about) char, Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean, Select columns of interest., \n    Get TEPCO nuclide names as values not column names \n    to extract contained information (nuclide name, unc, dl, ...).\n    , Extract nuclide name from TEPCO data., Extract unit from TEPCO data., Extract value type from TEPCO data., Reshape: long to wide, \n    Remap `UNIT` name to MARIS id.\n    , Remap `NUCLIDE` name to MARIS id., Remap `DL`, `DLV`, `VALUE` based on TEPCO -> MARIS rules., Convert from Bq/L to Bq/m3., Parse time column from TEPCO., Encode time as seconds since epoch., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator., Convert from Bq/L to Bq/m3."}

source

encode

 encode (fname_out:str, **kwargs)

Encode TEPCO data to NetCDF.

Type Details
fname_out str Path to the folder where the NetCDF output will be saved
kwargs VAR_KEYWORD
Exported source
def encode(
    fname_out: str, # Path to the folder where the NetCDF output will be saved
    **kwargs # Additional keyword arguments
    ):
    "Encode TEPCO data to NetCDF."
    dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
    
    tfm = Transformer(dfs, cbs=[
        RemoveJapanaseCharCB(),
        FixRangeValueStringCB(),
        SelectColsOfInterestCB(common_coi, nuclides_pattern),
        WideToLongCB(),
        ExtractNuclideNameCB(),
        ExtractUnitCB(),
        ExtractValueTypeCB(),
        LongToWideCB(),
        RemapUnitNameCB(unit_mapping),
        RemapNuclideNameCB(nuclide_mapping),
        RemapVALUE_DL_DLV_CB(),
        ConvertToBqM3CB(),
        ParseTimeCB(),
        EncodeTimeCB(),
        SanitizeLonLatCB(),
        AddSampleIdCB()
    ])        
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out, 
                            global_attrs=get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw),
                            verbose=kwargs.get('verbose', False)
                            )
    encoder.encode()
encode(fname_out, verbose=False)
100%|██████████| 11/11 [00:05<00:00,  2.17it/s]
100%|██████████| 11/11 [00:05<00:00,  2.14it/s]
Warning: 4831 missing time value(s) in SEAWATER
decode(fname_in=fname_out, verbose=True)
Saved SEAWATER to ../../_data/output/tepco_SEAWATER.csv
df_output = pd.read_csv("../../_data/output/tepco_SEAWATER.csv")
df_output.head()
samplabcode longitude latitude begperiod station nuclide_id activity unit_id uncertaint detection detection_lim samptype_id ref_id
0 NaN 140.60388 36.29972 2011-10-13 13:21:00 T-C 29 4000.0 1 NaN < 4000.0 1 679
1 NaN 140.60388 36.29972 2011-10-13 13:21:00 T-C 31 6000.0 1 NaN < 6000.0 1 679
2 NaN 140.60388 36.29972 2011-10-13 13:21:00 T-C 33 9000.0 1 NaN < 9000.0 1 679
3 NaN 140.60388 36.29972 2011-10-13 13:23:00 T-C 29 4000.0 1 NaN < 4000.0 1 679
4 NaN 140.60388 36.29972 2011-10-13 13:23:00 T-C 31 6000.0 1 NaN < 6000.0 1 679