Data pipeline (handler) to convert TEPCO dataset (Source) to NetCDF format

Configuration & file paths

Exported source
fname_coastal_water = 'https://radioactivity.nra.go.jp/cont/en/results/sea/coastal_water.csv'
fname_clos1F = 'https://radioactivity.nra.go.jp/cont/en/results/sea/close1F_water.xlsx'
fname_iaea_orbs = 'https://raw.githubusercontent.com/RML-IAEA/iaea.orbs/refs/heads/main/src/iaea/orbs/stations/station_points.csv'

fname_out = '../../_data/output/tepco.nc'

Load data

We here load the data from the NRA (Nuclear Regulatory Authority) website. For the moment, we only process radioactivity concentration data in the seawater around Fukushima Dai-ichi NPP [TEPCO] (coastal_water.csv) and in the close1F_water.xlsx file.

In near future, MARIS will provide a dedicated handler for all related ALPS data including measurements not only provided by TEPCO but also MOE, NRA, MLITT and Fukushima Prefecture.

FEEDBACK TO DATA PROVIDER

The coastal_water.csv file contains two sections: the measurements and the locations. We identify below the line number where the locations begin. A single point of truth for the location of the stations would ease the processing in future.


source

find_location_section

 find_location_section (df, col_idx=0, pattern='Sampling point number')

Find the line number where location data begins.

Exported source
def find_location_section(df, 
                          col_idx=0,
                          pattern='Sampling point number'
                          ):
    "Find the line number where location data begins."
    mask = df.iloc[:, col_idx] == pattern
    indices = df[mask].index
    return indices[0] if len(indices) > 0 else -1
find_location_section(pd.read_csv(fname_coastal_water, low_memory=False))
np.int64(28039)
FEEDBACK TO DATA PROVIDER

Distinct parsing of the time from coastal_water.csv and close1F_water.xlsx files are required. Indeed:

  • coastal_water.csv uses the format YYYY/MM/DD in the Sampling HH:MM and
  • close1F_water.xlsx uses the format YYYY-MM-DD HH:MM:SS.

source

fix_sampling_time

 fix_sampling_time (x)
Exported source
def fix_sampling_time(x):
    if pd.isna(x): 
        return '00:00:00'
    else:
        hour, min =  x.split(':')[:2]
        return f"{hour if len(hour) == 2 else '0' + hour}:{min}:00"

source

get_coastal_water_df

 get_coastal_water_df (fname_coastal_water)

Get the measurements dataframe from the coastal_water.csv file.

Exported source
def get_coastal_water_df(fname_coastal_water):
    "Get the measurements dataframe from the `coastal_water.csv` file."
    
    locs_idx = find_location_section(pd.read_csv(fname_coastal_water, 
                                      skiprows=0, low_memory=False))
    
    df = pd.read_csv(fname_coastal_water, skiprows=1, 
                     nrows=locs_idx - 1,
                     low_memory=False)
    df.dropna(subset=['Sampling point number'], inplace=True)
    df['Sampling time'] = df['Sampling time'].map(fix_sampling_time)
    
    df['TIME'] = df['Sampling date'].replace('-', '/') + ' ' + df['Sampling time']
    
    df = df.drop(columns=['Sampling date', 'Sampling time'])
    return df
df_coastal_water = get_coastal_water_df(fname_coastal_water)
df_coastal_water.tail()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
28031 T-S3 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.9E+00 NaN NaN NaN NaN NaN 2025/1/8 09:36:00
28032 T-S4 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.9E+00 NaN NaN NaN NaN NaN 2025/1/8 09:59:00
28033 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.3E+00 NaN NaN NaN NaN NaN 2025/1/13 07:44:00
28034 T-S8 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.6E+00 NaN NaN NaN NaN NaN 2025/1/15 05:22:00
28035 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 7.4E+00 NaN NaN NaN NaN NaN 2025/1/20 07:57:00

5 rows × 49 columns

coi = [o for o in df_coastal_water.columns if "134Cs" in o]

df_coastal_water[coi + ['Sampling point number', 'TIME']].head(30)
134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) Sampling point number TIME
0 4.8E+01 9.2E+00 T-3 2011/3/21 23:15:00
1 3.1E+01 8.7E+00 T-4 2011/3/21 23:45:00
2 4.6E+01 1.4E+01 T-3 2011/3/22 14:28:00
3 3.9E+01 1.1E+01 T-4 2011/3/22 15:06:00
4 5.1E+01 2.0E+01 T-3 2011/3/23 13:51:00
5 3.3E+01 2.1E+01 T-4 2011/3/23 14:25:00
6 9.9E+01 3.8E+01 T-3 2011/3/24 09:30:00
7 3.5E+01 7.0E+00 T-4 2011/3/24 08:45:00
8 2.6E+01 7.4E+00 T-3 2011/3/25 10:00:00
9 2.0E+01 6.7E+00 T-4 2011/3/25 09:10:00
10 2.6E+01 1.8E+01 T-3 2011/3/26 15:15:00
11 1.3E+01 7.1E+00 T-4 2011/3/26 15:50:00
12 5.4E+02 1.2E+01 T-3 2011/3/27 14:30:00
13 2.0E+01 6.0E+00 T-4 2011/3/27 08:45:00
14 6.1E+02 2.3E+01 T-3 2011/3/28 09:35:00
15 3.3E+02 2.1E+01 T-4 2011/3/28 08:45:00
16 3.2E+02 1.3E+01 T-3 2011/3/29 10:15:00
17 2.3E+02 1.2E+01 T-4 2011/3/29 09:20:00
18 3.6E+02 2.0E+01 T-3 2011/3/30 10:00:00
19 1.8E+02 2.0E+01 T-4 2011/3/30 09:05:00
20 3.6E+02 2.1E+01 T-3 2011/3/31 10:00:00
21 1.6E+02 2.0E+01 T-4 2011/3/31 09:15:00
22 3.0E+02 1.8E+01 T-3 2011/4/1 09:50:00
23 2.0E+02 1.8E+01 T-4 2011/4/1 09:00:00
24 1.9E+01 1.5E+01 8 2011/4/2 13:35:00
25 1.7E+02 1.7E+01 T-3 2011/4/2 09:55:00
26 5.1E+01 1.7E+01 T-4 2011/4/2 09:00:00
27 2.3E+01 4.9E+00 T-5 2011/4/2 14:03:00
28 NaN NaN T-7 2011/4/2 13:12:00
29 NaN NaN 8 2011/4/3 12:20:00
coi
['134Cs radioactivity concentration (Bq/L)', '134Cs detection limit (Bq/L)']
df_coastal_water.dropna(subset=coi, how='any')[coi + ['Sampling point number', 'TIME']]
134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) Sampling point number TIME
0 4.8E+01 9.2E+00 T-3 2011/3/21 23:15:00
1 3.1E+01 8.7E+00 T-4 2011/3/21 23:45:00
2 4.6E+01 1.4E+01 T-3 2011/3/22 14:28:00
3 3.9E+01 1.1E+01 T-4 2011/3/22 15:06:00
4 5.1E+01 2.0E+01 T-3 2011/3/23 13:51:00
... ... ... ... ...
28023 ND 1.0E-03 T-MG3 2024/12/17 10:19:00
28024 ND 9.7E-04 T-MG3 2024/12/17 10:12:00
28026 ND 1.2E-03 T-3 2024/12/24 11:45:00
28027 ND 1.3E-03 T-4 2024/12/24 09:48:00
28028 ND 1.3E-03 T-6 2024/12/24 13:00:00

18317 rows × 4 columns

mask = df_coastal_water['134Cs radioactivity concentration (Bq/L)'] == 'ND'
df_coastal_water[mask][coi + ['Sampling point number', 'TIME']]
134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) Sampling point number TIME
53 ND NaN 5 2011/4/6 11:30:00
57 ND NaN 8 2011/4/6 12:52:00
59 ND NaN 10 2011/4/6 13:37:00
64 ND NaN T-7 2011/4/6 12:44:00
65 ND NaN T-7 2011/4/6 13:15:00
... ... ... ... ...
28023 ND 1.0E-03 T-MG3 2024/12/17 10:19:00
28024 ND 9.7E-04 T-MG3 2024/12/17 10:12:00
28026 ND 1.2E-03 T-3 2024/12/24 11:45:00
28027 ND 1.3E-03 T-4 2024/12/24 09:48:00
28028 ND 1.3E-03 T-6 2024/12/24 13:00:00

18404 rows × 4 columns

len(df_coastal_water)
28036
FEEDBACK TO DATA PROVIDER

Identification of the stations location requires three distinct files:

  • the second section of the coastal_water.csv file
  • the R6zahyo.pdf file further processed by https://github.com/RML-IAEA/iaea.orbs
  • the second sections of all sheets of close1F_water.xlsx file

All files and sheets required to look up the location of the stations.


source

get_locs_coastal_water

 get_locs_coastal_water (fname_coastal_water)
Exported source
def get_locs_coastal_water(fname_coastal_water):
    locs_idx = find_location_section(pd.read_csv(fname_coastal_water, 
                                      skiprows=0, low_memory=False))
    
    df = pd.read_csv(fname_coastal_water, skiprows=locs_idx+1, 
                     low_memory=False).iloc[:, :3]
    
    df.columns = ['STATION', 'LON', 'LAT']
    df.dropna(subset=['LAT'], inplace=True)
    df['org'] = 'coastal_seawater.csv'
    return df
df_locs_coastal_water = get_locs_coastal_water(fname_coastal_water)
print(f'Nb. of stations: {len(df_locs_coastal_water)}')
df_locs_coastal_water.head()
Nb. of stations: 48
STATION LON LAT org
0 T-0 37.42 141.04 coastal_seawater.csv
1 T-11 37.24 141.05 coastal_seawater.csv
2 T-12 37.15 141.04 coastal_seawater.csv
3 T-13-1 37.64 141.04 coastal_seawater.csv
4 T-14 37.55 141.06 coastal_seawater.csv
df_locs_coastal_water.STATION.unique()
array(['T-0', 'T-11', 'T-12', 'T-13-1', 'T-14', 'T-17-1', 'T-18', 'T-20',
       'T-22', 'T-3', 'T-4', 'T-4-1', 'T-4-2', 'T-5', 'T-6', 'T-7', 'T-A',
       'T-B', 'T-B1', 'T-B2', 'T-B3', 'T-B4', 'T-C', 'T-D', 'T-D1',
       'T-D5', 'T-D9', 'T-E', 'T-E1', 'T-Z', 'T-MG6', 'T-S1', 'T-S7',
       'T-H1', 'T-S2', 'T-S6', 'T-M10', 'T-MA', 'T-S3', 'T-S4', 'T-S8',
       'T-MG4', 'T-G4', 'T-MG5', 'T-MG1', 'T-MG0', 'T-MG3', 'T-MG2'],
      dtype=object)
FEEDBACK TO DATA PROVIDER

Data contained in the close1F_water.xlsx file are spread in several sheets (one per station). Each sheet further contains two sections: the measurements and the locations.

For each sheet, we have to identify the line number where to split both measurements and the location. We then need to further iterate over all sheets to concatenate the results.


source

get_clos1F_df

 get_clos1F_df (fname_clos1F)

Get measurements dataframe from close1F_water.xlsx file and parse datetime.

Exported source
def get_clos1F_df(fname_clos1F):
    "Get measurements dataframe from close1F_water.xlsx file and parse datetime."
    excel_file = pd.ExcelFile(fname_clos1F)
    dfs = {}
    
    for sheet_name in tqdm(excel_file.sheet_names):
        locs_idx = find_location_section(pd.read_excel(excel_file, 
                                                       sheet_name=sheet_name,
                                                       skiprows=1))
        df = pd.read_excel(excel_file, 
                   sheet_name=sheet_name, 
                   skiprows=1,
                   nrows=locs_idx-1)
        
        df.dropna(subset=['Sampling point number'], inplace=True)
        df['Sampling date'] = df['Sampling date']\
            .astype(str)\
            .apply(lambda x: x.split(' ')[0]\
            .replace('-', '/'))
            
        dfs[sheet_name] = df
    
    df = pd.concat(dfs.values(), ignore_index=True)
    df.dropna(subset=['Sampling date'], inplace=True)
    df['TIME'] = df['Sampling date'] + ' ' + df['Sampling time'].astype(str)
    df = df.drop(columns=['Sampling date', 'Sampling time'])
    return df
df_clos1F = get_clos1F_df(fname_clos1F)
df_clos1F.head()
100%|██████████| 11/11 [00:07<00:00,  1.44it/s]
Sampling point number 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) Total beta radioactivity concentration (Bq/L) Total beta detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) Collection layer of seawater ... 106Ru detection limit (Bq/L) 60Co radioactivity concentration (Bq/L) 60Co detection limit (Bq/L) 95Zr radioactivity concentration (Bq/L) 95Zr detection limit (Bq/L) 99Mo radioactivity concentration (Bq/L) 99Mo detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) TIME
0 T-0-1 ND 1.5 ND 1.4 ND 18.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
1 T-0-1 NaN NaN NaN NaN NaN NaN 4.7 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
2 T-0-1 ND 1.1 ND 1.4 ND 20.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
3 T-0-1 NaN NaN NaN NaN NaN NaN ND 2.9 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
4 T-0-1 ND 0.66 ND 0.49 ND 17.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/27 08:14:00

5 rows × 57 columns

df_clos1F['Sampling point number'].unique()
array(['T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
       'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)

source

get_locs_clos1F

 get_locs_clos1F (fname_clos1F)

Get locations dataframe from close1F_water.xlsx file from each sheets.

Exported source
def get_locs_clos1F(fname_clos1F):
    "Get locations dataframe from close1F_water.xlsx file from each sheets."
    excel_file = pd.ExcelFile(fname_clos1F)
    dfs = {}
    
    for sheet_name in tqdm(excel_file.sheet_names):
        locs_idx = find_location_section(pd.read_excel(excel_file, 
                                                       sheet_name=sheet_name,
                                                       skiprows=1))
        df = pd.read_excel(excel_file, 
                           sheet_name=sheet_name, 
                           skiprows=locs_idx+2)
            
        dfs[sheet_name] = df
    
    df = pd.concat(dfs.values(), ignore_index=True).iloc[:, :3]
    df.dropna(subset=['Sampling coordinate North latitude (Decimal)'], inplace=True)    
    df.columns = ['STATION', 'LON', 'LAT']
    df['org'] = 'close1F.csv'
    return df
df_locs_clos1F = get_locs_clos1F(fname_clos1F)
print(f'Nb. of stations: {len(df_locs_clos1F)}')
df_locs_clos1F.head()
100%|██████████| 11/11 [00:04<00:00,  2.31it/s]
Nb. of stations: 11
STATION LON LAT org
0 T-0-1 37.43 141.04 close1F.csv
11 T-0-1A 37.43 141.05 close1F.csv
22 T-0-2 37.42 141.05 close1F.csv
33 T-0-3 37.42 141.04 close1F.csv
44 T-0-3A 37.42 141.05 close1F.csv
FEEDBACK TO DATA PROVIDER

The close1F_water.xlsx file contains station locations that are not present in the coastal_water.csv dataset, as demonstrated in the comparison below:

set(df_locs_clos1F.STATION) - set(df_locs_coastal_water.STATION)
{'T-0-1',
 'T-0-1A',
 'T-0-2',
 'T-0-3',
 'T-0-3A',
 'T-1',
 'T-2',
 'T-2-1',
 'T-A1',
 'T-A2',
 'T-A3'}
FEEDBACK TO DATA PROVIDER

In theory all locations are supposed to be provided in the R6zahyo.pdf file. This file is further processed by https://github.com/RML-IAEA/iaea.orbs and the result is provided in the station_points.csv file.

However, this file lacks complete coverage of locations referenced in both coastal_water.csv and close1F_water.xlsx files, while simultaneously containing additional locations not present in either (see below). A more standardized and comprehensive location reference system would significantly improve the efficiency and reliability of the data ingestion process.


source

get_locs_orbs

 get_locs_orbs (fname_iaea_orbs)
Exported source
def get_locs_orbs(fname_iaea_orbs):
    df = pd.read_csv(fname_iaea_orbs)
    df.columns = ['org', 'STATION', 'LON', 'LAT']
    return df
df_locs_orbs = get_locs_orbs(fname_iaea_orbs)
df_locs_orbs.head()
org STATION LON LAT
0 MOE E-31 141.727667 39.059167
1 MOE E-32 141.635667 38.996000
2 MOE E-37 141.948611 39.259167
3 MOE E-38 141.755000 39.008333
4 MOE E-39 141.766667 38.991667
set(df_locs_orbs.STATION) - (set(df_locs_clos1F.STATION) | set(df_locs_coastal_water.STATION))
{'C-P1',
 'C-P2',
 'C-P3',
 'C-P4',
 'C-P5',
 'C-P8',
 'E-31',
 'E-32',
 'E-37',
 'E-38',
 'E-39',
 'E-3A',
 'E-41',
 'E-42',
 'E-43',
 'E-44',
 'E-45',
 'E-46',
 'E-47',
 'E-48',
 'E-49',
 'E-4A',
 'E-4B',
 'E-4C',
 'E-4F',
 'E-4G',
 'E-4H',
 'E-4J',
 'E-4K',
 'E-4L',
 'E-4M',
 'E-71',
 'E-72',
 'E-73',
 'E-74',
 'E-75',
 'E-76',
 'E-77',
 'E-78',
 'E-79',
 'E-7A',
 'E-7B',
 'E-7C',
 'E-7D',
 'E-7F',
 'E-7G',
 'E-7H',
 'E-7I',
 'E-7J',
 'E-7K',
 'E-7L',
 'E-81',
 'E-82',
 'E-83',
 'E-84',
 'E-85',
 'E-S1',
 'E-S10',
 'E-S13',
 'E-S14',
 'E-S15',
 'E-S17',
 'E-S18',
 'E-S19',
 'E-S20',
 'E-S21',
 'E-S22',
 'E-S23',
 'E-S24',
 'E-S25',
 'E-S26',
 'E-S27',
 'E-S28',
 'E-S29',
 'E-S3',
 'E-S30',
 'E-S31',
 'E-S32',
 'E-S33',
 'E-S34',
 'E-S35',
 'E-S36',
 'E-S4',
 'E-S5',
 'E-T1',
 'E-T2',
 'E-T3',
 'E-T4',
 'E-T5',
 'E-T6',
 'E-T7',
 'E-T8',
 'F-P01',
 'F-P02',
 'F-P03',
 'F-P04',
 'F-P05',
 'F-P06',
 'F-P07',
 'F-P08',
 'F-P09',
 'F-P10',
 'F-P11',
 'F-P12',
 'F-P13',
 'F-P14',
 'F-P15',
 'F-P29',
 'F-P30',
 'F-P31',
 'F-P32',
 'F-P33',
 'F-P34',
 'F-P35',
 'F-P37',
 'F-P38',
 'F-P39',
 'F-P40',
 'F-P41',
 'F-P42',
 'F-P43',
 'F-P45',
 'F-P46',
 'F-P47',
 'F-P48',
 'F-P49',
 'F-P50',
 'F-P51',
 'F-P52',
 'F-P53',
 'F-P54',
 'F-P55',
 'F-P56',
 'F-P57',
 'F-P58',
 'F-P59',
 'F-P60',
 'F-P61',
 'F-P62',
 'F-P63',
 'F-P64',
 'F-P65',
 'F-P66',
 'F-P67',
 'F-P68',
 'F-P69',
 'F-P70',
 'F-P71',
 'F-P72',
 'F-P73',
 'F-P74',
 'F-P75',
 'F-P76',
 'F-P77',
 'F-P78',
 'F-P79',
 'F-P80',
 'F-P81',
 'F-P82',
 'F-P83',
 'K-T1',
 'K-T2',
 'KK-U1',
 'M-10',
 'M-101',
 'M-102',
 'M-103',
 'M-104',
 'M-11',
 'M-14',
 'M-15',
 'M-19',
 'M-20',
 'M-21',
 'M-25',
 'M-26',
 'M-27',
 'M-A1',
 'M-A3',
 'M-B1',
 'M-B5',
 'M-C1',
 'M-C10',
 'M-C2',
 'M-C3',
 'M-C4',
 'M-C6',
 'M-C7',
 'M-C8',
 'M-C9',
 'M-D1',
 'M-D3',
 'M-E1',
 'M-E3',
 'M-E5',
 'M-F1',
 'M-F3',
 'M-G0',
 'M-G1',
 'M-G3',
 'M-G4',
 'M-H1',
 'M-H3',
 'M-I0',
 'M-I1',
 'M-I3',
 'M-IB2',
 'M-IB4',
 'M-J1',
 'M-J3',
 'M-K1',
 'M-L1',
 'M-L3',
 'M-M1',
 'M-MI4',
 'T-S5',
 'T-①',
 'T-②',
 'T-③',
 'T-④',
 'T-⑤',
 'T-⑥',
 'T-⑦',
 'T-⑧',
 'T-⑨',
 'T-⑩',
 'T-⑪',
 'T-⑫',
 'T-⑬'}

source

concat_locs

 concat_locs (dfs)

Concatenate and drop duplicates from coastal_seawater.csv and iaea_orbs.csv (kept)

Exported source
def concat_locs(dfs):
    "Concatenate and drop duplicates from coastal_seawater.csv and iaea_orbs.csv (kept)"
    df = pd.concat(dfs)
    # Group by org to be used for sorting
    df['org_grp'] = df['org'].apply(
        lambda x: 1 if x == 'coastal_seawater.csv' else 2 if x == 'close1F.csv' else 0)
    df.sort_values('org_grp', ascending=True, inplace=True)
    # Drop duplicates and keep orbs data first
    df.drop_duplicates(subset='STATION', keep='first', inplace=True)
    df.drop(columns=['org_grp'], inplace=True)
    df.sort_values('STATION', ascending=True, inplace=True)
    return df
df_locs = concat_locs([df_locs_clos1F, df_locs_coastal_water, df_locs_orbs])
df_locs.head()
STATION LON LAT org
214 C-P1 139.863333 35.425000 NRA
215 C-P2 139.863333 35.401667 NRA
216 C-P3 139.881667 35.370000 NRA
217 C-P4 139.846667 35.356667 NRA
218 C-P5 139.800000 35.343333 NRA

source

align_dfs

 align_dfs (df_from, df_to)

Align columns structure of df_from to df_to.

Exported source
def align_dfs(df_from, df_to):
    "Align columns structure of df_from to df_to."
    df = defaultdict()    
    for c in df_to.columns:
        df[c] = df_from[c].values if c in df_from.columns else np.nan
    return pd.DataFrame(df)
align_dfs(df_clos1F, df_coastal_water).head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
0 T-0-1 NaN NaN NaN ND 1.5 ND 1.4 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
1 T-0-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 4.7 NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
2 T-0-1 NaN NaN NaN ND 1.1 ND 1.4 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
3 T-0-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 2.9 NaN NaN NaN NaN NaN 2013/08/21 08:09:00
4 T-0-1 NaN NaN NaN ND 0.66 ND 0.49 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/27 08:14:00

5 rows × 49 columns


source

concat_dfs

 concat_dfs (df_coastal_water, df_clos1F)

Concatenate and drop duplicates from coastal_seawater.csv and close1F_water.xlsx (kept)

Exported source
def concat_dfs(df_coastal_water, df_clos1F):
    "Concatenate and drop duplicates from coastal_seawater.csv and close1F_water.xlsx (kept)"
    df_clos1F = align_dfs(df_clos1F, df_coastal_water)
    df = pd.concat([df_coastal_water, df_clos1F])
    return df
df_meas = concat_dfs(df_coastal_water, df_clos1F)
df_meas.head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/21 23:15:00
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/21 23:45:00
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/22 14:28:00
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/22 15:06:00
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN NaN NaN NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00

5 rows × 49 columns


source

georef_data

 georef_data (df_meas, df_locs)

Georeference measurements dataframe using locations dataframe.

Exported source
def georef_data(df_meas, df_locs):
    "Georeference measurements dataframe using locations dataframe."
    assert "Sampling point number" in df_meas.columns and "STATION" in df_locs.columns
    return pd.merge(df_meas, df_locs, how="inner", 
                    left_on='Sampling point number', right_on='STATION')
df_meas_georef = georef_data(df_meas, df_locs)
df_meas_georef.head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns


source

load_data

 load_data (fname_coastal_water, fname_clos1F, fname_iaea_orbs)

Load, align and georeference TEPCO data

Exported source
def load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs):
    "Load, align and georeference TEPCO data"
    df_locs = concat_locs(
        [get_locs_coastal_water(fname_coastal_water), 
         get_locs_clos1F(fname_clos1F),
         get_locs_orbs(fname_iaea_orbs)])
    df_meas = concat_dfs(get_coastal_water_df(fname_coastal_water), get_clos1F_df(fname_clos1F))
    df_meas.dropna(subset=['Sampling point number'], inplace=True)
    return {'SEAWATER': georef_data(df_meas, df_locs)}
dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
dfs['SEAWATER'].head()
100%|██████████| 11/11 [00:04<00:00,  2.31it/s]
100%|██████████| 11/11 [00:06<00:00,  1.77it/s]
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

print(f"# of cols, rows: {dfs['SEAWATER'].shape}")
dfs['SEAWATER'].head()
# of cols, rows: (47526, 53)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

dfs['SEAWATER'].STATION.unique()
array(['T-3', 'T-4', 'T-5', 'T-7', 'T-11', 'T-12', 'T-14', 'T-18', 'T-20',
       'T-22', 'T-MA', 'T-M10', 'T-A', 'T-D', 'T-E', 'T-B', 'T-C',
       'T-MG1', 'T-MG2', 'T-MG3', 'T-MG4', 'T-MG5', 'T-MG6', 'T-D1',
       'T-D5', 'T-D9', 'T-E1', 'T-G4', 'T-H1', 'T-S5', 'T-S6', 'T-17-1',
       'T-B3', 'T-13-1', 'T-S3', 'T-S4', 'T-B4', 'T-S1', 'T-S2', 'T-MG0',
       'T-Z', 'T-B1', 'T-B2', 'T-S7', 'T-S8', 'T-0', 'T-4-1', 'T-4-2',
       'T-6', 'T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
       'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)
np.sum(dfs['SEAWATER'] == "ND")
Sampling point number                                 0
Collection layer of seawater                          0
131I radioactivity concentration (Bq/L)            8642
131I detection limit (Bq/L)                           0
134Cs radioactivity concentration (Bq/L)          29544
134Cs detection limit (Bq/L)                          0
137Cs radioactivity concentration (Bq/L)          16587
137Cs detection limit (Bq/L)                          0
132I radioactivity concentration (Bq/L)               3
132I detection limit (Bq/L)                           0
132Te radioactivity concentration (Bq/L)              0
132Te detection limit (Bq/L)                          0
136Cs radioactivity concentration (Bq/L)              2
136Cs detection limit (Bq/L)                          0
140La radioactivity concentration (Bq/L)              0
140La detection limit (Bq/L)                          0
89Sr radioactivity concentration (Bq/L)             101
89Sr detection limit (Bq/L)                           0
90Sr radioactivity concentration (Bq/L)             339
90Sr detection limit (Bq/L)                           0
238Pu radioactivity concentration (Bq/L)            303
238Pu detection limit (Bq/L)                          0
239Pu+240Pu radioactivity concentration (Bq/L)      225
239Pu+240Pu statistical error (Bq/L)                  0
239Pu+240Pu detection limit (Bq/L)                    0
Total alpha radioactivity concentration (Bq/L)      946
Total alpha detection limit (Bq/L)                    0
Total beta radioactivity concentration (Bq/L)      4770
Total beta detection limit (Bq/L)                     0
140Ba radioactivity concentration (Bq/L)              0
140Ba detection limit (Bq/L)                          0
106Ru radioactivity concentration (Bq/L)              0
106Ru detection limit (Bq/L)                          0
58Co radioactivity concentration (Bq/L)               3
58Co detection limit (Bq/L)                           0
60Co radioactivity concentration (Bq/L)               9
60Co detection limit (Bq/L)                           0
144Ce radioactivity concentration (Bq/L)              9
144Ce detection limit (Bq/L)                          0
54Mn radioactivity concentration (Bq/L)               9
54Mn detection limit (Bq/L)                           0
3H radioactivity concentration (Bq/L)              8762
3H detection limit (Bq/L)                             0
125Sb radioactivity concentration (Bq/L)            647
125Sb detection limit (Bq/L)                          0
105Ru radioactivity concentration (Bq/L)              0
105Ru detection limit (Bq/L)                          0
Unnamed: 49                                           0
TIME                                                  0
STATION                                               0
LON                                                   0
LAT                                                   0
org                                                   0
dtype: int64
dfs['SEAWATER'][['TIME', '134Cs radioactivity concentration (Bq/L)', '134Cs detection limit (Bq/L)']]
TIME 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L)
0 2011/3/21 23:15:00 4.8E+01 9.2E+00
1 2011/3/21 23:45:00 3.1E+01 8.7E+00
2 2011/3/22 14:28:00 4.6E+01 1.4E+01
3 2011/3/22 15:06:00 3.9E+01 1.1E+01
4 2011/3/23 13:51:00 5.1E+01 2.0E+01
... ... ... ...
47521 2024/12/23 07:36:00 NaN NaN
47522 2024/12/30 07:51:00 ND 0.27
47523 2025/01/06 07:43:00 ND 0.27
47524 2025/01/13 07:55:00 ND 0.3
47525 2025/01/20 08:08:00 ND 0.29

47526 rows × 3 columns

Remove 約 (about) character

FEEDBACK TO DATA PROVIDER

We systematically remove the character. Please confirm that this is the correct way to handle this. We could imagine that mentioning uncertainty would be less ambiguous in future.


source

RemoveJapanaseCharCB

 RemoveJapanaseCharCB ()

Remove 約 (about) char

Exported source
class RemoveJapanaseCharCB(Callback):
    "Remove 約 (about) char"
    def _transform_if_about(self, value, about_char='約'):
        if pd.isna(value): return value
        return (value.replace(about_char, '') if str(value).count(about_char) != 0 
                else value)
    
    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            cols_rdn = [c for c in tfm.dfs[k].columns if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
            tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_about)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB()])

tfm()['SEAWATER'].sample(10)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
38687 T-1 上層 NaN NaN ND 0.71 ND 0.85 NaN NaN ... NaN NaN NaN NaN NaN 2023/07/03 07:30:00 T-1 141.034444 37.431111 TEPCO
12847 T-MG2 下層 NaN NaN ND 1.5E-03 1.3E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2017/5/9 08:21:00 T-MG2 141.666667 38.300000 TEPCO
36678 T-1 上層 NaN NaN ND 0.39 ND 0.52 NaN NaN ... NaN NaN NaN NaN NaN 2019/04/26 08:17:00 T-1 141.034444 37.431111 TEPCO
1654 T-MA 上層 ND 4.0E+00 ND 6.0E+00 ND 9.0E+00 NaN NaN ... NaN NaN NaN NaN NaN 2011/10/18 06:45:00 T-MA 141.083333 37.750000 TEPCO
31288 T-0-3 NaN NaN NaN ND 0.28 ND 0.31 NaN NaN ... NaN NaN NaN NaN NaN 2022/09/05 07:26:00 T-0-3 141.040278 37.416111 TEPCO
5136 T-A 上層 NaN NaN ND 1.2E+00 ND 1.2E+00 NaN NaN ... NaN NaN NaN NaN NaN 2013/5/14 07:19:00 T-A 140.763889 36.713889 TEPCO
36680 T-1 上層 NaN NaN ND 0.6 ND 0.66 NaN NaN ... NaN NaN NaN NaN NaN 2019/04/28 07:55:00 T-1 141.034444 37.431111 TEPCO
11466 T-A 上層 NaN NaN ND 8.9E-01 ND 1.2E+00 NaN NaN ... NaN NaN NaN NaN NaN 2016/7/11 09:48:00 T-A 140.763889 36.713889 TEPCO
2382 T-4 上層 ND 6.7E-01 ND 9.1E-01 ND 1.0E+00 NaN NaN ... NaN NaN NaN NaN NaN 2012/1/12 08:00:00 T-4 141.013889 37.241667 TEPCO
1086 T-MG5 上層 ND 4.0E+00 ND 6.0E+00 ND 9.0E+00 NaN NaN ... NaN NaN NaN NaN NaN 2011/8/9 08:34:00 T-MG5 141.250000 38.166667 TEPCO

10 rows × 53 columns

Fix values range string

FEEDBACK TO DATA PROVIDER

Value ranges are provided as strings (e.g ‘4.0E+00<&<8.0E+00’ or ‘1.0~2.7’). We replace them by their mean. Please confirm that this is the correct way to handle this. Again, mentioning uncertainty would be less ambiguous in future.


source

FixRangeValueStringCB

 FixRangeValueStringCB ()

Replace range values (e.g ‘4.0E+00<&<8.0E+00’ or ‘1.0~2.7’) by their mean

Exported source
class FixRangeValueStringCB(Callback):
    "Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean"
    
    def _extract_and_calculate_mean(self, s):
        # For scientific notation ranges
        float_strings = re.findall(r"[+-]?\d+\.?\d*E?[+-]?\d*", s)
        if float_strings:
            float_numbers = np.array(float_strings, dtype=float)
            return float_numbers.mean()
        return s
    
    def _transform_if_range(self, value):
        if pd.isna(value): 
            return value
        value = str(value)
        # Check for both range patterns
        if '<&<' in value or '~' in value:
            return self._extract_and_calculate_mean(value)
        return value

    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            cols_rdn = [c for c in tfm.dfs[k].columns 
                       if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
            # tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_range).astype(float)
            tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_range)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB()
    ])

df_test = tfm()['SEAWATER']
df_test.sample(10)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME STATION LON LAT org
5799 T-A 上層 NaN NaN ND 1.2E+00 ND 1.3E+00 NaN NaN ... NaN NaN NaN NaN NaN 2013/9/9 07:22:00 T-A 140.763889 36.713889 TEPCO
27769 T-0-1A NaN NaN NaN ND 0.67 ND 0.73 NaN NaN ... NaN NaN NaN NaN NaN 2016/02/22 08:13 T-0-1A 141.046667 37.430556 TEPCO
17986 T-11 下層 NaN NaN ND 1.4E-03 1.9E-02 NaN NaN NaN ... NaN NaN NaN NaN NaN 2020/5/18 08:57:00 T-11 141.047222 37.241667 TEPCO
12433 T-5 上層 NaN NaN ND 1.2E-03 2.6E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2017/2/9 08:22:00 T-5 141.200000 37.416667 TEPCO
18480 T-D1 上層 NaN NaN ND 1.1E-03 4.9E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2020/9/4 07:54:00 T-D1 141.072222 37.500000 TEPCO
6019 T-4-2 上層 NaN NaN 1.3E-01 NaN 3.1E-01 NaN NaN NaN ... NaN NaN NaN NaN NaN 2013/10/24 10:50:00 T-4-2 37.210000 141.010000 coastal_seawater.csv
43266 T-2 上層 NaN NaN ND 0.58 ND 0.71 NaN NaN ... NaN NaN NaN NaN NaN 2022/09/27 08:25:00 T-2 141.033611 37.415833 TEPCO
31559 T-0-3 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2024/04/08 08:41:00 T-0-3 141.040278 37.416111 TEPCO
9845 T-MG5 中層 NaN NaN ND 1.4E-03 3.6E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2015/9/2 09:03:00 T-MG5 141.250000 38.166667 TEPCO
25570 T-S3 上層 NaN NaN ND 1.4E-03 2.1E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2024/9/10 06:44:00 T-S3 141.078889 37.458333 TEPCO

10 rows × 53 columns

Select columns of interest

We select the columns of interest and in particular the elements of interest, in our case radionuclides.


source

SelectColsOfInterestCB

 SelectColsOfInterestCB (common_coi, nuclides_pattern)

Select columns of interest.

Exported source
common_coi = ['LON', 'LAT', 'TIME', 'STATION']
nuclides_pattern = '(Bq/L)'
Exported source
class SelectColsOfInterestCB(Callback):
    "Select columns of interest."
    def __init__(self, common_coi, nuclides_pattern): fc.store_attr()
    def __call__(self, tfm):
        nuc_of_interest = [c for c in tfm.dfs['SEAWATER'].columns if nuclides_pattern in c]
        tfm.dfs['SEAWATER'] = tfm.dfs['SEAWATER'][self.common_coi + nuc_of_interest]
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
LON LAT TIME STATION 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) ... 144Ce radioactivity concentration (Bq/L) 144Ce detection limit (Bq/L) 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L)
46278 141.050761 37.440794 2023/05/15 07:26:00 T-A1 NaN NaN ND 0.31 ND 0.37 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
44708 37.410000 141.030000 2013/06/04 07:05:00 T-2-1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25033 141.200000 37.416667 2024/6/12 07:21:00 T-5 NaN NaN ND 1.4E-03 2.1E-03 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
45837 37.410000 141.030000 2015/12/09 06:10:00 T-2-1 ND 0.63 ND 0.77 ND 0.86 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
45341 37.410000 141.030000 2014/10/18 05:35:00 T-2-1 ND 0.87 ND 0.47 ND 0.83 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 49 columns

Reshape: wide to long

So that we can extract information such as nuclide name, unit, derived quantities such as uncertainty, detection limit, …


source

WideToLongCB

 WideToLongCB (id_vars=['LON', 'LAT', 'TIME', 'STATION'])

Get TEPCO nuclide names as values not column names to extract contained information (nuclide name, unc, dl, …).

Exported source
class WideToLongCB(Callback):
    """
    Get TEPCO nuclide names as values not column names 
    to extract contained information (nuclide name, unc, dl, ...).
    """
    def __init__(self, id_vars=['LON', 'LAT', 'TIME', 'STATION']): 
        fc.store_attr()
        
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'] = pd.melt(tfm.dfs['SEAWATER'], id_vars=self.id_vars)
#| eval: false
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.head()
LON LAT TIME STATION variable value
0 141.026389 37.322222 2011/3/21 23:15:00 T-3 131I radioactivity concentration (Bq/L) 1.1E+03
1 141.013889 37.241667 2011/3/21 23:45:00 T-4 131I radioactivity concentration (Bq/L) 6.6E+02
2 141.026389 37.322222 2011/3/22 14:28:00 T-3 131I radioactivity concentration (Bq/L) 1.1E+03
3 141.013889 37.241667 2011/3/22 15:06:00 T-4 131I radioactivity concentration (Bq/L) 6.7E+02
4 141.026389 37.322222 2011/3/23 13:51:00 T-3 131I radioactivity concentration (Bq/L) 7.4E+02

Extract

Nulide name, dl, unc, … are extracted from column names as embedded in TEPCO data source.

Nuclide name


source

extract_nuclide

 extract_nuclide (text:str)

Extract the nuclide identifier from a measurement variable name using regex.

Exported source
def extract_nuclide(text: str) -> str:
    "Extract the nuclide identifier from a measurement variable name using regex."
    pattern = r'^(Total\s+(?:alpha|beta)|[^\s]+)'
    match = re.match(pattern, text, re.IGNORECASE)
    return match.group(1) if match else text

For instance:

print(extract_nuclide("Total alpha radioactivity concentration (Bq/L)"))
print(extract_nuclide("131I radioactivity concentration (Bq/L)"))
Total alpha
131I

source

ExtractNuclideNameCB

 ExtractNuclideNameCB (src_col='variable', dest_col='NUCLIDE')

Extract nuclide name from TEPCO data.

Exported source
class ExtractNuclideNameCB(Callback):
    "Extract nuclide name from TEPCO data."
    def __init__(self, src_col='variable', dest_col='NUCLIDE'): fc.store_attr()
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'][self.dest_col] = tfm.dfs['SEAWATER'][self.src_col].map(extract_nuclide)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
LON LAT TIME STATION variable value NUCLIDE
1477205 141.200000 37.233333 2012/9/14 07:20:00 T-7 58Co radioactivity concentration (Bq/L) NaN 58Co
1798737 141.033611 37.415833 2016/10/25 07:15 T-2 54Mn radioactivity concentration (Bq/L) NaN 54Mn
618505 141.200000 37.416667 2011/6/30 09:00:00 T-5 140La detection limit (Bq/L) NaN 140La
580255 141.200000 37.416667 2015/9/24 08:20:00 T-5 140La radioactivity concentration (Bq/L) NaN 140La
928568 141.082500 37.428611 2024/9/10 07:14:00 T-S4 238Pu detection limit (Bq/L) NaN 238Pu

Unit


source

ExtractUnitCB

 ExtractUnitCB (src_col='variable', dest_col='UNIT')

Extract unit from TEPCO data.

Exported source
class ExtractUnitCB(Callback):
    "Extract unit from TEPCO data."
    def __init__(self, src_col='variable', dest_col='UNIT'): fc.store_attr()
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'][self.dest_col] = tfm.dfs['SEAWATER'][self.src_col].str.extract(r'\((.*?)\)')
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
LON LAT TIME STATION variable value NUCLIDE UNIT
559025 141.034444 37.431111 2018/05/25 08:10:00 T-1 136Cs detection limit (Bq/L) NaN 136Cs Bq/L
1290274 141.082500 37.428611 2014/4/24 06:11:00 T-S4 140Ba radioactivity concentration (Bq/L) NaN 140Ba Bq/L
1085206 141.033611 37.415833 2011/06/24 13:45:00 T-2 239Pu+240Pu detection limit (Bq/L) NaN 239Pu+240Pu Bq/L
1577066 141.047222 37.241667 2015/2/17 10:52:00 T-11 60Co radioactivity concentration (Bq/L) NaN 60Co Bq/L
1435763 141.583333 38.233333 2015/10/1 08:39:00 T-MG3 106Ru detection limit (Bq/L) NaN 106Ru Bq/L

Value type

Is it a measurement or derived detection such as detection limit or uncertainty?


source

ExtractValueTypeCB

 ExtractValueTypeCB (src_col='variable', dest_col='type')

Extract value type from TEPCO data.

Exported source
class ExtractValueTypeCB(Callback):
    "Extract value type from TEPCO data."
    def __init__(self, src_col='variable', dest_col='type'): fc.store_attr()
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'][self.dest_col] = np.select(
            [
                tfm.dfs['SEAWATER'][self.src_col].str.contains('detection limit', case=False),
                tfm.dfs['SEAWATER'][self.src_col].str.contains('statistical error', case=False)],
            ['DL', 'UNC'],
            default='VALUE'
        )
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
LON LAT TIME STATION variable value NUCLIDE UNIT type
893047 141.034444 37.431111 2021/03/09 07:25:00 T-1 238Pu radioactivity concentration (Bq/L) NaN 238Pu Bq/L VALUE
1151734 141.148611 37.348333 2016/4/25 06:42:00 T-B4 Total alpha detection limit (Bq/L) NaN Total alpha Bq/L DL
2092536 141.000000 38.083333 2011/9/12 07:26:00 T-MG6 105Ru detection limit (Bq/L) NaN 105Ru Bq/L DL
60592 141.062500 37.552778 2017/6/23 08:09:00 T-14 131I detection limit (Bq/L) NaN 131I Bq/L DL
908911 141.200000 37.416667 2013/10/4 09:37:00 T-5 238Pu detection limit (Bq/L) 4.5E-06 238Pu Bq/L DL

Reshape: long to wide

Send type column to columns names (VALUE, DL, UNC)


source

LongToWideCB

 LongToWideCB (src_col='variable', dest_col='type')

Reshape: long to wide

Exported source
class LongToWideCB(Callback):
    "Reshape: long to wide"
    def __init__(self, src_col='variable', dest_col='type'): fc.store_attr()
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER'] = pd.pivot_table(
            tfm.dfs['SEAWATER'],
            values='value',
            index=['LON', 'LAT', 'TIME', 'STATION', 'NUCLIDE', 'UNIT'],
            columns='type',
            aggfunc='first'
        ).reset_index()
        tfm.dfs['SEAWATER'].reset_index(inplace=True)
        tfm.dfs['SEAWATER'].rename(columns={'index': 'SMP_ID'}, inplace=True)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
67298 67298 141.050761 37.424686 2024/10/21 07:27:00 T-A2 134Cs Bq/L 0.4 NaN ND
77038 77038 141.078889 37.458333 2021/10/12 06:18:00 T-S3 134Cs Bq/L 1.4E-03 NaN ND
35118 35118 141.034444 37.431111 2012/08/13 07:40:00 T-1 Total alpha Bq/L 0.087 NaN ND
24378 24378 141.033611 37.415833 2018/02/09 06:45:00 T-2 137Cs Bq/L 0.58 NaN ND
21398 21398 141.033611 37.415833 2012/05/31 08:30:00 T-2 137Cs Bq/L 1.6 NaN ND
df_test[df_test.VALUE == 'ND'].groupby('NUCLIDE').size().sort_values(ascending=False)
NUCLIDE
134Cs          24098
137Cs          15806
3H              8233
131I            7958
Total beta      4764
Total alpha      946
125Sb            647
90Sr             338
238Pu            302
239Pu+240Pu      225
89Sr             100
144Ce              9
54Mn               9
60Co               9
58Co               3
132I               3
136Cs              2
dtype: int64
df_test[df_test.VALUE == 'ND']
type ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
0 0 37.210000 141.01 2012/10/16 07:25:00 T-4-1 131I Bq/L 1.3E-01 NaN ND
1 1 37.210000 141.01 2012/10/16 07:25:00 T-4-1 134Cs Bq/L 1.9E-01 NaN ND
2 2 37.210000 141.01 2012/10/16 07:25:00 T-4-1 137Cs Bq/L 2.7E-01 NaN ND
3 3 37.210000 141.01 2012/10/2 07:30:00 T-4-1 131I Bq/L 1.1E-01 NaN ND
4 4 37.210000 141.01 2012/10/2 07:30:00 T-4-1 134Cs Bq/L 2.2E-01 NaN ND
... ... ... ... ... ... ... ... ... ... ...
89626 89626 141.666667 38.30 2024/7/9 08:19:00 T-MG2 134Cs Bq/L 1.4E-03 NaN ND
89628 89628 141.666667 38.30 2024/8/6 07:38:00 T-MG2 134Cs Bq/L 1.1E-03 NaN ND
89630 89630 141.666667 38.30 2024/8/6 07:51:00 T-MG2 134Cs Bq/L 1.3E-03 NaN ND
89632 89632 141.666667 38.30 2024/9/10 08:19:00 T-MG2 134Cs Bq/L 1.1E-03 NaN ND
89634 89634 141.666667 38.30 2024/9/10 08:32:00 T-MG2 134Cs Bq/L 1.1E-03 NaN ND

63452 rows × 10 columns

df_test.VALUE == 'ND'
0         True
1         True
2         True
3         True
4         True
         ...  
89631    False
89632     True
89633    False
89634     True
89635    False
Name: VALUE, Length: 89636, dtype: bool

Remap UNIT name to MARIS nomenclature


source

RemapUnitNameCB

 RemapUnitNameCB (unit_mapping)

Remap UNIT name to MARIS id.

Exported source
unit_mapping = {'Bq/L': 3}
Exported source
class RemapUnitNameCB(Callback):
    """
    Remap `UNIT` name to MARIS id.
    """
    def __init__(self, unit_mapping): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER']['UNIT'] = tfm.dfs['SEAWATER']['UNIT'].map(self.unit_mapping)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
49628 49628 141.040278 37.416111 2019/10/29 07:51:00 T-0-3 3H 3 0.88 NaN NaN
76353 76353 141.078889 37.458333 2022/6/22 10:03:00 T-S3 3H 3 NaN NaN 1.4E-01
7924 7924 140.702222 35.987500 2017/1/25 12:47:00 T-D 134Cs 3 9.7E-01 NaN NaN
185 185 37.410000 141.030000 2012/12/02 07:30:00 T-2-1 137Cs 3 1.3 NaN NaN
33266 33266 141.034444 37.431111 2011/06/23 13:50:00 T-1 134Cs 3 NaN NaN 19

Remap NUCLIDE name to MARIS nomenclature


source

RemapNuclideNameCB

 RemapNuclideNameCB (nuclide_mapping)

Remap NUCLIDE name to MARIS id.

Exported source
nuclide_mapping = {
    '131I': 29,
    '134Cs': 31,
    '137Cs': 33,
    '125Sb': 24,
    'Total beta': 103,
    '238Pu': 67,
    '239Pu+240Pu': 77,
    '3H': 1,
    '89Sr': 11,
    '90Sr': 12,
    'Total alpha': 104,
    '132I': 100,
    '136Cs': 102,
    '58Co': 8,
    '105Ru': 97,
    '106Ru': 17,
    '140La': 35,
    '140Ba': 34,
    '132Te': 99,
    '60Co': 9,
    '144Ce': 37,
    '54Mn': 6
}
Exported source
class RemapNuclideNameCB(Callback):
    "Remap `NUCLIDE` name to MARIS id."
    def __init__(self, nuclide_mapping): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER']['NUCLIDE'] = tfm.dfs['SEAWATER']['NUCLIDE'].map(self.nuclide_mapping)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
20803 20803 141.033611 37.415833 2012/06/12 08:20:00 T-2 31 3 1.2 NaN NaN
58045 58045 141.046667 37.416111 2023/10/23 07:53:00 T-0-3A 33 3 0.32 NaN NaN
81066 81066 141.200000 37.416667 2015/6/23 07:54:00 T-5 31 3 1.9E-03 NaN NaN
74961 74961 141.072222 37.500000 2020/4/8 08:20:00 T-D1 31 3 1.2E-03 NaN NaN
63729 63729 141.047222 37.241667 2011/6/3 07:40:00 T-11 31 3 NaN NaN 1.1E+01
df_test.dropna(subset=['DL', 'VALUE'], how='any')
type ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE
154 154 37.410000 141.030000 2012/11/26 08:10:00 T-2-1 31 3 1.1 NaN 0.62
155 155 37.410000 141.030000 2012/11/26 08:10:00 T-2-1 33 3 1.5 NaN 1.1
213 213 37.410000 141.030000 2012/12/10 08:00:00 T-2-1 31 3 1.1 NaN 0.44
214 214 37.410000 141.030000 2012/12/10 08:00:00 T-2-1 33 3 1.5 NaN 0.69
239 239 37.410000 141.030000 2012/12/17 09:10:00 T-2-1 31 3 1.1 NaN 0.32
... ... ... ... ... ... ... ... ... ... ...
83047 83047 141.233333 37.516667 2015/10/23 06:50:00 T-B2 31 3 1.4E-03 NaN 2.0E-03
83055 83055 141.233333 37.516667 2015/3/18 06:18:00 T-B2 31 3 1.4E-03 NaN 2.0E-03
83059 83059 141.233333 37.516667 2015/5/26 06:14:00 T-B2 31 3 1.4E-03 NaN 1.6E-03
83906 83906 141.250000 38.166667 2015/11/5 08:50:00 T-MG5 31 3 1.8E-03 NaN 1.5E-03
83958 83958 141.250000 38.166667 2015/5/22 08:10:00 T-MG5 31 3 1.7E-03 NaN 1.7E-03

3451 rows × 10 columns

Remap VALUE, DL, DLV

We remap DL (Detection Limit) value to MARIS ids as follows:

  • First check if activity (VALUE) is reported as “ND”, based on reported detection limit DL:
if VALUE is "ND":
    if not DL: 
        VALUE, DLV, DL = NaN, NaN, 3
    else:
        VALUE, DLV, DL = DL, DL, 2
  • Then if activity (VALUE) is reported:
if VALUE:
    VALUE, DLV, DL = VALUE, DL, 1

but if not reported, then based on detection level (DL) reported:

else:
    if DL:
        VALUE, DLV, DL = DL, DL, 2
    else:
        VALUE, DLV, DL = NaN, NaN, NaN (should be dropped)

With 1: Detected value (=), 2: Detection limit (<), 3: Not detected (ND) and where:

  • VALUE is the activity reported by TEPCO
  • DL is initially the detection limit as reported by TEPCO but later on remapped to MARIS detection level nomenclature (categorical)
  • DLV is the detection limit value as reported by TEPCO (copied from DL)

source

RemapVALUE_DL_DLV_CB

 RemapVALUE_DL_DLV_CB ()

Remap DL, DLV, VALUE based on TEPCO -> MARIS rules.

Exported source
class RemapVALUE_DL_DLV_CB(Callback):
    "Remap `DL`, `DLV`, `VALUE` based on TEPCO -> MARIS rules."    
    def map_all_columns(self, row):
        """Map all three columns (VALUE, DL, DLV) at once based on TEPCO rules"""
        value, dl = row['VALUE'], row['DL']
        new_value, new_dlv, new_dl = value, dl, 1
        
        if value == 'ND':
            if pd.isna(dl):
                new_value, new_dlv, new_dl = np.nan, np.nan, 3
            else:
                new_value, new_dlv, new_dl = dl, dl, 2
                
        elif pd.isna(value):
            if pd.isna(dl):
                new_value, new_dlv, new_dl = np.nan, np.nan, np.nan
            else:
                new_value, new_dlv, new_dl = dl, dl, 2
                
        return pd.Series({
            'VALUE': new_value,
            'DLV': new_dlv, 
            'DL': new_dl
        })
        
    def __call__(self, tfm):
        mapped = tfm.dfs['SEAWATER'].apply(self.map_all_columns, axis=1)
        tfm.dfs['SEAWATER'][['VALUE', 'DLV', 'DL']] = mapped
        tfm.dfs['SEAWATER']['DL'] = tfm.dfs['SEAWATER']['DL'].astype(int)
        tfm.dfs['SEAWATER']['VALUE'] = tfm.dfs['SEAWATER']['VALUE'].astype(float)
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(20)
type ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV
28281 28281 141.033611 37.415833 2021/01/12 07:03:00 T-2 31 3 2 NaN 0.8100 0.81
3621 3621 37.410000 141.030000 2014/12/17 06:05:00 T-2-1 103 3 1 NaN 15.0000 NaN
8443 8443 140.702222 35.987500 2016/7/13 13:20:00 T-D 33 3 2 NaN 1.2000 1.2E+00
86787 86787 141.583333 38.233333 2014/1/24 08:53:00 T-MG3 31 3 2 NaN 0.0022 2.2E-03
16846 16846 141.026389 37.322222 2011/4/27 08:40:00 T-3 29 3 1 NaN 13.0000 NaN
41804 41804 141.034444 37.431111 2018/03/29 07:00:00 T-1 29 3 2 NaN 0.5500 0.55
53540 53540 141.040278 37.430556 2023/05/01 06:54:00 T-0-1 33 3 2 NaN 0.3100 0.31
78995 78995 141.133333 38.250000 2015/10/1 08:59:00 T-MG4 33 3 1 NaN 0.0065 NaN
87337 87337 141.583333 38.233333 2021/11/17 08:55:00 T-MG3 31 3 2 NaN 0.0014 1.4E-03
81259 81259 141.200000 37.416667 2012/4/13 10:20:00 T-5 31 3 1 NaN 0.0320 NaN
41616 41616 141.034444 37.431111 2018/01/29 06:55:00 T-1 33 3 2 NaN 0.5900 0.59
42921 42921 141.034444 37.431111 2019/03/01 08:20:00 T-1 33 3 2 NaN 0.7500 0.75
78661 78661 141.133333 38.250000 2012/6/27 08:41:00 T-MG4 33 3 1 NaN 0.0220 NaN
55120 55120 141.040556 37.478889 2020/10/6 11:40:00 T-6 1 3 2 NaN 0.3100 3.1E-01
41257 41257 141.034444 37.431111 2017/10/10 07:40:00 T-1 29 3 2 NaN 0.5500 0.55
60895 60895 141.046667 37.423333 2022/05/30 07:47:00 T-0-2 33 3 2 NaN 0.3100 0.31
47455 47455 141.034444 37.431111 2024/04/15 07:25:00 T-1 1 3 1 NaN 0.3300 NaN
73130 73130 141.072222 37.416667 2019/12/3 09:12:00 T-D5 12 3 1 NaN 0.0013 NaN
66260 66260 141.050739 37.409267 2023/01/09 07:34:00 T-A3 31 3 2 NaN 0.3700 0.37
27296 27296 141.033611 37.415833 2020/03/21 06:35:00 T-2 33 3 2 NaN 0.6300 0.63

Parse & encode time


source

ParseTimeCB

 ParseTimeCB (time_name='TIME')

Parse time column from TEPCO.

Exported source
class ParseTimeCB(Callback):
    "Parse time column from TEPCO."
    def __init__(self, time_name='TIME'): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER'][self.time_name] = pd.to_datetime(tfm.dfs['SEAWATER'][self.time_name], 
                                                             format='%Y/%m/%d %H:%M:%S', errors='coerce')
tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB(),
    ParseTimeCB(),
    EncodeTimeCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
Warning: 3058 missing time value(s) in SEAWATER
type ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV
37478 37478 141.034444 37.431111 1407568620 T-1 33 3 2 NaN 0.570 0.57
54973 54973 141.040556 37.478889 1573551900 T-6 33 3 1 NaN 0.093 NaN
20390 20390 141.033611 37.415833 1314547500 T-2 31 3 2 NaN 21.000 21
71367 71367 141.072167 37.333333 1668502860 T-D9 103 3 1 NaN 16.000 NaN
85332 85332 141.283333 38.333333 1312886100 T-MG1 33 3 2 NaN 9.000 9.0E+00

Sanitize coordinates

tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
    ])

df_test = tfm()['SEAWATER']
df_test.sample(5)
Warning: 3058 missing time value(s) in SEAWATER
type ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV
19352 19352 141.026389 37.322222 1729597920 T-3 33 3 1 NaN 0.0160 NaN
85591 85591 141.283333 38.333333 1365157920 T-MG1 33 3 1 NaN 0.0087 NaN
10898 10898 140.837222 35.796111 1564667220 T-E 33 3 2 NaN 1.2000 1.2E+00
9473 9473 140.763889 36.713889 1586160600 T-A 31 3 2 NaN 0.8200 8.2E-01
76909 76909 141.078889 37.458333 1436249940 T-S3 33 3 1 NaN 0.0100 NaN

Encode to NetCDF

tfm = Transformer(dfs, cbs=[
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    ExtractNuclideNameCB(),
    ExtractUnitCB(),
    ExtractValueTypeCB(),
    LongToWideCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapVALUE_DL_DLV_CB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
    ])

dfs_tfm = tfm()
tfm.logs
Warning: 3058 missing time value(s) in SEAWATER
['Remove 約 (about) char',
 "Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean",
 'Select columns of interest.',
 '\n    Get TEPCO nuclide names as values not column names \n    to extract contained information (nuclide name, unc, dl, ...).\n    ',
 'Extract nuclide name from TEPCO data.',
 'Extract unit from TEPCO data.',
 'Extract value type from TEPCO data.',
 'Reshape: long to wide',
 '\n    Remap `UNIT` name to MARIS id.\n    ',
 'Remap `NUCLIDE` name to MARIS id.',
 'Remap `DL`, `DLV`, `VALUE` based on TEPCO -> MARIS rules.',
 'Parse time column from TEPCO.',
 'Encode time as seconds since epoch.',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.']
dfs_tfm['SEAWATER'].sample(10)
type SMP_ID LON LAT TIME STATION NUCLIDE UNIT DL UNC VALUE DLV
88358 88358 141.583333 38.633333 1665132900 T-MG0 33 3 1 NaN 0.0015 NaN
60946 60946 141.046667 37.423333 1661759100 T-0-2 31 3 2 NaN 0.2800 0.28
68188 68188 141.062500 37.552778 1353316500 T-14 31 3 1 NaN 0.0240 NaN
78004 78004 141.083333 37.750000 1324968900 T-MA 31 3 2 NaN 0.9300 9.3E-01
45108 45108 141.034444 37.431111 1630914900 T-1 33 3 2 NaN 0.7500 0.75
63506 63506 141.046667 37.430556 1661151900 T-0-1A 103 3 1 NaN 17.0000 NaN
55903 55903 141.040556 37.478889 1720519500 T-6 1 3 1 NaN 0.4300 NaN
54473 54473 141.040556 37.478889 1438677000 T-6 103 3 2 NaN 16.0000 1.6E+01
77300 77300 141.082500 37.428611 1513144020 T-S4 33 3 1 NaN 0.0042 NaN
70225 70225 141.072167 37.333333 1435657200 T-D9 33 3 1 NaN 0.0072 NaN

source

get_attrs

 get_attrs (tfm, zotero_key, kw=['oceanography', 'Earth Science > Oceans >
            Ocean Chemistry> Radionuclides', 'Earth Science > Human
            Dimensions > Environmental Impacts > Nuclear Radiation
            Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean
            Tracers, Earth Science > Oceans > Marine Sediments', 'Earth
            Science > Oceans > Ocean Chemistry, Earth Science > Oceans >
            Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality >
            Ocean Contaminants', 'Earth Science > Biological
            Classification > Animals/Vertebrates > Fish', 'Earth Science >
            Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science >
            Biological Classification > Animals/Invertebrates > Mollusks',
            'Earth Science > Biological Classification >
            Animals/Invertebrates > Arthropods > Crustaceans', 'Earth
            Science > Biological Classification > Plants > Macroalgae
            (Seaweeds)'])

Retrieve global attributes from MARIS dump.

Exported source
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']
Exported source
def get_attrs(tfm, zotero_key, kw=kw):
    "Retrieve global attributes from MARIS dump."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()
get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw)
{'geospatial_lat_min': '141.66666667',
 'geospatial_lat_max': '38.63333333',
 'geospatial_lon_min': '140.60388889',
 'geospatial_lon_max': '35.79611111',
 'geospatial_bounds': 'POLYGON ((140.60388889 35.79611111, 141.66666667 35.79611111, 141.66666667 38.63333333, 140.60388889 38.63333333, 140.60388889 35.79611111))',
 'time_coverage_start': '2011-03-21T14:30:00',
 'time_coverage_end': '2025-01-25T07:24:00',
 'id': 'JEV6HP5A',
 'title': "Readings of Sea Area Monitoring - Monitoring of sea water - Sea area close to TEPCO's Fukushima Daiichi NPS / Coastal area - Readings of Sea Area Monitoring [TEPCO]",
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "TEPCO - Tokyo Electric Power Company"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
 'publisher_postprocess_logs': "Remove 約 (about) char, Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean, Select columns of interest., \n    Get TEPCO nuclide names as values not column names \n    to extract contained information (nuclide name, unc, dl, ...).\n    , Extract nuclide name from TEPCO data., Extract unit from TEPCO data., Extract value type from TEPCO data., Reshape: long to wide, \n    Remap `UNIT` name to MARIS id.\n    , Remap `NUCLIDE` name to MARIS id., Remap `DL`, `DLV`, `VALUE` based on TEPCO -> MARIS rules., Parse time column from TEPCO., Encode time as seconds since epoch., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator."}

source

encode

 encode (fname_out:str, **kwargs)

Encode TEPCO data to NetCDF.

Type Details
fname_out str Path to the folder where the NetCDF output will be saved
kwargs VAR_KEYWORD
Exported source
def encode(
    fname_out: str, # Path to the folder where the NetCDF output will be saved
    **kwargs # Additional keyword arguments
    ):
    "Encode TEPCO data to NetCDF."
    dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
    
    tfm = Transformer(dfs, cbs=[
        RemoveJapanaseCharCB(),
        FixRangeValueStringCB(),
        SelectColsOfInterestCB(common_coi, nuclides_pattern),
        WideToLongCB(),
        ExtractNuclideNameCB(),
        ExtractUnitCB(),
        ExtractValueTypeCB(),
        LongToWideCB(),
        RemapUnitNameCB(unit_mapping),
        RemapNuclideNameCB(nuclide_mapping),
        RemapVALUE_DL_DLV_CB(),
        ParseTimeCB(),
        EncodeTimeCB(),
        SanitizeLonLatCB()
    ])        
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out, 
                            global_attrs=get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw),
                            verbose=kwargs.get('verbose', False)
                            )
    encoder.encode()
encode(fname_out, verbose=False)
100%|██████████| 11/11 [00:04<00:00,  2.26it/s]
100%|██████████| 11/11 [00:04<00:00,  2.28it/s]
Warning: 3058 missing time value(s) in SEAWATER
decode(fname_in=fname_out, verbose=True)
Saved SEAWATER to ../../_data/output/tepco_SEAWATER.csv
df_output = pd.read_csv("../../_data/output/tepco_SEAWATER.csv")
df_output.head()
longitude latitude begperiod station samplabcode nuclide_id activity unit_id uncertaint detection detection_val samptype_id ref_id
0 140.60388 36.29972 2011-10-13 13:21:00 T-C 5981 29 4.0 3 NaN < 4.0 1 679
1 140.60388 36.29972 2011-10-13 13:21:00 T-C 5982 31 6.0 3 NaN < 6.0 1 679
2 140.60388 36.29972 2011-10-13 13:21:00 T-C 5983 33 9.0 3 NaN < 9.0 1 679
3 140.60388 36.29972 2011-10-13 13:23:00 T-C 5984 29 4.0 3 NaN < 4.0 1 679
4 140.60388 36.29972 2011-10-13 13:23:00 T-C 5985 31 6.0 3 NaN < 6.0 1 679