Data pipeline (handler) to convert TEPCO dataset (Source) to NetCDF format

Configuration & file paths

Exported source
fname_coastal_water = 'https://radioactivity.nra.go.jp/cont/en/results/sea/coastal_water.csv'
fname_clos1F = 'https://radioactivity.nra.go.jp/cont/en/results/sea/close1F_water.xlsx'
fname_iaea_orbs = 'https://raw.githubusercontent.com/RML-IAEA/iaea.orbs/refs/heads/main/src/iaea/orbs/stations/station_points.csv'

fname_out = '../../_data/output/tepco.nc'

Load data

We here load the data from the NRA (Nuclear Regulatory Authority) website. For the moment, we only process radioactivity concentration data in the seawater around Fukushima Dai-ichi NPP [TEPCO] (coastal_water.csv) and in the close1F_water.xlsx file.

In near future, MARIS will provide a dedicated handler for all related ALPS data including measurements not only provided by TEPCO but also MOE, NRA, MLITT and Fukushima Prefecture.

FEEDBACK TO DATA PROVIDER

The coastal_water.csv file contains two sections: the measurements and the locations. We identify below the line number where the locations begin. A single point of truth for the location of the stations would ease the processing in future.


source

find_location_section

 find_location_section (df, col_idx=0, pattern='Sampling point number')

Find the line number where location data begins.

Exported source
def find_location_section(df, 
                          col_idx=0,
                          pattern='Sampling point number'
                          ):
    "Find the line number where location data begins."
    mask = df.iloc[:, col_idx] == pattern
    indices = df[mask].index
    return indices[0] if len(indices) > 0 else -1
find_location_section(pd.read_csv(fname_coastal_water, low_memory=False))
27844
FEEDBACK TO DATA PROVIDER

Distinct parsing of the time from coastal_water.csv and close1F_water.xlsx files are required. Indeed:

  • coastal_water.csv uses the format YYYY/MM/DD in the Sampling HH:MM and
  • close1F_water.xlsx uses the format YYYY-MM-DD HH:MM:SS.

source

fix_sampling_time

 fix_sampling_time (x)
Exported source
def fix_sampling_time(x):
    if pd.isna(x): 
        return '00:00:00'
    else:
        hour, min =  x.split(':')[:2]
        return f"{hour if len(hour) == 2 else '0' + hour}:{min}:00"

source

get_coastal_water_df

 get_coastal_water_df (fname_coastal_water)

Get the measurements dataframe from the coastal_water.csv file.

Exported source
def get_coastal_water_df(fname_coastal_water):
    "Get the measurements dataframe from the `coastal_water.csv` file."
    
    locs_idx = find_location_section(pd.read_csv(fname_coastal_water, 
                                      skiprows=0, low_memory=False))
    
    df = pd.read_csv(fname_coastal_water, skiprows=1, 
                     nrows=locs_idx - 1,
                     low_memory=False)
    df.dropna(subset=['Sampling point number'], inplace=True)
    df['Sampling time'] = df['Sampling time'].map(fix_sampling_time)
    
    df['TIME'] = df['Sampling date'].replace('-', '/') + ' ' + df['Sampling time']
    
    df = df.drop(columns=['Sampling date', 'Sampling time'])
    return df
df_coastal_water = get_coastal_water_df(fname_coastal_water)
df_coastal_water.tail()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
27836 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.9E+00 NaN NaN NaN NaN NaN 2024/12/9 07:51:00
27837 T-S8 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.9E+00 NaN NaN NaN NaN NaN 2024/12/9 10:07:00
27838 T-S3 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.9E+00 NaN NaN NaN NaN NaN 2024/12/11 10:37:00
27839 T-S4 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.8E+00 NaN NaN NaN NaN NaN 2024/12/11 11:06:00
27840 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 8.6E+00 NaN NaN NaN NaN NaN 2024/12/16 07:24:00

5 rows × 49 columns

FEEDBACK TO DATA PROVIDER

Identification of the stations location requires three distinct files:

  • the second section of the coastal_water.csv file
  • the R6zahyo.pdf file further processed by https://github.com/RML-IAEA/iaea.orbs
  • the second sections of all sheets of close1F_water.xlsx file

All files and sheets required to look up the location of the stations.


source

get_locs_coastal_water

 get_locs_coastal_water (fname_coastal_water)
Exported source
def get_locs_coastal_water(fname_coastal_water):
    locs_idx = find_location_section(pd.read_csv(fname_coastal_water, 
                                      skiprows=0, low_memory=False))
    
    df = pd.read_csv(fname_coastal_water, skiprows=locs_idx+1, 
                     low_memory=False).iloc[:, :3]
    
    df.columns = ['station', 'LON', 'LAT']
    df.dropna(subset=['LAT'], inplace=True)
    df['org'] = 'coastal_seawater.csv'
    return df
df_locs_coastal_water = get_locs_coastal_water(fname_coastal_water)
df_locs_coastal_water.head()
station LON LAT org
0 T-0 37.42 141.04 coastal_seawater.csv
1 T-11 37.24 141.05 coastal_seawater.csv
2 T-12 37.15 141.04 coastal_seawater.csv
3 T-13-1 37.64 141.04 coastal_seawater.csv
4 T-14 37.55 141.06 coastal_seawater.csv
df_locs_coastal_water['station'].unique()
array(['T-0', 'T-11', 'T-12', 'T-13-1', 'T-14', 'T-17-1', 'T-18', 'T-20',
       'T-22', 'T-3', 'T-4', 'T-4-1', 'T-4-2', 'T-5', 'T-6', 'T-7', 'T-A',
       'T-B', 'T-B1', 'T-B2', 'T-B3', 'T-B4', 'T-C', 'T-D', 'T-D1',
       'T-D5', 'T-D9', 'T-E', 'T-E1', 'T-Z', 'T-MG6', 'T-S1', 'T-S7',
       'T-H1', 'T-S2', 'T-S6', 'T-M10', 'T-MA', 'T-S3', 'T-S4', 'T-S8',
       'T-MG4', 'T-G4', 'T-MG5', 'T-MG1', 'T-MG0', 'T-MG3', 'T-MG2'],
      dtype=object)
FEEDBACK TO DATA PROVIDER

Data contained in the close1F_water.xlsx file are spread in several sheets (one per station). Each sheet further contains two sections: the measurements and the locations.

For each sheet, we have to identify the line number where to split both measurements and the location. We then need to further iterate over all sheets to concatenate the results.


source

get_clos1F_df

 get_clos1F_df (fname_clos1F)

Get measurements dataframe from close1F_water.xlsx file and parse datetime.

Exported source
def get_clos1F_df(fname_clos1F):
    "Get measurements dataframe from close1F_water.xlsx file and parse datetime."
    excel_file = pd.ExcelFile(fname_clos1F)
    dfs = {}
    
    for sheet_name in tqdm(excel_file.sheet_names):
        locs_idx = find_location_section(pd.read_excel(excel_file, 
                                                       sheet_name=sheet_name,
                                                       skiprows=1))
        df = pd.read_excel(excel_file, 
                   sheet_name=sheet_name, 
                   skiprows=1,
                   nrows=locs_idx-1)
        
        df.dropna(subset=['Sampling point number'], inplace=True)
        df['Sampling date'] = df['Sampling date']\
            .astype(str)\
            .apply(lambda x: x.split(' ')[0]\
            .replace('-', '/'))
            
        dfs[sheet_name] = df
    
    df = pd.concat(dfs.values(), ignore_index=True)
    df.dropna(subset=['Sampling date'], inplace=True)
    df['TIME'] = df['Sampling date'] + ' ' + df['Sampling time'].astype(str)
    df = df.drop(columns=['Sampling date', 'Sampling time'])
    return df
df_clos1F = get_clos1F_df(fname_clos1F); df_clos1F.head()
100%|██████████| 11/11 [00:06<00:00,  1.80it/s]
Sampling point number 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) Total beta radioactivity concentration (Bq/L) Total beta detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) Collection layer of seawater ... 106Ru detection limit (Bq/L) 60Co radioactivity concentration (Bq/L) 60Co detection limit (Bq/L) 95Zr radioactivity concentration (Bq/L) 95Zr detection limit (Bq/L) 99Mo radioactivity concentration (Bq/L) 99Mo detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) TIME
0 T-0-1 ND 1.5 ND 1.4 ND 18.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
1 T-0-1 NaN NaN NaN NaN NaN NaN 4.7 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
2 T-0-1 ND 1.1 ND 1.4 ND 20.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
3 T-0-1 NaN NaN NaN NaN NaN NaN ND 2.9 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
4 T-0-1 ND 0.66 ND 0.49 ND 17.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/27 08:14:00

5 rows × 57 columns

df_clos1F['Sampling point number'].unique()
array(['T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
       'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)

source

get_locs_clos1F

 get_locs_clos1F (fname_clos1F)

Get locations dataframe from close1F_water.xlsx file from each sheets.

Exported source
def get_locs_clos1F(fname_clos1F):
    "Get locations dataframe from close1F_water.xlsx file from each sheets."
    excel_file = pd.ExcelFile(fname_clos1F)
    dfs = {}
    
    for sheet_name in tqdm(excel_file.sheet_names):
        locs_idx = find_location_section(pd.read_excel(excel_file, 
                                                       sheet_name=sheet_name,
                                                       skiprows=1))
        df = pd.read_excel(excel_file, 
                           sheet_name=sheet_name, 
                           skiprows=locs_idx+2)
            
        dfs[sheet_name] = df
    
    df = pd.concat(dfs.values(), ignore_index=True).iloc[:, :3]
    df.dropna(subset=['Sampling coordinate North latitude (Decimal)'], inplace=True)    
    df.columns = ['station', 'LON', 'LAT']
    df['org'] = 'close1F.csv'
    return df
df_locs_clos1F = get_locs_clos1F(fname_clos1F)
df_locs_clos1F.head()
100%|██████████| 11/11 [00:06<00:00,  1.80it/s]
station LON LAT org
0 T-0-1 37.43 141.04 close1F.csv
11 T-0-1A 37.43 141.05 close1F.csv
22 T-0-2 37.42 141.05 close1F.csv
33 T-0-3 37.42 141.04 close1F.csv
44 T-0-3A 37.42 141.05 close1F.csv
FEEDBACK TO DATA PROVIDER

In theory all locations are supposed to be provided in the R6zahyo.pdf file. This file is further processed by https://github.com/RML-IAEA/iaea.orbs and the result is provided in the station_points.csv file.

However, this file does not contain all locations refered to in both coastal_water.csv and close1F_water.xlsx files.


source

get_locs_orbs

 get_locs_orbs (fname_iaea_orbs)
Exported source
def get_locs_orbs(fname_iaea_orbs):
    df = pd.read_csv(fname_iaea_orbs)
    df.columns = ['org', 'station', 'LON', 'LAT']
    return df
df_locs_orbs = get_locs_orbs(fname_iaea_orbs)
df_locs_orbs.head()
org station LON LAT
0 MOE E-31 141.727667 39.059167
1 MOE E-32 141.635667 38.996000
2 MOE E-37 141.948611 39.259167
3 MOE E-38 141.755000 39.008333
4 MOE E-39 141.766667 38.991667

source

concat_locs

 concat_locs (dfs)

Concatenate and drop duplicates from coastal_seawater.csv and iaea_orbs.csv (kept)

Exported source
def concat_locs(dfs):
    "Concatenate and drop duplicates from coastal_seawater.csv and iaea_orbs.csv (kept)"
    df = pd.concat(dfs)
    # Group by org to be used for sorting
    df['org_grp'] = df['org'].apply(
        lambda x: 1 if x == 'coastal_seawater.csv' else 2 if x == 'close1F.csv' else 0)
    df.sort_values('org_grp', ascending=True, inplace=True)
    # Drop duplicates and keep orbs data first
    df.drop_duplicates(subset='station', keep='first', inplace=True)
    df.drop(columns=['org_grp'], inplace=True)
    df.sort_values('station', ascending=True, inplace=True)
    return df
# df_locs = concat_locs(df_locs_coastal_water, df_locs_orbs)
df_locs = concat_locs([df_locs_clos1F, df_locs_coastal_water, df_locs_orbs])
df_locs.head()
station LON LAT org
214 C-P1 139.863333 35.425000 NRA
215 C-P2 139.863333 35.401667 NRA
216 C-P3 139.881667 35.370000 NRA
217 C-P4 139.846667 35.356667 NRA
218 C-P5 139.800000 35.343333 NRA

source

align_dfs

 align_dfs (df_from, df_to)

Align columns structure of df_from to df_to.

Exported source
def align_dfs(df_from, df_to):
    "Align columns structure of df_from to df_to."
    df = defaultdict()    
    for c in df_to.columns:
        df[c] = df_from[c].values if c in df_from.columns else np.NAN
    return pd.DataFrame(df)
align_dfs(df_clos1F, df_coastal_water).head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
0 T-0-1 NaN NaN NaN ND 1.5 ND 1.4 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
1 T-0-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 4.7 NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
2 T-0-1 NaN NaN NaN ND 1.1 ND 1.4 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
3 T-0-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 2.9 NaN NaN NaN NaN NaN 2013/08/21 08:09:00
4 T-0-1 NaN NaN NaN ND 0.66 ND 0.49 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/27 08:14:00

5 rows × 49 columns


source

concat_dfs

 concat_dfs (df_coastal_water, df_clos1F)

Concatenate and drop duplicates from coastal_seawater.csv and close1F_water.xlsx (kept)

Exported source
def concat_dfs(df_coastal_water, df_clos1F):
    "Concatenate and drop duplicates from coastal_seawater.csv and close1F_water.xlsx (kept)"
    df_clos1F = align_dfs(df_clos1F, df_coastal_water)
    df = pd.concat([df_coastal_water, df_clos1F])
    return df
df_meas = concat_dfs(df_coastal_water, df_clos1F)
df_meas.head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/21 23:15:00
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/21 23:45:00
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/22 14:28:00
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/22 15:06:00
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN NaN NaN NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00

5 rows × 49 columns


source

georef_data

 georef_data (df_meas, df_locs)

Georeference measurements dataframe using locations dataframe.

Exported source
def georef_data(df_meas, df_locs):
    "Georeference measurements dataframe using locations dataframe."
    assert "Sampling point number" in df_meas.columns and "station" in df_locs.columns
    return pd.merge(df_meas, df_locs, how="inner", 
                    left_on='Sampling point number', right_on='station')
df_meas_georef = georef_data(df_meas, df_locs)
df_meas_georef.head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns


source

load_data

 load_data (fname_coastal_water, fname_clos1F, fname_iaea_orbs)

Load, align and georeference TEPCO data

Exported source
def load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs):
    "Load, align and georeference TEPCO data"
    df_locs = concat_locs(
        [get_locs_coastal_water(fname_coastal_water), 
         get_locs_clos1F(fname_clos1F),
         get_locs_orbs(fname_iaea_orbs)])
    df_meas = concat_dfs(get_coastal_water_df(fname_coastal_water), get_clos1F_df(fname_clos1F))
    df_meas.dropna(subset=['Sampling point number'], inplace=True)
    return {'SEAWATER': georef_data(df_meas, df_locs)}
dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
dfs['SEAWATER'].head()
100%|██████████| 11/11 [00:06<00:00,  1.79it/s]
100%|██████████| 11/11 [00:06<00:00,  1.81it/s]
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

print(f"# of cols, rows: {dfs['SEAWATER'].shape}")
dfs['SEAWATER'].head()
# of cols, rows: (47148, 53)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

dfs['SEAWATER']['Sampling point number'].unique()
array(['T-3', 'T-4', 'T-5', 'T-7', 'T-11', 'T-12', 'T-14', 'T-18', 'T-20',
       'T-22', 'T-MA', 'T-M10', 'T-A', 'T-D', 'T-E', 'T-B', 'T-C',
       'T-MG1', 'T-MG2', 'T-MG3', 'T-MG4', 'T-MG5', 'T-MG6', 'T-D1',
       'T-D5', 'T-D9', 'T-E1', 'T-G4', 'T-H1', 'T-S5', 'T-S6', 'T-17-1',
       'T-B3', 'T-13-1', 'T-S3', 'T-S4', 'T-B4', 'T-S1', 'T-S2', 'T-MG0',
       'T-Z', 'T-B1', 'T-B2', 'T-S7', 'T-S8', 'T-0', 'T-4-1', 'T-4-2',
       'T-6', 'T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
       'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)

Fix missing values

FEEDBACK TO DATA PROVIDER

We remap the ND value to NaN. Please confirm that this is the correct way to handle missing values.

ND is assigned NaN. This needs to be confirmed.


source

FixMissingValuesCB

 FixMissingValuesCB ()

Assign NaN to values equal to ND (not detected) - to be confirmed

Exported source
class FixMissingValuesCB(Callback):
    "Assign `NaN` to values equal to `ND` (not detected) - to be confirmed "
    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            predicate = tfm.dfs[k] == 'ND'
            tfm.dfs[k][predicate] = np.nan
tfm = Transformer(dfs, cbs=[FixMissingValuesCB()])
tfm()['SEAWATER'].head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 NaN 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 NaN 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

Remove 約 (about) character

FEEDBACK TO DATA PROVIDER

We systematically remove the character. Please confirm that this is the correct way to handle this. We could imagine that mentioning uncertainty would be less ambiguous in future.


source

RemoveJapanaseCharCB

 RemoveJapanaseCharCB ()

Remove 約 (about) char

Exported source
class RemoveJapanaseCharCB(Callback):
    "Remove 約 (about) char"
    def _transform_if_about(self, value, about_char='約'):
        if pd.isna(value): return value
        return (value.replace(about_char, '') if str(value).count(about_char) != 0 
                else value)
    
    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            cols_rdn = [c for c in tfm.dfs[k].columns if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
            tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_about)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB()])

tfm()['SEAWATER'].sample(10)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
29536 T-0-2 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2021/07/26 06:20:00 T-0-2 141.046667 37.423333 TEPCO
35606 T-1 上層 NaN 0.6 NaN 0.72 NaN 0.52 NaN NaN ... NaN NaN NaN NaN NaN 2017/07/29 07:10:00 T-1 141.034444 37.431111 TEPCO
18648 T-MG1 上層 NaN NaN NaN 1.3E-03 3.2E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2020/10/7 10:03:00 T-MG1 141.283333 38.333333 TEPCO
10357 T-11 上層 NaN NaN 1.9E-03 NaN 1.2E-02 NaN NaN NaN ... NaN NaN NaN NaN NaN 2015/12/7 10:16:00 T-11 141.047222 37.241667 TEPCO
20208 T-14 上層 NaN NaN NaN 1.4E-03 6.0E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2021/9/13 07:46:00 T-14 141.062500 37.552778 TEPCO
32077 T-0-3A NaN NaN NaN NaN 0.7 NaN 0.6 NaN NaN ... NaN NaN NaN NaN NaN 2020/01/06 07:02:00 T-0-3A 141.046667 37.416111 TEPCO
42671 T-2 上層 NaN NaN NaN 0.47 NaN 0.56 NaN NaN ... NaN NaN NaN NaN NaN 2022/02/27 09:15:00 T-2 141.033611 37.415833 TEPCO
32875 T-1 上層 NaN 7 24 NaN 22 NaN NaN NaN ... NaN NaN NaN NaN NaN 2011/06/10 09:20:00 T-1 141.034444 37.431111 TEPCO
34009 T-1 上層 NaN 0.82 NaN 0.66 1.1 NaN NaN NaN ... NaN NaN NaN NaN NaN 2014/02/18 07:50:00 T-1 141.034444 37.431111 TEPCO
43715 T-2 上層 NaN NaN NaN 0.55 NaN 0.7 NaN NaN ... NaN NaN NaN NaN NaN 2024/02/25 06:56:00 T-2 141.033611 37.415833 TEPCO

10 rows × 53 columns

Fix values range string

FEEDBACK TO DATA PROVIDER

Value ranges are provided as strings (e.g ‘4.0E+00<&<8.0E+00’ or ‘1.0~2.7’). We replace them by their mean. Please confirm that this is the correct way to handle this. Again, mentioning uncertainty would be less ambiguous in future.


source

FixRangeValueStringCB

 FixRangeValueStringCB ()

Replace range values (e.g ‘4.0E+00<&<8.0E+00’ or ‘1.0~2.7’) by their mean

Exported source
class FixRangeValueStringCB(Callback):
    "Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean"
    
    def _extract_and_calculate_mean(self, s):
        # For scientific notation ranges
        float_strings = re.findall(r"[+-]?\d+\.?\d*E?[+-]?\d*", s)
        if float_strings:
            float_numbers = np.array(float_strings, dtype=float)
            return float_numbers.mean()
        return s
    
    def _transform_if_range(self, value):
        if pd.isna(value): 
            return value
        value = str(value)
        # Check for both range patterns
        if '<&<' in value or '~' in value:
            return self._extract_and_calculate_mean(value)
        return value

    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            cols_rdn = [c for c in tfm.dfs[k].columns 
                       if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
            tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_range).astype(float)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB()
    ])

df_test = tfm()['SEAWATER']
df_test.sample(10)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
28705 T-0-1A NaN NaN NaN NaN 0.3300 NaN 0.31 NaN NaN ... NaN NaN NaN NaN NaN 2024/12/02 07:19:00 T-0-1A 141.046667 37.430556 TEPCO
41527 T-2 上層 NaN NaN NaN 0.6300 NaN 0.75 NaN NaN ... NaN NaN NaN NaN NaN 2019/12/17 06:50:00 T-2 141.033611 37.415833 TEPCO
40137 T-2 上層 NaN 0.55 NaN 0.6200 NaN 0.53 NaN NaN ... NaN NaN NaN NaN NaN 2017/02/26 07:10:00 T-2 141.033611 37.415833 TEPCO
46806 T-A3 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2022/12/12 07:33:00 T-A3 141.050739 37.409267 TEPCO
39560 T-2 上層 NaN 0.77 2.7000 NaN 3.5000 NaN NaN NaN ... NaN NaN NaN NaN NaN 2012/01/07 08:15:00 T-2 141.033611 37.415833 TEPCO
39204 T-2 NaN 180.0 NaN 530.0000 NaN 540.0000 NaN NaN NaN ... NaN NaN NaN NaN NaN 2011/04/18 08:40:00 T-2 141.033611 37.415833 TEPCO
35238 T-1 上層 NaN 0.71 NaN 0.5400 NaN 0.72 NaN NaN ... NaN NaN NaN NaN NaN 2016/10/17 07:40 T-1 141.034444 37.431111 TEPCO
15957 T-5 下層 NaN NaN NaN 0.0014 0.0032 NaN NaN NaN ... NaN NaN NaN NaN NaN 2019/3/13 08:14:00 T-5 141.200000 37.416667 TEPCO
1308 T-4 上層 NaN 8.00 NaN 21.0000 NaN 24.00 NaN NaN ... NaN NaN NaN NaN NaN 2011/8/30 08:00:00 T-4 141.013889 37.241667 TEPCO
18213 T-3 上層 NaN NaN 0.0016 NaN 0.0310 NaN NaN NaN ... NaN NaN NaN NaN NaN 2020/7/7 14:50:00 T-3 141.026389 37.322222 TEPCO

10 rows × 53 columns

Select columns of interest

We select the columns of interest and in particular the elements of interest, in our case radionuclides.


source

SelectColsOfInterestCB

 SelectColsOfInterestCB (common_coi, nuclides_pattern)

Select columns of interest.

Exported source
common_coi = ['org', 'LON', 'LAT', 'TIME', 'station']
nuclides_pattern = '(Bq/L)'
Exported source
class SelectColsOfInterestCB(Callback):
    "Select columns of interest."
    def __init__(self, common_coi, nuclides_pattern): fc.store_attr()
    def __call__(self, tfm):
        nuc_of_interest = [c for c in tfm.dfs['SEAWATER'].columns if nuclides_pattern in c]
        tfm.dfs['SEAWATER'] = tfm.dfs['SEAWATER'][self.common_coi + nuc_of_interest]
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
org LON LAT TIME station 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) ... 144Ce radioactivity concentration (Bq/L) 144Ce detection limit (Bq/L) 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L)
27526 TEPCO 141.046667 37.430556 2015/10/12 08:10:00 T-0-1A NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 1.5 NaN NaN NaN NaN
11530 TEPCO 141.006944 37.055556 2016/7/21 08:01:00 T-17-1 NaN NaN 0.0019 NaN 0.0100 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32069 TEPCO 141.046667 37.416111 2019/12/09 07:00:00 T-0-3A NaN NaN NaN 0.7600 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16230 TEPCO 141.072222 37.416667 2019/5/8 08:43:00 T-D5 NaN NaN NaN 0.0012 0.0028 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
39178 TEPCO 141.033611 37.415833 2011/04/05 14:10:00 T-2 11000.0 42.0 5300.0000 39.0000 5400.0000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 50 columns

Reshape: wide to long

This step is necessary to extract information such as nuclide names, detection limint, uncertainty, …


source

WideToLongCB

 WideToLongCB ()

Parse TEPCO measurement columns to extract nuclide name, measurement value, detection limit and uncertainty

Exported source
class WideToLongCB(Callback):
    """
    Parse TEPCO measurement columns to extract nuclide name, measurement value, 
    detection limit and uncertainty
    """
    def __init__(self): fc.store_attr()
    
    
    def _melt(self, df):
        "Melt dataframe to long format."
        return df.melt(id_vars=['LON', 'LAT', 'TIME', 'station'])
        
    def _extract_nuclide(self, text):
        words = text.split(' ')
        # Handle special cases for alpha/beta
        if len(words) >= 2 and words[1].lower() in ['alpha', 'beta']:
            return f"{words[0]} {words[1]}"
        return words[0]
    
    def _nuclide_name(self, df):
        "Extract nuclide name from nuclide names."
        df['NUCLIDE'] = df['variable'].map(self._extract_nuclide)
        return df
    
    def _type_indicator(self, df):
        "Create type indicators."
        df['is_concentration'] = df['variable'].str.contains('radioactivity concentration')
        df['is_dl'] = df['variable'].str.contains('detection limit')
        df['is_unc'] = df['variable'].str.contains('statistical error')
        return df
    
    def _unit(self, df):
        "Extract unit from nuclide names."
        df['UNIT'] = df['variable'].str.extract(r'\((.*?)\)')
        return df
    
    def _type_column(self, df):
        "Create type column."
        conditions = [
            df['is_concentration'],
            df['is_dl'],
            df['is_unc']
        ]
        choices = ['VALUE', 'DL', 'UNC']
        df['type'] = np.select(conditions, choices)
        df = df.drop(['is_concentration', 'is_dl', 'is_unc'], axis=1)
        return df
    
    def __call__(self, tfm):
        tfm.dfs['SEAWATER'] = self._melt(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = self._nuclide_name(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = self._type_indicator(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = self._unit(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = self._type_column(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = pd.pivot_table(
            tfm.dfs['SEAWATER'],
            values='value',
            index=['LON', 'LAT', 'TIME', 'station', 'NUCLIDE', 'UNIT'],
            columns='type',
            aggfunc='first'
        ).reset_index()
        # reset the index and rename it ID
        tfm.dfs['SEAWATER'].reset_index(inplace=True)
        tfm.dfs['SEAWATER'].rename(columns={'index': 'ID'}, inplace=True)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.head()
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
0 0 37.21 141.01 2012/10/16 07:25:00 T-4-1 131I Bq/L 0.13 NaN NaN
1 1 37.21 141.01 2012/10/16 07:25:00 T-4-1 134Cs Bq/L 0.19 NaN NaN
2 2 37.21 141.01 2012/10/16 07:25:00 T-4-1 137Cs Bq/L 0.27 NaN NaN
3 3 37.21 141.01 2012/10/2 07:30:00 T-4-1 131I Bq/L 0.11 NaN NaN
4 4 37.21 141.01 2012/10/2 07:30:00 T-4-1 134Cs Bq/L 0.22 NaN NaN

Remap UNIT name to MARIS nomenclature


source

RemapUnitNameCB

 RemapUnitNameCB (unit_mapping)

Remap UNIT name to MARIS id.

Exported source
unit_mapping = {'Bq/L': 3}
Exported source
class RemapUnitNameCB(Callback):
    """
    Remap `UNIT` name to MARIS id.
    """
    def __init__(self, unit_mapping): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER']['UNIT'] = tfm.dfs['SEAWATER']['UNIT'].map(self.unit_mapping)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
26226 26226 141.033611 37.415833 2019/12/07 06:50:00 T-2 134Cs 3 0.68 NaN NaN
42239 42239 141.034444 37.431111 2019/04/19 07:55:00 T-1 137Cs 3 0.59 NaN NaN
6965 6965 140.665556 36.506389 2016/2/18 08:15:00 T-B 137Cs 3 1.1 NaN NaN
37355 37355 141.034444 37.431111 2015/03/20 07:20:00 T-1 137Cs 3 0.57 NaN NaN
3462 3462 37.410000 141.030000 2015/03/29 05:30:00 T-2-1 131I 3 0.77 NaN NaN

Remap NUCLIDE name to MARIS nomenclature


source

RemapNuclideNameCB

 RemapNuclideNameCB (nuclide_mapping)

Remap NUCLIDE name to MARIS id.

Exported source
nuclide_mapping = {
    '131I': 29,
    '134Cs': 31,
    '137Cs': 33,
    '125Sb': 24,
    'Total beta': 103,
    '238Pu': 67,
    '239Pu+240Pu': 77,
    '3H': 1,
    '89Sr': 11,
    '90Sr': 12,
    'Total alpha': 104,
    '132I': 100,
    '136Cs': 102,
    '58Co': 8,
    '105Ru': 97,
    '106Ru': 17,
    '140La': 35,
    '140Ba': 34,
    '132Te': 99,
    '60Co': 9,
    '144Ce': 37,
    '54Mn': 6
}
Exported source
class RemapNuclideNameCB(Callback):
    """
    Remap `NUCLIDE` name to MARIS id.
    """
    def __init__(self, nuclide_mapping): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER']['NUCLIDE'] = tfm.dfs['SEAWATER']['NUCLIDE'].map(self.nuclide_mapping)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
64330 64330 141.047222 37.241667 2020/4/23 09:02:00 T-11 31 3 0.0013 NaN NaN
14273 14273 141.013889 37.241667 2014/5/27 16:10:00 T-4 31 3 NaN NaN 0.034
54131 54131 141.040556 37.478889 2020/11/4 10:20:00 T-6 1 3 NaN NaN 0.43
4012 4012 37.410000 141.030000 2015/08/13 04:25:00 T-2-1 33 3 0.53 NaN NaN
16214 16214 141.026389 37.322222 2011/4/3 09:35:00 T-3 29 3 15.0 NaN 280.0

Remap DL value to MARIS nomenclature

We remap DL (Detection Limit) value to MARIS ids as follows:

  • if a DL value is reported with assign 2 (Detection limit or ‘<’)
  • if a DL value is not reported with assign 1 (Detected value or ‘=’)

source

RemapDLCB

 RemapDLCB ()

Remap DL name to MARIS id.

Exported source
class RemapDLCB(Callback):
    """
    Remap `DL` name to MARIS id.
    """
    def __init__(self): fc.store_attr()
    def dl_mapping(self, value): return 1 if pd.isna(value) else 2
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER']['DL'] = tfm.dfs['SEAWATER']['DL'].map(self.dl_mapping)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapDLCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(10)
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
30183 30183 141.033611 37.415833 2023/03/21 06:40:00 T-2 33 3 2 NaN NaN
62836 62836 141.046667 37.430556 2023/12/25 07:25:00 T-0-1A 33 3 2 NaN NaN
29187 29187 141.033611 37.415833 2022/05/24 08:20:00 T-2 31 3 2 NaN NaN
39737 39737 141.034444 37.431111 2017/03/07 07:07:00 T-1 33 3 2 NaN NaN
76161 76161 141.082500 37.428611 2022/12/21 10:09:00 T-S4 31 3 2 NaN NaN
68140 68140 141.062500 37.552778 2022/2/22 07:31:00 T-14 33 3 1 NaN 0.0032
10487 10487 140.837222 35.796111 2022/7/15 12:37:00 T-E 31 3 2 NaN NaN
44378 44378 141.034444 37.431111 2021/10/22 08:05:00 T-1 31 3 2 NaN NaN
28852 28852 141.033611 37.415833 2022/02/12 09:00:00 T-2 33 3 2 NaN NaN
81902 81902 141.200000 37.416667 2023/5/22 07:18:00 T-5 1 3 2 NaN NaN

Parse & encode time


source

ParseTimeCB

 ParseTimeCB (time_name='TIME')

Parse time column from TEPCO.

Exported source
class ParseTimeCB(Callback):
    "Parse time column from TEPCO."
    def __init__(self,
                 time_name='TIME'):
        fc.store_attr()
        
    def __call__(self, tfm):
        tfm.dfs['SEAWATER'][self.time_name] = pd.to_datetime(tfm.dfs['SEAWATER'][self.time_name], 
                                                             format='%Y/%m/%d %H:%M:%S', errors='coerce')
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapDLCB(),
    ParseTimeCB(),
    EncodeTimeCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
Warning: 3054 missing time value(s) in SEAWATER
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
41631 41631 141.034444 37.431111 1539417900 T-1 33 3 2 NaN NaN
52479 52479 141.040278 37.430556 1669014720 T-0-1 31 3 2 NaN NaN
77918 77918 141.133333 38.250000 1513332540 T-MG4 33 3 1 NaN 0.0025
65250 65250 141.050739 37.409267 1691997120 T-A3 1 3 2 NaN NaN
76386 76386 141.083333 37.000000 1331799300 T-M10 31 3 2 NaN NaN

Sanitize coordinates

tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapDLCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
    ])

df_test = tfm()['SEAWATER']
df_test.sample(5)
Warning: 3054 missing time value(s) in SEAWATER
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
71629 71629 141.072222 37.416667 1492680000 T-D5 12 3 1 NaN 0.00079
78746 78746 141.154167 37.407778 1574143320 T-B3 31 3 2 NaN NaN
62476 62476 141.046667 37.430556 1662966120 T-0-1A 31 3 2 NaN NaN
26047 26047 141.033611 37.415833 1570863300 T-2 103 3 1 NaN 10.0
85057 85057 141.583333 38.233333 1322471280 T-MG3 29 3 2 NaN NaN

Encode to NetCDF

tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapDLCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
    ])

dfs_tfm = tfm()
tfm.logs
Warning: 3054 missing time value(s) in SEAWATER
['Assign `NaN` to values equal to `ND` (not detected) - to be confirmed ',
 'Remove 約 (about) char',
 "Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean",
 'Select columns of interest.',
 '\n    Parse TEPCO measurement columns to extract nuclide name, measurement value, \n    detection limit and uncertainty\n    ',
 '\n    Remap `UNIT` name to MARIS id.\n    ',
 '\n    Remap `NUCLIDE` name to MARIS id.\n    ',
 '\n    Remap `DL` name to MARIS id.\n    ',
 'Parse time column from TEPCO.',
 'Encode time as seconds since epoch.',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.']
dfs_tfm['SEAWATER'].sample(10)
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
50903 50903 141.040278 37.430556 1426497540 T-0-1 31 3 2 NaN NaN
80089 80089 141.200000 37.416667 1387788780 T-5 104 3 2 NaN NaN
87638 87638 141.666667 38.300000 1405326000 T-MG2 33 3 1 NaN 0.0024
83708 83708 141.250000 38.166667 1623230100 T-MG5 31 3 2 NaN NaN
7671 7671 140.702222 35.987500 1334756580 T-D 33 3 2 NaN NaN
64007 64007 141.047222 37.241667 1487581680 T-11 33 3 1 NaN 0.0062
16230 16230 141.026389 37.322222 1302170100 T-3 31 3 2 NaN 980.0
82691 82691 141.233333 37.516667 1694499420 T-B2 1 3 1 NaN 0.12
75966 75966 141.082500 37.428611 1415857740 T-S4 33 3 1 NaN 0.0092
48042 48042 141.039444 37.265000 1722493080 T-S5 31 3 2 NaN NaN

source

get_attrs

 get_attrs (tfm, zotero_key, kw=['oceanography', 'Earth Science > Oceans >
            Ocean Chemistry> Radionuclides', 'Earth Science > Human
            Dimensions > Environmental Impacts > Nuclear Radiation
            Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean
            Tracers, Earth Science > Oceans > Marine Sediments', 'Earth
            Science > Oceans > Ocean Chemistry, Earth Science > Oceans >
            Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality >
            Ocean Contaminants', 'Earth Science > Biological
            Classification > Animals/Vertebrates > Fish', 'Earth Science >
            Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science >
            Biological Classification > Animals/Invertebrates > Mollusks',
            'Earth Science > Biological Classification >
            Animals/Invertebrates > Arthropods > Crustaceans', 'Earth
            Science > Biological Classification > Plants > Macroalgae
            (Seaweeds)'])

Retrieve global attributes from MARIS dump.

Exported source
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']
Exported source
def get_attrs(tfm, zotero_key, kw=kw):
    "Retrieve global attributes from MARIS dump."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()
get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw)
{'geospatial_lat_min': '141.66666667',
 'geospatial_lat_max': '38.63333333',
 'geospatial_lon_min': '140.60388889',
 'geospatial_lon_max': '35.79611111',
 'geospatial_bounds': 'POLYGON ((140.60388889 35.79611111, 141.66666667 35.79611111, 141.66666667 38.63333333, 140.60388889 38.63333333, 140.60388889 35.79611111))',
 'time_coverage_start': '2011-03-21T14:30:00',
 'time_coverage_end': '2024-12-21T08:03:00',
 'id': 'JEV6HP5A',
 'title': "Readings of Sea Area Monitoring - Monitoring of sea water - Sea area close to TEPCO's Fukushima Daiichi NPS / Coastal area - Readings of Sea Area Monitoring [TEPCO]",
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "TEPCO - Tokyo Electric Power Company"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
 'publisher_postprocess_logs': "Assign `NaN` to values equal to `ND` (not detected) - to be confirmed , Remove 約 (about) char, Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean, Select columns of interest., \n    Parse TEPCO measurement columns to extract nuclide name, measurement value, \n    detection limit and uncertainty\n    , \n    Remap `UNIT` name to MARIS id.\n    , \n    Remap `NUCLIDE` name to MARIS id.\n    , \n    Remap `DL` name to MARIS id.\n    , Parse time column from TEPCO., Encode time as seconds since epoch., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator."}

source

encode

 encode (fname_out:str, **kwargs)

Encode TEPCO data to NetCDF.

Type Details
fname_out str Path to the folder where the NetCDF output will be saved
kwargs VAR_KEYWORD
Exported source
def encode(
    fname_out: str, # Path to the folder where the NetCDF output will be saved
    **kwargs # Additional keyword arguments
    ):
    "Encode TEPCO data to NetCDF."
    dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
    
    tfm = Transformer(dfs, cbs=[
        FixMissingValuesCB(),
        RemoveJapanaseCharCB(),
        FixRangeValueStringCB(),
        SelectColsOfInterestCB(common_coi, nuclides_pattern),
        WideToLongCB(),
        RemapUnitNameCB(unit_mapping),
        RemapNuclideNameCB(nuclide_mapping),
        RemapDLCB(),
        ParseTimeCB(),
        EncodeTimeCB(),
        SanitizeLonLatCB()
    ])        
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out, 
                            global_attrs=get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw),
                            verbose=kwargs.get('verbose', False)
                            )
    encoder.encode()
encode(fname_out, verbose=False)
100%|██████████| 11/11 [00:06<00:00,  1.82it/s]
100%|██████████| 11/11 [00:06<00:00,  1.80it/s]
Warning: 3054 missing time value(s) in SEAWATER
decode(fname_in=fname_out, verbose=True)
Saved SEAWATER to ../../_data/output/tepco_SEAWATER.csv