Data pipeline (handler) to convert TEPCO dataset (Source) to NetCDF format

Configuration & file paths

Exported source
fname_coastal_water = 'https://radioactivity.nra.go.jp/cont/en/results/sea/coastal_water.csv'
fname_clos1F = 'https://radioactivity.nra.go.jp/cont/en/results/sea/close1F_water.xlsx'
fname_iaea_orbs = 'https://raw.githubusercontent.com/RML-IAEA/iaea.orbs/refs/heads/main/stations/station_points.csv'

fname_out = '../../_data/output/tepco.nc'

Load data

We here load the data from the NRA (Nuclear Regulatory Authority) website. For the moment, we only process radioactivity concentration data in the seawater around Fukushima Dai-ichi NPP [TEPCO] (coastal_water.csv) and in the close1F_water.xlsx file.

In near future, MARIS will provide a dedicated handler for all related ALPS data including measurements not only provided by TEPCO but also MOE, NRA, MLITT and Fukushima Prefecture.

Tip

FEEDBACK TO DATA PROVIDER: The coastal_water.csv file contains two sections: the measurements and the locations. We identify below the line number where the locations begin. A single point of truth for the location of the stations would ease the processing in future.


source

find_location_section

 find_location_section (df, col_idx=0, pattern='Sampling point number')

Find the line number where location data begins.

Exported source
def find_location_section(df, 
                          col_idx=0,
                          pattern='Sampling point number'
                          ):
    "Find the line number where location data begins."
    mask = df.iloc[:, col_idx] == pattern
    indices = df[mask].index
    return indices[0] if len(indices) > 0 else -1
find_location_section(pd.read_csv(fname_coastal_water, low_memory=False))
27483
Tip

FEEDBACK TO DATA PROVIDER: Distinct parsing of the time from coastal_water.csv and close1F_water.xlsx files are required. Indeed:

  • coastal_water.csv uses the format YYYY/MM/DD in the Sampling HH:MM and
  • close1F_water.xlsx uses the format YYYY-MM-DD HH:MM:SS.

source

fix_sampling_time

 fix_sampling_time (x)
Exported source
def fix_sampling_time(x):
    if pd.isna(x): 
        return '00:00:00'
    else:
        hour, min =  x.split(':')[:2]
        return f"{hour if len(hour) == 2 else '0' + hour}:{min}:00"

source

get_coastal_water_df

 get_coastal_water_df (fname_coastal_water)

Get the measurements dataframe from the coastal_water.csv file.

Exported source
def get_coastal_water_df(fname_coastal_water):
    "Get the measurements dataframe from the `coastal_water.csv` file."
    
    locs_idx = find_location_section(pd.read_csv(fname_coastal_water, 
                                      skiprows=0, low_memory=False))
    
    df = pd.read_csv(fname_coastal_water, skiprows=1, 
                     nrows=locs_idx - 1,
                     low_memory=False)
    df.dropna(subset=['Sampling point number'], inplace=True)
    df['Sampling time'] = df['Sampling time'].map(fix_sampling_time)
    
    df['TIME'] = df['Sampling date'].replace('-', '/') + ' ' + df['Sampling time']
    
    df = df.drop(columns=['Sampling date', 'Sampling time'])
    return df
df_coastal_water = get_coastal_water_df(fname_coastal_water)
df_coastal_water.tail()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
27475 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 8.0E+00 NaN NaN NaN NaN NaN 2024/10/14 07:42:00
27476 T-D5 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 8.3E+00 NaN NaN NaN NaN NaN 2024/10/21 07:44:00
27477 T-S8 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 7.5E+00 NaN NaN NaN NaN NaN 2024/10/21 07:38:00
27478 T-S3 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.8E+00 NaN NaN NaN NaN NaN 2024/10/25 09:42:00
27479 T-S4 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 6.8E+00 NaN NaN NaN NaN NaN 2024/10/25 10:18:00

5 rows × 49 columns

Tip

FEEDBACK TO DATA PROVIDER

Identification of the stations location requires three distinct files:

- the second section of the coastal_water.csv file
- the R6zahyo.pdf file further processed by https://github.com/RML-IAEA/iaea.orbs
- the second sections of all sheets of close1F_water.xlsx file

All files and sheets required to look up the location of the stations.


source

get_locs_coastal_water

 get_locs_coastal_water (fname_coastal_water)
Exported source
def get_locs_coastal_water(fname_coastal_water):
    locs_idx = find_location_section(pd.read_csv(fname_coastal_water, 
                                      skiprows=0, low_memory=False))
    
    df = pd.read_csv(fname_coastal_water, skiprows=locs_idx+1, 
                     low_memory=False).iloc[:, :3]
    
    df.columns = ['station', 'LON', 'LAT']
    df.dropna(subset=['LAT'], inplace=True)
    df['org'] = 'coastal_seawater.csv'
    return df
df_locs_coastal_water = get_locs_coastal_water(fname_coastal_water)
df_locs_coastal_water.head()
station LON LAT org
0 T-0 37.42 141.04 coastal_seawater.csv
1 T-11 37.24 141.05 coastal_seawater.csv
2 T-12 37.15 141.04 coastal_seawater.csv
3 T-13-1 37.64 141.04 coastal_seawater.csv
4 T-14 37.55 141.06 coastal_seawater.csv
df_locs_coastal_water['station'].unique()
array(['T-0', 'T-11', 'T-12', 'T-13-1', 'T-14', 'T-17-1', 'T-18', 'T-20',
       'T-22', 'T-3', 'T-4', 'T-4-1', 'T-4-2', 'T-5', 'T-6', 'T-7', 'T-A',
       'T-B', 'T-B1', 'T-B2', 'T-B3', 'T-B4', 'T-C', 'T-D', 'T-D1',
       'T-D5', 'T-D9', 'T-E', 'T-E1', 'T-Z', 'T-MG6', 'T-S1', 'T-S7',
       'T-H1', 'T-S2', 'T-S6', 'T-M10', 'T-MA', 'T-S3', 'T-S4', 'T-S8',
       'T-MG4', 'T-G4', 'T-MG5', 'T-MG1', 'T-MG0', 'T-MG3', 'T-MG2'],
      dtype=object)
Tip

FEEDBACK TO DATA PROVIDER: Data contained in the close1F_water.xlsx file are spread in several sheets (one per station). Each sheet further contains two sections: the measurements and the locations.

For each sheet, we have to identify the line number where to split both measurements and the location. We then need to further iterate over all sheets to concatenate the results.


source

get_clos1F_df

 get_clos1F_df (fname_clos1F)

Get measurements dataframe from close1F_water.xlsx file and parse datetime.

Exported source
def get_clos1F_df(fname_clos1F):
    "Get measurements dataframe from close1F_water.xlsx file and parse datetime."
    excel_file = pd.ExcelFile(fname_clos1F)
    dfs = {}
    
    for sheet_name in tqdm(excel_file.sheet_names):
        locs_idx = find_location_section(pd.read_excel(excel_file, 
                                                       sheet_name=sheet_name,
                                                       skiprows=1))
        df = pd.read_excel(excel_file, 
                   sheet_name=sheet_name, 
                   skiprows=1,
                   nrows=locs_idx-1)
        
        df.dropna(subset=['Sampling point number'], inplace=True)
        df['Sampling date'] = df['Sampling date']\
            .astype(str)\
            .apply(lambda x: x.split(' ')[0]\
            .replace('-', '/'))
            
        dfs[sheet_name] = df
    
    df = pd.concat(dfs.values(), ignore_index=True)
    df.dropna(subset=['Sampling date'], inplace=True)
    df['TIME'] = df['Sampling date'] + ' ' + df['Sampling time'].astype(str)
    df = df.drop(columns=['Sampling date', 'Sampling time'])
    return df
df_clos1F = get_clos1F_df(fname_clos1F); df_clos1F.head()
100%|██████████| 11/11 [00:06<00:00,  1.83it/s]
Sampling point number 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) Total beta radioactivity concentration (Bq/L) Total beta detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) Collection layer of seawater ... 106Ru detection limit (Bq/L) 60Co radioactivity concentration (Bq/L) 60Co detection limit (Bq/L) 95Zr radioactivity concentration (Bq/L) 95Zr detection limit (Bq/L) 99Mo radioactivity concentration (Bq/L) 99Mo detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) TIME
0 T-0-1 ND 1.5 ND 1.4 ND 18.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
1 T-0-1 NaN NaN NaN NaN NaN NaN 4.7 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
2 T-0-1 ND 1.1 ND 1.4 ND 20.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
3 T-0-1 NaN NaN NaN NaN NaN NaN ND 2.9 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
4 T-0-1 ND 0.66 ND 0.49 ND 17.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/27 08:14:00

5 rows × 57 columns

df_clos1F['Sampling point number'].unique()
array(['T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
       'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)

source

get_locs_clos1F

 get_locs_clos1F (fname_clos1F)

Get locations dataframe from close1F_water.xlsx file from each sheets.

Exported source
def get_locs_clos1F(fname_clos1F):
    "Get locations dataframe from close1F_water.xlsx file from each sheets."
    excel_file = pd.ExcelFile(fname_clos1F)
    dfs = {}
    
    for sheet_name in tqdm(excel_file.sheet_names):
        locs_idx = find_location_section(pd.read_excel(excel_file, 
                                                       sheet_name=sheet_name,
                                                       skiprows=1))
        df = pd.read_excel(excel_file, 
                           sheet_name=sheet_name, 
                           skiprows=locs_idx+2)
            
        dfs[sheet_name] = df
    
    df = pd.concat(dfs.values(), ignore_index=True).iloc[:, :3]
    df.dropna(subset=['Sampling coordinate North latitude (Decimal)'], inplace=True)    
    df.columns = ['station', 'LON', 'LAT']
    df['org'] = 'close1F.csv'
    return df
df_locs_clos1F = get_locs_clos1F(fname_clos1F)
df_locs_clos1F.head()
100%|██████████| 11/11 [00:05<00:00,  1.86it/s]
station LON LAT org
0 T-0-1 37.43 141.04 close1F.csv
11 T-0-1A 37.43 141.05 close1F.csv
22 T-0-2 37.42 141.05 close1F.csv
33 T-0-3 37.42 141.04 close1F.csv
44 T-0-3A 37.42 141.05 close1F.csv
Tip

FEEDBACK TO DATA PROVIDER: In theory all locations are supposed to be provided in the R6zahyo.pdf file. This file is further processed by https://github.com/RML-IAEA/iaea.orbs and the result is provided in the station_points.csv file.

However, this file does not contain all locations refered to in both coastal_water.csv and close1F_water.xlsx files.


source

get_locs_orbs

 get_locs_orbs (fname_iaea_orbs)
Exported source
def get_locs_orbs(fname_iaea_orbs):
    df = pd.read_csv(fname_iaea_orbs)
    df.columns = ['org', 'station', 'LON', 'LAT']
    return df
df_locs_orbs = get_locs_orbs(fname_iaea_orbs)
df_locs_orbs.head()
org station LON LAT
0 MOE E-31 141.727667 39.059167
1 MOE E-32 141.635667 38.996000
2 MOE E-37 141.948611 39.259167
3 MOE E-38 141.755000 39.008333
4 MOE E-39 141.766667 38.991667

source

concat_locs

 concat_locs (dfs)

Concatenate and drop duplicates from coastal_seawater.csv and iaea_orbs.csv (kept)

Exported source
def concat_locs(dfs):
    "Concatenate and drop duplicates from coastal_seawater.csv and iaea_orbs.csv (kept)"
    df = pd.concat(dfs)
    # Group by org to be used for sorting
    df['org_grp'] = df['org'].apply(
        lambda x: 1 if x == 'coastal_seawater.csv' else 2 if x == 'close1F.csv' else 0)
    df.sort_values('org_grp', ascending=True, inplace=True)
    # Drop duplicates and keep orbs data first
    df.drop_duplicates(subset='station', keep='first', inplace=True)
    df.drop(columns=['org_grp'], inplace=True)
    df.sort_values('station', ascending=True, inplace=True)
    return df
# df_locs = concat_locs(df_locs_coastal_water, df_locs_orbs)
df_locs = concat_locs([df_locs_clos1F, df_locs_coastal_water, df_locs_orbs])
df_locs.head()
station LON LAT org
214 C-P1 139.863333 35.425000 NRA
215 C-P2 139.863333 35.401667 NRA
216 C-P3 139.881667 35.370000 NRA
217 C-P4 139.846667 35.356667 NRA
218 C-P5 139.800000 35.343333 NRA

source

align_dfs

 align_dfs (df_from, df_to)

Align columns structure of df_from to df_to.

Exported source
def align_dfs(df_from, df_to):
    "Align columns structure of df_from to df_to."
    df = defaultdict()    
    for c in df_to.columns:
        df[c] = df_from[c].values if c in df_from.columns else np.NAN
    return pd.DataFrame(df)
align_dfs(df_clos1F, df_coastal_water).head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
0 T-0-1 NaN NaN NaN ND 1.5 ND 1.4 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
1 T-0-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 4.7 NaN NaN NaN NaN NaN NaN 2013/08/14 08:17:00
2 T-0-1 NaN NaN NaN ND 1.1 ND 1.4 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/21 08:09:00
3 T-0-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN ND 2.9 NaN NaN NaN NaN NaN 2013/08/21 08:09:00
4 T-0-1 NaN NaN NaN ND 0.66 ND 0.49 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013/08/27 08:14:00

5 rows × 49 columns


source

concat_dfs

 concat_dfs (df_coastal_water, df_clos1F)

Concatenate and drop duplicates from coastal_seawater.csv and close1F_water.xlsx (kept)

Exported source
def concat_dfs(df_coastal_water, df_clos1F):
    "Concatenate and drop duplicates from coastal_seawater.csv and close1F_water.xlsx (kept)"
    df_clos1F = align_dfs(df_clos1F, df_coastal_water)
    df = pd.concat([df_coastal_water, df_clos1F])
    return df
df_meas = concat_dfs(df_coastal_water, df_clos1F)
df_meas.head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/21 23:15:00
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/21 23:45:00
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/22 14:28:00
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011/3/22 15:06:00
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN NaN NaN NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00

5 rows × 49 columns


source

georef_data

 georef_data (df_meas, df_locs)

Georeference measurements dataframe using locations dataframe.

Exported source
def georef_data(df_meas, df_locs):
    "Georeference measurements dataframe using locations dataframe."
    assert "Sampling point number" in df_meas.columns and "station" in df_locs.columns
    return pd.merge(df_meas, df_locs, how="inner", 
                    left_on='Sampling point number', right_on='station')
df_meas_georef = georef_data(df_meas, df_locs)
df_meas_georef.head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns


source

load_data

 load_data (fname_coastal_water, fname_clos1F, fname_iaea_orbs)

Load, align and georeference TEPCO data

Exported source
def load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs):
    "Load, align and georeference TEPCO data"
    df_locs = concat_locs(
        [get_locs_coastal_water(fname_coastal_water), 
         get_locs_clos1F(fname_clos1F),
         get_locs_orbs(fname_iaea_orbs)])
    df_meas = concat_dfs(get_coastal_water_df(fname_coastal_water), get_clos1F_df(fname_clos1F))
    df_meas.dropna(subset=['Sampling point number'], inplace=True)
    return {'SEAWATER': georef_data(df_meas, df_locs)}
dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
dfs['SEAWATER'].head()
100%|██████████| 11/11 [00:05<00:00,  1.86it/s]
100%|██████████| 11/11 [00:05<00:00,  1.85it/s]
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

print(f"# of cols, rows: {dfs['SEAWATER'].shape}")
dfs['SEAWATER'].head()
# of cols, rows: (46421, 53)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 ND 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 ND 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

dfs['SEAWATER']['Sampling point number'].unique()
array(['T-3', 'T-4', 'T-5', 'T-7', 'T-11', 'T-12', 'T-14', 'T-18', 'T-20',
       'T-22', 'T-MA', 'T-M10', 'T-A', 'T-D', 'T-E', 'T-B', 'T-C',
       'T-MG1', 'T-MG2', 'T-MG3', 'T-MG4', 'T-MG5', 'T-MG6', 'T-D1',
       'T-D5', 'T-D9', 'T-E1', 'T-G4', 'T-H1', 'T-S5', 'T-S6', 'T-17-1',
       'T-B3', 'T-13-1', 'T-S3', 'T-S4', 'T-B4', 'T-S1', 'T-S2', 'T-MG0',
       'T-Z', 'T-B1', 'T-B2', 'T-S7', 'T-S8', 'T-0', 'T-4-1', 'T-4-2',
       'T-6', 'T-0-1', 'T-0-1A', 'T-0-2', 'T-0-3', 'T-0-3A', 'T-1', 'T-2',
       'T-2-1', 'T-A1', 'T-A2', 'T-A3'], dtype=object)

Fix missing values

Tip

FEEDBACK TO DATA PROVIDER: We remap the ND value to NaN. Please confirm that this is the correct way to handle missing values.

ND is assigned NaN. This needs to be confirmed.


source

FixMissingValuesCB

 FixMissingValuesCB ()

Assign NaN to values equal to ND (not detected) - to be confirmed

Exported source
class FixMissingValuesCB(Callback):
    "Assign `NaN` to values equal to `ND` (not detected) - to be confirmed "
    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            predicate = tfm.dfs[k] == 'ND'
            tfm.dfs[k][predicate] = np.nan
tfm = Transformer(dfs, cbs=[FixMissingValuesCB()])
tfm()['SEAWATER'].head()
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
0 T-3 NaN 1.1E+03 1.3E+01 4.8E+01 9.2E+00 5.3E+01 8.8E+00 1.6E+02 44.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:15:00 T-3 141.026389 37.322222 TEPCO
1 T-4 NaN 6.6E+02 1.2E+01 3.1E+01 8.7E+00 3.3E+01 8.3E+00 1.2E+02 41.0 ... NaN NaN NaN NaN NaN 2011/3/21 23:45:00 T-4 141.013889 37.241667 TEPCO
2 T-3 NaN 1.1E+03 2.0E+01 4.6E+01 1.4E+01 4.0E+01 1.4E+01 NaN 88.0 ... NaN NaN NaN NaN NaN 2011/3/22 14:28:00 T-3 141.026389 37.322222 TEPCO
3 T-4 NaN 6.7E+02 1.9E+01 3.9E+01 1.1E+01 4.4E+01 1.1E+01 NaN 79.0 ... NaN NaN NaN NaN NaN 2011/3/22 15:06:00 T-4 141.013889 37.241667 TEPCO
4 T-3 NaN 7.4E+02 2.7E+01 5.1E+01 2.0E+01 5.5E+01 2.0E+01 2.0E+02 58.0 ... NaN NaN 34.0 25.0 NaN 2011/3/23 13:51:00 T-3 141.026389 37.322222 TEPCO

5 rows × 53 columns

Remove 約 (about) character

Tip

FEEDBACK TO DATA PROVIDER: We systematically remove the character. Please confirm that this is the correct way to handle this. We could imagine that mentioning uncertainty would be less ambiguous in future.


source

RemoveJapanaseCharCB

 RemoveJapanaseCharCB ()

Remove 約 (about) char

Exported source
class RemoveJapanaseCharCB(Callback):
    "Remove 約 (about) char"
    def _transform_if_about(self, value, about_char='約'):
        if pd.isna(value): return value
        return (value.replace(about_char, '') if str(value).count(about_char) != 0 
                else value)
    
    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            cols_rdn = [c for c in tfm.dfs[k].columns if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
            tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_about)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB()])

tfm()['SEAWATER'].sample(10)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
6639 T-6 上層 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2014/2/14 09:30:00 T-6 141.040556 37.478889 TEPCO
8437 T-12 下層 NaN NaN 1.8E-03 NaN 8.8E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2014/12/23 06:26:00 T-12 141.037500 37.150000 TEPCO
34322 T-1 上層 NaN 0.62 NaN 0.76 NaN 0.58 NaN NaN ... NaN NaN NaN NaN NaN 2015/11/23 08:40:00 T-1 141.034444 37.431111 TEPCO
42599 T-2 上層 NaN NaN NaN 0.68 NaN 0.75 NaN NaN ... NaN NaN NaN NaN NaN 2023/02/12 08:30:00 T-2 141.033611 37.415833 TEPCO
37557 T-1 上層 NaN NaN NaN 0.68 NaN 0.69 NaN NaN ... NaN NaN NaN NaN NaN 2022/09/11 08:00:00 T-1 141.034444 37.431111 TEPCO
33013 T-1 上層 NaN 0.49 NaN 1.1 NaN 1.4 NaN NaN ... NaN NaN NaN NaN NaN 2013/01/18 12:25:00 T-1 141.034444 37.431111 TEPCO
43329 T-2 上層 NaN NaN NaN 0.91 NaN 0.64 NaN NaN ... NaN NaN NaN NaN NaN 2024/07/02 06:45:00 T-2 141.033611 37.415833 TEPCO
15953 T-11 下層 NaN NaN NaN 1.3E-03 7.2E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2019/3/13 10:05:00 T-11 141.047222 37.241667 TEPCO
20966 T-D5 上層 NaN NaN NaN 1.2E-03 3.4E-03 NaN NaN NaN ... NaN NaN NaN NaN NaN 2022/2/22 08:17:00 T-D5 141.072222 37.416667 TEPCO
34532 T-1 上層 NaN 0.71 NaN 0.5 NaN 0.57 NaN NaN ... NaN NaN NaN NaN NaN 2016/05/08 07:55 T-1 141.034444 37.431111 TEPCO

10 rows × 53 columns

Fix values range string

Tip

FEEDBACK TO DATA PROVIDER: Value ranges are provided as strings (e.g ‘4.0E+00<&<8.0E+00’ or ‘1.0~2.7’). We replace them by their mean. Please confirm that this is the correct way to handle this. Again, mentioning uncertainty would be less ambiguous in future.


source

FixRangeValueStringCB

 FixRangeValueStringCB ()

Replace range values (e.g ‘4.0E+00<&<8.0E+00’ or ‘1.0~2.7’) by their mean

Exported source
class FixRangeValueStringCB(Callback):
    "Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean"
    
    def _extract_and_calculate_mean(self, s):
        # For scientific notation ranges
        float_strings = re.findall(r"[+-]?\d+\.?\d*E?[+-]?\d*", s)
        if float_strings:
            float_numbers = np.array(float_strings, dtype=float)
            return float_numbers.mean()
        return s
    
    def _transform_if_range(self, value):
        if pd.isna(value): 
            return value
        value = str(value)
        # Check for both range patterns
        if '<&<' in value or '~' in value:
            return self._extract_and_calculate_mean(value)
        return value

    def __call__(self, tfm): 
        for k in tfm.dfs.keys():
            cols_rdn = [c for c in tfm.dfs[k].columns 
                       if ('(Bq/L)' in c) and (tfm.dfs[k][c].dtype == 'object')]
            tfm.dfs[k][cols_rdn] = tfm.dfs[k][cols_rdn].map(self._transform_if_range).astype(float)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB()
    ])

df_test = tfm()['SEAWATER']
df_test.sample(10)
Sampling point number Collection layer of seawater 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) 137Cs detection limit (Bq/L) 132I radioactivity concentration (Bq/L) 132I detection limit (Bq/L) ... 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L) Unnamed: 49 TIME station LON LAT org
37611 T-1 上層 NaN NaN NaN 0.5900 NaN 0.71 NaN NaN ... NaN NaN NaN NaN NaN 2022/10/22 08:20:00 T-1 141.034444 37.431111 TEPCO
19691 T-B4 上層 NaN NaN NaN 0.0015 0.0024 NaN NaN NaN ... NaN NaN NaN NaN NaN 2021/5/25 06:41:00 T-B4 141.148611 37.348333 TEPCO
33488 T-1 上層 NaN 0.73 NaN 0.7600 NaN 0.68 NaN NaN ... NaN NaN NaN NaN NaN 2014/01/28 08:00:00 T-1 141.034444 37.431111 TEPCO
17019 T-MG4 上層 NaN NaN NaN 0.0013 0.0052 NaN NaN NaN ... NaN NaN NaN NaN NaN 2019/10/28 10:49:00 T-MG4 141.133333 38.250000 TEPCO
16040 T-5 下層 NaN NaN NaN 0.0012 0.0033 NaN NaN NaN ... NaN NaN NaN NaN NaN 2019/3/25 06:16:00 T-5 141.200000 37.416667 TEPCO
2152 T-4 上層 NaN 0.71 NaN 0.9400 NaN 1.00 NaN NaN ... NaN NaN NaN NaN NaN 2011/12/16 07:55:00 T-4 141.013889 37.241667 TEPCO
37438 T-1 上層 NaN NaN NaN 0.7600 NaN 0.53 NaN NaN ... NaN NaN NaN NaN NaN 2022/06/13 08:40:00 T-1 141.034444 37.431111 TEPCO
6329 T-D5 上層 NaN NaN 0.0064 NaN 0.0180 NaN NaN NaN ... NaN NaN NaN NaN NaN 2013/12/17 10:29:00 T-D5 141.072222 37.416667 TEPCO
2507 T-4 上層 NaN 0.80 NaN 0.8500 1.2000 NaN NaN NaN ... NaN NaN NaN NaN NaN 2012/1/27 08:00:00 T-4 141.013889 37.241667 TEPCO
2860 T-12 上層 NaN 0.86 NaN 1.0000 NaN 1.10 NaN NaN ... NaN NaN NaN NaN NaN 2012/3/15 07:30:00 T-12 141.037500 37.150000 TEPCO

10 rows × 53 columns

Select columns of interest

We select the columns of interest and in particular the elements of interest, in our case radionuclides.


source

SelectColsOfInterestCB

 SelectColsOfInterestCB (common_coi, nuclides_pattern)

Select columns of interest.

Exported source
common_coi = ['org', 'LON', 'LAT', 'TIME', 'station']
nuclides_pattern = '(Bq/L)'
Exported source
class SelectColsOfInterestCB(Callback):
    "Select columns of interest."
    def __init__(self, common_coi, nuclides_pattern): fc.store_attr()
    def __call__(self, tfm):
        nuc_of_interest = [c for c in tfm.dfs['SEAWATER'].columns if nuclides_pattern in c]
        tfm.dfs['SEAWATER'] = tfm.dfs['SEAWATER'][self.common_coi + nuc_of_interest]
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
org LON LAT TIME station 131I radioactivity concentration (Bq/L) 131I detection limit (Bq/L) 134Cs radioactivity concentration (Bq/L) 134Cs detection limit (Bq/L) 137Cs radioactivity concentration (Bq/L) ... 144Ce radioactivity concentration (Bq/L) 144Ce detection limit (Bq/L) 54Mn radioactivity concentration (Bq/L) 54Mn detection limit (Bq/L) 3H radioactivity concentration (Bq/L) 3H detection limit (Bq/L) 125Sb radioactivity concentration (Bq/L) 125Sb detection limit (Bq/L) 105Ru radioactivity concentration (Bq/L) 105Ru detection limit (Bq/L)
6857 TEPCO 141.000000 38.083333 2014/3/18 10:16:00 T-MG6 NaN NaN 0.0032 NaN 0.0120 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13993 TEPCO 141.000000 38.083333 2018/1/11 10:16:00 T-MG6 NaN NaN NaN 0.0014 0.0035 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32155 TEPCO 141.046667 37.416111 2024/04/29 07:47:00 T-0-3A NaN NaN NaN 0.3300 NaN ... NaN NaN NaN NaN NaN 6.3 NaN NaN NaN NaN
1367 TEPCO 141.000000 36.966667 2011/9/12 05:30:00 T-20 NaN 4.0 NaN 6.0000 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7755 TEPCO 141.042500 37.640833 2014/8/29 05:58:00 T-13-1 NaN NaN 0.0021 NaN 0.0071 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 50 columns

Reshape: wide to long

This step is necessary to extract information such as nuclide names, detection limint, uncertainty, …


source

WideToLongCB

 WideToLongCB ()

Parse TEPCO measurement columns to extract nuclide name, measurement value, detection limit and uncertainty

Exported source
class WideToLongCB(Callback):
    """
    Parse TEPCO measurement columns to extract nuclide name, measurement value, 
    detection limit and uncertainty
    """
    def __init__(self): fc.store_attr()
    
    
    def _melt(self, df):
        "Melt dataframe to long format."
        return df.melt(id_vars=['LON', 'LAT', 'TIME', 'station'])
        
    def _extract_nuclide(self, text):
        words = text.split(' ')
        # Handle special cases for alpha/beta
        if len(words) >= 2 and words[1].lower() in ['alpha', 'beta']:
            return f"{words[0]} {words[1]}"
        return words[0]
    
    def _nuclide_name(self, df):
        "Extract nuclide name from nuclide names."
        df['NUCLIDE'] = df['variable'].map(self._extract_nuclide)
        return df
    
    def _type_indicator(self, df):
        "Create type indicators."
        df['is_concentration'] = df['variable'].str.contains('radioactivity concentration')
        df['is_dl'] = df['variable'].str.contains('detection limit')
        df['is_unc'] = df['variable'].str.contains('statistical error')
        return df
    
    def _unit(self, df):
        "Extract unit from nuclide names."
        df['UNIT'] = df['variable'].str.extract(r'\((.*?)\)')
        return df
    
    def _type_column(self, df):
        "Create type column."
        conditions = [
            df['is_concentration'],
            df['is_dl'],
            df['is_unc']
        ]
        choices = ['VALUE', 'DL', 'UNC']
        df['type'] = np.select(conditions, choices)
        df = df.drop(['is_concentration', 'is_dl', 'is_unc'], axis=1)
        return df
    
    def __call__(self, tfm):
        tfm.dfs['SEAWATER'] = self._melt(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = self._nuclide_name(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = self._type_indicator(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = self._unit(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = self._type_column(tfm.dfs['SEAWATER'])
        tfm.dfs['SEAWATER'] = pd.pivot_table(
            tfm.dfs['SEAWATER'],
            values='value',
            index=['LON', 'LAT', 'TIME', 'station', 'NUCLIDE', 'UNIT'],
            columns='type',
            aggfunc='first'
        ).reset_index()
        # reset the index and rename it ID
        tfm.dfs['SEAWATER'].reset_index(inplace=True)
        tfm.dfs['SEAWATER'].rename(columns={'index': 'ID'}, inplace=True)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.head()
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
0 0 37.21 141.01 2012/10/16 07:25:00 T-4-1 131I Bq/L 0.13 NaN NaN
1 1 37.21 141.01 2012/10/16 07:25:00 T-4-1 134Cs Bq/L 0.19 NaN NaN
2 2 37.21 141.01 2012/10/16 07:25:00 T-4-1 137Cs Bq/L 0.27 NaN NaN
3 3 37.21 141.01 2012/10/2 07:30:00 T-4-1 131I Bq/L 0.11 NaN NaN
4 4 37.21 141.01 2012/10/2 07:30:00 T-4-1 134Cs Bq/L 0.22 NaN NaN

Remap UNIT name to MARIS nomenclature


source

RemapUnitNameCB

 RemapUnitNameCB (unit_mapping)

Remap UNIT name to MARIS id.

Exported source
unit_mapping = {'Bq/L': 3}
Exported source
class RemapUnitNameCB(Callback):
    """
    Remap `UNIT` name to MARIS id.
    """
    def __init__(self, unit_mapping): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER']['UNIT'] = tfm.dfs['SEAWATER']['UNIT'].map(self.unit_mapping)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
18784 18784 141.033611 37.415833 2011/03/26 14:30:00 T-2 136Cs 3 52.0 NaN 1300.0
59680 59680 141.046667 37.423333 2024/01/24 08:07:00 T-0-2 3H 3 7.7 NaN NaN
45767 45767 141.034444 37.431111 2023/10/08 07:00:00 T-1 137Cs 3 0.78 NaN NaN
86664 86664 141.666667 38.300000 2015/3/26 08:01:00 T-MG2 134Cs 3 0.0017 NaN NaN
14504 14504 141.013889 37.241667 2017/4/25 13:50:00 T-4 137Cs 3 NaN NaN 0.033

Remap NUCLIDE name to MARIS nomenclature


source

RemapNuclideNameCB

 RemapNuclideNameCB (nuclide_mapping)

Remap NUCLIDE name to MARIS id.

Exported source
nuclide_mapping = {
    '131I': 29,
    '134Cs': 31,
    '137Cs': 33,
    '125Sb': 24,
    'Total beta': 103,
    '238Pu': 67,
    '239Pu+240Pu': 77,
    '3H': 1,
    '89Sr': 11,
    '90Sr': 12,
    'Total alpha': 104,
    '132I': 100,
    '136Cs': 102,
    '58Co': 8,
    '105Ru': 97,
    '106Ru': 17,
    '140La': 35,
    '140Ba': 34,
    '132Te': 99,
    '60Co': 9,
    '144Ce': 37,
    '54Mn': 6
}
Exported source
class RemapNuclideNameCB(Callback):
    """
    Remap `NUCLIDE` name to MARIS id.
    """
    def __init__(self, nuclide_mapping): fc.store_attr()
    def __call__(self, tfm):
        tfm.dfs['SEAWATER']['NUCLIDE'] = tfm.dfs['SEAWATER']['NUCLIDE'].map(self.nuclide_mapping)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping)
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
7337 7337 140.665556 36.506389 2024/4/17 08:30:00 T-B 33 3 1.1 NaN NaN
27402 27402 141.033611 37.415833 2021/01/02 06:55:00 T-2 33 3 0.77 NaN NaN
24020 24020 141.033611 37.415833 2018/06/08 07:10:00 T-2 33 3 0.71 NaN NaN
24764 24764 141.033611 37.415833 2018/12/02 07:20:00 T-2 29 3 0.68 NaN NaN
18866 18866 141.033611 37.415833 2011/04/04 14:20:00 T-2 31 3 68.0 NaN 19000.0

Remap DL value to MARIS nomenclature

We remap DL (Detection Limit) value to MARIS ids as follows:

  • if a DL value is reported with assign 2 (Detection limit or ‘<’)
  • if a DL value is not reported with assign 1 (Detected value or ‘=’)

source

RemapDLCB

 RemapDLCB ()

Remap DL name to MARIS id.

Exported source
class RemapDLCB(Callback):
    """
    Remap `DL` name to MARIS id.
    """
    def __init__(self): fc.store_attr()
    def dl_mapping(self, value): return 1 if pd.isna(value) else 2
    def __call__(self, tfm): 
        tfm.dfs['SEAWATER']['DL'] = tfm.dfs['SEAWATER']['DL'].map(self.dl_mapping)
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapDLCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(10)
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
53677 53677 141.040556 37.478889 2020/7/28 10:10:00 T-6 33 3 1 NaN 0.013
27688 27688 141.033611 37.415833 2021/03/29 07:00:00 T-2 103 3 1 NaN 10.0
24009 24009 141.033611 37.415833 2018/06/05 06:55:00 T-2 103 3 1 NaN 15.0
14111 14111 141.013889 37.241667 2012/9/11 07:35:00 T-4 33 3 1 NaN 0.28
78923 78923 141.200000 37.416667 2012/2/5 09:00:00 T-5 33 3 2 NaN NaN
78237 78237 141.200000 37.233333 2015/9/25 07:38:00 T-7 31 3 2 NaN NaN
71999 71999 141.072222 37.416667 2024/6/24 08:34:00 T-D5 31 3 2 NaN NaN
5053 5053 37.410000 141.030000 2016/5/23 06:05 T-2-1 103 3 1 NaN 12.0
64018 64018 141.047222 37.311111 2012/8/10 06:43:00 T-S7 33 3 1 NaN 0.07
8601 8601 140.763889 36.713889 2012/4/3 08:06:00 T-A 33 3 2 NaN NaN

Parse & encode time


source

ParseTimeCB

 ParseTimeCB (time_name='TIME')

Parse time column from TEPCO.

Exported source
class ParseTimeCB(Callback):
    "Parse time column from TEPCO."
    def __init__(self,
                 time_name='TIME'):
        fc.store_attr()
        
    def __call__(self, tfm):
        tfm.dfs['SEAWATER'][self.time_name] = pd.to_datetime(tfm.dfs['SEAWATER'][self.time_name], 
                                                             format='%Y/%m/%d %H:%M:%S', errors='coerce')
tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapDLCB(),
    ParseTimeCB(),
    EncodeTimeCB()
    ])

df_test = tfm()['SEAWATER'] 
df_test.sample(5)
Warning: 3054 missing time value(s) in SEAWATER
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
58363 58363 141.046667 37.423333 1514185620 T-0-2 103 3 2 NaN NaN
30242 30242 141.033611 37.415833 1684140000 T-2 31 3 2 NaN 0.0016
65319 65319 141.050761 37.424686 1718609100 T-A2 1 3 2 NaN NaN
37274 37274 141.034444 37.431111 1433141640 T-1 31 3 2 NaN 0.063
44857 44857 141.034444 37.431111 1664611800 T-1 33 3 2 NaN NaN

Sanitize coordinates

tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapDLCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
    ])

df_test = tfm()['SEAWATER']
df_test.sample(5)
Warning: 3054 missing time value(s) in SEAWATER
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
70811 70811 141.072222 37.416667 1499763240 T-D5 33 3 1 NaN 0.0054
74769 74769 141.078889 37.458333 1492579380 T-S3 31 3 2 NaN 0.0014
10660 10660 140.922222 36.905556 1313992500 T-18 31 3 2 NaN NaN
79869 79869 141.200000 37.416667 1494315840 T-5 31 3 2 NaN NaN
52102 52102 141.040278 37.430556 1687159980 T-0-1 103 3 2 NaN NaN

Encode to NetCDF

tfm = Transformer(dfs, cbs=[
    FixMissingValuesCB(),
    RemoveJapanaseCharCB(),
    FixRangeValueStringCB(),
    SelectColsOfInterestCB(common_coi, nuclides_pattern),
    WideToLongCB(),
    RemapUnitNameCB(unit_mapping),
    RemapNuclideNameCB(nuclide_mapping),
    RemapDLCB(),
    ParseTimeCB(),
    EncodeTimeCB(),
    SanitizeLonLatCB()
    ])

dfs_tfm = tfm()
tfm.logs
Warning: 3054 missing time value(s) in SEAWATER
['Assign `NaN` to values equal to `ND` (not detected) - to be confirmed ',
 'Remove 約 (about) char',
 "Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean",
 'Select columns of interest.',
 '\n    Parse TEPCO measurement columns to extract nuclide name, measurement value, \n    detection limit and uncertainty\n    ',
 '\n    Remap `UNIT` name to MARIS id.\n    ',
 '\n    Remap `NUCLIDE` name to MARIS id.\n    ',
 '\n    Remap `DL` name to MARIS id.\n    ',
 'Parse time column from TEPCO.',
 'Encode time as seconds since epoch.',
 'Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator.']
dfs_tfm['SEAWATER'].sample(10)
type ID LON LAT TIME station NUCLIDE UNIT DL UNC VALUE
22476 22476 141.033611 37.415833 1495781700 T-2 33 3 2 NaN NaN
53348 53348 141.040556 37.478889 1527587400 T-6 31 3 1 NaN 0.0016
78326 78326 141.200000 37.233333 1562136360 T-7 33 3 1 NaN 0.0023
75414 75414 141.083333 37.000000 1326265200 T-M10 29 3 2 NaN NaN
83669 83669 141.283333 38.333333 1529401560 T-MG1 31 3 2 NaN NaN
8348 8348 140.763889 36.713889 1307695680 T-A 31 3 2 NaN NaN
46219 46219 141.034444 37.431111 1710835200 T-1 1 3 2 NaN NaN
22785 22785 141.033611 37.415833 1502091300 T-2 12 3 1 NaN 0.0022
69223 69223 141.072167 37.333333 1614590460 T-D9 33 3 1 NaN 0.004
82314 82314 141.250000 38.166667 1449046800 T-MG5 33 3 1 NaN 0.0042

source

get_attrs

 get_attrs (tfm, zotero_key, kw=['oceanography', 'Earth Science > Oceans >
            Ocean Chemistry> Radionuclides', 'Earth Science > Human
            Dimensions > Environmental Impacts > Nuclear Radiation
            Exposure', 'Earth Science > Oceans > Ocean Chemistry > Ocean
            Tracers, Earth Science > Oceans > Marine Sediments', 'Earth
            Science > Oceans > Ocean Chemistry, Earth Science > Oceans >
            Sea Ice > Isotopes', 'Earth Science > Oceans > Water Quality >
            Ocean Contaminants', 'Earth Science > Biological
            Classification > Animals/Vertebrates > Fish', 'Earth Science >
            Biosphere > Ecosystems > Marine Ecosystems', 'Earth Science >
            Biological Classification > Animals/Invertebrates > Mollusks',
            'Earth Science > Biological Classification >
            Animals/Invertebrates > Arthropods > Crustaceans', 'Earth
            Science > Biological Classification > Plants > Macroalgae
            (Seaweeds)'])

Retrieve global attributes from MARIS dump.

Exported source
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']
Exported source
def get_attrs(tfm, zotero_key, kw=kw):
    "Retrieve global attributes from MARIS dump."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()
get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw)
{'geospatial_lat_min': '141.66666667',
 'geospatial_lat_max': '38.63333333',
 'geospatial_lon_min': '140.60388889',
 'geospatial_lon_max': '35.79611111',
 'geospatial_bounds': 'POLYGON ((140.60388889 35.79611111, 141.66666667 35.79611111, 141.66666667 38.63333333, 140.60388889 38.63333333, 140.60388889 35.79611111))',
 'time_coverage_start': '2011-03-21T14:30:00',
 'time_coverage_end': '2024-10-26T07:32:00',
 'title': "Readings of Sea Area Monitoring - Monitoring of sea water - Sea area close to TEPCO's Fukushima Daiichi NPS / Coastal area - Readings of Sea Area Monitoring [TEPCO]",
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "TEPCO - Tokyo Electric Power Company"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)',
 'publisher_postprocess_logs': "Assign `NaN` to values equal to `ND` (not detected) - to be confirmed , Remove 約 (about) char, Replace range values (e.g '4.0E+00<&<8.0E+00' or '1.0~2.7') by their mean, Select columns of interest., \n    Parse TEPCO measurement columns to extract nuclide name, measurement value, \n    detection limit and uncertainty\n    , \n    Remap `UNIT` name to MARIS id.\n    , \n    Remap `NUCLIDE` name to MARIS id.\n    , \n    Remap `DL` name to MARIS id.\n    , Parse time column from TEPCO., Encode time as seconds since epoch., Drop rows with invalid longitude & latitude values. Convert `,` separator to `.` separator."}

source

encode

 encode (fname_out:str, **kwargs)

Encode TEPCO data to NetCDF.

Type Details
fname_out str Path to the folder where the NetCDF output will be saved
kwargs
Exported source
def encode(
    fname_out: str, # Path to the folder where the NetCDF output will be saved
    **kwargs # Additional keyword arguments
    ):
    "Encode TEPCO data to NetCDF."
    dfs = load_data(fname_coastal_water, fname_clos1F, fname_iaea_orbs)
    
    tfm = Transformer(dfs, cbs=[
        FixMissingValuesCB(),
        RemoveJapanaseCharCB(),
        FixRangeValueStringCB(),
        SelectColsOfInterestCB(common_coi, nuclides_pattern),
        WideToLongCB(),
        RemapUnitNameCB(unit_mapping),
        RemapNuclideNameCB(nuclide_mapping),
        RemapDLCB(),
        ParseTimeCB(),
        EncodeTimeCB(),
        SanitizeLonLatCB()
    ])        
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out, 
                            global_attrs=get_attrs(tfm, zotero_key='JEV6HP5A', kw=kw),
                            verbose=kwargs.get('verbose', False)
                            )
    encoder.encode()
encode(fname_out, verbose=False)
100%|██████████| 11/11 [00:06<00:00,  1.82it/s]
100%|██████████| 11/11 [00:05<00:00,  1.86it/s]
Warning: 3054 missing time value(s) in SEAWATER