import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import numpy as np
import pandas as pd
import rasterio
import geopandas as gpd
from trufl.utils import gridder
from trufl.sampler import Sampler, rank_to_sample
from trufl.collector import DataCollector
from trufl.callbacks import (State, MaxCB, MinCB, StdCB, CountCB, MoranICB, PriorCB)
from trufl.optimizer import Optimizer
= '#BF360C', '#263238' red, black
Trufl
Trufl was initiated in the context of the IAEA (International Atomic Energy Agency) Coordinated Research Project (CRP) titled “Monitoring and Predicting Radionuclide Uptake and Dynamics for Optimizing Remediation of Radioactive Contamination in Agriculture”.
While Trufl was originally developed to address the remediation of farmland affected by nuclear accidents, its approach and algorithms are applicable to a wide range of application domains. This includes managing legacy contaminants or monitoring phenomena that require consideration of multiple decision criteria over time, taking into account a wide range of factors and contexts.
This package leverages the work done by Floris Abrams in the context of his PhD in collaboration between SCK CEN and KU Leuven and Franck Albinet, International Consultant in Geospatial Data Science and currently PhD researcher in AI applied to nuclear remedation at KU Leuven.
Install
pip install trufl
Getting started
In highly sensitive and high-stakes situations, it is essential that decision making is informed, transparent, and accountable, with decisions being based on a thorough and objective analysis of the available data and the needs and concerns of affected communities being taken into account.
Given the time constraints and limited budgets that are often associated with data surveys (in particular ones supposed to informed highly sensitive situation), it is crucial to make informed decisions about how to allocate resources. This is even more important when considering the many variables that can be taken into account, such as prior knowledge of the area, health and economic impacts, land use, whether remediation has already taken place, population density, and more. Our approach leverages Multiple-criteria decision-making approaches to optimize the data survey workflow:
In this demo, we will walk you through a typical workflow using the Trufl
package. To help illustrate the process, we will use a “toy” dataset that represents a typical spatial pattern of soil contaminants.
- We assume that we have access to the ground truth, which is a raster file that shows the spatial distribution of a soil contaminant;
- We will make decisions about how to optimally sample the administrative units (polygons), which in this case are simulated as a grid (using the
gridder
utilities function); - Based on prior knowledge, such as prior airborne surveys or other data, an
Optimizer
will rank each administrative unit (grid cell) according to its priority for sampling; - We will then perform random sampling on the designated units (grid cells) (using a
Sampler
). To simulate the measurement process, we will use the ground truth to emulate measurements at each location (using aDataCollector
); - We will evaluate the new state of each unit based on the measurements and pass it to a new round of optimization. This process will be repeated iteratively to refine the sampling strategy.
Imports
Our simulated ground truth
The assumed ground truth reveals a typical spatial pattern of contaminant such as Cs137
after a nuclear accident for instance:
= './files/ground-truth-01-4326-simulated.tif'
fname_raster with rasterio.open(fname_raster) as src:
'off')
plt.axis(1))
plt.imshow(src.read('Simulated Ground Truth') plt.title(
Simulate administrative units
The sampling strategy will be determined on a per-grid-cell basis within the administrative unit. We define below a 10 x 10 grid over the area of interest:
= gridder(fname_raster, nrows=10, ncols=10)
gdf_grid gdf_grid.head()
geometry | |
---|---|
loc_id | |
0 | POLYGON ((-1.20830 43.26950, -1.20830 43.26042... |
1 | POLYGON ((-1.20830 43.27858, -1.20830 43.26950... |
2 | POLYGON ((-1.20830 43.28766, -1.20830 43.27858... |
3 | POLYGON ((-1.20830 43.29673, -1.20830 43.28766... |
4 | POLYGON ((-1.20830 43.30581, -1.20830 43.29673... |
Note how each administrative unit is uniquely identified by its loc_id
.
=black, lw=0.5)
gdf_grid.boundary.plot(color'off')
plt.axis('Simulated Administrative Units'); plt.title(
Round I: Optimize sampling based on prior at \(t_0\)
What prior knowledge do we have?
At the initial time \(t_0\), data sampling has not yet begun, but we can often leverage existing prior knowledge of our phenomenon of interest to inform our sampling strategy/policy. In the context of nuclear remediation, this prior knowledge can often be obtained through mobile surveys, such as airborne or carborne surveys, which can provide a coarse estimation of soil contamination levels.
In the example below, we simulate prior information about the soil property of interest by calculating the average value of the property over each grid cell.
At this stage, we have no measurements, so we simply create an empty Geopandas GeoDataFrame.
= gpd.GeoDataFrame(index=pd.Index([], name='loc_id'),
samples_t0 =None, data={'value': None}) geometry
We need to set an index loc_id
and have a geometry
and value
columns.
Now we get/“sense” the state of our grid cells based on the simulated prior (Mean over each grid cell PriorCB
):
= State(samples_t0, gdf_grid, cbs=[PriorCB(fname_raster)])
state
# You have to call the instance
= state(); state_t0.head() state_t0
Prior | |
---|---|
loc_id | |
0 | 0.102492 |
1 | 0.125727 |
2 | 0.161802 |
3 | 0.184432 |
4 | 0.201405 |
='left').plot(column='Prior',
gdf_grid.join(state_t0, how='viridis',
cmap={'label': 'Value'},
legend_kwds=True)
legend'off')
plt.axis('Prior: Mean value at Administrative Unit level'); plt.title(
Sampling priority ranks
= [True]
benefit_criteria = Optimizer(state=state_t0)
optimizer = optimizer.get_rank(is_benefit_x=benefit_criteria, w_vector = [1],
df_rank =None, c_method = None,
n_method=None, s_method="CP")
w_method
df_rank.head()
rank | |
---|---|
loc_id | |
92 | 1 |
93 | 2 |
91 | 3 |
94 | 4 |
82 | 5 |
For more information on how the Optimizer
operates, please see the section Delving deeper into the optimization process.
='left').plot(column='rank',
gdf_grid.join(df_rank, how='viridis_r',
cmap={'label': 'Rank'},
legend_kwds=True)
legend'off')
plt.axis('Sampling Priorirty Rank'); plt.title(
Informed random sampling
It’s worth noting that in the absence of any prior knowledge, a uniform sampling strategy over the area of interest may be used. However, this approach may not be the most efficient use of the available data collection and analysis budget.
Based on the ranks (sampling priority) calculated by the Optimizer
and given sampling budget, let’s calculate the number of samples to be collected for each administrative unit (loc_id
). Different sampling policies can be used (Weighted, Quantiles, …):
= 600
budget_t0 = rank_to_sample(df_rank['rank'].sort_index().values,
n =budget_t0, min=1, policy="quantiles"); n budget
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 4, 4,
4, 4, 4, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 1, 1, 1, 4,
4, 7, 7, 12, 7, 7, 1, 1, 4, 4, 7, 12, 12, 12, 12, 7, 1,
4, 4, 7, 7, 12, 12, 12, 7, 4, 4, 4, 7, 7, 7, 7, 7, 7,
4, 4, 4, 7, 12, 7, 12, 12, 7, 7, 4, 4, 7, 12, 12, 12, 12,
12, 12, 12, 7, 7, 12, 12, 12, 12, 12, 12, 12, 7, 7, 7])
We can now decide where to sample based on this sampling schema:
= Sampler(gdf_grid)
sampler = sampler.sample(n, method='uniform')
sample_locs_t0
print(sample_locs_t0.head())
= sample_locs_t0.plot(markersize=2, color=red)
ax
=black, lw=0.5, ax=ax)
gdf_grid.boundary.plot(color'off')
plt.axis('Ranked Random Samples Location'); plt.title(
geometry
loc_id
0 POINT (-1.21727 43.26778)
1 POINT (-1.22102 43.27806)
2 POINT (-1.21712 43.27979)
3 POINT (-1.22145 43.29287)
4 POINT (-1.21036 43.30109)
Emulating measurement campaign
The data collector collects measurements at the random sampling locations in the field. In our case, we emulate this process by extracting measurements from the provided raster file.
“Measuring” variable of interest from a given raster:
= DataCollector(fname_raster)
dc_emulator = dc_emulator.collect(sample_locs_t0)
measurements_t0
print(measurements_t0.head())
= measurements_t0.plot(column='value', s=2, legend=True)
ax =black, lw=0.5, ax=ax);
gdf_grid.boundary.plot(color'off')
plt.axis('Measurements at Random Sampling Points'); plt.title(
geometry value
loc_id
0 POINT (-1.21727 43.26778) 0.137188
1 POINT (-1.22102 43.27806) 0.151005
2 POINT (-1.21712 43.27979) 0.164272
3 POINT (-1.22145 43.29287) 0.181001
4 POINT (-1.21036 43.30109) 0.168969
This marks the end of our initial measurement efforts, based on our prior knowledge of the phenomenon. Going forward, we can use the additional insights gained during this phase to enhance our future measurements.
Round II: Optimize sampling with additional insights at \(t_1\)
For each administrative unit, we now have additional knowledge acquired during the previous campaign, in addition to our prior knowledge. In the current round, the optimization of the sampling will be carried out based on the maximum, minimum, standard Deviation, number of measurements already conducted, our prior knowledge, and an estimate of the presence of spatial trends or spatial correlations (Moran’s I).
It’s worth noting that you can use any quantitative or qualitative secondary geographical information as a variable in the state, such as population, whether any previous remediation actions have taken place, the economic impact of the contamination, and so on.
Getting administrative units new state
= State(measurements_t0, gdf_grid, cbs=[
state =5), PriorCB(fname_raster)]) MaxCB(), MinCB(), StdCB(), CountCB(), MoranICB(k
state().head()
Max | Min | Standard Deviation | Count | Moran.I | Prior | |
---|---|---|---|---|---|---|
loc_id | ||||||
0 | 0.137188 | 0.137188 | 0.0 | 1 | NaN | 0.102492 |
1 | 0.151005 | 0.151005 | 0.0 | 1 | NaN | 0.125727 |
2 | 0.164272 | 0.164272 | 0.0 | 1 | NaN | 0.161802 |
3 | 0.181001 | 0.181001 | 0.0 | 1 | NaN | 0.184432 |
4 | 0.168969 | 0.168969 | 0.0 | 1 | NaN | 0.201405 |
The Moran’s I index is a statistical method used to determine if there is a spatial correlation/trend within each area of interest. For example, a random field would have a Moran’s I index close to 0, while a clear gradient of low to high values, such as from south to north, would be characterized by a Moran’s I index close to 1.
Finding optimal number of samples to be collected
- We first decide if each variable of the State are to maximize (benefit) or minimize (cost):
= [True, True, True, False, False, True] benefit_criteria
- Then assign an importance weight to each of the variable of the
State
(Min
,Max
, …):
= Optimizer(state=state())
optimizer = optimizer.get_rank(is_benefit_x=benefit_criteria,
df_rank = [0.2, 0.1, 0.1, 0.2, 0.2, 0.2],
w_vector ="LINEAR1", c_method=None, w_method=None, s_method="CP") n_method
='left').plot(column='rank',
gdf_grid.join(df_rank, how='viridis_r',
cmap={'label': 'Rank'},
legend_kwds=True)
legend'off')
plt.axis('Sampling Priorirty Rank'); plt.title(
df_rank.head()
rank | |
---|---|
loc_id | |
26 | 1 |
73 | 2 |
27 | 3 |
78 | 4 |
24 | 5 |
Based on this rank we can again: 1. based on the ranks (sampling priority) and given sampling budget, calculate the number of samples to be collected for each administrative unit and carry out random sampling; 2. perform the random sampling; 3. and carry out the measurements.
Informed random sampling
= 400
budget_t1 = rank_to_sample(df_rank['rank'].sort_index().values,
n =budget_t1, min=1, policy="quantiles"); n budget
array([1, 1, 1, 1, 1, 1, 3, 4, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 3, 8, 1, 1,
1, 1, 8, 8, 8, 8, 4, 8, 1, 1, 1, 3, 8, 4, 4, 4, 8, 3, 1, 1, 3, 4,
4, 4, 4, 4, 4, 3, 1, 3, 4, 4, 8, 3, 3, 4, 8, 8, 1, 8, 3, 3, 4, 4,
4, 3, 4, 8, 8, 8, 8, 8, 4, 3, 4, 8, 8, 8, 8, 4, 8, 8, 4, 3, 3, 3,
3, 3, 8, 4, 3, 3, 4, 4, 3, 8, 3, 3])
= Sampler(gdf_grid)
sampler = sampler.sample(n, method='uniform')
sample_locs_t1
= sample_locs_t1.plot(markersize=2, color=red)
ax =black, lw=0.5, ax=ax)
gdf_grid.boundary.plot(color'off')
plt.axis('Ranked Random Samples Location'); plt.title(
Second measurement campaign
= DataCollector(fname_raster)
dc_emulator = dc_emulator.collect(sample_locs_t1)
measurements_t1
= measurements_t1.plot(column='value', s=2, legend=True)
ax =black, lw=0.5, ax=ax);
gdf_grid.boundary.plot(color'off')
plt.axis('Measurements at Random Sampling Points'); plt.title(
= pd.concat([measurements_t0, measurements_t1])
measurements_sofar
= measurements_sofar.plot(column='value', s=2, legend=True)
ax =black, lw=0.5, ax=ax);
gdf_grid.boundary.plot(color'off')
plt.axis('Measurements after \n 2 informed measurement campaigns'); plt.title(
Delving deeper into the optimization process
Determine the ranking of the administrative polygons
The ranking is based on the importance of increasing sampling in each polygon. A multi-criteria decision-making methodology is used to rank the polygons from most important to least important, with lower ranks indicating a higher priority for sampling.
Criteria
The state of the polygons will be used as criteria to determine the rank:
Criteria | State variable | Criteria Type | |
---|---|---|---|
Estimated value | PriorCB() | Benefit | |
Maximum sample value | MaxCB() | Benefit | |
Minimal sample value | MinCB() | Benefit | |
Sample count | CountCB() | Cost | |
Standard deviation | StdCB() | Benefit | |
Moran I index | MoranICB(k=5) | Cost |
Criteria type
Criteria can be of the type benefit or cost:
- Benefit: high values equal high importance to sample more;
- Cost: low value equal high importance to sample more).
Weights
A weight vector is used to determine the importance of criteria in comparison with each other.
MCDM techniques
- CP (Compromise programming):
- Distance based measure, where the distance to the optmal point is used, where low values relate to good alternatives.
- TOPSIS (Technique for Order Preference by Similarity to Ideal Solution):
- Distance-based measure, where the closeness to the optimal and anti-optimal points is assessed (with higher values indicating better alternatives).
Rank
Based on the MCDM value a ranking of the polygons is created:
Start with using equal weights for all the criteria, later you will explore the impact of changing the weight vector. Make sure the sum of the weight vector is 1.
Ranking of administrative units based on three criteria:
= [True, True, True]
benefit_criteria = State(measurements_sofar, gdf_grid, cbs=[MaxCB(), MinCB(), StdCB()])
state = [0.3, 0.3, 0.4]
weight_vector
= Optimizer(state=state())
optimizer = optimizer.get_rank(is_benefit_x=benefit_criteria, w_vector = weight_vector,
df ="LINEAR1", c_method = None, w_method=None, s_method="CP")
n_method
df.head()
rank | |
---|---|
loc_id | |
71 | 1 |
72 | 2 |
69 | 3 |
59 | 4 |
38 | 5 |
Based on the ranking of the administrative units, an optimized sampling strategy for \(t_1\) can be determined.
= pd.merge(df, gdf_grid[['geometry']], left_index=True, right_index=True, how='inner')
combined_df = gpd.GeoDataFrame(combined_df)
combined_gdf
= plt.subplots(1, 1, figsize=(10, 8))
fig, ax = combined_gdf.plot(column='rank', cmap='Reds_r', legend=True, ax=ax)
cax ='value', ax=ax, cmap='viridis', s=1.5, legend=True)
measurements_sofar.plot(column
= cax.get_figure().get_axes()[1]
cbar
cbar.invert_yaxis()
= mlines.Line2D([], [], color='Red', marker='o', linestyle='None',
rank_legend =10, label='High Rank')
markersize= mlines.Line2D([], [], color='Yellow', marker='o', linestyle='None',
value_legend =10, label='High prior value')
markersize
=[rank_legend, value_legend], loc='upper left', bbox_to_anchor=(1.5, 1.25))
ax.legend(handles plt.show()
Multi-year Adaptive sampling approach
- Sampling in year 0 will done based on the prior;
- Sampling in year t will be done based on 6 state variables:
- [Max value, Min value, Standard deviation, sample count, Moran I, Prior value]
- [0.2, 0.1, 0.1, 0.2, 0.2, 0.2]
- Sampling policy will be based on the point budget and the quantile in which the unit ranks:
- 1st: 50 % of point budget
- 2nd: 30% of point budget
- 3th: 20% of point budget
- 4th: no extra sample points
= 4
number_of_years = 150 yearly_sample_budget
= plt.subplots(1, number_of_years, figsize=(12, 8)) # Adjust figsize as needed
fig, axs = axs.flatten()
axs
= Sampler(gdf_grid)
sampler = DataCollector(fname_raster)
dc_emulator
# Samples
= gpd.GeoDataFrame(index=pd.Index([], name='loc_id'), geometry=None, data={'value': None})
samples_t_0 = []
samples_t
= State(samples_t_0, gdf_grid, cbs=[PriorCB(fname_raster)])
state
# You have to call the instance
= state()
state_t0
= [True]
benefit_criteria = Optimizer(state=state_t0)
optimizer = optimizer.get_rank(is_benefit_x=benefit_criteria, w_vector = [1],
df_rank =None, c_method = None,
n_method=None, s_method="CP")
w_method
= pd.merge(df, gdf_grid[['geometry']], left_index=True, right_index=True, how='inner')
combined_df = gpd.GeoDataFrame(combined_df)
combined_gdf ='rank',cmap='Reds_r', legend_kwds={'label': 'Rank'}, ax = axs[0])
combined_gdf.plot(column
for fig_n, ax in zip(range(1, number_of_years+1), axs[1:]):
= rank_to_sample(combined_gdf['rank'].sort_index().values,
n =yearly_sample_budget, min=1, policy="quantiles")
budget= sampler.sample(n, method='uniform')
sample_locs_t = dc_emulator.collect(sample_locs_t)
samples try:
= pd.concat([samples_t, samples])
samples_t except:
= pd.concat([samples])
samples_t
# plot points versus rank of polygon
= combined_gdf.plot(column='rank', cmap='Reds_r', ax=ax)
ax ='value', ax=ax, cmap='viridis', s=1)
samples_t.plot(columnf"Year {fig_n} (number of samples: {len(samples_t)})")
ax.title.set_text(
# new state
= State(samples_t, gdf_grid, cbs=[
state =5), PriorCB(fname_raster)])
MaxCB(), MinCB(), StdCB(), CountCB(), MoranICB(k
= Optimizer(state=state())
optimizer
# 2. rank polygons
= [True, True, True, False, False, True]
benefit_criteria = optimizer.get_rank(is_benefit_x=benefit_criteria, w_vector = [0.2, 0.1, 0.1, 0.2, 0.2, 0.2], n_method="LINEAR1", c_method = None, w_method=None, s_method="CP")
df
# 3. map ranking
= pd.merge(df, gdf_grid[['geometry']], left_index=True, right_index=True, how='inner')
combined_df = gpd.GeoDataFrame(combined_df)
combined_gdf
plt.tight_layout() plt.show()