get_cache_path()
Path('/Users/franckalbinet/.soilspecdata')
The official OSSL documentation provides more details on the dataset and its variables.
OSSLData (df:pandas.core.frame.DataFrame)
OSSL (Open Soil Spectral Library) data container
Type | Details | |
---|---|---|
df | DataFrame | dataframe containing OSSL data |
OSSLData._parse_columns ()
Parse columns into visnir, mir and properties
get_cache_path (dest_dir:str='.soilspecdata')
Get cache path for OSSL data
Type | Default | Details | |
---|---|---|---|
dest_dir | str | .soilspecdata | Name of the cache directory |
Returns | Path | Path to the cache directory (~/dest_dir) |
For instance:
The default gzipped file is downloaded from the following URL: https://storage.googleapis.com/soilspec4gg-public/ossl_all_L0_v1.2.csv.gz
get_ossl (url='https://storage.googleapis.com/soilspec4gg- public/ossl_all_L0_v1.2.csv.gz', force_download=False)
Load OSSL data from cache or download it
Type | Default | Details | |
---|---|---|---|
url | str | https://storage.googleapis.com/soilspec4gg-public/ossl_all_L0_v1.2.csv.gz | OSSL data gzipped file URL |
force_download | bool | False | if True, force download |
How to use it:
(['scan_visnir.350_ref', 'scan_visnir.352_ref'],
['scan_mir.600_abs', 'scan_mir.602_abs'],
['dataset.code_ascii_txt', 'id.layer_local_c'])
OSSLData._get_valid_spectra_mask (spectra_cols:List[str])
Return mask for samples with all non-null values in spectra
Type | Details | |
---|---|---|
spectra_cols | List | Spectra column names |
Returns | ndarray | Mask |
OSSL gzip archive is formated in a wide format (with metadata, soil properties, visnir and mir spectra as columns). Note that all samples have not been scanned simultaneously with VisNIR and MIR instruments according to the data source/provider.
As a result, when selecting a subset of columns, e.g. ossl.mir_cols
, the returned dataframe will have a lot of missing values (NaN
). The above function return a mask for samples with all non-null values in spectra.
scan_mir.600_abs | scan_mir.602_abs | scan_mir.604_abs | scan_mir.606_abs | scan_mir.608_abs | scan_mir.610_abs | scan_mir.612_abs | scan_mir.614_abs | scan_mir.616_abs | scan_mir.618_abs | ... | scan_mir.3982_abs | scan_mir.3984_abs | scan_mir.3986_abs | scan_mir.3988_abs | scan_mir.3990_abs | scan_mir.3992_abs | scan_mir.3994_abs | scan_mir.3996_abs | scan_mir.3998_abs | scan_mir.4000_abs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.527853 | 1.531908 | 1.532084 | 1.530892 | 1.530645 | 1.531506 | 1.531582 | 1.531413 | 1.532904 | 1.535459 | ... | 0.356776 | 0.356642 | 0.355784 | 0.354743 | 0.354104 | 0.353663 | 0.353237 | 0.352923 | 0.352548 | 0.352053 |
1 | 1.538449 | 1.543622 | 1.545751 | 1.546997 | 1.549450 | 1.553714 | 1.557981 | 1.561652 | 1.566082 | 1.571555 | ... | 0.358399 | 0.358142 | 0.357144 | 0.355980 | 0.355242 | 0.354722 | 0.354217 | 0.353825 | 0.353376 | 0.352798 |
2 | 1.619721 | 1.614226 | 1.615612 | 1.620649 | 1.626406 | 1.631747 | 1.636411 | 1.639527 | 1.642449 | 1.646890 | ... | 0.372522 | 0.372338 | 0.371425 | 0.370337 | 0.369679 | 0.369245 | 0.368808 | 0.368469 | 0.368084 | 0.367563 |
3 | 1.570129 | 1.567954 | 1.573055 | 1.580834 | 1.586880 | 1.590397 | 1.595117 | 1.600492 | 1.603847 | 1.606447 | ... | 0.357992 | 0.357734 | 0.356713 | 0.355480 | 0.354681 | 0.354137 | 0.353619 | 0.353217 | 0.352756 | 0.352158 |
4 | 1.484832 | 1.484367 | 1.484977 | 1.486258 | 1.488400 | 1.492040 | 1.495075 | 1.496595 | 1.498354 | 1.501437 | ... | 0.316249 | 0.316089 | 0.315098 | 0.313910 | 0.313210 | 0.312758 | 0.312312 | 0.311971 | 0.311568 | 0.311044 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
135646 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
135647 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
135648 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
135649 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
135650 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
135651 rows × 1701 columns
mask = ossl.df[ossl.mir_cols].notna().all(axis=1)
print(mask.sum(), 'samples with all non-null values in mir spectra out of the total', len(mask))
85684 samples with all non-null values in mir spectra out of the total 135651
OSSLData._extract_wavenumbers (cols:List[str])
Extract wavenumbers from spectral column names
Type | Details | |
---|---|---|
cols | List | column names |
For instance, to retrieve the wavenumbers from the MIR columns:
array([ 600, 602, 604, ..., 3996, 3998, 4000], shape=(1701,))
OSSLData._extract_measurement_type (cols:List[str])
Extract measurement type from column names
Type | Details | |
---|---|---|
cols | List | Spectral column names |
Returns | str | abs (Absorbance) or ref (Reflectance) |
For instance, to retrieve the measurement type from the MIR or VISNIR columns:
('ref', 'abs')
OSSLData._filter_wavelength_range (wavenumbers:numpy.ndarray, spectra:numpy.ndarray, cols:List[str], wmin:Optional[int]=None, wmax:Optional[int]=None)
Filter spectra based on wavenumber range
Type | Default | Details | |
---|---|---|---|
wavenumbers | ndarray | Wavenumbers | |
spectra | ndarray | Spectra | |
cols | List | Column names | |
wmin | Optional | None | Min wavenumber |
wmax | Optional | None | Max wavenumber |
Returns | Tuple | Filtered wavenumbers, spectra, columns |
wavenumbers, spectra, cols = ossl._filter_wavelength_range(
wavenumbers=ossl._extract_wavenumbers(ossl.visnir_cols),
spectra=ossl.df[ossl.visnir_cols].values,
cols=ossl.visnir_cols,
wmin=4000, wmax=25000
)
print(f'Original wavenumbers: {ossl._extract_wavenumbers(ossl.visnir_cols).min()} - {ossl._extract_wavenumbers(ossl.visnir_cols).max()}')
print(f'Filtered wavenumbers: {wavenumbers.min()} - {wavenumbers.max()}')
print(f'Spectra shape: {spectra.shape}')
print(f'Filtered columns. From: {cols[0]} to: {cols[-1]}')
Original wavenumbers: 4000 - 28571
Filtered wavenumbers: 4000 - 25000
Spectra shape: (135651, 1051)
Filtered columns. From: scan_visnir.400_ref to: scan_visnir.2500_ref
IMPORTANT: Not that by default, both VISNIR and MIR spectra are converted to wavenumbers.
OSSLData.get_visnir (wmin:Optional[int]=4000, wmax:Optional[int]=25000)
Get VISNIR spectra within specified wavenumber range
Type | Default | Details | |
---|---|---|---|
wmin | Optional | 4000 | Min wavenumber |
wmax | Optional | 25000 | Max wavenumber |
Returns | SpectraData | VISNIR data |
For instance, to retrieve the VISNIR spectra between 8000 and 25000 wavenumbers:
OSSLData.get_mir (wmin:Optional[int]=600, wmax:Optional[int]=4000)
Get MIR spectra within specified wavenumber range
Type | Default | Details | |
---|---|---|---|
wmin | Optional | 600 | Min wavenumber |
wmax | Optional | 4000 | Max wavenumber |
Returns | SpectraData | MIR data |
For instance, to retrieve the MIR spectra between 600 and 4000 wavenumbers (default range):
mir_data = ossl.get_mir()
mir_data.spectra.shape, mir_data.wavenumbers.min(), mir_data.wavenumbers.max()
((85684, 1701), np.int64(600), np.int64(4000))
OSSLData.get_properties (properties=None, require_complete:bool=False)
Get properties data with sample IDs
Type | Default | Details | |
---|---|---|---|
properties | NoneType | None | Properties |
require_complete | bool | False | if True, only return samples with no null values |
Returns | DataFrame | Selected properties data |
Get only complete MIR spectra:
Get properties needed as ML targets (must be complete):
targets = ossl.get_properties(['cec_usda.a723_cmolc.kg'], require_complete=True)
targets.shape, targets.head()
((57064, 1),
cec_usda.a723_cmolc.kg
id
S40857 6.633217
S40858 3.822628
S40859 3.427324
S40860 1.906545
S40861 13.403203)
Get optional metadata (can have NaN
values):
metadata = ossl.get_properties(['longitude.point_wgs84_dd', 'latitude.point_wgs84_dd'], require_complete=False)
metadata.shape, metadata.head()
((135651, 2),
longitude.point_wgs84_dd latitude.point_wgs84_dd
id
icr072246 15.687492 -7.377750
icr072247 15.687492 -7.377750
icr072266 15.687817 -7.351243
icr072267 15.687817 -7.351243
icr072286 15.687965 -7.331673)
OSSLData.get_aligned_data (spectra_data:soilspecdata.types.SpectraData, target_cols:Union[str,List[str]])
Get aligned spectra and target data for ML, along with their sample IDs
Type | Details | |
---|---|---|
spectra_data | SpectraData | Spectra data |
target_cols | Union | Target columns |
Returns | Tuple | Aligned spectra, targets, sample IDs |
For instance, to retrieve the MIR spectra and the corresponding CEC values in an amenable form for a Machine/Deep Learning pipeline:
X, y, ids = ossl.get_aligned_data(
spectra_data=mir_data,
target_cols='cec_usda.a723_cmolc.kg'
)
X.shape, y.shape, ids.shape
((3, 3), (3, 1), (3,))
Later, if you need metadata for these samples: