OSSL datasets

Data loading for the OSSL dataset

The official OSSL documentation provides more details on the dataset and its variables.


source

OSSLData

 OSSLData (df:pandas.core.frame.DataFrame)

OSSL (Open Soil Spectral Library) data container

Type Details
df DataFrame dataframe containing OSSL data

source

OSSLData._parse_columns

 OSSLData._parse_columns ()

Parse columns into visnir, mir and properties


source

get_cache_path

 get_cache_path (dest_dir:str='.soilspecdata')

Get cache path for OSSL data

Type Default Details
dest_dir str .soilspecdata Name of the cache directory
Returns Path Path to the cache directory (~/dest_dir)

For instance:

get_cache_path()
Path('/Users/franckalbinet/.soilspecdata')

The default gzipped file is downloaded from the following URL: https://storage.googleapis.com/soilspec4gg-public/ossl_all_L0_v1.2.csv.gz


source

get_ossl

 get_ossl (url='https://storage.googleapis.com/soilspec4gg-
           public/ossl_all_L0_v1.2.csv.gz', force_download=False)

Load OSSL data from cache or download it

Type Default Details
url str https://storage.googleapis.com/soilspec4gg-public/ossl_all_L0_v1.2.csv.gz OSSL data gzipped file URL
force_download bool False if True, force download

How to use it:

ossl = get_ossl(force_download=False)
ossl.visnir_cols[:2], ossl.mir_cols[:2], ossl.properties_cols[:2]
(['scan_visnir.350_ref', 'scan_visnir.352_ref'],
 ['scan_mir.600_abs', 'scan_mir.602_abs'],
 ['dataset.code_ascii_txt', 'id.layer_local_c'])

source

OSSLData._get_valid_spectra_mask

 OSSLData._get_valid_spectra_mask (spectra_cols:List[str])

Return mask for samples with all non-null values in spectra

Type Details
spectra_cols List Spectra column names
Returns ndarray Mask

OSSL gzip archive is formated in a wide format (with metadata, soil properties, visnir and mir spectra as columns). Note that all samples have not been scanned simultaneously with VisNIR and MIR instruments according to the data source/provider.

As a result, when selecting a subset of columns, e.g. ossl.mir_cols, the returned dataframe will have a lot of missing values (NaN). The above function return a mask for samples with all non-null values in spectra.

ossl.df[ossl.mir_cols]
scan_mir.600_abs scan_mir.602_abs scan_mir.604_abs scan_mir.606_abs scan_mir.608_abs scan_mir.610_abs scan_mir.612_abs scan_mir.614_abs scan_mir.616_abs scan_mir.618_abs ... scan_mir.3982_abs scan_mir.3984_abs scan_mir.3986_abs scan_mir.3988_abs scan_mir.3990_abs scan_mir.3992_abs scan_mir.3994_abs scan_mir.3996_abs scan_mir.3998_abs scan_mir.4000_abs
0 1.527853 1.531908 1.532084 1.530892 1.530645 1.531506 1.531582 1.531413 1.532904 1.535459 ... 0.356776 0.356642 0.355784 0.354743 0.354104 0.353663 0.353237 0.352923 0.352548 0.352053
1 1.538449 1.543622 1.545751 1.546997 1.549450 1.553714 1.557981 1.561652 1.566082 1.571555 ... 0.358399 0.358142 0.357144 0.355980 0.355242 0.354722 0.354217 0.353825 0.353376 0.352798
2 1.619721 1.614226 1.615612 1.620649 1.626406 1.631747 1.636411 1.639527 1.642449 1.646890 ... 0.372522 0.372338 0.371425 0.370337 0.369679 0.369245 0.368808 0.368469 0.368084 0.367563
3 1.570129 1.567954 1.573055 1.580834 1.586880 1.590397 1.595117 1.600492 1.603847 1.606447 ... 0.357992 0.357734 0.356713 0.355480 0.354681 0.354137 0.353619 0.353217 0.352756 0.352158
4 1.484832 1.484367 1.484977 1.486258 1.488400 1.492040 1.495075 1.496595 1.498354 1.501437 ... 0.316249 0.316089 0.315098 0.313910 0.313210 0.312758 0.312312 0.311971 0.311568 0.311044
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
135646 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
135647 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
135648 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
135649 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
135650 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

135651 rows × 1701 columns

mask = ossl.df[ossl.mir_cols].notna().all(axis=1)
print(mask.sum(), 'samples with all non-null values in mir spectra out of the total', len(mask))
85684 samples with all non-null values in mir spectra out of the total 135651

source

OSSLData._extract_wavenumbers

 OSSLData._extract_wavenumbers (cols:List[str])

Extract wavenumbers from spectral column names

Type Details
cols List column names

For instance, to retrieve the wavenumbers from the MIR columns:

ossl._extract_wavenumbers(ossl.mir_cols)
array([ 600,  602,  604, ..., 3996, 3998, 4000], shape=(1701,))

source

OSSLData._extract_measurement_type

 OSSLData._extract_measurement_type (cols:List[str])

Extract measurement type from column names

Type Details
cols List Spectral column names
Returns str abs (Absorbance) or ref (Reflectance)

For instance, to retrieve the measurement type from the MIR or VISNIR columns:

ossl._extract_measurement_type(ossl.visnir_cols), ossl._extract_measurement_type(ossl.mir_cols)
('ref', 'abs')

source

OSSLData._filter_wavelength_range

 OSSLData._filter_wavelength_range (wavenumbers:numpy.ndarray,
                                    spectra:numpy.ndarray, cols:List[str],
                                    wmin:Optional[int]=None,
                                    wmax:Optional[int]=None)

Filter spectra based on wavenumber range

Type Default Details
wavenumbers ndarray Wavenumbers
spectra ndarray Spectra
cols List Column names
wmin Optional None Min wavenumber
wmax Optional None Max wavenumber
Returns Tuple Filtered wavenumbers, spectra, columns
wavenumbers, spectra, cols = ossl._filter_wavelength_range(
    wavenumbers=ossl._extract_wavenumbers(ossl.visnir_cols), 
    spectra=ossl.df[ossl.visnir_cols].values, 
    cols=ossl.visnir_cols, 
    wmin=4000, wmax=25000
)

print(f'Original wavenumbers: {ossl._extract_wavenumbers(ossl.visnir_cols).min()} - {ossl._extract_wavenumbers(ossl.visnir_cols).max()}')
print(f'Filtered wavenumbers: {wavenumbers.min()} - {wavenumbers.max()}')
print(f'Spectra shape: {spectra.shape}')
print(f'Filtered columns. From: {cols[0]} to: {cols[-1]}')
Original wavenumbers: 4000 - 28571
Filtered wavenumbers: 4000 - 25000
Spectra shape: (135651, 1051)
Filtered columns. From: scan_visnir.400_ref to: scan_visnir.2500_ref

IMPORTANT: Not that by default, both VISNIR and MIR spectra are converted to wavenumbers.


source

OSSLData.get_visnir

 OSSLData.get_visnir (wmin:Optional[int]=4000, wmax:Optional[int]=25000)

Get VISNIR spectra within specified wavenumber range

Type Default Details
wmin Optional 4000 Min wavenumber
wmax Optional 25000 Max wavenumber
Returns SpectraData VISNIR data

For instance, to retrieve the VISNIR spectra between 8000 and 25000 wavenumbers:

visnir_data = ossl.get_visnir(wmin=8000, wmax=25000)
visnir_data.spectra.shape
(64644, 426)

source

OSSLData.get_mir

 OSSLData.get_mir (wmin:Optional[int]=600, wmax:Optional[int]=4000)

Get MIR spectra within specified wavenumber range

Type Default Details
wmin Optional 600 Min wavenumber
wmax Optional 4000 Max wavenumber
Returns SpectraData MIR data

For instance, to retrieve the MIR spectra between 600 and 4000 wavenumbers (default range):

mir_data = ossl.get_mir()
mir_data.spectra.shape, mir_data.wavenumbers.min(), mir_data.wavenumbers.max()
((85684, 1701), np.int64(600), np.int64(4000))

source

OSSLData.get_properties

 OSSLData.get_properties (properties=None, require_complete:bool=False)

Get properties data with sample IDs

Type Default Details
properties NoneType None Properties
require_complete bool False if True, only return samples with no null values
Returns DataFrame Selected properties data

Get only complete MIR spectra:

mir_data = ossl.get_mir()

Get properties needed as ML targets (must be complete):

targets = ossl.get_properties(['cec_usda.a723_cmolc.kg'], require_complete=True)
targets.shape, targets.head()
((57064, 1),
         cec_usda.a723_cmolc.kg
 id                            
 S40857                6.633217
 S40858                3.822628
 S40859                3.427324
 S40860                1.906545
 S40861               13.403203)

Get optional metadata (can have NaN values):

metadata = ossl.get_properties(['longitude.point_wgs84_dd', 'latitude.point_wgs84_dd'], require_complete=False)
metadata.shape, metadata.head()
((135651, 2),
            longitude.point_wgs84_dd  latitude.point_wgs84_dd
 id                                                          
 icr072246                 15.687492                -7.377750
 icr072247                 15.687492                -7.377750
 icr072266                 15.687817                -7.351243
 icr072267                 15.687817                -7.351243
 icr072286                 15.687965                -7.331673)

source

OSSLData.get_aligned_data

 OSSLData.get_aligned_data (spectra_data:soilspecdata.types.SpectraData,
                            target_cols:Union[str,List[str]])

Get aligned spectra and target data for ML, along with their sample IDs

Type Details
spectra_data SpectraData Spectra data
target_cols Union Target columns
Returns Tuple Aligned spectra, targets, sample IDs

For instance, to retrieve the MIR spectra and the corresponding CEC values in an amenable form for a Machine/Deep Learning pipeline:

X, y, ids = ossl.get_aligned_data(
    spectra_data=mir_data,
    target_cols='cec_usda.a723_cmolc.kg'
)

X.shape, y.shape, ids.shape
((3, 3), (3, 1), (3,))

Later, if you need metadata for these samples:

metadata = ossl.get_properties(['longitude.point_wgs84_dd', 'latitude.point_wgs84_dd']).loc[ids]
metadata.head()
longitude.point_wgs84_dd latitude.point_wgs84_dd
id
173693 NaN NaN
172161 -120.354407 42.207350
181527 -107.274835 47.434481
176683 NaN NaN
212508 -103.316306 46.488522