OSSL datasets

Data loading for the OSSL dataset

The official OSSL documentation provides more details on the dataset and its variables.

OSSLData

 OSSLData (df:pandas.core.frame.DataFrame)

OSSL (Open Soil Spectral Library) data container

	Type	Details
df	DataFrame	dataframe containing OSSL data

source

OSSLData._parse_columns

 OSSLData._parse_columns ()

Parse columns into visnir, mir and properties

source

get_cache_path

 get_cache_path (dest_dir:str='.soilspecdata')

Get cache path for OSSL data

	Type	Default	Details
dest_dir	str	.soilspecdata	Name of the cache directory
Returns	Path		Path to the cache directory (~/dest_dir)

For instance:

get_cache_path()

Path('/Users/franckalbinet/.soilspecdata')

The default gzipped file is downloaded from the following URL: https://storage.googleapis.com/soilspec4gg-public/ossl_all_L0_v1.2.csv.gz

source

get_ossl

 get_ossl (url='https://storage.googleapis.com/soilspec4gg-
           public/ossl_all_L0_v1.2.csv.gz', force_download=False)

Load OSSL data from cache or download it

	Type	Default	Details
url	str	https://storage.googleapis.com/soilspec4gg-public/ossl_all_L0_v1.2.csv.gz	OSSL data gzipped file URL
force_download	bool	False	if True, force download

How to use it:

ossl = get_ossl(force_download=False)

ossl.visnir_cols[:2], ossl.mir_cols[:2], ossl.properties_cols[:2]

(['scan_visnir.350_ref', 'scan_visnir.352_ref'],
 ['scan_mir.600_abs', 'scan_mir.602_abs'],
 ['dataset.code_ascii_txt', 'id.layer_local_c'])

source

OSSLData._get_valid_spectra_mask

 OSSLData._get_valid_spectra_mask (spectra_cols:List[str])

Return mask for samples with all non-null values in spectra

	Type	Details
spectra_cols	List	Spectra column names
Returns	ndarray	Mask

OSSL gzip archive is formated in a wide format (with metadata, soil properties, visnir and mir spectra as columns). Note that all samples have not been scanned simultaneously with VisNIR and MIR instruments according to the data source/provider.

As a result, when selecting a subset of columns, e.g. ossl.mir_cols, the returned dataframe will have a lot of missing values (NaN). The above function return a mask for samples with all non-null values in spectra.

ossl.df[ossl.mir_cols]

	scan_mir.600_abs	scan_mir.602_abs	scan_mir.604_abs	scan_mir.606_abs	scan_mir.608_abs	scan_mir.610_abs	scan_mir.612_abs	scan_mir.614_abs	scan_mir.616_abs	scan_mir.618_abs	...	scan_mir.3982_abs	scan_mir.3984_abs	scan_mir.3986_abs	scan_mir.3988_abs	scan_mir.3990_abs	scan_mir.3992_abs	scan_mir.3994_abs	scan_mir.3996_abs	scan_mir.3998_abs	scan_mir.4000_abs
0	1.527853	1.531908	1.532084	1.530892	1.530645	1.531506	1.531582	1.531413	1.532904	1.535459	...	0.356776	0.356642	0.355784	0.354743	0.354104	0.353663	0.353237	0.352923	0.352548	0.352053
1	1.538449	1.543622	1.545751	1.546997	1.549450	1.553714	1.557981	1.561652	1.566082	1.571555	...	0.358399	0.358142	0.357144	0.355980	0.355242	0.354722	0.354217	0.353825	0.353376	0.352798
2	1.619721	1.614226	1.615612	1.620649	1.626406	1.631747	1.636411	1.639527	1.642449	1.646890	...	0.372522	0.372338	0.371425	0.370337	0.369679	0.369245	0.368808	0.368469	0.368084	0.367563
3	1.570129	1.567954	1.573055	1.580834	1.586880	1.590397	1.595117	1.600492	1.603847	1.606447	...	0.357992	0.357734	0.356713	0.355480	0.354681	0.354137	0.353619	0.353217	0.352756	0.352158
4	1.484832	1.484367	1.484977	1.486258	1.488400	1.492040	1.495075	1.496595	1.498354	1.501437	...	0.316249	0.316089	0.315098	0.313910	0.313210	0.312758	0.312312	0.311971	0.311568	0.311044
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
135646	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
135647	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
135648	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
135649	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
135650	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

135651 rows × 1701 columns

mask = ossl.df[ossl.mir_cols].notna().all(axis=1)
print(mask.sum(), 'samples with all non-null values in mir spectra out of the total', len(mask))

85684 samples with all non-null values in mir spectra out of the total 135651

source

OSSLData._extract_wavenumbers

 OSSLData._extract_wavenumbers (cols:List[str])

Extract wavenumbers from spectral column names

	Type	Details
cols	List	column names

For instance, to retrieve the wavenumbers from the MIR columns:

ossl._extract_wavenumbers(ossl.mir_cols)

array([ 600,  602,  604, ..., 3996, 3998, 4000], shape=(1701,))

source

OSSLData._extract_measurement_type

 OSSLData._extract_measurement_type (cols:List[str])

Extract measurement type from column names

	Type	Details
cols	List	Spectral column names
Returns	str	`abs` (Absorbance) or `ref` (Reflectance)

For instance, to retrieve the measurement type from the MIR or VISNIR columns:

ossl._extract_measurement_type(ossl.visnir_cols), ossl._extract_measurement_type(ossl.mir_cols)

('ref', 'abs')

source

OSSLData._filter_wavelength_range

 OSSLData._filter_wavelength_range (wavenumbers:numpy.ndarray,
                                    spectra:numpy.ndarray, cols:List[str],
                                    wmin:Optional[int]=None,
                                    wmax:Optional[int]=None)

Filter spectra based on wavenumber range

	Type	Default	Details
wavenumbers	ndarray		Wavenumbers
spectra	ndarray		Spectra
cols	List		Column names
wmin	Optional	None	Min wavenumber
wmax	Optional	None	Max wavenumber
Returns	Tuple		Filtered wavenumbers, spectra, columns

wavenumbers, spectra, cols = ossl._filter_wavelength_range(
    wavenumbers=ossl._extract_wavenumbers(ossl.visnir_cols), 
    spectra=ossl.df[ossl.visnir_cols].values, 
    cols=ossl.visnir_cols, 
    wmin=4000, wmax=25000
)

print(f'Original wavenumbers: {ossl._extract_wavenumbers(ossl.visnir_cols).min()} - {ossl._extract_wavenumbers(ossl.visnir_cols).max()}')
print(f'Filtered wavenumbers: {wavenumbers.min()} - {wavenumbers.max()}')
print(f'Spectra shape: {spectra.shape}')
print(f'Filtered columns. From: {cols[0]} to: {cols[-1]}')

Original wavenumbers: 4000 - 28571
Filtered wavenumbers: 4000 - 25000
Spectra shape: (135651, 1051)
Filtered columns. From: scan_visnir.400_ref to: scan_visnir.2500_ref

IMPORTANT: Not that by default, both VISNIR and MIR spectra are converted to wavenumbers.

source

OSSLData.get_visnir

 OSSLData.get_visnir (wmin:Optional[int]=4000, wmax:Optional[int]=25000)

Get VISNIR spectra within specified wavenumber range

	Type	Default	Details
wmin	Optional	4000	Min wavenumber
wmax	Optional	25000	Max wavenumber
Returns	SpectraData		VISNIR data

For instance, to retrieve the VISNIR spectra between 8000 and 25000 wavenumbers:

visnir_data = ossl.get_visnir(wmin=8000, wmax=25000)
visnir_data.spectra.shape

(64644, 426)

source

OSSLData.get_mir

 OSSLData.get_mir (wmin:Optional[int]=600, wmax:Optional[int]=4000)

Get MIR spectra within specified wavenumber range

	Type	Default	Details
wmin	Optional	600	Min wavenumber
wmax	Optional	4000	Max wavenumber
Returns	SpectraData		MIR data

For instance, to retrieve the MIR spectra between 600 and 4000 wavenumbers (default range):

mir_data = ossl.get_mir()
mir_data.spectra.shape, mir_data.wavenumbers.min(), mir_data.wavenumbers.max()

((85684, 1701), np.int64(600), np.int64(4000))

source

OSSLData.get_properties

 OSSLData.get_properties (properties=None, require_complete:bool=False)

Get properties data with sample IDs

	Type	Default	Details
properties	NoneType	None	Properties
require_complete	bool	False	if True, only return samples with no null values
Returns	DataFrame		Selected properties data

Get only complete MIR spectra:

mir_data = ossl.get_mir()

Get properties needed as ML targets (must be complete):

targets = ossl.get_properties(['cec_usda.a723_cmolc.kg'], require_complete=True)
targets.shape, targets.head()

((57064, 1),
         cec_usda.a723_cmolc.kg
 id                            
 S40857                6.633217
 S40858                3.822628
 S40859                3.427324
 S40860                1.906545
 S40861               13.403203)

Get optional metadata (can have NaN values):

metadata = ossl.get_properties(['longitude.point_wgs84_dd', 'latitude.point_wgs84_dd'], require_complete=False)
metadata.shape, metadata.head()

((135651, 2),
            longitude.point_wgs84_dd  latitude.point_wgs84_dd
 id                                                          
 icr072246                 15.687492                -7.377750
 icr072247                 15.687492                -7.377750
 icr072266                 15.687817                -7.351243
 icr072267                 15.687817                -7.351243
 icr072286                 15.687965                -7.331673)

source

OSSLData.get_aligned_data

 OSSLData.get_aligned_data (spectra_data:soilspecdata.types.SpectraData,
                            target_cols:Union[str,List[str]])

Get aligned spectra and target data for ML, along with their sample IDs

	Type	Details
spectra_data	SpectraData	Spectra data
target_cols	Union	Target columns
Returns	Tuple	Aligned spectra, targets, sample IDs

For instance, to retrieve the MIR spectra and the corresponding CEC values in an amenable form for a Machine/Deep Learning pipeline:

X, y, ids = ossl.get_aligned_data(
    spectra_data=mir_data,
    target_cols='cec_usda.a723_cmolc.kg'
)

X.shape, y.shape, ids.shape

((3, 3), (3, 1), (3,))

Later, if you need metadata for these samples:

metadata = ossl.get_properties(['longitude.point_wgs84_dd', 'latitude.point_wgs84_dd']).loc[ids]
metadata.head()

	longitude.point_wgs84_dd	latitude.point_wgs84_dd
id
173693	NaN	NaN
172161	-120.354407	42.207350
181527	-107.274835	47.434481
176683	NaN	NaN
212508	-103.316306	46.488522