get_cache_path()Path('/Users/franckalbinet/.soilspecdata')The official OSSL documentation provides more details on the dataset and its variables.
OSSLData (df:pandas.core.frame.DataFrame)
OSSL (Open Soil Spectral Library) data container
| Type | Details | |
|---|---|---|
| df | DataFrame | dataframe containing OSSL data | 
OSSLData._parse_columns ()
Parse columns into visnir, mir and properties
get_cache_path (dest_dir:str='.soilspecdata')
Get cache path for OSSL data
| Type | Default | Details | |
|---|---|---|---|
| dest_dir | str | .soilspecdata | Name of the cache directory | 
| Returns | Path | Path to the cache directory (~/dest_dir) | 
For instance:
The default gzipped file is downloaded from the following URL: https://storage.googleapis.com/soilspec4gg-public/ossl_all_L0_v1.2.csv.gz
get_ossl (url='https://storage.googleapis.com/soilspec4gg- public/ossl_all_L0_v1.2.csv.gz', force_download=False)
Load OSSL data from cache or download it
| Type | Default | Details | |
|---|---|---|---|
| url | str | https://storage.googleapis.com/soilspec4gg-public/ossl_all_L0_v1.2.csv.gz | OSSL data gzipped file URL | 
| force_download | bool | False | if True, force download | 
How to use it:
(['scan_visnir.350_ref', 'scan_visnir.352_ref'],
 ['scan_mir.600_abs', 'scan_mir.602_abs'],
 ['dataset.code_ascii_txt', 'id.layer_local_c'])OSSLData._get_valid_spectra_mask (spectra_cols:List[str])
Return mask for samples with all non-null values in spectra
| Type | Details | |
|---|---|---|
| spectra_cols | List | Spectra column names | 
| Returns | ndarray | Mask | 
OSSL gzip archive is formated in a wide format (with metadata, soil properties, visnir and mir spectra as columns). Note that all samples have not been scanned simultaneously with VisNIR and MIR instruments according to the data source/provider.
As a result, when selecting a subset of columns, e.g. ossl.mir_cols, the returned dataframe will have a lot of missing values (NaN). The above function return a mask for samples with all non-null values in spectra.
| scan_mir.600_abs | scan_mir.602_abs | scan_mir.604_abs | scan_mir.606_abs | scan_mir.608_abs | scan_mir.610_abs | scan_mir.612_abs | scan_mir.614_abs | scan_mir.616_abs | scan_mir.618_abs | ... | scan_mir.3982_abs | scan_mir.3984_abs | scan_mir.3986_abs | scan_mir.3988_abs | scan_mir.3990_abs | scan_mir.3992_abs | scan_mir.3994_abs | scan_mir.3996_abs | scan_mir.3998_abs | scan_mir.4000_abs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.527853 | 1.531908 | 1.532084 | 1.530892 | 1.530645 | 1.531506 | 1.531582 | 1.531413 | 1.532904 | 1.535459 | ... | 0.356776 | 0.356642 | 0.355784 | 0.354743 | 0.354104 | 0.353663 | 0.353237 | 0.352923 | 0.352548 | 0.352053 | 
| 1 | 1.538449 | 1.543622 | 1.545751 | 1.546997 | 1.549450 | 1.553714 | 1.557981 | 1.561652 | 1.566082 | 1.571555 | ... | 0.358399 | 0.358142 | 0.357144 | 0.355980 | 0.355242 | 0.354722 | 0.354217 | 0.353825 | 0.353376 | 0.352798 | 
| 2 | 1.619721 | 1.614226 | 1.615612 | 1.620649 | 1.626406 | 1.631747 | 1.636411 | 1.639527 | 1.642449 | 1.646890 | ... | 0.372522 | 0.372338 | 0.371425 | 0.370337 | 0.369679 | 0.369245 | 0.368808 | 0.368469 | 0.368084 | 0.367563 | 
| 3 | 1.570129 | 1.567954 | 1.573055 | 1.580834 | 1.586880 | 1.590397 | 1.595117 | 1.600492 | 1.603847 | 1.606447 | ... | 0.357992 | 0.357734 | 0.356713 | 0.355480 | 0.354681 | 0.354137 | 0.353619 | 0.353217 | 0.352756 | 0.352158 | 
| 4 | 1.484832 | 1.484367 | 1.484977 | 1.486258 | 1.488400 | 1.492040 | 1.495075 | 1.496595 | 1.498354 | 1.501437 | ... | 0.316249 | 0.316089 | 0.315098 | 0.313910 | 0.313210 | 0.312758 | 0.312312 | 0.311971 | 0.311568 | 0.311044 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 135646 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
| 135647 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
| 135648 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
| 135649 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
| 135650 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
135651 rows × 1701 columns
mask = ossl.df[ossl.mir_cols].notna().all(axis=1)
print(mask.sum(), 'samples with all non-null values in mir spectra out of the total', len(mask))85684 samples with all non-null values in mir spectra out of the total 135651OSSLData._extract_wavenumbers (cols:List[str])
Extract wavenumbers from spectral column names
| Type | Details | |
|---|---|---|
| cols | List | column names | 
For instance, to retrieve the wavenumbers from the MIR columns:
array([ 600,  602,  604, ..., 3996, 3998, 4000], shape=(1701,))OSSLData._extract_measurement_type (cols:List[str])
Extract measurement type from column names
| Type | Details | |
|---|---|---|
| cols | List | Spectral column names | 
| Returns | str | abs(Absorbance) orref(Reflectance) | 
For instance, to retrieve the measurement type from the MIR or VISNIR columns:
('ref', 'abs')OSSLData._filter_wavelength_range (wavenumbers:numpy.ndarray, spectra:numpy.ndarray, cols:List[str], wmin:Optional[int]=None, wmax:Optional[int]=None)
Filter spectra based on wavenumber range
| Type | Default | Details | |
|---|---|---|---|
| wavenumbers | ndarray | Wavenumbers | |
| spectra | ndarray | Spectra | |
| cols | List | Column names | |
| wmin | Optional | None | Min wavenumber | 
| wmax | Optional | None | Max wavenumber | 
| Returns | Tuple | Filtered wavenumbers, spectra, columns | 
wavenumbers, spectra, cols = ossl._filter_wavelength_range(
    wavenumbers=ossl._extract_wavenumbers(ossl.visnir_cols), 
    spectra=ossl.df[ossl.visnir_cols].values, 
    cols=ossl.visnir_cols, 
    wmin=4000, wmax=25000
)
print(f'Original wavenumbers: {ossl._extract_wavenumbers(ossl.visnir_cols).min()} - {ossl._extract_wavenumbers(ossl.visnir_cols).max()}')
print(f'Filtered wavenumbers: {wavenumbers.min()} - {wavenumbers.max()}')
print(f'Spectra shape: {spectra.shape}')
print(f'Filtered columns. From: {cols[0]} to: {cols[-1]}')Original wavenumbers: 4000 - 28571
Filtered wavenumbers: 4000 - 25000
Spectra shape: (135651, 1051)
Filtered columns. From: scan_visnir.400_ref to: scan_visnir.2500_refIMPORTANT: Not that by default, both VISNIR and MIR spectra are converted to wavenumbers.
OSSLData.get_visnir (wmin:Optional[int]=4000, wmax:Optional[int]=25000)
Get VISNIR spectra within specified wavenumber range
| Type | Default | Details | |
|---|---|---|---|
| wmin | Optional | 4000 | Min wavenumber | 
| wmax | Optional | 25000 | Max wavenumber | 
| Returns | SpectraData | VISNIR data | 
For instance, to retrieve the VISNIR spectra between 8000 and 25000 wavenumbers:
OSSLData.get_mir (wmin:Optional[int]=600, wmax:Optional[int]=4000)
Get MIR spectra within specified wavenumber range
| Type | Default | Details | |
|---|---|---|---|
| wmin | Optional | 600 | Min wavenumber | 
| wmax | Optional | 4000 | Max wavenumber | 
| Returns | SpectraData | MIR data | 
For instance, to retrieve the MIR spectra between 600 and 4000 wavenumbers (default range):
mir_data = ossl.get_mir()
mir_data.spectra.shape, mir_data.wavenumbers.min(), mir_data.wavenumbers.max()((85684, 1701), np.int64(600), np.int64(4000))OSSLData.get_properties (properties=None, require_complete:bool=False)
Get properties data with sample IDs
| Type | Default | Details | |
|---|---|---|---|
| properties | NoneType | None | Properties | 
| require_complete | bool | False | if True, only return samples with no null values | 
| Returns | DataFrame | Selected properties data | 
Get only complete MIR spectra:
Get properties needed as ML targets (must be complete):
targets = ossl.get_properties(['cec_usda.a723_cmolc.kg'], require_complete=True)
targets.shape, targets.head()((57064, 1),
         cec_usda.a723_cmolc.kg
 id                            
 S40857                6.633217
 S40858                3.822628
 S40859                3.427324
 S40860                1.906545
 S40861               13.403203)Get optional metadata (can have NaN values):
metadata = ossl.get_properties(['longitude.point_wgs84_dd', 'latitude.point_wgs84_dd'], require_complete=False)
metadata.shape, metadata.head()((135651, 2),
            longitude.point_wgs84_dd  latitude.point_wgs84_dd
 id                                                          
 icr072246                 15.687492                -7.377750
 icr072247                 15.687492                -7.377750
 icr072266                 15.687817                -7.351243
 icr072267                 15.687817                -7.351243
 icr072286                 15.687965                -7.331673)OSSLData.get_aligned_data (spectra_data:soilspecdata.types.SpectraData, target_cols:Union[str,List[str]])
Get aligned spectra and target data for ML, along with their sample IDs
| Type | Details | |
|---|---|---|
| spectra_data | SpectraData | Spectra data | 
| target_cols | Union | Target columns | 
| Returns | Tuple | Aligned spectra, targets, sample IDs | 
For instance, to retrieve the MIR spectra and the corresponding CEC values in an amenable form for a Machine/Deep Learning pipeline:
X, y, ids = ossl.get_aligned_data(
    spectra_data=mir_data,
    target_cols='cec_usda.a723_cmolc.kg'
)
X.shape, y.shape, ids.shape((3, 3), (3, 1), (3,))Later, if you need metadata for these samples: