Sample uniqueness

What constitutes a single sample in the context of MARIS database?

As defined in OpenRefine MARIS currently, unique sample IDs are defined as the concatenation of the following columns:

ref_id
latitude
longitude
begperiod
samptype_id
salinity (if available)
sliceup (if available)
slicedown (if available)
sampdepth (if available)
samplabcode (if available)
species_id (if available)
bodypar_id (if available)
station (if available)
SedRepName (if available)

Rule 1

We also use station and samplabcode, when available. Station is a name given to a sampling location, samplabcode is the data provider’s unique ID. TO BE CLARIFIED

Seawater

As you can see in most cases for seawater we can use the required information – lat, lon, time, sample depth – to define a unique sample ID to which we can link measurements. If sample depth is not provided we assume surface and indicate this using a value = -1 beforehand.

Questions: - cases where sample depth is not provided several times at the same location and time?

Sediment

For sediment we extend this a bit using top, bottom (sliceup and slicedown), SedRepName (mud, clay, …?).

If top and bottom are missing for a sediment sample we assume that it is a grab sample from the surface. In this case I think we set top: -1 to indicate this (I will check for sure).

Questions: - so in essence same as above

Biota

For biota it is extended using the taxon and tissue IDs (species_id, bodypar_id).

Item 1 examples:

in geotraces when we have the same (e.g for seawater) lon, lat, time, smp_depth for several nuclides measurement in a given rosette (at least that’s what I understand);

We ignore rosette (cast?) and bottle IDs. We assume that all measurements with the same lon, lat, time, depth are the same sample. Salinity is used to provide further confidence. Geotraces is more or less a big edge case. Normally data is not provided with such detail.

Questions: - so it means that we have to include nuclide type to get unicity

or in OSPAR sediment when we have records where top, bottom is NaN for a given lon, lat, time. In that case our compound index would be (lon, lat, time, top, bottom);

See above, if missing top = -1. There should not be multiple grab sediment samples for the same lat, lon and time. If there are then it probably indicates that a core was taken but slice top and bottom is missing. In this case the records should be ignored until this information is provided.

In the HELCOM example where top and bottom are NaN and actual values for the same lat, lon, time then I would assume a grab sample and a core were taken simultaneously so they are all different samples. If there are multiple records with NaN for top and bottom then it could be multiple grab samples at the same location and time but this would be unusual and should be queried with the data provider.

or (not sure it happens sometimes but that’s a point Niall mentioned) when we have replicates at the same location, time, depth, …

Sometimes there are replicates, yes. E.g. sometimes TEPCO report quick results for certain samples and then reanalyse them for a longer time for more precision – they report both. So in this case replicates are valid.

Question: - what do we do in such case?

Rule 2: Location must be inferred.

Another example is when we do not have detailed information about sampling location or time and are forced to make general assumptions (e.g. the location of a port where multiple samples of the same species are landed and/or the sampling date for such samples is reported simply as a year or a quarter and we are forced to assume the mid-point) then, unless samplabcode is provided, there can be replicates. Currently we can live with this (though if we spot it we can force unique sample IDs by temporarily injecting dummy values for samplabcode which are removed after the sample IDs are generated).

Niall’s situation

in geotraces when we have the same (e.g for seawater) lon, lat, time, smp_depth for several nuclides measurement in a given rosette (at least that’s what I understand);
in OSPAR sediment when we have records where top, bottom is NaN for a given lon, lat, time. In that case our compound index would be (lon, lat, time, top, bottom);
In situations where a nuclide is measured for a sample using more than one method
In situations where rapid analysis and detailed analysis is reported (rapid -while arriving at lab- vs detailed measurement afterwards);
In a situation where a sample is collected and split into two or more sub-samples. For this sample the compound index would be the same. Sometimes this type of sample is sent to several laboratories (ring trial/inter-lab comparison);
In situations where a nuclide is measured for a sample using more than one method (e.g. Am241 normally measured by alpha and gamma spectrometry);