Spatial prediction and assessment of Soil Organic Carbon

Prepared as a supplementary material for the following research articles:

This step-by-step tutorial explains how to map Soil Organic Carbon Stocks (OCS) using soil samples (point data). We demonstrate derivation of values both at site level (per profile) and by using raster calculus (per pixel). We also show how to estimate total OCS for an area of interest (which can be a field plot, farm and/or administrative region). The R script you can download from github. Instructions on how to install and setup all software used in this example you can find here. For an introduction to soil mapping using Machine Learning Algorithms refer to this tutorial. To download global soil organic carbon (content, density and stock) maps at 250 m resolution visit ftp.soilgrids.org/data/recent/. To access ISRIC's global compilation of soil profiles please refer to: http://www.isric.org/explore/wosis.

Carbon below ground can be organic and non-organic or mineral (usually carbonates and bicarbonates) i.e. CaCO3 in the rocks. Organic carbon stock below ground (0–2 m) in terrestrial ecosystems consists of two major components:

  1. Living organism biomass i.e. mainly:
    • Plant roots,
    • Microbial biomass (Xu et al., 2012),
  2. Plant and animal residues at various stages of decomposition (organic matter).

Xu et al. (2013) have estimated that the global microbial biomass is about 17 Pg C, which is only about 2% of the total organic matter, hence amount of C in microbial biomass can be neglected in comparison to the total stock, although if one would include all living organism and especially tree roots, then the portion of the C in the living organism could be more significant, especially in areas under dense forests.

Soil Organic Carbon Stock (OCS) is the mass of soil organic carbon per standard area and for a specific depth interval, usually expressed in kg/m2 or t/ha. It can be derived using (laboratory and/or field) measurement of soil organic carbon content (ORC; expressed in g/kg of <2mm mineral earth), taking into account bulk density (BLD), thickness of the soil layer, and volume percentage of coarse fragments (CRF) (Nelson and Sommers, 1982; Poeplau et al. 2017):

OCS [kg/m2] = ORC [%] × BLD [kg/m3] × (1-CRF) × HOT [m]

The correction for gravel content is necessary because only material less than 2 mm is analyzed for ORC concentration. Note also that OCS always refers to a specific depth interval or horizon thickness (HOT), e.g.:

kg/m2 for depth 0–30 cm (IPCC; Berhongaray and Alvarez, 2013),

Values of OCS in kg/m2 can also be expressed in tons/ha units, in which case simple conversion formula can be applied:

1 × kg/m2 = 10 × tons/ha

Total OCS for an area of interest can be derived by multiplying OCS by total area e.g.:

120 tons/ha × 1 km2 = 120 × 100 = 12,000 tons

Example of OCS calculus and OCS propagated error for a given volume of soil using organic carbon content (ORC), bulk density (BLD), thickness of horizon (HOT), and percentage of coarse fragments (CRF). After Hengl et al. (2014). OCSKGM function available via the GSIF package.

Another way to express soil organic carbon is through soil organic carbon density (OCD in kg/m3), which is in fact equivalent to OCS divided by the horizon thickness:

OCD [kg/m3] = ORC [%] × BLD [kg/m3] × (1-CRF) = OCS / HOT

While OCS is a summary measure of SOC always associated with specific depth interval, OCD is a relative measure of soil organic carbon distribution and can be associated to any support size i.e. to arbitrary depth. In principle, OCD (kg/m3) is strongly correlated with ORC (g/kg) as indicated in the figure below, however, depending on soil mineralogy and coarse fragment content, OCD can be lower or higher than what the smoothed line indicates (notice the range of values around the smoothed line is relatively wide). It is important to understand however, that, as long as ORC, BLD and CRF are known, one can convert the values from ORC to OCD and OCS and vice versa, without loosing any information about the soil organic carbon stock.

Correlation between soil organic carbon density and soil organic carbon content (displayed on a log-scale) created using a global compilations of soil profile data. Values 1, 2, 3, 4, 5 and 6 in the plot (log scale) correspond to values 2, 6, 19, 54, 147 and 402. Note that for ORC »12%, the OCD line flattens, which means that, organic carbon density practically stops to increase with the increase of ORC content.

In summary, there are four main variables to represent soil organic carbon:

  1. Soil Organic Carbon fraction or content (ORC) in g/kg (permille) or dg/kg (percent),
  2. Soil Organic Carbon Density (OCD) in kg/m3,
  3. Soil Organic Carbon Stock (OCS) in kg/m2 or in tons/ha and for the given soil depth interval,
  4. Total Soil Organic Carbon Stock (TOCS) in million tonnes or Pg i.e. OCS multiplied by surface area,

Global estimates of the total soil organic carbon stock are highly variable (Scharlemann et al. 2014): the current estimates of the current total soil organic carbon stock range between 800–2100 Pg C (for 0–100 cm), with the median estimate of about 1500 Pg C (for 0–100 cm). This means that the average OCS for 0–100 cm depth interval for the land mask (148,940,000 km2) is about 11 kg/m2 or 110 tons/ha, and that average soil organic carbon density (OCD) is about 11 kg/m3 (compare to the standard bulk density of fine earth of 1250 kg/m3); standard OCS for 0–30 cm depth interval is 7 kg/m2 i.e. the average OCD is about 13 kg/m3.

The distribution of soil organic carbon in the world is, however, highly patchy with large areas with OCS « 100 tons/ha, and then some 'pockets' of accumulated organic material i.e. organic soil types (histosols) with OCS up to 850 tons/ha (for 0–30 cm depth interval). The world's soil organic matter accumulation areas are usually the following biomes / land cover classes: wetlands and peatlands, mangroves, tundras and taigas.

Land use and agriculture in particular have led to dramatic decreases in soil carbon stocks in last 200+ years (agricultural and industrial revolutions). Lal (2004) estimated that approximately 54 Pg C have been added to the atmosphere due to agricultural activities with another 26 Pg C being lost from soils due to erosion. Wei et al. (2014) have estimated that, in average, conversion from forests to various agricultural land results to 30–50% decrease of SOCS. Modelling and monitoring of soil organic carbon dynamics is therefore of increasing importance (see e.g. FAO report "Unlocking the Potential of Soil Organic Carbon").

As mentioned previously, OCS stock is most commonly derived from measurements of the organic carbon (ORC) content, soil bulk density (BLD) and the volume fraction of gravel (CRF). These are usually sampled either per soil layers or soil horizons (a sequence of horizons makes a soil profile), which can refer to variable soil depth intervals i.e. are non-standard. That means that, before one can determine OCS for standard fixed depth intervals (e.g. 0–30 cm or 0–100 cm), values of ORC, BLD and CRF need to be standardized so they refer to common depth intervals.

Consider, for example, the following two real life examples of soil profile data for a standard agricultural soil and an organic soil. In the first example, profile from Australia, the soil profile data shows:

upper
limit
(cm)
lower
limit
(cm)
organic
carbon
content
(g / kg)
bulk
density
(kg / m-cubic)
CF (%) SOCS
(kg / m-square)
0 10 8.2 1340* 6* 1.1
10 20 7.5 1367* 6* 1
20 55 6.1 1382* 7* 3
55 90 3.3 1433* 8* 1.7
90 116 1.6 1465* 8* 0.6

Note that BLD variable was not available for described horizons (the original soil profile description / laboratory data indicates that no BLD has been observed for this profile), hence we can at least use the BLD estimated using SoilGrids250m data. It (unfortunately) commonly happens that soil profile observations miss BLD measurements, and hence BLD needs to be generated using a Pedo-Transfer function or extracted from soil maps.

To determine OCS for standard depth intervals 0-30, 0–100 and 0–200 cm, we first fit a mass-preserving spline:

> library(GSIF)
> library(aqp)
> library(plyr)
> lon = 149.73; lat = -30.09;
> id = "399_EDGEROI_ed079"; TIMESTRR = "1987-01-05"
> top = c(0, 10, 20, 55, 90)
> bottom = c(10, 20, 55, 90, 116)
> ORC = c(8.2, 7.5, 6.1, 3.3, 1.6)
> BLD = c(1340, 1367, 1382, 1433, 1465)
> CRF = c(6, 6, 7, 8, 8)
> #OCS = OCSKGM(ORC, BLD, CRF, HSIZE=bottom-top)
> prof1 <- join(data.frame(id, top, bottom, ORC, BLD, CRF),
+               data.frame(id, lon, lat, TIMESTRR), type='inner')
Joining by: id
> depths(prof1) <- id ~ top + bottom
> site(prof1) <- ~ lon + lat + TIMESTRR
> coordinates(prof1) <- ~ lon + lat
> proj4string(prof1) <- CRS("+proj=longlat +datum=WGS84")
> ORC.s <- mpspline(prof1, var.name="ORC", d=t(c(0,30,100,200)), vhigh = 2200)
Fitting mass preserving splines per profile...
  |=====================================================================================| 100%
> BLD.s <- mpspline(prof1, var.name="BLD", d=t(c(0,30,100,200)), vhigh = 2200)
Fitting mass preserving splines per profile...
  |=====================================================================================| 100%
> CRF.s <- mpspline(prof1, var.name="CRF", d=t(c(0,30,100,200)), vhigh = 2200)
Fitting mass preserving splines per profile...
  |=====================================================================================| 100%

now we can derive OCS per each centimeter by using:

> OCSKGM(ORC.s$var.std$`0-30 cm`, BLD.s$var.std$`0-30 cm`, CRF.s$var.std$`0-30 cm`, HSIZE=30)
[1] 2.875408
attr(,"measurementError")
[1] 3.84
attr(,"units")
[1] "kilograms per square-meter"
> OCSKGM(ORC.s$var.std$`30-100 cm`, BLD.s$var.std$`30-100 cm`, CRF.s$var.std$`30-100 cm`, HSIZE=70)
[1] 3.616302
attr(,"measurementError")
[1] 9.18
attr(,"units")
[1] "kilograms per square-meter"

This gives the following estimates of OCS for standard depth intervals:

  • 0–30 cm: 2.9 kg / m-square
  • 0–100 cm: 6.5 kg / m-square
  • 0–200 cm: 8.5 kg / m-square (85 tonnes / ha)

Value of OCS between 5–35 kg / m-square for 0–100 cm are most common for a variety of mineral soils with e.g. 1–3% of soil organic carbon.

Determination of soil organic carbon density and stock for standard depth intervals: example of a mineral soil profile from Australia.

In the second example we look at a profile from Canada (a histosol with >40% of organic carbon):

upper
limit
(cm)
lower
limit
(cm)
organic
carbon
content
(g / kg)
bulk
density
(kg / m-cubic)
CF (%) SOCS
(kg / m-square)
0 31 472 179* 5* 24.9
31 61 492 166* 6* 23
61 91 487 169* 6* 23.2
91 122 502 160* 6* 23.4
122 130 59 787* 6* 3.5

Here also BLD values were missing hence need to be estimated. For this we can use the simple Pedo-Transfer rule e.g. from Köchy et al. (2015):

BLD.f = (-0.31 * log1p(ORC/10) + 1.38)*1000

We can again fit mass-preserving splines and determine OCS for standard depth intervals by using the functions applied to the profile 1. This finally gives the following estimates:

  • 0–30 cm: 24.0 kg / m-square
  • 0–100 cm: 72.8 kg / m-square
  • 0–200 cm: 129.4 kg / m-square (1294 tonnes / ha)
Determination of soil organic carbon density and stock for standard depth intervals: example of an organic soil profile from Canada.

Note that only 3–4% of the total soil profiles in the world have organic carbon content above 8% (soils with ORC >12% are often classified as organic soils or histosols in USDA and/or WRB classification and are even less frequent), hence soil-depth functions of organic carbon content and derivation of OCS for organic soils specific to patches of organic soils. On the other hand, organic soils carry much more total OCS. Precise processing and mapping of organic soils is often crucial for accurate estimation of total OCS for large areas, and hence it is fairly important to use a good PTF to fill in missing values for BLD for organic soils. As a rule of thumb, organic soil will rarely have density above some number e.g. 120 kg/m3 because even though SOC content can be >50%, bulk density of such soil gets proportionally lower and bulk density is physically bound with how is material organized in soil (unless soils is artificially compacted). Also, getting the correct estimates of coarse fragments is important as otherwise (if CRF is ignored) total stock can be over-estimated >100% (Poeplau et al. 2017).

In the case bulk density is missing and no local PTF exists, WoSIS points (global compilation of soil profiles) can be used to fit a PTF that can fill-in the gaps in bulk density measurements globally. A regression matrix extracted on 15th of May 2017 (and which contains harmonized values for BD, organic carbon content, pH, sand and clay content, depth of horizon and USDA soil type at some 20,000 soil profiles world-wide), can be fitted using a random forest model (Ramcharan et al. 2017):

> dfs_tbl = readRDS("wosis_tbl.rds")
> ind.tax = readRDS("ov_taxousda.rds")
> library(ranger)
> fm.BLD = as.formula(paste("BLD ~ ORCDRC + CLYPPT + SNDPPT + PHIHOX + DEPTH.f +", paste(names(ind.tax), collapse="+")))
> m.BLD_PTF <- ranger(fm.BLD, dfs_tbl, num.trees = 85, importance='impurity')
> m.BLD_PTF
...
Type:                             Regression
Number of trees:                  85
Sample size:                      98650
Number of independent variables:  70
Mtry:                             8
Target node size:                 5
Variable importance mode:         impurity
OOB prediction error:             32782.78
R squared:                        0.5431644

This shows somewhat lower accuracy i.e. an RMSE of ±180 kg/m3, but still probably better than dropping totally observations without bulk density from SOC assessment. A disadvantage of this model is that, in order to predict BD for new locations, we need to also have measurements of texture fractions, pH and organic carbon of course. For example, an Udalf (TAXOUSDA84) with 1.1% organic carbon, 22% clay, pH of 6.5, sand content of 35% and at depth of 5 cm would result in bulk density of:

> ind.tax.new = ind.tax[which(ind.tax$TAXOUSDA84==1)[1],]
> predict(m.BLD_PTF, cbind(data.frame(ORCDRC=11, CLYPPT=22, PHIHOX=6.5, SNDPPT=35, DEPTH.f=5), ind.tax.new))$predictions
[1] 1532.635

Note also that the PTF from above needs USDA suborder values per point location following the SoilGrids legend, and formatted as in the ind.tax object. Unfortunately, the model from above probably over-estimates bulk density for organic soils as these are usually under-represented i.e. often not available (consider a Saprist with 32% organic carbon):

> ind.tax.new = ind.tax[which(ind.tax$TAXOUSDA13==1)[1],]
> predict(m.BLD_PTF, cbind(data.frame(ORCDRC=320, CLYPPT=8, PHIHOX=5.5, SNDPPT=45, DEPTH.f=10), ind.tax.new))$predictions
[1] 766.1135

An alternative to estimating BLD is to just use ORC values, e.g. (see plot below):

> m.BLD_ls = loess(BLD ~ ORCDRC, ovA, span=1/18)
> predict(m.BLD_ls, data.frame(ORCDRC=220))
       1
329.2059

This gives almost 2 times lower value than the random forest-based PTF from above. Over-estimating BLD would also result in two times higher OCS, hence clearly accurate information on BLD can be crucial for any OCS monitoring project. The PTF fitted using random forest above is likely over-estimating BLD values, mainly because there are not enough training points in organic soils that have both measurements of ORC, BLD, soil pH and texture fractions (if ANY of the calibration measurements are missing, the whole horizons are taken out of calibration and hence different ranges of BLD could be completely misrepresented).

Correlation plot between soil organic carbon density and bulk density (fine earth), created using the global compilations of soil profile data. Black line indicates fitted loess polynomial surface. Note that there is still quite some scatter around the fitted line: many combinations of BLD and ORC, that do not fall close to the correlation line, can still be observed.

To fill-in missing values for BLD, SoilGrids project uses a combination of the two global Pedo-Transfer functions: (1) PTF fitted using random forest model that locally predicts BLD as a function of organic carbon content, clay and sand content, pH and coarse fragments, and (2) simpler model that predicts BLD just based on ORC. The average RMSE of these PTFs for BLD is about ±150 kg/m3.

For mineral soils relationship between soil organic carbon and soil depth follows a log-log relationship which can be also approximated with the following (global) model (R-square: 0.36; see figure below):

ORC (depth) =  exp[ 4.1517  −0.60934 * log(depth) ]

This also illustrates that any organic carbon spatial prediction model can significantly profit from including depth into the statistical modelling.

Globally fitted regression model for predicting soil organic carbon using depth only (log-log regression) and (a) individual soil profile from the ISRIC soil monolith collection. After Hengl et al. (2014). Dots shown on the plot on the right are the actual ORC values for the profile on the left.

In summary, PTFs can be efficiently used to fill in gaps in BLD values (BLD is usually highly correlated with organic carbon content and depth, texture fractions, soil classification and soil pH can also help improve accuracy of the PTFs), however, for organic soils there is in general less calibration data and hence the errors are potentially higher. Mistakes in estimating BLD can result in systematic and significant over/under-estimations of the actual stock; on the other hand, removing all soil horizons from OCS assessment that do not have BLD measurements leads also to poorer accuracy as less points are included in training of the spatial prediction models. Especially for organic soils (>12% organic carbon), there is no easy solution for filling-in missing values for BLD and collecting additional (local) calibration points might unavoidable. Lobsey and Viscarra Rossel (2016) have recently proposed a method that combines gamma-ray attenuation and visible–near infrared (vis–NIR) spectroscopy to measure ex situ the bulk density using samples that are sampled freshly, wet and under field conditions. Hopefully BLD measurements (or their complete lack of) will be less and less problem in the future.

Most of projects focused on monitoring OCS require that an estimate of OCS is provided for the whole area of interest so that the user can also visually explore spatial patterns of OCS. In this tutorial we demonstrate how to generate maps of OCS using point samples and RS based covariates. The output of this process is usually a gridded map (SpatialPixelsDataFrame) covering the area of interest (plot, farm, administrative unit or similar). Once OCS is mapped, we can multiply OCS densities with area of each pixel and sum up all numbers we can compute the total OCS in total tonnes using the formula from above. Predicted OCS values can also be aggregated per land cover group or similar. If series of OCS maps are produced for the same area of interest (time-series of OCS), these can be used to derive OCS change per pixel.

In principle, there are three main approaches to estimating total OCS for an area of interest:

  • By directly predicting OCS, here called the the 2D approach to OCS mapping (this often requires vertical aggregation / modeling of soil variable depth curves as indicated above),
  • By predicting ORC, BLD and CRF, and then deriving OCS per layer, here called the 3D approach to OCS mapping with ORC, BLD and CRF mapped separately,
  • By deriving OCD (organic carbon density) and then directly predicting OCD and converting it to OCS, here called the 3D approach to OCS mapping via direct modeling of OCD,
Three main computational paths (2D and 3D) to producing maps of organic carbon stock. In this tutorial we in general recommend using the path 3 for modeling organic carbon stock.

Although 2D prediction of OCS from point data seems to be more straightforward, many soil profiles contain measurements at non-standard depth intervals (varying support sizes also) and hence 2D modeling of OCS can often be a cumbersome. In most of situations where legacy soil profile data is used, 3D modeling of OCD is probably the most elegant solution to mapping OCS because:

  • No vertical aggregation of values via spline fitting or similar is needed to standardize values per standard depths,
  • No additional uncertainty is introduced (in the case of the 2D approach splines likely introduce some extra uncertainty in the model),
  • Predictions of OCD/OCS can be generated for any depth interval using the same model (i.e. predictions are based on a single 3D model),

A disadvantage of doing 3D modeling of OCD is, however, that correlation with covariate layers could be less clear than if separate models are build for ORC, BLD and CRF: because OCD is a composite variable, it can often be difficult to distinguish whether the values are lower or higher due to differences in ORC, BLD or CRF. We leave it to the users to compare various approaches to OCS mapping and then select the method that achieves best accuracy and/or is most fit for use for their applications.

The purpose of spatial prediction is to (a) produce a map showing spatial distribution of the variable of interest for the area of interest, and (b) to do this in an unbiased way. A comprehensive path to evaluating spatial predictions is the caret approach (Kuhn and Johnson, 2013), which wraps up many of the standard processes such as model training and validation, method comparison and visualization. Consider for example the meuse data set, often used to demonstrate geostatistical modeling steps, that contains 155 measurements of organic matter % in topsoil. We can quickly compare performance of using GLM vs random forest vs no model for predicting organic matter (om):

> library(caret); library(rgdal)
> demo(meuse, echo=FALSE)
> fitControl <- trainControl(method="repeatedcv", number=2, repeats=2)
> meuse.ov <- cbind(over(meuse, meuse.grid), meuse@data)
> mFit0 <- train(om~1, data=meuse.ov, method="glm", family=gaussian(link=log), trControl=fitControl, na.action=na.omit)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInf\o,  :
  There were missing values in resampled performance measures.
> mFit1 <- train(om~soil, data=meuse.ov, method="glm", family=gaussian(link=log), trControl=fitControl, na.action=na.omit)
> mFit2 <- train(om~dist+soil+ffreq, data=meuse.ov, method="glm", family=gaussian(link=log), trControl=fitControl, na.action=na.omit)
> mFit3 <- train(om~dist+soil+ffreq, data=meuse.ov, method="ranger", trControl=fitControl, na.action=na.omit)

so that we can compare performance of the three models by using:

> resamps <- resamples(list(Mean=mFit0, Soilmap=mFit1, GLM=mFit2, RF=mFit3))
> bwplot(resamps, layout = c(3, 1))
Comparison of spatial prediction accuracy (RMSE at cross-validation points) for simple averaging (Mean), GLM with only soil map as covariate (Soilmap), GLM and random forest (RF) models with all possible covariates. Error bars indicate range of RMSE values for repeated CV.

In the case above, it seems that random forest (ranger package) helps decrease mean RMSE of predicting organic matter for about 32%:

> round((1-min(mFit3$results$RMSE)/min(mFit0$results$RMSE))*100)
[1] 32

In the case above, there is certainly added value in using spatial covariates (in the case above: distance to water and flooding frequency maps) and in using machine learning for spatial prediction, even with smaller data sets.

Note also that the assessment of spatial prediction accuracy for the three models based on the train function above is model-free, i.e. cross-validation of the models is independent from the models used because at each cross-validation subset fitting of the model is repeated and validation points are kept away from model training. Subsetting point samples is not always trivial however: in order to consider cross-validation as completely reliable, the samples ought to be representative of the study area and preferably collected using objective sampling such as simple random sampling or similar (Brus et al., 2011). In the case the sampling locations are clustered in geographical space i.e. if some parts of the study area are completely omitted from sampling, then also the results of cross-validation will reflect that sampling bias / poor representation. In all the following examples we will assume that cross-validation gives a reliable measure of mapping accuracy and we will use it as the basis of accuracy assessment i.e. mapping efficiency. In reality, cross-validation might be tricky to implement and could often lead to somewhat over-optimistic results if either sampling bias exists or/and if there are too little points for model validation. For example, in the case of soil profile data, it is highly recommended that whole profiles are taken out from CV because soil horizons are too strongly correlated (as discussed in detail in Gasch et al., 2015).

The whole process of spatial prediction of soil properties could be summarized in 5 steps:

  1. Initial model comparison (comparison of prediction accuracy and computing time).
  2. Selection of applicable model(s) and estimation of model parameters i.e. model fitting.
  3. Predictions i.e. generation of maps for all areas of interest.
  4. Objective accuracy assessment using independent (cross-)validation.
  5. Export and sharing of maps and summary documentation explaining all processing steps.

Studying the caret package tutorial is highly recommended for anyone looking for a systematic introduction to predictive modelling.

As we have shown in the previous example, adding relevant covariates that can explain distribution of soil organic carbon increases accuracy of spatial predictions. Hence prior to generating predictions of OCS, it is a good idea to invest into preparing a list of Remote Sensing (RS), geomorphological/lithologic and DEM-based covariates that could potentially help explain spatial distribution of OCS. Since 2016, there are many high resolution (30–250 m) covariates with global coverage, and that are publicly available without restrictions. Both spatial detail, accessibility and accuracy of RS-based products has been growing exponentially and there is now evidence that that trend is going to slow down in the coming decades (Herold et al. 2016). The most relevant publicly available remote sensing-based covariates that can be downloaded and used to improve predictive soil mapping at high spatial resolutions are, for example:

Note that the download time for 30 m global RS data could be significant if the data is needed for a larger area (hence you might consider using some RS data processing hub such as Sentinel hub, Google Earth Engine and/or Amazon Web Services instead of trying to download large mosaics yourself). Number of covariates used for generating SoilGrids can also be accessed from Geonode.isric.org.

The geospt package contains 125 samples of OCS from Colombia already at standard depth intervals, hence this data set is ready for 2D mapping of OCS. The data sets consists of tabular values for points and a raster map containing the borders of the study area:

> load("COSha10.rda")
> load("COSha30.rda")
> str(COSha30)
'data.frame':    118 obs. of  10 variables:
 $ ID        : Factor w/ 118 levels "S1","S10","S100",..: 1 44 61 89 100 110 2 9 15 21 ...
 $ x         : int  669030 669330 670292 669709 671321 670881 670548 671340 671082 670862 ...
 $ y         : int  448722 448734 448697 448952 448700 448699 448700 448969 448966 448968 ...
 $ DA30      : num  1.65 1.6 1.5 1.32 1.41 1.39 1.51 1.39 1.55 1.63 ...
 $ CO30      : num  0.99 1.33 1.33 1.09 1.04 1.19 1.21 1.36 1.09 1.19 ...
 $ COB1r     : Factor w/ 6 levels "Az","Ci","Cpf",..: 5 5 2 5 2 5 2 2 2 5 ...
 $ S_UDS     : Factor w/ 19 levels "BJa1","BQa1",..: 12 5 12 5 11 12 12 12 12 12 ...
 $ COSha30   : num  49.2 64 59.8 43.1 44.2 ...
 $ Cor4DAidep: num  43.3 56.3 54 37.9 39.9 ...
 $ CorT      : num  1.37 1.39 1.38 1.36 1.36
> load("COSha30map.rda")
> proj4string(COSha30map) = "+proj=utm +zone=18 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"

COSha10 = 0–10 cm, COSha30 = 0–30 cm in tons / ha are values for OCS aggregated to standard soil depth intervals, so there is no need to do any spline fitting and/or vertical aggregation.

Example of a data set with OCS samples (for 2D prediction). Case study available via the geospt package (Colombia).

We can import a number of RS-based covariates to R by (these were derived from the global 30 m layers listed previously):

> covs30m = readRDS("covs30m.rds")

From the DEM layer, we can derive some 8–10 additional DEM derivatives (read more about how to prepare DEM derivatives using SAGA GIS) that could potentially help with mapping OCS.

We can also derive buffer distances from observations points:

> classes = cut(COSha30$COSha30, breaks=seq(0, 100, length=10))
> covs30mdist = buffer.dist(COSha30["COSha30"], covs30m[1], classes)

and finally convert all these to Principal Components to help separate noise from the main signals:

> covs30m@data = cbind(covs30m@data, covs30mdist@data)
> fm.spc = as.formula(paste(" ~ ", paste(names(covs30m), collapse = "+")))
> fm.spc
~SRTMGL1_SRTMGL1.2_cprof + SRTMGL1_SRTMGL1.2_devmean + SRTMGL1_SRTMGL1.2_openn +
    SRTMGL1_SRTMGL1.2_openp + SRTMGL1_SRTMGL1.2_slope + SRTMGL1_SRTMGL1.2_twi +
    SRTMGL1_SRTMGL1.2_vbf + SRTMGL1_SRTMGL1.2_vdepth + SRTMGL1_SRTMGL1.2 +
    GlobalForestChange2000.2014_first_NIRL00 + GlobalForestChange2000.2014_first_REDL00 +
    GlobalForestChange2000.2014_first_SW1L00 + GlobalForestChange2000.2014_first_SW2L00 +
    GlobalForestChange2000.2014_treecover2000 + layer.1 + layer.2 +
    layer.3 + layer.4 + layer.5 + layer.6 + layer.7 + layer.8

By using the above listed of covariates, we can fit a spatial prediction 2D model using some available models such as ranger, xgboost and gamboost. We model the target variable as a function of PCs:

> fm.COSha30
COSha30 ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +
    PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC17 + PC18 +
    PC19 + PC20 + PC21

Comparison of random forest GLMboost and xgboost shows that none of the models are (unfortunately) distinct with the amount of variation that can be explained ranging between 10–20%. It is very common for soil mapping projects that the amount of variation that models explain are low and hence the average error of prediction and/or prediction intervals are wide. This could happen because the measurement errors were high, and/or because there are missing covariates, but it could also happen because natural complexity of soils in the area is simply high.

Note that our predictions of OCS are somewhat different from the predictions produced by the geospt package authors, although the main patterns are comparable.

Comparison of predictions generated using ordinary kriging (left) and machine learning with 30 m resolution covariates (right).

Assuming that measurements of ORC have been also temporally referenced (at least the year of sampling), points can be used to build spatiotemporal models of soil organic carbon. Consider for example the soil profile data available for conterminous USA:

> OCD_stN <- readRDS("usa48.OCD_spacetime_matrix.rds")

This data shows that there are actually enough observations spread in time (last 60+ years) to fit a spatiotemporal model:

> hist(OCD_stN$YEAR, main="Distribution of soil observations based on sampling year", xlab="Year", col="darkgrey")
Histogram of sampling years for conterminous USA. Data source: the USA National Cooperative Soil Survey Soil Characterization database.

In fact, because the data set above represents values of OCD at variable depths, we can use this data to fit a full 3D+T spatiotemporal model in the form:

OCD(xydt) = d + X1(xyt) + X1(xyt) + … + Xp(xyt)

where d is the depth, X1…Xp are (static or dynamics) covariates, and xyt are spatiotemporal coordinates. Here we can assume that static covariates are mainly landform and lithology: these have probably not changed much in the last 100 years. Land cover, land use and climate, on the other hand, have probably changed drastically in the last 100 years and have to be represented with time-series of images. There are, indeed, several time-series data sets now available that can be used to represent land cover dynamics:

  • HYDE 3.2 Historic land use data set (Klein et al. 2011): contains the distribution of main agricultural systems from 10,000 BC (pre-historic no landuse condition) to present time. 10 categories of land use have been represented: total cropping, total grazing, pasture (improved grazingland), rangeland (unimproved grazingland), total rainfed cropping, total irrigated cropping with further subdivisions for rice and non-rice cropping systems for both rainfed and irrigated cropping.
  • CRU TS2.1 climatic surfaces for period 1960–1990 (Harris et al. 2014).

All these are unfortunately available only at relatively coarse resolution of 10 km. Note also that, since these are time-series of images, spatiotemporal overlay can take time spatial overlay must be repeated for each time period. The spatiotemporal matrix file already contains results of overlay, so that we can focus directly on building spatiotemporal models of OCD e.g.:

> fm0.st <- as.formula(paste('OCDENS ~ DEPTH.f + ', paste(pr.lst, collapse="+")))
> sel0.m = complete.cases(OCD_stN[,all.vars(fm0.st)])
> rf0.OCD_st <- ranger(fm0.st, data=OCD_stN[sel0.m,all.vars(fm0.st)], importance="impurity", write.forest=TRUE, num.trees=120)

The model fitting result shows that model explains almost 60% of variation in OCD values:

Type:                             Regression
Number of trees:                  120
Sample size:                      249025
Number of independent variables:  127
Mtry:                             11
Target node size:                 5
Variable importance mode:         impurity
OOB prediction error:             102.5703
R squared:                        0.5911552

the most important covariates being:

               [,1]
DEPTH.f  16242618.7
DEM        766114.8
GRAZING    716857.0
MODFC09    677867.9
VBF        663850.9
MODFC06    643100.1
MODFC03    619556.6
MODFC10    615712.3
CROPLAND   611457.7
MODFC07    610335.4
TWI        603085.5
MODFC08    599495.6
TRAINFED   590347.6
RFNORICE   584568.0
MODFC05    576126.9

which shows that the far the most important soil covariate is soil depth, followed by elevation, grazing, MODIS cloud fraction images, cropland and similar. For full description of codes please refer to this table.

Finally, based on this model, we can generate predictions for 3–4 specific time periods and for some arbitrary depth e.g. 10 cm. The maps below clearly show that ca 8% of the soil organic carbon has been lost in the last 90 years, most likely due to the increase of grazing and croplands. The maps also show, however, that some areas in the northern latitudes are experiencing an increase in SOC possibly due to higher rainfall i.e. based on the CRU data set.

Predicted OCD (in kg/m3) at 10 cm depth for the year 2014. Blue colors indicate low values, red high values. Download map.
Predicted OCD (in kg/m3) at 10 cm depth for the year 1925.Download map.

This demonstrates that, as long as there is enough training data spread through time, and as long as covariates are available for corresponding time range, machine learning can also be used to fit full 3D+T spatiotemporal prediction models (Gasch et al. 2015). Once we produce a time-series of images for some target soil variable of interest, the next step would be to implement time-series analysis methods to e.g. detect temporal trends and areas of highest soil degradation. An R package that is fairly useful for such analysis is the greenbrown package, primarily used to map and quantify degradation of land cover (Forkel et al. 2015).

We can focus on the time-series of predicted organic carbon density for Texas:

> library(greenbrown)
Loading required package: strucchange
Loading required package: zoo
 
Attaching package: ‘zoo’
 
The following objects are masked from ‘package:base’:
 
    as.Date, as.Date.numeric
 
Loading required package: sandwich
> library(raster)
> setwd("./USA48")
> tif.lst <- list.files(pattern="_10km.tif")
> g10km <- as(readGDAL(tif.lst[1]), "SpatialPixelsDataFrame")
usa48.OCD_10cm_year1935_10km.tif has GDAL driver GTiff
and has 2160 rows and 4320 columns
> for(i in 2:length(tif.lst)){ g10km@data[,i] = readGDAL(tif.lst[i], silent=TRUE)$band1[g10km@grid.index] }
> names(g10km) = basename(tif.lst)
> g10km = as.data.frame(g10km)
> gridded(g10km) = ~x+y
> proj4string(g10km) = "+proj=longlat +datum=WGS84"
> library(maps)
> library(maptools)
Checking rgeos availability: TRUE\
> states <- map('state', plot=FALSE, fill=TRUE)
> states = SpatialPolygonsDataFrame(map2SpatialPolygons(states, IDs=1:length(states$names)), data.frame(names=states$names))
> proj4string(states) = "+proj=longlat +datum=WGS84"
> ov.g10km = over(y=states, x=g10km)
> txg10km = g10km[which(ov.g10km$names=="texas"),]
> txg10km = as.data.frame(txg10km)
> gridded(txg10km) = ~x+y
> proj4string(txg10km) = "+proj=longlat +datum=WGS84"
> spplot(log1p(stack(txg10km)), col.regions=SAGA_pal[[1]])
> g10km.b = raster::brick(txg10km)
Time-series of predictions of organic carbon density for Texas.

We can analyze this time-series data to see where is the decrease of organic carbon most significant, for example the slope of the change:

> trendmap <- TrendRaster(g10km.b, start=c(1935, 1), freq=1, breaks=1) ## can be computationally intensive
> plot(trendmap[["SlopeSEG1"]], col=rev(SAGA_pal[["SG_COLORS_GREEN_GREY_RED"]]), zlim=c(-1.5,1.5), main="Slope SEG1")
Predicted slope of change of soil organic carbon density for Texas for period 1935–2014. Negative values indicate loss of soil organic carbon.

which shows that loss of soil organic carbon is distinct especially in the southern part of Texas. The slope coefficient map is in average negative, which indicates that most of the state has lose organic carbon for the period of interest. Note that running such time-series analysis is not trivial as enough of observations in time (if possible: repetitions) are needed to be able to extract significant patterns. Also TrendRaster function can be quite computationally intensive, hence some careful planning of the processing steps / processing infrastructure is usually a good idea.

Based on all the examples and discussion from above, the following key points can be emphasized:

  1. OCS for an area of interest can be derived either using 2D or 3D approach. 3D approach typically includes modeling ORC, BLD and CRF separately (and then deriving OCS per pixel), or modeling OCD for standard depths and then converting to OCS.
  2. Publicly available RS-based covariates (SRTM / ALOS DEM, Landsat, Sentinel satellites) are available for improving the mapping accuracy of OCS. Improving the accuracy of OCS maps is inexpensive given the increasing availability of RS data.
  3. PT (Pedo-Transfer) rules can be used to fill in missing BLD values and to estimate ORC for deeper soil depths.
  4. Machine learning techniques such as Random Forest, neural nets, gradient boosting and similar, can be used to predict soil organic carbon in 2D, 3D and in spatiotemporal modeling frameworks. Accuracy of these predictions improves (in comparison to linear statistical models) especially where the relationship between soil organic carbon distribution and climatic, land cover, hydrological, relief and similar covariates is complex (non-linear).
  5. Global estimates of ORC, BLD and CRF (SoilGrids.org) can be used as covariates so that consistent predictions can be produced for purpose of global soil data.
  6. By producing spatial predictions of OCS for specific time periods, one can derive estimates of OCS change (loss or gain).
  7. Most of statistical tools required for running spatial analysis, time series analysis, export and visualization of data is available in R, especially thanks to the contributed packages: aqp, caret, ranger, xgboost, GSIF, greenbrown and similar.

wiki soil organic carbon
 stars  from 0 votes (Details)