3D soil property mapping using the GSIF package

The purpose of this tutorial is to demonstrate major processing steps used within the GSIF framework for generating soil property and soil class maps from point data, and with the help of multi-scale covariates. The GSIF (R package) project summary page you can find here. To learn more about the Global Soil Information Facilities (GSIF), visit the main project page. See also the complete list of functions available via the GSIF package. Some more sophisticated modelling techniques based on Machine Learning can be found here.

Download this tutorial as R script.

For demonstration purposes we use the Ebergötzen data set, which has been used by the SAGA GIS development team (Böhner et al., 2006; Hengl et al., 2012). To start the exercise, first install and load all required packages (see also: GSIF installation instructions). Note that GSIF is dependent on the plotKML package, which you should always first install from R-forge as these are both experimental packages:

> sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
> library(aqp)
> library(raster)
> library(plotKML)
> library(GSIF)
Loading required package: RCurl
Loading required package: bitops
GSIF version 0.3-0 (2013-03-07)
URL: http://gsif.r-forge.r-project.org/

load the input data:

> data(eberg)
> data(eberg_grid)
> data(eberg_grid25)

Profile data

The Ebergötzen data set is a 10 by 10 km large case study in central Germany. The eberg object contains soil profile observations, and eberg_grid and eberg_grid25 contain covariate layers at 100 and 25 m resolution. Before we proceed with running geostatistical analysis, we can run some initial data screening. We can first look at the data structure:

> str(eberg)
'data.frame':   3670 obs. of  30 variables:
 $ ID      : Factor w/ 3692 levels "id0001","id0002",..: 3302 93 2827 1858 3539 3540 2828 94 1859 95 ...
 $ soiltype: Factor w/ 13 levels "A","B","D","G",..: NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ...
 $ TAXGRSC : Factor w/ 13 levels "Auenboden","Braunerde",..: NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ...
 $ X       : int  3569323 3569328 3569328 3569335 3569336 3569340 3569343 3569344 3569348 3569349 ...
 $ Y       : int  5716216 5715647 5716024 5715770 5716095 5716233 5716325 5715447 5714342 5716281 ...
 $ UHDICM_A: num  0 0 0 0 0 0 0 0 0 0 ...
 $ LHDICM_A: num  10 10 10 10 10 10 10 10 10 10 ...
 $ SNDMHT_A: num  20 20 18.8 20 16.6 12.5 20 20 20 16.5 ...
 $ SLTMHT_A: num  40 40 37.6 40 33.2 25 40 40 40 66.5 ...
 $ CLYMHT_A: num  40 40 37.6 40 33.2 25 40 40 40 21 ...
 $ UHDICM_B: num  10 10 10 10 10 10 10 10 10 10 ...
 $ LHDICM_B: num  30 30 30 30 30 30 30 30 30 30 ...
 $ SNDMHT_B: num  NA\ 20 NA\ NA\ NA\ NA\ NA\ 20 18.8 16.5 ...
 $ SLTMHT_B: num  NA\ 40 NA\ NA\ NA\ NA\ NA\ 40 37.6 66.5 ...
 $ CLYMHT_B: num  NA\ 40 NA\ NA\ NA\ NA\ NA\ 40 37.6 21 ...
 $ UHDICM_C: num  30 30 30 30 30 30 30 30 30 30 ...
 $ LHDICM_C: num  50 50 50 50 50 50 50 50 50 50 ...
 $ SNDMHT_C: num  NA\ 18.8 NA\ NA\ NA\ NA\ NA\ 18.8 NA\ NA\ ...
 $ SLTMHT_C: num  NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ...
 $ CLYMHT_C: num  NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ...
 $ UHDICM_D: num  50 50 50 50 50 50 50 50 50 50 ...
 $ LHDICM_D: num  70 70 70 70 70 70 70 70 70 70 ...
 $ SNDMHT_D: num  NA\ 18.8 NA\ NA\ NA\ NA\ NA\ 18.8 NA\ NA\ ...
 $ SLTMHT_D: num  NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ...
 $ CLYMHT_D: num  NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ...
 $ UHDICM_E: num  70 70 70 70 70 70 70 70 70 70 ...
 $ LHDICM_E: num  90 90 90 90 90 90 90 90 90 90 ...
 $ SNDMHT_E: num  NA\ 18.8 NA\ NA\ NA\ NA\ NA\ 18.8 NA\ NA\ ...
 $ SLTMHT_E: num  NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ...
 $ CLYMHT_E: num  NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ...

This shows that soil texture fractions SNDMHT, SLTMHT and CLYMHT have been sampled at fixed depths (10-30, 30-50, 50-70, 70-90), and soiltype contains 13 classes of soil types.

Histogram plotting

Our focus in this exercises is spatial prediction of sand content over the whole volume of soil. Let us see some summary info for sand content:

> summary(eberg$SNDMHT_A)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA\'s
   4.062  18.800  20.000  30.260  42.300  92.500   3.000
> library(StatDA)
> par(mar=c(2.5,2.5,0.5,0.5), oma=c(0,0,0,0))
> edaplot(eberg$SNDMHT_A[!is.na(eberg$SNDMHT_A)], H.freq=TRUE, box=FALSE, S.pch=3, S.cex=0.5,
+   D.lwd=1.5, P.ylab="", P.log=FALSE, P.logfine=c(5,10), P.main="", P.xlab="", B.pch=3, B.cex=0.5)
Histogram for sand content.

From the plot above, we can also observe that many numbers are in fact overlapping. It seems that there is only a clusters of values possible for SNDMHT. Going back to the origin of this data set, we can notice that the sand, silt and clay values have been determined by using the so-called texture by hand method, i.e. via texture classes. The literature (Skaggs et al., 2001) shows that the this technique can be used to determine the content of soil earth fractions only to an accuracy of ±5–10%. This means that we should not plan to map any of the texture fractions to a precision better than ±5% (detection limit), because it would exceed the measurement error.

These is a relatively large set (3670 points), so it might be a good idea to subset it (e.g. take only 30% of samples) to speed up the processing:

> eberg.xy <- eberg[runif(nrow(eberg)) < .3,]
> coordinates(eberg.xy) <- ~X+Y
> proj4string(eberg.xy) <- CRS("+init=epsg:31467")
> gridded(eberg_grid) <- ~x+y
> proj4string(eberg_grid) <- CRS("+init=epsg:31467")
> gridded(eberg_grid25) <- ~x+y
> proj4string(eberg_grid25) <- CRS("+init=epsg:31467")

Testing spatial randomness and point clustering

Prior to geostatistical analysis, it is usefull to assess how representative and consistent the soil profile data is. To achieve this, we can run some basic analysis of the point geometry and then overlap the points with predictors to see how well are the environmental features represented. First, we can test if the points represent the geographical space in an unbiased way (see chapter 7, in Bivand et al., 2008):

> library(spatstat)
> mg_owin <- as.owin(data.frame(x=as.data.frame(eberg_grid)[,"x"], y=as.data.frame(eberg_grid)[,"y"], window = TRUE))
> eberg.ppp <- ppp(x=coordinates(eberg.xy)[,1], y=coordinates(eberg.xy)[,2], marks=eberg.xy$zinc, window=mg_owin)
> env.eberg.xy <- envelope(eberg.ppp, fun=Gest)
Generating 99 simulations of CSR  ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99.
Done.
> par(mar=c(4.5,4.5,0.5,0.5), oma=c(0,0,0,0))
> plot(env.eberg.xy, lwd=list(3,1,1,1), main="")
     lty col  key      label                                           meaning
obs    1   1  obs  G[obs](r)           observed value of G(r) for data pattern
theo   2   2 theo G[theo](r)                 theoretical value of G(r) for CSR
hi     1   8   hi   G[hi](r) upper pointwise envelope of G(r) from simulations
lo     1   8   lo   G[lo](r) lower pointwise envelope of G(r) from simulations
The results of the Complete Spatial Randomness test (spatstat package) for the Ebergötzen case study. In this case the observed G-curve does not fit into the 95% range for the CSR estimated for this study area.

which shows that the point samples do not exactly satisfy the Complete Spatial Randomness test (spatstat package) — samples at larger distances have a lower spatial density than a completely random design, which usually means that some parts of the case study are under-sampled (see also figure below). On the other hand, points at shorter distances (<75 m) would pass the CSR test, which indicates that, at least at shorter distances, there is no clustering in geographical space.

Testing the feature space coverage

To see how representative is the point data considering the coverage of feature space, we use the MaxEnt function that extends methods available via the dismo package:

> jar <- paste(system.file(package="dismo"), "/java/maxent.jar", sep='')
> if (file.exists(jar)) {
+  me.eberg <- MaxEnt(occurrences=eberg.ppp, covariates=eberg_grid)
+  par(mfrow=c(1,2), mar=c(0.5,0.5,0.5,0.5), oma=c(0,0,0,0))
+  image(x=methods/html/as.html">as(me.eberg@predicted, "SpatialPixelsDataFrame"), col=rev(heat.colors(25)), xlab="", ylab="")
+  points(me.eberg@occurrences, pch="+", cex=.7)
+  image(me.eberg@sp.domain, col="grey", xlab="", ylab="")
+  }
Feature space analysis: sampling likelihood in feature space based on MaxEnt (left; dark red indicates higher values), and areas fully represented by the current samples following the cross-validation (right). See code examples for more details.

Which shows that, some parts of the study area (higher elevations, specific land cover types) have been systematically omitted from sampling. The map on the right shows which pixels (grey) are actually valid to run predictions as the probability of occurrence of sampling points at these locations (at least based on the feature space defined by the covariates) is statistically significant. Ideally, the whole map should be gray, which means none of the areas in feature space have been under-represented. This is nothing that should worry us too much, but something we need to be aware when doing the interpretation of produced maps. Note: To run MaxEnt, you will first need to obtain the maxent.jar and place it somewhere where dismo can find it e.g. under library/dismo/java/maxent.jar. To learn more about MaxEnt, refer to the dismo package vignette (Hijmans and Elith, 2012).

A much better way to design sampling would be to use some type of probability or statistical sampling. For example, to optimally allocate samples in feature space (to equally represent all environmental conditions) we can use the clhs package, developed and maintained by Pierre Roudier. This package implements conditioned Latin hypercube sampling, as published by Minasny and McBratney (2006). Here is an example for the Ebergotzen data set:

> install.packages("clhs")
> library(clhs)
> library(plotKML)
> library(GSIF)
> library(sp)
 
> data(eberg_grid)
> gridded(eberg_grid) <- ~x+y
> proj4string(eberg_grid) <- CRS("+init=epsg:31467")
> formulaString <- ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6
> eberg_spc <- spc(eberg_grid, formulaString)
Converting PRMGEO6 to indicators...
Converting covariates to principal components...
 
> sample100 <- clhs(eberg_spc@predicted@data[,1:7], size=100)
 |================================================| 100%
## Takes ca 3 minutes!
> sample100.sp <- SpatialPointsDataFrame(eberg_grid@coords[sample100,], data.frame(x=rep(1, length(sample100))))
> proj4string(sample100.sp) <- CRS("+init=epsg:31467")
> plotKML(sample100.sp)
100 points allocated using Latin Hypercube sampling as implemented in the clhs package.

Note that the points are now more equally spread, so that also forest patches are represented in the sampling design. Consequently, spatial modelling using this sampling would be more representative of all environmental conditions. Note that the clhs package allows you to also specify costs of sampling per pixel, so that a conditional sampling design can be generated as a trade-off between good representation and cost minimization.

Any geostatistical mapping boils down to two major processes: (a) model fitting (or model estimation), and (b) generation of spatial predictions. Also in GSIF package geostatistical mapping is implemented via two main functions: fit.gstatModel and predict.gstatModel. In the following sections we demonstrate how to fit models and generate predictions using soil profile data and a diversity of covariate layers typical for soil mapping applications. Functions fit.gstatModel and predict.gstatModel largely extend the functionality of the stats, rpart and gstat packages. Gstat package (see chapter 8, in Bivand et al., 2008) is by many still considered one of the most comprehensive packages for geostatistical modeling.

Preparing point data for geostatistical analysis

Before we build a regression-kriging model, we need to prepare the soil observations to a format suitable for fitting of 3D models. We start by converting the eberg data frame to class SoilProfileCollection-class (aqp package) and the geosamples-class (GSIF package):

> # select columns of interest:
> s.lst <- c("ID", "soiltype", "TAXGRSC", "X", "Y")
> h.lst <- c("UHDICM","LHDICM","SNDMHT","SLTMHT","CLYMHT")
> sites <- eberg[,s.lst]
> horizons <- getHorizons(eberg, idcol="ID", sel=h.lst)
> eberg.spc <- join(horizons, sites, type='inner')
 Joining by: ID
> depths(eberg.spc) <- ID ~ UHDICM + LHDICM
> site(eberg.spc) <- as.formula(paste("~", paste(s.lst[-1], collapse="+"), sep=""))
Warning message:
converting IDs from factor to character
> coordinates(eberg.spc) <- ~X+Y
> proj4string(eberg.spc) <- CRS("+init=epsg:31467")

GSIF package by default works with geosamples-class as the main class for field observations (points). Conversion of the SPC class data to geosamples is straight forward:

> eberg.spc@horizons$SNDMHT.t <- log((eberg.spc@horizons$SNDMHT/100) / (1-eberg.spc@horizons$SNDMHT/100))
> eberg.geo <- as.geosamples(eberg.spc)
Reprojecting to +proj=longlat +datum=WGS84 ...
> str(eberg.geo)
Formal class 'geosamples' [package "GSIF"] with 3 slots
  ..@ registry: chr NA\
  ..@ methods :'data.frame':    6 obs. of  4 variables:
  .. ..$ methodid      : Factor w/ 6 levels "CLYMHT","SLTMHT",..: 1 2 3 4 5 6
  .. ..$ description   : logi [1:6] NA\ NA\ NA\ NA\ NA\ NA\
  .. ..$ units         : logi [1:6] NA\ NA\ NA\ NA\ NA\ NA\
  .. ..$ detectionLimit: logi [1:6] NA\ NA\ NA\ NA\ NA\ NA\
  ..@ data    :'data.frame':    7854 obs. of  14 variables:
  .. ..$ observationid   : chr [1:7854] NA\ NA\ NA\ NA\ ...
  .. ..$ sampleid        : chr [1:7854] "id0009" "id0017" "id0047" "id0082" ...
  .. ..$ longitude       : num [1:7854] 10.1 10.1 10.2 10.2 10.2 ...
  .. ..$ latitude        : num [1:7854] 51.5 51.6 51.6 51.6 51.6 ...
  .. ..$ locationError   : num [1:7854] NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ...
  .. ..$ TimeSpan.begin  : POSIXct[1:7854], format:  ...
  .. ..$ TimeSpan.end    : POSIXct[1:7854], format:  ...
  .. ..$ altitude        : num [1:7854] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..$ altitudeMode    : chr [1:7854] "relativeToGround" "relativeToGround" "relativeToGround" "relativeToGround" ...
  .. ..$ sampleArea      : num [1:7854] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ sampleThickness : num [1:7854] 2 2 2 2 2 2 2 2 2 2 ...
  .. ..$ observedValue   : chr [1:7854] "B" "L" NA\ "K" ...
  .. ..$ methodid        : Factor w/ 6 levels "CLYMHT","SLTMHT",..: 5 5 5 5 5 5 5 5 5 5 ...
  .. ..$ measurementError: num [1:7854] NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ...
> levels(eberg.geo@data$methodid)
[1] "CLYMHT"   "SLTMHT"   "SNDMHT"
[4] "SNDMHT.t" "soiltype" "TAXGRSC"

Geosamples-class can be considered the most plain (standard) format for any space-time observations and, as such, highly suitable for storing free-form geosamples. Main advantage of using this class in R is that it can be easily manipulated and converted to spatial and aqp classes. Likewise, the column names in the @data slot correspond to the tag names used in the KML schema, which makes it easier to export such data to some GIS or Google Earth. Note also that the idea of using geosamples is to allow users to extend possibilities of geostatistical analysis with data parameters such as horizontal and vertical support (sampleArea and sampleThickness) and uncertainty measures (locationError and measurementError).

Generating soil predictive components

Prior to geostatistical modeling, it is also probably a good idea to convert all covariates to independent components. This way, it will be easier to subset to the optimal number of predictors during the regression analysis. PCA helps reducing the prediction bias, which might happen if the covariates are cross-correlated. A wrapper function spc will convert all factor variables to indicators and run PCA on a stack of grids:

> formulaString <- ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6
> eberg_spc <- spc(eberg_grid, formulaString)
Converting PRMGEO6 to indicators...
Converting covariates to principal components...

The output from this operation is a stack of independent components, all numeric and all scaled around 0 value. To see which inputs define any of the components, we can look at the rotation table or plot the images:

> eberg_spc@pca$rotation
                  PC1         PC2         PC3         PC4         PC5         PC6           PC7         PC8
DEMSRT6   -0.55711093  0.20819653 -0.03119766  0.06485905 -0.04919198  0.04583550 -2.166881e-02 -0.11850187
TWISRT6    0.45829444  0.23870553 -0.22013669 -0.05930392 -0.05772050  0.05289886  4.749496e-02  0.16709268
TIRAST6    0.31720111  0.23411688  0.21194747  0.22670979 -0.06911383  0.01606945 -5.050788e-02 -0.86045157
PRMGEO6_1 -0.14135533 -0.03765647  0.60672337  0.34711053  0.49254843  0.02022393  1.512718e-01  0.12762671
PRMGEO6_2 -0.43416755  0.28698148 -0.47280406  0.01002071  0.02500768 -0.10026014  3.879591e-02 -0.21151385
PRMGEO6_3  0.07971922  0.06542432 -0.07672724 -0.04496406 -0.06289324  0.35365015  8.983826e-01 -0.02270445
PRMGEO6_4  0.36654086  0.41740804 -0.16408783 -0.03037150  0.41383444 -0.11009869 -1.480861e-01  0.18050170
PRMGEO6_5 -0.02765866  0.09015600  0.01774654  0.06338709 -0.08345800  0.89976730 -3.708044e-01  0.07892063
PRMGEO6_6 -0.01705511 -0.01720833  0.38970849 -0.80790529 -0.16623181 -0.01556903 -3.792439e-02 -0.11209884
PRMGEO6_7  0.16045128 -0.72535140 -0.26755675  0.17295113 -0.06543613  0.01762360 -4.878972e-02 -0.14541877
PRMGEO6_8  0.05464832  0.21913031  0.23958662  0.36153522 -0.72948460 -0.19250233 -6.541899e-05  0.28913001
 
                   PC9        PC10          PC11
DEMSRT6   -0.062531842  0.78623052  1.419406e-15
TWISRT6   -0.773340893  0.21597904 -7.371827e-16
TIRAST6    0.004370069  0.01648279  3.410741e-16
PRMGEO6_1 -0.311806003 -0.06650597  3.258389e-01
PRMGEO6_2 -0.210767393 -0.44338968  4.555854e-01
PRMGEO6_3  0.186242372  0.05142616  8.656808e-02
PRMGEO6_4  0.452668089  0.23662602  4.059308e-01
PRMGEO6_5  0.035319735 -0.10118848  1.225425e-01
PRMGEO6_6 -0.067946642  0.04174463  3.828709e-01
PRMGEO6_7 -0.024254803  0.25057179  5.051019e-01
PRMGEO6_8  0.101594844 -0.02238362  3.118692e-01
> pal = rev(rainbow(65)[1:48])
> rd = range(eberg_spc@predicted@data[,1], na.rm=TRUE)
> spplot(eberg_spc@predicted[1:4], at=seq(rd[1], rd[2], length.out=48), col.regions=pal)
First four components derived using the eberg_grid data.

Model fitting

In the GSIF package, regression-kriging model can be fitted at once by using the fit.gstatModel function. First, we need to specify the model explaining the distribution of soil property as a function of soil covariates:

> logits = function(x){log((x/100)/(1-x/100))}
> glm.formulaString = as.formula(paste("SNDMHT.t ~ ", paste(names(eberg_spc@predicted), collapse="+"), "+ ns(altitude, df=4)")) ## Note: we have to use logits to prevent the output predictions to be outside the range [0-1];
> glm.formulaString
SNDMHT.t ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +
    PC10 + PC11 + ns(altitude, df = 4)

In other words the observed values will be modeled as a function of PCs and altitude (natural splines via the ns function). The ns function is only one of the possible functions in R to fit soil-depth relationships. Other possible model is, for example, the restricted cubic splines (rsc) function in the rms package. Note that the logit-transformed value of the target variable is required to prevent from making predictions outside the 0-1 range.

In the GSIF package, the 3D GLM-kriging model can be fitted at once by running:

> SNDMHT.m <- fit.gstatModel(observations=eberg.geo, glm.formulaString, covariates=eberg_spc@predicted)
Fitting a linear model...
Warning: Shapiro-Wilk normality test and Anderson-Darling normality test report probability of < .05 indicating lack of normal distribution for residuals
Fitting a 3D variogram...
Warning: singular model in variogram fit
Saving an object of class 'gstatModel'...
> summary(SNDMHT.m@regModel)
Call:
glm(formula = SNDMHT.t ~ PC1 + PC4 + PC5 + PC7 + PC8 + PC9 +
    PC10 + PC11 + ns(altitude, df = 4), family = fit.family,
    data = rmatrix)
 
Deviance Residuals:
    Min       1Q   Median       3Q
-2.7863  -0.4729  -0.0251   0.3674
    Max
 3.6106
 
Coefficients:
                        Estimate
(Intercept)           -1.231e+00
PC1                   -3.040e-01
PC4                   -4.412e-01
PC5                   -1.566e-01
PC7                   -1.420e-01
PC8                   -2.449e-01
PC9                   -1.502e-01
PC10                   3.631e-01
PC11                  -1.429e+14
ns(altitude, df = 4)1  1.063e-01
ns(altitude, df = 4)2  1.563e-01
ns(altitude, df = 4)3  2.334e-01
ns(altitude, df = 4)4  2.488e-01
                      Std. Error
(Intercept)            6.966e-02
PC1                    3.567e-02
PC4                    2.218e-02
PC5                    2.256e-02
PC7                    2.462e-02
PC8                    2.789e-02
PC9                    3.967e-02
PC10                   1.008e-01
PC11                   2.304e+13
ns(altitude, df = 4)1  1.163e-01
ns(altitude, df = 4)2  9.725e-02
ns(altitude, df = 4)3  1.549e-01
ns(altitude, df = 4)4  7.869e-02
                      t value
(Intercept)           -17.671
PC1                    -8.522
PC4                   -19.889
PC5                    -6.943
PC7                    -5.769
PC8                    -8.782
PC9                    -3.786
PC10                    3.602
PC11                   -6.200
ns(altitude, df = 4)1   0.914
ns(altitude, df = 4)2   1.607
ns(altitude, df = 4)3   1.506
ns(altitude, df = 4)4   3.162
                      Pr(>|t|)
(Intercept)            < 2e-16 ***
PC1                    < 2e-16 ***
PC4                    < 2e-16 ***
PC5                   6.33e-12 ***
PC7                   1.02e-08 ***
PC8                    < 2e-16 ***
PC9                   0.000161 ***
PC10                  0.000329 ***
PC11                  7.82e-10 ***
ns(altitude, df = 4)1 0.361084
ns(altitude, df = 4)2 0.108338
ns(altitude, df = 4)3 0.132295
ns(altitude, df = 4)4 0.001606 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for gaussian family taken to be 0.7236981)
 
    Null deviance: 1361.91  on 1190  degrees of freedom
Residual deviance:  852.52  on 1178  degrees of freedom
  (594 observations deleted due to missingness)
AIC: 3009.7
 
Number of Fisher Scoring iterations: 2
> SNDMHT.m@vgmModel
  model     psill    range kappa ang1
1   Nug 0.0000000  0.00000   0.0    0
2   Exp 0.8202215 75.26947   0.5    0
  ang2 ang3 anis1   anis2
1    0    0     1 1.00000
2    0    0     1 0.00015

These result show that the model is significant, and this is valid for both regression (linear model) and the variogram part of the model. Note however that this is not completely a 3D regression-krging, as we do not actually have values of PCs at different depths (in fact, most of PCs relate only to the surface), so that values of covariates at land surface are basically copied to all depths. This does not represent any problem for the GLM modeling, however, you should be aware that, because values of covariates are fixed with the different depths, the resulting 3D patterns in the target variable will be mainly controlled by the surface patterns.

GSIF package allows you to fit also regression tree and RandomForest models to the regression part of the model. These are implemented via the rpart and randomForest packages. For example, to fit a regression tree, we set the method as:

> SNDMHT.m2 <- fit.gstatModel(observations=eberg.geo, glm.formulaString, covariates=eberg_spc@predicted, method="rpart")
Fitting a regression tree model...
Estimated Complexity Parameter (for prunning): 0.01352
Warning: Shapiro-Wilk normality test and Anderson-Darling normality test report probability of < .05 indicating lack of normal distribution for residuals
Fitting a 3D variogram...
Saving an object of class 'gstatModel'...

Spatial predictions

Once we have fitted a gstatModel, we can generate predictions and estimate the associated uncertainty at any depth. In the last step, we need to prepare the 3D prediction locations i.e. grid cells that need to be mapped. In GSIF package, this can be done by using the sp3D function:

> new3D <- sp3D(eberg_spc@predicted)

This will prepare the existing SPCs for 3D predictions (6 standard depths, at block support responding to the size of standard horizons) by adding the altitude column:

> str(new3D[[1]]@grid)
  Formal class 'GridTopology' [package "sp"] with 3 slots
   ..@ cellcentre.offset: Named num [1:3] 3.57e+06 5.71e+06 -2.50e-02
   .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude"
   ..@ cellsize         : Named num [1:3] 1e+02 1e+02 5e-02
   .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude"
   ..@ cells.dim        : Named int [1:3] 100 100 1
   .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude"

So that we can now run predictions for each standard depth in a loop:

> sd.l <- lapply(new3D, FUN=function(x){predict(SNDMHT.m, predictionLocations=x, nfold=0)})
Generating predictions using the trend model (RK method)...
[using ordinary kriging]
...

This operation can take time depending on the size of the grids and number of 3D points used to generate predictions.

The final predictions can be back-transformed to the 0-100% scale by (Diggle and Ribeiro, 2007, p. 148):

> invlogit = function(x){exp(x)/(1+exp(x))*100}
> invlogit.m = function(x, v){((1+exp(-x))^(-1)-.5*v*exp(-x)*(1-exp(-x))*(1+exp(-x))^(-3) )*100}
> for(j in 1:length(sd.l)){
    sd.l[[j]]@predicted$M <- round(invlogit.m(sd.l[[j]]@predicted$SNDMHT.t, sd.l[[j]]@predicted$var1.var))
    sd.l[[j]]@predicted$L <- round(invlogit(sd.l[[j]]@predicted$SNDMHT.t - 1.645*sqrt(sd.l[[j]]@predicted$var1.var)))
    sd.l[[j]]@predicted$U <- round(invlogit(sd.l[[j]]@predicted$SNDMHT.t + 1.645*sqrt(sd.l[[j]]@predicted$var1.var)))
}
> str(sd.l[[1]]@predicted@data)
'data.frame':   10000 obs. of  7 variables:
 $ var1.pred    : num  -1.94 -1.94 -1.94 -1.64 -1.61 ...
 $ var1.var     : num  0.279 0.278 0.278 0.252 0.252 ...
 $ SNDMHT.t.svar: num  0.244 0.243 0.243 0.221 0.22 ...
 $ SNDMHT.t     : num  -1.94 -1.94 -1.94 -1.64 -1.61 ...
 $ M            : num  14 14 14 17 18 18 19 19 20 20 ...
 $ L            : num  6 6 6 8 8 8 8 9 9 10 ...
 $ U            : num  26 26 25 31 31 31 33 33 33 33 ...

In this case we predict both the mean value (M) and the upper (U) and lower (L) confidence limits based on the 90% probability.

Finally, we can prepare the produced predictions and export them as a GlobalSoilMap-class object. Before we can save the ouput object as a GlobalSoilMap-class object, we need to reproject the produced predictions to geographical coordinates (WGS84), which is possible by using the make.3Dgrid function:

> p = get("cellsize", envir = GSIF.opts)[2]
> s = get("stdepths", envir = GSIF.opts)
> s
[1] -0.025 -0.100 -0.225 -0.450 -0.800 -1.500
> sd.ll <- sapply(1:length(sd.l), FUN=function(x){make.3Dgrid(sd.l[[x]]@predicted[c("L","M","U")], pixelsize=p, stdepths=s[x])})
Resampling 3 layers to +proj=longlat +datum=WGS84 with grid cell size of 0.000833333333333333 ...

The object sd.ll can now be saved as a GlobalSoilMap-class object:

> SNDMHT.gsm <- SoilGrids(obj=sd.ll, varname="SNDPPT", TimeSpan=list(begin="1999-02-01", end="2001-07-01"))
> str(SNDMHT.gsm, max.level=2)
Formal class 'SoilGrids' [package "GSIF"] with 8 slots
  ..@ varname : chr "SNDPPT"
  ..@ TimeSpan:List of 2
  ..@ sd1     :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots
  ..@ sd2     :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots
  ..@ sd3     :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots
  ..@ sd4     :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots
  ..@ sd5     :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots
  ..@ sd6     :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots
> save(SNDMHT.gsm, file="SNDMHT.rda", compress="xz")

and visualized in Google Earth by using the plotKML package functionality:

> z0 = mean(eberg_grid$DEMSRT6, na.rm=TRUE)
> for(j in 1:length(sd.ll)){
+  kml(methods/html/slot.html">slot(SNDMHT.gsm, paste("sd", j, sep="")),
+     folder.name = paste("eberg_sd", j, sep=""),
+     file = paste("SNDMHT_sd", j, ".kml", sep=""),
+     colour = M, z.lim=c(10,85),
+     raster_name = paste("SNDMHT_sd", j, ".png", sep=""),
+     altitude = z0+5000+(s[j]*2500))
+ }
KML file header opened for parsing...
Parsing to KML...
Closing  SNDMHT_sd1.kml
KML file header opened for parsing...
...
Parsing to KML...
Closing  SNDMHT_sd6.kml
Predicted sand content using 3D GLM-kriging as visualized in Google Earth (zip). When visualizing this data, make sure that the relief exaggeration in Google Earth has been set at 1.

To check how accurate these maps are, we can use the validate method. This will interatively fit the gstatModels and then derive the residual error at validation points. The validate method is a total cross-validation as the cross-validation points are masked out from the model fitting, unlike in gstat where the krige.cv will use a constant model predicted using all data.

> rk.cv <- validate(SNDMHT.m)
Estimating the predictionDomain...
Estimated nearest neighbour distance (point pattern): 126
Running 5-fold cross validation...
Generating predictions using the trend model (KED method)...
[using universal kriging]
100% done
Generating predictions using the trend model (KED method)...
[using universal kriging]
...
> tvar <- 1-var(rk.cv[[1]]$residual, na.rm=T)/var(rk.cv[[1]]$observed, na.rm=T)
> signif(tvar*100, 3)
[1] 61.2

This shows that the model is able to explain almost 60% of variability in the target variable at validation points.

Visualizing the uncertainty of predictions

According to the GlobalSoilMap.net specifications, a requirement to submit the predicted soil property maps is to estimate the 90% upper and lower confidence intervals and attach them to the predictions (thus, two extra maps). By using the above describe methodology (i.e. the gstatModel class objects), one can derive confidence intervals for any arbitrary point in the 3D continuum. Here is an example of the calculus for some arbitrary 3D point:

> loc <- eberg_spc@predicted[1200:1201,]
> new3D.loc <- sp3D(loc)
> str(new3D.loc[[1]])
Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots
   ..@ data       :'data.frame': 2 obs. of  11 variables:
   .. ..$ PC1 : num [1:2] 3.09 -2.42
   .. ..$ PC2 : num [1:2] 1.601 0.823
   .. ..$ PC3 : num [1:2] -0.576 -1.176
   .. ..$ PC4 : num [1:2] -0.09268 -0.00844
   .. ..$ PC5 : num [1:2] 1.0204 0.0536
   .. ..$ PC6 : num [1:2] -0.23 -0.189
   .. ..$ PC7 : num [1:2] -0.312 0.0839
   .. ..$ PC8 : num [1:2] 0.221 -0.15
   .. ..$ PC9 : num [1:2] -0.218 -0.222
   .. ..$ PC10: num [1:2] -0.1249 0.0608
   .. ..$ PC11: num [1:2] -4.66e-15 7.88e-15
   ..@ coords.nrs : num(0)
   ..@ grid       :Formal class 'GridTopology' [package "sp"] with 3 slots
   .. .. ..@ cellcentre.offset: Named num [1:3] 3.57e+06 5.72e+06 -2.50e-02
   .. .. .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude"
   .. .. ..@ cellsize         : Named num [1:3] 9900 100 0.05
   .. .. .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude"
   .. .. ..@ cells.dim        : Named int [1:3] 2 2 1
   .. .. .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude"
   ..@ grid.index : int [1:2] 2 3
   ..@ coords     : num [1:2, 1:3] 3.58e+06 3.57e+06 5.72e+06 5.72e+06 -2.50e-02 ...
   .. ..- attr(*, "dimnames")=List of 2
   .. .. ..$ : NULL\
   .. .. ..$ : chr [1:3] "longitude" "latitude" "altitude"
   ..@ bbox       : num [1:3, 1:2] 3565100 5716700 -4950 3584900 5716900 ...
   .. ..- attr(*, "dimnames")=List of 2
   .. .. ..$ : chr [1:3] "longitude" "latitude" "altitude"
   .. .. ..$ : chr [1:2] "min" "max"
   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
   .. .. ..@ projargs: chr " +init=epsg:31467 +proj=tmerc +lat_0=0 +lon_0=9 +k=1 +x_0=3500000 +y_0=0 +ellps=bessel __truncated__
> sd.loc <- predict(SNDMHT.m, predictionLocations=new3D.loc[[1]], nfold=0, subset.observations = SNDMHT.m@sp@coords[,1])
Generating predictions using the trend model (RK method)...
[using ordinary kriging]
100% done
> int90 <- sd.loc@predicted$SNDMHT.t[1] + c(-1.645, 1.645)*sqrt(sd.loc@predicted$var1.var[1])
> invlogit(int90)
[1] 10.97902 26.44757
> new3D.loc[[1]]@coords[1,]
longitude    latitude    altitude
3579950.000 5716850.000      -0.025

This means that the 90% confidence interval of predictions for this 3D location is 11-26% of sand. Because the gstatModel is basically a global 3D model for this soil property, we can estimate uncertainty at any new 3D location, the only requirement is that values of covariates at this location are known, and that the location is submitted as SpatialPixelsDataFrame object.

To visualize uncertainty, you can consider using geostatistical simulations:

> SNDMHT.xy <- spTransform(SNDMHT.geo, CRS("+init=epsg:31467"))
> sel.stripe <- eberg_spc@predicted@coords[,2] > min(SNDMHT.xy@coords[,2])  # 2400 locations
> loc <- eberg_spc@predicted[sel.stripe,]
> new3D.loc <- sp3D(loc)
> sd.loc <- predict(SNDMHT.m, predictionLocations=new3D.loc[[1]], nsim=20, block=c(0,0,0))
Generating 20 conditional simulations using the trend model (RK method)...
drawing 20 GLS realisations of beta...
[using conditional Gaussian simulation]
  100% done
> sd.loc@realizations <- calc(sd.loc@realizations, fun=invlogit)

This function will create an object of class ''RasterBrickSimulations'', then can then be visualized by using the plotKML package:

> str(sd.loc, max.level=2)
Formal class 'RasterBrickSimulations' [package "plotKML"] with 3 slots
  ..@ variable    : chr "SNDMHT"
  ..@ sampled     :Formal class 'SpatialLines' [package "sp"] with 3 slots
  ..@ realizations:Formal class 'RasterBrick' [package "raster"] with 12 slots
> plotKML(sd.loc, file="SNDMHT_sims.kml", zlim=c(10,85))
KML file header opened for parsing...
Reprojecting to +proj=longlat +datum=WGS84 ...
...
Parsing to KML...
Reprojecting to +proj=longlat +datum=WGS84 ...
Parsing to KML...
Closing  SNDMHT_sims.kml

The resulting simulations are shown in figure below. Note that the geostatistical simulations at block support can be very time consuming, so that running this code for millions of locations is probably not a good idea. One of the purposes of the GSIF package for R is to allow for faster methods to run geostatistical simulations (further development will likely go via Python).

10 simulations of sand content (at point support) and a cross-section showing the changes in values in vertical direction.

Predicting soil classes

GSIF package also provides functionality for pedometric mapping of soil classes. Soil types can be mapped using the wrapper functions spmultinom and spfkm. This will run supervised fuzzy k-means using a list of covariates layers provided as SpatialPixelsDataFrame object, and optional class centers and class variances. As in the case of continuous/numeric variables, the process consists of model fitting and predictions. In GSIF package, the two are wrapped into a single function:

> formulaString = soiltype ~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10
> eberg_sm <- spmultinom(formulaString, eberg.xy, eberg_spc@predicted)
Fitting a multinomial logistic regression model...
# weights:  144 (121 variable)
initial  value 1853.740361
iter  10 value 1200.197532
iter  20 value 1167.479984
iter  30 value 1156.517152
iter  40 value 1150.287170
iter  50 value 1146.787227
iter  60 value 1144.369824
iter  70 value 1144.156270
iter  80 value 1143.913263
iter  90 value 1143.907377
final  value 1143.907359
converged
Estimated Cohen Kappa (weighted): 0.3116
Warning message:
In multinom(formulaString, ov, ...) : group ‘Hw’ is empty

The spmultinom tries to estimate distribution of soil types by fitting a multinomial logistic regression model (multinom package) available via the package nnet (Venables and Ripley, 2002). The output is an object of class SpatialMemberships, which contains predictions, the model parameters, class centres and variances, and membership maps derived per class. The resulting Kappa statistics shows that the predicted classes match with ca 31% observed classes, which is a standardly low value for soil taxonomic data.

Class centres and variances can also be passed via the class.c and class.sd arguments via the spfkm function (alternative to spmultinom), in which case spfkm will directly derived supervised fuzzy k-means.

To see the estimated class centers:

> eberg_sm@class.c
          PC1         PC2        PC3         PC4         PC5         PC6         PC7          PC8         PC9
A  -1.9136970  0.42290575 -0.8199637  0.06929519  0.14254910 -0.29501667  0.04445353 -0.569246722  0.59515829
B   0.1451428  0.05149812  1.0232911 -1.38193489 -0.88743587 -0.15377161 -0.07627217 -0.077118762 -0.05831949
D  -0.4773107  0.33677264  1.5855751  1.21311891 -0.20838896 -0.24014369  0.29626729  0.816515294 -0.34327845
G   3.1222982  1.48675002 -0.9355987 -0.37516028  1.06098275 -0.21828971 -0.22613231  1.241218926 -0.71454825
Ha -2.0722482  0.87009506 -1.0655838  0.03860135 -0.02849083  0.41575214  0.20629221 -0.233234399 -0.16586679
L   1.8885517  0.03098325 -0.4946663  0.10999082  0.51145092 -0.12456134 -0.21485459  0.006779172  0.06215298
Q   0.4974576 -1.57456661 -0.2174762  0.53124815 -0.14682206  0.01991976 -0.15147949 -0.951987328  0.45050045
R  -2.1043180  0.50564145 -1.0560584 -0.05346994  0.12978516 -0.25752102  0.08547869 -0.075669733  0.18523146
S   0.7410611 -1.00398803 -0.6111388  0.14103309  0.15076192 -0.03602664 -0.11831015  0.349219067  0.05584245
Z  -1.4125863  0.03343204  1.7252433  1.14667200  0.72181439 -0.13599257  0.29550948  0.118988806  0.29309998
          PC10
A  -1.23773329
B  -0.01363216
D   0.03916544
G  -0.09389767
Ha -0.02458035
L  -0.11880316
Q   0.12662896
R  -0.79474759
S   0.26756947
Z  -0.15280722
> row.names(eberg_sm@class.c)
[1] "A"  "B"  "D"  "G"  "Ha" "L"  "Q"  "R"  "S"  "Z"
Predicted soil types for the Ebergötzen case study. See spmultinom for more details. This can be also considered a downscaling or disaggregation method.

Mapped soil class memberships can also be used to map soil properties, which is then equivalent to weighted averaging. We can derive the memberships using the fuzzy k-means to ensure that they sum up to 1:

> eberg_sm <- spfkm(formulaString, eberg.xy, eberg_spc@predicted)
Trying to estimate the class centres using the 'multinom' method...
Fitting a multinomial logistic regression model...
...
Predicted soil memberships using the spfkm method.

The regression model changes to e.g.:

> glm.formulaString2 = as.formula(paste("SNDMHT_A ~ ", paste(names(eberg_sm@mu), collapse="+"), "-1"))
> glm.formulaString2
SNDMHT_A ~ A + B + D + G + Hw + K + L + Q + R + S + Z - 1
> SNDMHT.m2 <- fit.gstatModel(observations=eberg.xy, glm.formulaString2, covariates=eberg_sm@mu)
> summary(SNDMHT.m2@regModel)
Call:
glm(formula = SNDMHT_A ~ B + D + G + Hw + K + L + Q + R + S +
    Z - 1, family = family, data = rmatrix)
 
Deviance Residuals:
    Min       1Q   Median       3Q      Max
-51.603   -8.522   -1.022    8.397   55.466
 
Coefficients:
   Estimate Std. Error t value Pr(>|t|)
B    58.103      1.492  38.955  < 2e-16 ***
D    21.022      1.961  10.717  < 2e-16 ***
G    13.500      7.090   1.904  0.05742 .
Hw   15.879      5.225   3.039  0.00248 **
K    20.159      3.965   5.084 5.06e-07 ***
L    24.465      1.720  14.224  < 2e-16 ***
Q    32.552      2.535  12.839  < 2e-16 ***
R    17.559      5.991   2.931  0.00352 **
S    32.760      1.628  20.124  < 2e-16 ***
Z    17.072      1.988   8.589  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for gaussian family taken to be 251.3532)
 
    Null deviance: 839197  on 567  degrees of freedom
Residual deviance: 140004  on 557  degrees of freedom
  (1 observation deleted due to missingness)
AIC: 4754.7
 
Number of Fisher Scoring iterations: 2

Note that intercept needs to be taken out, so that the best predictor of the sand content for some soil type is basically the mean value of the sand for that soil type (for example class “B” is expected to have an average sand content of about 58%). If you compare the model based on soil classes and model fitted in the previous section (SNDMHT.m), you can see that fitting data using a 3D model results in a slightly better fit. Nevertheless, soil classes are in this case study significant estimators of sand content (especially classes B, D, L, Q, S and Z).

In the next examples we demonstrate how to produce predictions using multi-scale and/or multi-source data. By multi-scale data we imply covariates used for geostatistical mapping that are available at 2 or more (distinctly different) resolutions, but then cover the same area of interest (see also: RasterStack in the raster package). Or in other words, we fit models and generate predictions with covariates at different resolutions, but with the same extent. In the case of the multi-source data, covariates can be of any scale, they can have a variable extent, and variable accuracy. As a general strategy, for multi-scale data we propose fitting a single model to combined covariates, while for the multi-source data we propose using data assimilation methods (merging of predictions).

Multi-scale versus multi-source approaches to soil mapping.

Regression-kriging using multiscale data

Ebergötzen case study contains covariates at 100 and 25 m resolution. Both list of covariates have the same extent, which allows us to fit a model by still using the same fit.gstatModel function used above:

> eberg_grids <- list(eberg_grid, eberg_grid25)
> unlist(sapply(eberg_grids, names))
[1] "PRMGEO6" "DEMSRT6" "TWISRT6" "TIRAST6" "LNCCOR6" "DEMTOPx" "HBTSOLx" "TWITOPx" "NVILANx"
> glm.formulaString3 = observedValue ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6+LNCCOR6+TWITOPx+NVILANx+ns(altitude, df=4)
> SNDMHT.m3 <- fit.gstatModel(observations=eberg.geo, glm.formulaString3, covariates=eberg_grids, methodid="SNDMHT.t")

In this case, fit.gstatModel function recognizes that the input to the model is a list of grids, and hence overlays points and grids in a loop. Note however, that the model fitting might be time consuming as the model needs to overlay points and grids, then merge resulting outputs in a loop.

The model above (SNDMHT.m3) would be more difficult to use for prediction because it can happen that some land cover classes or soil mapping units do not contain any points. The model will thus not be able to predict values at new locations and R will report an error e.g.

"error in model... factor 'LNCCOR6' has new level(s)"

Thus, a more robust way to generate predictions with multiscale covariates is to, again, convert the original predictors to principal components that are all continuous and independent. We first need to downscale all predictors from 100 m to 25 m resolution using the gdalwarp function:

> eberg_grid25@data <- cbind(eberg_grid25@data, gdalwarp(eberg_grid, pixsize=eberg_grid25@grid@cellsize[1],
+    GridTopology=eberg_grid25@bbox, resampling_method="cubicspline")@data)
  Resampling 5 layers to "+init=epsg:31467" with grid cell size of 25 ...

So that we can derive soil predictive component via the spc function:

> formulaString2 <- ~ TWITOPx+NVILANx+PRMGEO6+DEMSRT6+TWISRT6+TIRAST6
> eberg_spc25 <- spc(eberg_grid25, formulaString2)
First four components derived using the combined 100 m and 25 m resolution data. Compare with the PCs derived using the 100 m resolution covariates only.

which now allows us to model spatial distribution of sand content as a function of 14 components, at four times better resolution:

> glm.formulaString3 = as.formula(paste("logits(SNDMHT) ~ ", paste(names(eberg_spc25@predicted), collapse="+"), "+ ns(altitude, df=4)"))
> glm.formulaString3
logits(SNDMHT) ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 +
    PC9 + PC10 + PC11 + PC12 + PC13 + PC14 + ns(altitude, df = 4)
> SNDMHT.m3 <- fit.gstatModel(observations=eberg.geo, glm.formulaString3,
+    covariates=eberg_spc25@predicted, methodid="SNDMHT.t")

Note that the resulting model (glm.formulaString3) is only slightly better than the model we fitted using the 100 m resolution data. For mapping sand content in this study area, there seems to be not much gain in using 25 m resolution covariates considering the model significance. The difference is, however, that the predictions with 25 m resolution covariates show four times better detail than the model with 100 m resolution grids (see small topographic features in the figure below), hence on the end there is a significant gain in using higher resolution data.

Comparison of predictions created using 25 m and 100 m resolution covariates. The missing values indicate areas of where the predictions were signficant (and hence were masked out). Values from 4% to 96% sand (blue to red).

It is worth mentioning that preparation of the prediction locations for multiscale data is, however, slightly more complicated as the covariates had to be resampled from 100 m resolution to 25 m resolution. This operation can be time consuming and can lead to artifacts. GSIF, by default, uses FWTools (gdalwarp function) for resampling and reprojection of grids. FWTools is highly suitable for processing large data, and has many possibilities considering the downscaling algorithms. For example, via FWTools you can pass the resampling method bicubicspline (Keys, 1981), which is effective for downscaling relatively smooth grids such as DEM or DEM parameters. The problem is that R needs to read and write spatial grids between FWTools, so that processing grids using FWTools can still take significant amount of time / RAM when working with large grids.

Merging multi-source data

Imagine if the covariates for the Ebergötzen case study did not have the same extent, so that by overlaying the points we would loose significant parts of the callibration data as the covariates would not match. To avoid this, we can instead produce multiple models for each extent, then use some type of the data assimilation method to merge the predictions produced by various models. A sensitive approach to merge multiple predictions is to use the per-pixel accuracy to assign the weights, so that more accurate predictions receive more weight (Heuvelink and Bierkens, 1992).

In practice, merging multi-source predictions takes only few more steps than in the case of multiscale data. First, we fit models using the two scales independently:

> formulaString.l <- list(~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6, ~ DEMTOPx+TWITOPx+NVILANx)
> eberg_grids_spc <- spc(eberg_grids, formulaString.l)
Converting PRMGEO6 to indicators...
Converting covariates to principal components...
Converting covariates to principal components...
> glm.formulaString.l <- lapply(eberg_grids_spc,
+   FUN=function(x){as.formula(paste("logits(SNDMHT.t) ~ ", paste(names(x@predicted), collapse="+"),
+   "+ ns(altitude, df=4)"))})
> glm.formulaString.l
[[1]]
logits(SNDMHT.t) ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 +
    PC8 + PC9 + PC10 + PC11 + ns(altitude, df = 4)
<environment: 0x000000002a46cc58>
 
[[2]]
logits(SNDMHT.t) ~ PC1 + PC2 + PC3 + ns(altitude, df = 4)
<environment: 0x000000002ac47918>
> SNDMHT.ml <- fit.gstatModel(observations=eberg.geo,
+   glm.formulaString.l, lapply(eberg_grids_spc, methods/html/slot.html">slot, "predicted"),
+   methodid="SNDMHT.t", rvgm=NULL)

Second, we can produce predictions using a list of covariates at different scales:

> new3D.ml <- sapply(eberg_grids_spc, FUN=function(x){sp3D(x@predicted, stdepths=-0.025)})
> sd.ml <- predict(SNDMHT.ml, predictionLocations=new3D.ml, nfold=2, verbose=TRUE, mask.extra=FALSE)
Generating predictions using the trend model (RK method)...
[using ordinary kriging]
Running 2-fold cross validation...
|==============================================================================================| 100%
Generating predictions using the trend model (RK method)...
[using ordinary kriging]
Running 2-fold cross validation...
|==============================================================================================| 100%

In the third step, we can sum up the predictions using weighted averaging. Merge function will by default use results of cross-validation to scale the prediction variances, which are then used as weigths:

> sd.mlc <- merge(sd.ml[[1]], sd.ml[[2]], silent=FALSE)
Resampling 2 layers to " +init=epsg:31467 +p " with grid cell size: 25 ...
|==============================================================================================| 100%
Cross-validation RMSE (type = link):
observedValue_1 observedValue_2
 0.5713          0.6340

In this case the predictions using the 100 m resolution data will receive a slightly higher weights.

Predictions using multi-source data (produced using the function merge). Values from 4% to 96% sand (blue to red).

The results presented in figure above confirm that this method is equally valid and leads to almost identical predictions as when all covariates are used together (compare with the previous section). The final map again reflects mainly the patterns of the coarser predictors as these are more dominant. Assuming that the two predictions use completely independent covariates, the output should indeed be very similar to using a method that combines the two models. In fact, also in the case of regression-kriging, predictions using covariates are summed up to predictions using kriging (of residuals). In practice, the covariates might be correlated, and the models could overlap, so that the output of the merge function might be biased. In fact, merging multi-source data is highly sensitive to type of models used to generate predictions and associated assumptions (for a discussion see e.g. Gotway and Young, 2002; Gotway and Young, 2007). Nevertheless, this approach is attractive for global soil mapping applications where predictions can be produced for areas of various extent, or using completely independent methods that can not be combined mathematically. For example, by following this approach, one could build models and produce predictions of soil properties for the whole world but at coarse resolution of e.g. 1 km, then use these predictions to fill-in the gaps for a regional model at resolution of 250 m or 100 m. It is important, however, to be aware that the covariates, even if they are at different scales, are cross-correlated and hence this approach might lead to biased predictions (over or under estimation).

Finally, the propagated prediction uncertainty for merged predictions is more difficult to estimate as one can not simply take the mean of the prediction (kriging) variances. Instead, geostatistical simulations can be used to produce multiple realizations (see figure above). The propagated prediction error can then be derived by aggregating multiple realizations per pixel (see e.g. raster package).


wiki tutorial eberg
 stars  from 2 votes (Details)