# 3D soil property mapping using the GSIF package

*Edited by*: T. Hengl, B. Kempen and G.B.M. Heuvelink- Part of:
**ISRIC Spring School**

The purpose of this tutorial is to demonstrate major processing steps used within the GSIF framework for generating soil property and soil class maps from point data, and with the help of multi-scale covariates. The GSIF (R package) project summary page you can find here. To learn more about the Global Soil Information Facilities (GSIF), visit the main project page. See also the complete list of **functions** available via the GSIF package. Some more sophisticated modelling techniques based on Machine Learning can be found here.

Download this tutorial as **R script**.

## Data screening

For demonstration purposes we use the Ebergötzen data set, which has been used by the SAGA GIS development team (Böhner et al., 2006; Hengl et al., 2012). To start the exercise, first install and load all required packages (see also: GSIF installation instructions). Note that GSIF is dependent on the plotKML package, which you should always first install from R-forge as these are both experimental packages:

> sessionInfo()

R version 2.15.3 (2013-03-01) Platform: x86_64-w64-mingw32/x64 (64-bit)

> library(aqp) > library(raster) > library(plotKML) > library(GSIF)

Loading required package: RCurl Loading required package: bitops GSIF version 0.3-0 (2013-03-07) URL: http://gsif.r-forge.r-project.org/

load the input data:

> data(eberg) > data(eberg_grid) > data(eberg_grid25)

### Profile data

The Ebergötzen data set is a 10 by 10 km large case study in central Germany. The `eberg`

object contains soil profile observations, and `eberg_grid`

and `eberg_grid25`

contain covariate layers at 100 and 25 m resolution. Before we proceed with running geostatistical analysis, we can run some initial data screening. We can first look at the data structure:

> str(eberg)

'data.frame': 3670 obs. of 30 variables: $ ID : Factor w/ 3692 levels "id0001","id0002",..: 3302 93 2827 1858 3539 3540 2828 94 1859 95 ... $ soiltype: Factor w/ 13 levels "A","B","D","G",..: NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ... $ TAXGRSC : Factor w/ 13 levels "Auenboden","Braunerde",..: NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ... $ X : int 3569323 3569328 3569328 3569335 3569336 3569340 3569343 3569344 3569348 3569349 ... $ Y : int 5716216 5715647 5716024 5715770 5716095 5716233 5716325 5715447 5714342 5716281 ... $ UHDICM_A: num 0 0 0 0 0 0 0 0 0 0 ... $ LHDICM_A: num 10 10 10 10 10 10 10 10 10 10 ... $ SNDMHT_A: num 20 20 18.8 20 16.6 12.5 20 20 20 16.5 ... $ SLTMHT_A: num 40 40 37.6 40 33.2 25 40 40 40 66.5 ... $ CLYMHT_A: num 40 40 37.6 40 33.2 25 40 40 40 21 ... $ UHDICM_B: num 10 10 10 10 10 10 10 10 10 10 ... $ LHDICM_B: num 30 30 30 30 30 30 30 30 30 30 ... $ SNDMHT_B: num NA\ 20 NA\ NA\ NA\ NA\ NA\ 20 18.8 16.5 ... $ SLTMHT_B: num NA\ 40 NA\ NA\ NA\ NA\ NA\ 40 37.6 66.5 ... $ CLYMHT_B: num NA\ 40 NA\ NA\ NA\ NA\ NA\ 40 37.6 21 ... $ UHDICM_C: num 30 30 30 30 30 30 30 30 30 30 ... $ LHDICM_C: num 50 50 50 50 50 50 50 50 50 50 ... $ SNDMHT_C: num NA\ 18.8 NA\ NA\ NA\ NA\ NA\ 18.8 NA\ NA\ ... $ SLTMHT_C: num NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ... $ CLYMHT_C: num NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ... $ UHDICM_D: num 50 50 50 50 50 50 50 50 50 50 ... $ LHDICM_D: num 70 70 70 70 70 70 70 70 70 70 ... $ SNDMHT_D: num NA\ 18.8 NA\ NA\ NA\ NA\ NA\ 18.8 NA\ NA\ ... $ SLTMHT_D: num NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ... $ CLYMHT_D: num NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ... $ UHDICM_E: num 70 70 70 70 70 70 70 70 70 70 ... $ LHDICM_E: num 90 90 90 90 90 90 90 90 90 90 ... $ SNDMHT_E: num NA\ 18.8 NA\ NA\ NA\ NA\ NA\ 18.8 NA\ NA\ ... $ SLTMHT_E: num NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ... $ CLYMHT_E: num NA\ 37.6 NA\ NA\ NA\ NA\ NA\ 37.6 NA\ NA\ ...

This shows that soil texture fractions SNDMHT, SLTMHT and CLYMHT have been sampled at fixed depths (10-30, 30-50, 50-70, 70-90), and `soiltype`

contains 13 classes of soil types.

### Histogram plotting

Our focus in this exercises is spatial prediction of sand content over the whole volume of soil. Let us see some summary info for sand content:

> summary(eberg$SNDMHT_A)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA\'s 4.062 18.800 20.000 30.260 42.300 92.500 3.000

> library(StatDA) > par(mar=c(2.5,2.5,0.5,0.5), oma=c(0,0,0,0)) > edaplot(eberg$SNDMHT_A[!is.na(eberg$SNDMHT_A)], H.freq=TRUE, box=FALSE, S.pch=3, S.cex=0.5, + D.lwd=1.5, P.ylab="", P.log=FALSE, P.logfine=c(5,10), P.main="", P.xlab="", B.pch=3, B.cex=0.5)

From the plot above, we can also observe that many numbers are in fact overlapping. It seems that there is only a clusters of values possible for `SNDMHT`

. Going back to the origin of this data set, we can notice that the sand, silt and clay values have been determined by using the so-called texture by hand method, i.e. via texture classes. The literature (Skaggs et al., 2001) shows that the this technique can be used to determine the content of soil earth fractions only to an accuracy of ±5–10%. This means that we should not plan to map any of the texture fractions to a precision better than ±5% (detection limit), because it would exceed the measurement error.

These is a relatively large set (3670 points), so it might be a good idea to subset it (e.g. take only 30% of samples) to speed up the processing:

> eberg.xy <- eberg[runif(nrow(eberg)) < .3,] > coordinates(eberg.xy) <- ~X+Y > proj4string(eberg.xy) <- CRS("+init=epsg:31467") > gridded(eberg_grid) <- ~x+y > proj4string(eberg_grid) <- CRS("+init=epsg:31467") > gridded(eberg_grid25) <- ~x+y > proj4string(eberg_grid25) <- CRS("+init=epsg:31467")

### Testing spatial randomness and point clustering

Prior to geostatistical analysis, it is usefull to assess how representative and consistent the soil profile data is. To achieve this, we can run some basic analysis of the point geometry and then overlap the points with predictors to see how well are the environmental features represented. First, we can test if the points represent the geographical space in an unbiased way (see chapter 7, in Bivand et al., 2008):

> library(spatstat) > mg_owin <- as.owin(data.frame(x=as.data.frame(eberg_grid)[,"x"], y=as.data.frame(eberg_grid)[,"y"], window = TRUE)) > eberg.ppp <- ppp(x=coordinates(eberg.xy)[,1], y=coordinates(eberg.xy)[,2], marks=eberg.xy$zinc, window=mg_owin) > env.eberg.xy <- envelope(eberg.ppp, fun=Gest)

Generating 99 simulations of CSR ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99. Done.

> par(mar=c(4.5,4.5,0.5,0.5), oma=c(0,0,0,0)) > plot(env.eberg.xy, lwd=list(3,1,1,1), main="")

lty col key label meaning obs 1 1 obs G[obs](r) observed value of G(r) for data pattern theo 2 2 theo G[theo](r) theoretical value of G(r) for CSR hi 1 8 hi G[hi](r) upper pointwise envelope of G(r) from simulations lo 1 8 lo G[lo](r) lower pointwise envelope of G(r) from simulations

which shows that the point samples do not exactly satisfy the Complete Spatial Randomness test (spatstat package) — samples at larger distances have a lower spatial density than a completely random design, which usually means that some parts of the case study are under-sampled (see also figure below). On the other hand, points at shorter distances (<75 m) would pass the CSR test, which indicates that, at least at shorter distances, there is no clustering in geographical space.

### Testing the feature space coverage

To see how representative is the point data considering the coverage of feature space, we use the MaxEnt function that extends methods available via the dismo package:

> jar <- paste(system.file(package="dismo"), "/java/maxent.jar", sep='') > if (file.exists(jar)) { + me.eberg <- MaxEnt(occurrences=eberg.ppp, covariates=eberg_grid) + par(mfrow=c(1,2), mar=c(0.5,0.5,0.5,0.5), oma=c(0,0,0,0)) + image(x=methods/html/as.html">as(me.eberg@predicted, "SpatialPixelsDataFrame"), col=rev(heat.colors(25)), xlab="", ylab="") + points(me.eberg@occurrences, pch="+", cex=.7) + image(me.eberg@sp.domain, col="grey", xlab="", ylab="") + }

Which shows that, some parts of the study area (higher elevations, specific land cover types) have been systematically omitted from sampling. The map on the right shows which pixels (grey) are actually valid to run predictions as the probability of occurrence of sampling points at these locations (at least based on the feature space defined by the covariates) is statistically significant. Ideally, the whole map should be gray, which means none of the areas in feature space have been under-represented. This is nothing that should worry us too much, but something we need to be aware when doing the interpretation of produced maps. Note: To run MaxEnt, you will first need to obtain the maxent.jar and place it somewhere where dismo can find it e.g. under library/dismo/java/maxent.jar. To learn more about MaxEnt, refer to the dismo package vignette (Hijmans and Elith, 2012).

A much better way to design sampling would be to use some type of probability or statistical sampling. For example, to optimally allocate samples in feature space (to equally represent all environmental conditions) we can use the clhs package, developed and maintained by Pierre Roudier. This package implements conditioned Latin hypercube sampling, as published by Minasny and McBratney (2006). Here is an example for the Ebergotzen data set:

> install.packages("clhs") > library(clhs) > library(plotKML) > library(GSIF) > library(sp) > data(eberg_grid) > gridded(eberg_grid) <- ~x+y > proj4string(eberg_grid) <- CRS("+init=epsg:31467") > formulaString <- ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6 > eberg_spc <- spc(eberg_grid, formulaString) Converting PRMGEO6 to indicators... Converting covariates to principal components... > sample100 <- clhs(eberg_spc@predicted@data[,1:7], size=100) |================================================| 100% ## Takes ca 3 minutes! > sample100.sp <- SpatialPointsDataFrame(eberg_grid@coords[sample100,], data.frame(x=rep(1, length(sample100)))) > proj4string(sample100.sp) <- CRS("+init=epsg:31467") > plotKML(sample100.sp)

Note that the points are now more equally spread, so that also forest patches are represented in the sampling design. Consequently, spatial modelling using this sampling would be more representative of all environmental conditions. Note that the clhs package allows you to also specify costs of sampling per pixel, so that a conditional sampling design can be generated as a trade-off between good representation and cost minimization.

## Model fitting and spatial predictions

Any geostatistical mapping boils down to two major processes: (a) model fitting (or model estimation), and (b) generation of spatial predictions. Also in GSIF package geostatistical mapping is implemented via two main functions: fit.gstatModel and predict.gstatModel. In the following sections we demonstrate how to fit models and generate predictions using soil profile data and a diversity of covariate layers typical for soil mapping applications. Functions fit.gstatModel and predict.gstatModel largely extend the functionality of the stats, rpart and gstat packages. Gstat package (see chapter 8, in Bivand et al., 2008) is by many still considered one of the most comprehensive packages for geostatistical modeling.

### Preparing point data for geostatistical analysis

Before we build a regression-kriging model, we need to prepare the soil observations to a format suitable for fitting of 3D models. We start by converting the `eberg`

data frame to class SoilProfileCollection-class (aqp package) and the geosamples-class (GSIF package):

> # select columns of interest: > s.lst <- c("ID", "soiltype", "TAXGRSC", "X", "Y") > h.lst <- c("UHDICM","LHDICM","SNDMHT","SLTMHT","CLYMHT") > sites <- eberg[,s.lst] > horizons <- getHorizons(eberg, idcol="ID", sel=h.lst) > eberg.spc <- join(horizons, sites, type='inner')

Joining by: ID

> depths(eberg.spc) <- ID ~ UHDICM + LHDICM > site(eberg.spc) <- as.formula(paste("~", paste(s.lst[-1], collapse="+"), sep=""))

Warning message: converting IDs from factor to character

> coordinates(eberg.spc) <- ~X+Y > proj4string(eberg.spc) <- CRS("+init=epsg:31467")

GSIF package by default works with geosamples-class as the main class for field observations (points). Conversion of the SPC class data to geosamples is straight forward:

> eberg.spc@horizons$SNDMHT.t <- log((eberg.spc@horizons$SNDMHT/100) / (1-eberg.spc@horizons$SNDMHT/100)) > eberg.geo <- as.geosamples(eberg.spc)

Reprojecting to +proj=longlat +datum=WGS84 ...

> str(eberg.geo)

Formal class 'geosamples' [package "GSIF"] with 3 slots ..@ registry: chr NA\ ..@ methods :'data.frame': 6 obs. of 4 variables: .. ..$ methodid : Factor w/ 6 levels "CLYMHT","SLTMHT",..: 1 2 3 4 5 6 .. ..$ description : logi [1:6] NA\ NA\ NA\ NA\ NA\ NA\ .. ..$ units : logi [1:6] NA\ NA\ NA\ NA\ NA\ NA\ .. ..$ detectionLimit: logi [1:6] NA\ NA\ NA\ NA\ NA\ NA\ ..@ data :'data.frame': 7854 obs. of 14 variables: .. ..$ observationid : chr [1:7854] NA\ NA\ NA\ NA\ ... .. ..$ sampleid : chr [1:7854] "id0009" "id0017" "id0047" "id0082" ... .. ..$ longitude : num [1:7854] 10.1 10.1 10.2 10.2 10.2 ... .. ..$ latitude : num [1:7854] 51.5 51.6 51.6 51.6 51.6 ... .. ..$ locationError : num [1:7854] NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ... .. ..$ TimeSpan.begin : POSIXct[1:7854], format: ... .. ..$ TimeSpan.end : POSIXct[1:7854], format: ... .. ..$ altitude : num [1:7854] 0 0 0 0 0 0 0 0 0 0 ... .. ..$ altitudeMode : chr [1:7854] "relativeToGround" "relativeToGround" "relativeToGround" "relativeToGround" ... .. ..$ sampleArea : num [1:7854] 1 1 1 1 1 1 1 1 1 1 ... .. ..$ sampleThickness : num [1:7854] 2 2 2 2 2 2 2 2 2 2 ... .. ..$ observedValue : chr [1:7854] "B" "L" NA\ "K" ... .. ..$ methodid : Factor w/ 6 levels "CLYMHT","SLTMHT",..: 5 5 5 5 5 5 5 5 5 5 ... .. ..$ measurementError: num [1:7854] NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ...

> levels(eberg.geo@data$methodid)

[1] "CLYMHT" "SLTMHT" "SNDMHT" [4] "SNDMHT.t" "soiltype" "TAXGRSC"

Geosamples-class can be considered the most plain (standard) format for any space-time observations and, as such, highly suitable for storing free-form geosamples. Main advantage of using this class in R is that it can be easily manipulated and converted to spatial and aqp classes. Likewise, the column names in the `@data`

slot correspond to the tag names used in the KML schema, which makes it easier to export such data to some GIS or Google Earth. Note also that the idea of using geosamples is to allow users to extend possibilities of geostatistical analysis with data parameters such as horizontal and vertical support (`sampleArea`

and `sampleThickness`

) and uncertainty measures (`locationError`

and `measurementError`

).

### Generating soil predictive components

Prior to geostatistical modeling, it is also probably a good idea to convert all covariates to independent components. This way, it will be easier to subset to the optimal number of predictors during the regression analysis. PCA helps reducing the prediction bias, which might happen if the covariates are cross-correlated. A wrapper function spc will convert all factor variables to indicators and run PCA on a stack of grids:

> formulaString <- ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6 > eberg_spc <- spc(eberg_grid, formulaString)

Converting PRMGEO6 to indicators... Converting covariates to principal components...

The output from this operation is a stack of independent components, all numeric and all scaled around 0 value. To see which inputs define any of the components, we can look at the rotation table or plot the images:

`> eberg_spc@pca$rotation`

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 DEMSRT6 -0.55711093 0.20819653 -0.03119766 0.06485905 -0.04919198 0.04583550 -2.166881e-02 -0.11850187 TWISRT6 0.45829444 0.23870553 -0.22013669 -0.05930392 -0.05772050 0.05289886 4.749496e-02 0.16709268 TIRAST6 0.31720111 0.23411688 0.21194747 0.22670979 -0.06911383 0.01606945 -5.050788e-02 -0.86045157 PRMGEO6_1 -0.14135533 -0.03765647 0.60672337 0.34711053 0.49254843 0.02022393 1.512718e-01 0.12762671 PRMGEO6_2 -0.43416755 0.28698148 -0.47280406 0.01002071 0.02500768 -0.10026014 3.879591e-02 -0.21151385 PRMGEO6_3 0.07971922 0.06542432 -0.07672724 -0.04496406 -0.06289324 0.35365015 8.983826e-01 -0.02270445 PRMGEO6_4 0.36654086 0.41740804 -0.16408783 -0.03037150 0.41383444 -0.11009869 -1.480861e-01 0.18050170 PRMGEO6_5 -0.02765866 0.09015600 0.01774654 0.06338709 -0.08345800 0.89976730 -3.708044e-01 0.07892063 PRMGEO6_6 -0.01705511 -0.01720833 0.38970849 -0.80790529 -0.16623181 -0.01556903 -3.792439e-02 -0.11209884 PRMGEO6_7 0.16045128 -0.72535140 -0.26755675 0.17295113 -0.06543613 0.01762360 -4.878972e-02 -0.14541877 PRMGEO6_8 0.05464832 0.21913031 0.23958662 0.36153522 -0.72948460 -0.19250233 -6.541899e-05 0.28913001 PC9 PC10 PC11 DEMSRT6 -0.062531842 0.78623052 1.419406e-15 TWISRT6 -0.773340893 0.21597904 -7.371827e-16 TIRAST6 0.004370069 0.01648279 3.410741e-16 PRMGEO6_1 -0.311806003 -0.06650597 3.258389e-01 PRMGEO6_2 -0.210767393 -0.44338968 4.555854e-01 PRMGEO6_3 0.186242372 0.05142616 8.656808e-02 PRMGEO6_4 0.452668089 0.23662602 4.059308e-01 PRMGEO6_5 0.035319735 -0.10118848 1.225425e-01 PRMGEO6_6 -0.067946642 0.04174463 3.828709e-01 PRMGEO6_7 -0.024254803 0.25057179 5.051019e-01 PRMGEO6_8 0.101594844 -0.02238362 3.118692e-01

> pal = rev(rainbow(65)[1:48]) > rd = range(eberg_spc@predicted@data[,1], na.rm=TRUE) > spplot(eberg_spc@predicted[1:4], at=seq(rd[1], rd[2], length.out=48), col.regions=pal)

### Model fitting

In the GSIF package, regression-kriging model can be fitted at once by using the fit.gstatModel function. First, we need to specify the model explaining the distribution of soil property as a function of soil covariates:

> logits = function(x){log((x/100)/(1-x/100))} > glm.formulaString = as.formula(paste("SNDMHT.t ~ ", paste(names(eberg_spc@predicted), collapse="+"), "+ ns(altitude, df=4)")) ## Note: we have to use logits to prevent the output predictions to be outside the range [0-1]; > glm.formulaString

SNDMHT.t ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 + ns(altitude, df = 4)

In other words the observed values will be modeled as a function of PCs and altitude (natural splines via the ns function). The ns function is only one of the possible functions in R to fit soil-depth relationships. Other possible model is, for example, the restricted cubic splines (rsc) function in the rms package. Note that the logit-transformed value of the target variable is required to prevent from making predictions outside the 0-1 range.

In the GSIF package, the 3D GLM-kriging model can be fitted at once by running:

> SNDMHT.m <- fit.gstatModel(observations=eberg.geo, glm.formulaString, covariates=eberg_spc@predicted)

```
Fitting a linear model...
Warning: Shapiro-Wilk normality test and Anderson-Darling normality test report probability of < .05 indicating lack of normal distribution for residuals
Fitting a 3D variogram...
Warning: singular model in variogram fit
Saving an object of class 'gstatModel'...
```

> summary(SNDMHT.m@regModel)

Call: glm(formula = SNDMHT.t ~ PC1 + PC4 + PC5 + PC7 + PC8 + PC9 + PC10 + PC11 + ns(altitude, df = 4), family = fit.family, data = rmatrix) Deviance Residuals: Min 1Q Median 3Q -2.7863 -0.4729 -0.0251 0.3674 Max 3.6106 Coefficients: Estimate (Intercept) -1.231e+00 PC1 -3.040e-01 PC4 -4.412e-01 PC5 -1.566e-01 PC7 -1.420e-01 PC8 -2.449e-01 PC9 -1.502e-01 PC10 3.631e-01 PC11 -1.429e+14 ns(altitude, df = 4)1 1.063e-01 ns(altitude, df = 4)2 1.563e-01 ns(altitude, df = 4)3 2.334e-01 ns(altitude, df = 4)4 2.488e-01 Std. Error (Intercept) 6.966e-02 PC1 3.567e-02 PC4 2.218e-02 PC5 2.256e-02 PC7 2.462e-02 PC8 2.789e-02 PC9 3.967e-02 PC10 1.008e-01 PC11 2.304e+13 ns(altitude, df = 4)1 1.163e-01 ns(altitude, df = 4)2 9.725e-02 ns(altitude, df = 4)3 1.549e-01 ns(altitude, df = 4)4 7.869e-02 t value (Intercept) -17.671 PC1 -8.522 PC4 -19.889 PC5 -6.943 PC7 -5.769 PC8 -8.782 PC9 -3.786 PC10 3.602 PC11 -6.200 ns(altitude, df = 4)1 0.914 ns(altitude, df = 4)2 1.607 ns(altitude, df = 4)3 1.506 ns(altitude, df = 4)4 3.162 Pr(>|t|) (Intercept) < 2e-16 *** PC1 < 2e-16 *** PC4 < 2e-16 *** PC5 6.33e-12 *** PC7 1.02e-08 *** PC8 < 2e-16 *** PC9 0.000161 *** PC10 0.000329 *** PC11 7.82e-10 *** ns(altitude, df = 4)1 0.361084 ns(altitude, df = 4)2 0.108338 ns(altitude, df = 4)3 0.132295 ns(altitude, df = 4)4 0.001606 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.7236981) Null deviance: 1361.91 on 1190 degrees of freedom Residual deviance: 852.52 on 1178 degrees of freedom (594 observations deleted due to missingness) AIC: 3009.7 Number of Fisher Scoring iterations: 2

> SNDMHT.m@vgmModel

model psill range kappa ang1 1 Nug 0.0000000 0.00000 0.0 0 2 Exp 0.8202215 75.26947 0.5 0 ang2 ang3 anis1 anis2 1 0 0 1 1.00000 2 0 0 1 0.00015

These result show that the model is significant, and this is valid for both regression (linear model) and the variogram part of the model. Note however that this is not completely a 3D regression-krging, as we do not actually have values of PCs at different depths (in fact, most of PCs relate only to the surface), so that values of covariates at land surface are basically copied to all depths. This does not represent any problem for the GLM modeling, however, you should be aware that, because values of covariates are fixed with the different depths, the resulting 3D patterns in the target variable will be mainly controlled by the surface patterns.

GSIF package allows you to fit also regression tree and RandomForest models to the regression part of the model. These are implemented via the rpart and randomForest packages. For example, to fit a regression tree, we set the `method`

as:

> SNDMHT.m2 <- fit.gstatModel(observations=eberg.geo, glm.formulaString, covariates=eberg_spc@predicted, method="rpart")

Fitting a regression tree model... Estimated Complexity Parameter (for prunning): 0.01352 Warning: Shapiro-Wilk normality test and Anderson-Darling normality test report probability of < .05 indicating lack of normal distribution for residuals Fitting a 3D variogram... Saving an object of class 'gstatModel'...

### Spatial predictions

Once we have fitted a gstatModel, we can generate predictions and estimate the associated uncertainty at any depth. In the last step, we need to prepare the 3D prediction locations i.e. grid cells that need to be mapped. In GSIF package, this can be done by using the sp3D function:

> new3D <- sp3D(eberg_spc@predicted)

This will prepare the existing SPCs for 3D predictions (6 standard depths, at block support responding to the size of standard horizons) by adding the altitude column:

> str(new3D[[1]]@grid)

Formal class 'GridTopology' [package "sp"] with 3 slots ..@ cellcentre.offset: Named num [1:3] 3.57e+06 5.71e+06 -2.50e-02 .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude" ..@ cellsize : Named num [1:3] 1e+02 1e+02 5e-02 .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude" ..@ cells.dim : Named int [1:3] 100 100 1 .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude"

So that we can now run predictions for each standard depth in a loop:

> sd.l <- lapply(new3D, FUN=function(x){predict(SNDMHT.m, predictionLocations=x, nfold=0)})

Generating predictions using the trend model (RK method)... [using ordinary kriging] ...

This operation can take time depending on the size of the grids and number of 3D points used to generate predictions.

The final predictions can be back-transformed to the 0-100% scale by (Diggle and Ribeiro, 2007, p. 148):

> invlogit = function(x){exp(x)/(1+exp(x))*100} > invlogit.m = function(x, v){((1+exp(-x))^(-1)-.5*v*exp(-x)*(1-exp(-x))*(1+exp(-x))^(-3) )*100} > for(j in 1:length(sd.l)){ sd.l[[j]]@predicted$M <- round(invlogit.m(sd.l[[j]]@predicted$SNDMHT.t, sd.l[[j]]@predicted$var1.var)) sd.l[[j]]@predicted$L <- round(invlogit(sd.l[[j]]@predicted$SNDMHT.t - 1.645*sqrt(sd.l[[j]]@predicted$var1.var))) sd.l[[j]]@predicted$U <- round(invlogit(sd.l[[j]]@predicted$SNDMHT.t + 1.645*sqrt(sd.l[[j]]@predicted$var1.var))) } > str(sd.l[[1]]@predicted@data)

'data.frame': 10000 obs. of 7 variables: $ var1.pred : num -1.94 -1.94 -1.94 -1.64 -1.61 ... $ var1.var : num 0.279 0.278 0.278 0.252 0.252 ... $ SNDMHT.t.svar: num 0.244 0.243 0.243 0.221 0.22 ... $ SNDMHT.t : num -1.94 -1.94 -1.94 -1.64 -1.61 ... $ M : num 14 14 14 17 18 18 19 19 20 20 ... $ L : num 6 6 6 8 8 8 8 9 9 10 ... $ U : num 26 26 25 31 31 31 33 33 33 33 ...

In this case we predict both the mean value (M) and the upper (U) and lower (L) confidence limits based on the 90% probability.

Finally, we can prepare the produced predictions and export them as a GlobalSoilMap-class object. Before we can save the ouput object as a GlobalSoilMap-class object, we need to reproject the produced predictions to geographical coordinates (WGS84), which is possible by using the make.3Dgrid function:

> p = get("cellsize", envir = GSIF.opts)[2] > s = get("stdepths", envir = GSIF.opts) > s

[1] -0.025 -0.100 -0.225 -0.450 -0.800 -1.500

> sd.ll <- sapply(1:length(sd.l), FUN=function(x){make.3Dgrid(sd.l[[x]]@predicted[c("L","M","U")], pixelsize=p, stdepths=s[x])})

Resampling 3 layers to +proj=longlat +datum=WGS84 with grid cell size of 0.000833333333333333 ...

The object sd.ll can now be saved as a GlobalSoilMap-class object:

> SNDMHT.gsm <- SoilGrids(obj=sd.ll, varname="SNDPPT", TimeSpan=list(begin="1999-02-01", end="2001-07-01")) > str(SNDMHT.gsm, max.level=2)

Formal class 'SoilGrids' [package "GSIF"] with 8 slots ..@ varname : chr "SNDPPT" ..@ TimeSpan:List of 2 ..@ sd1 :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots ..@ sd2 :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots ..@ sd3 :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots ..@ sd4 :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots ..@ sd5 :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots ..@ sd6 :Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots

> save(SNDMHT.gsm, file="SNDMHT.rda", compress="xz")

and visualized in Google Earth by using the plotKML package functionality:

> z0 = mean(eberg_grid$DEMSRT6, na.rm=TRUE) > for(j in 1:length(sd.ll)){ + kml(methods/html/slot.html">slot(SNDMHT.gsm, paste("sd", j, sep="")), + folder.name = paste("eberg_sd", j, sep=""), + file = paste("SNDMHT_sd", j, ".kml", sep=""), + colour = M, z.lim=c(10,85), + raster_name = paste("SNDMHT_sd", j, ".png", sep=""), + altitude = z0+5000+(s[j]*2500)) + }

KML file header opened for parsing... Parsing to KML... Closing SNDMHT_sd1.kml KML file header opened for parsing... ... Parsing to KML... Closing SNDMHT_sd6.kml

To check how accurate these maps are, we can use the validate method. This will interatively fit the gstatModels and then derive the residual error at validation points. The validate method is a total cross-validation as the cross-validation points are masked out from the model fitting, unlike in gstat where the krige.cv will use a constant model predicted using all data.

> rk.cv <- validate(SNDMHT.m)

Estimating the predictionDomain... Estimated nearest neighbour distance (point pattern): 126 Running 5-fold cross validation... Generating predictions using the trend model (KED method)... [using universal kriging] 100% done Generating predictions using the trend model (KED method)... [using universal kriging] ...

> tvar <- 1-var(rk.cv[[1]]$residual, na.rm=T)/var(rk.cv[[1]]$observed, na.rm=T) > signif(tvar*100, 3)

[1] 61.2

This shows that the model is able to explain almost 60% of variability in the target variable at validation points.

### Visualizing the uncertainty of predictions

According to the GlobalSoilMap.net specifications, a requirement to submit the predicted soil property maps is to estimate the 90% upper and lower confidence intervals and attach them to the predictions (thus, two extra maps). By using the above describe methodology (i.e. the gstatModel class objects), one can derive confidence intervals for any arbitrary point in the 3D continuum. Here is an example of the calculus for some arbitrary 3D point:

> loc <- eberg_spc@predicted[1200:1201,] > new3D.loc <- sp3D(loc) > str(new3D.loc[[1]])

Formal class 'SpatialPixelsDataFrame' [package "sp"] with 7 slots ..@ data :'data.frame': 2 obs. of 11 variables: .. ..$ PC1 : num [1:2] 3.09 -2.42 .. ..$ PC2 : num [1:2] 1.601 0.823 .. ..$ PC3 : num [1:2] -0.576 -1.176 .. ..$ PC4 : num [1:2] -0.09268 -0.00844 .. ..$ PC5 : num [1:2] 1.0204 0.0536 .. ..$ PC6 : num [1:2] -0.23 -0.189 .. ..$ PC7 : num [1:2] -0.312 0.0839 .. ..$ PC8 : num [1:2] 0.221 -0.15 .. ..$ PC9 : num [1:2] -0.218 -0.222 .. ..$ PC10: num [1:2] -0.1249 0.0608 .. ..$ PC11: num [1:2] -4.66e-15 7.88e-15 ..@ coords.nrs : num(0) ..@ grid :Formal class 'GridTopology' [package "sp"] with 3 slots .. .. ..@ cellcentre.offset: Named num [1:3] 3.57e+06 5.72e+06 -2.50e-02 .. .. .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude" .. .. ..@ cellsize : Named num [1:3] 9900 100 0.05 .. .. .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude" .. .. ..@ cells.dim : Named int [1:3] 2 2 1 .. .. .. ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "altitude" ..@ grid.index : int [1:2] 2 3 ..@ coords : num [1:2, 1:3] 3.58e+06 3.57e+06 5.72e+06 5.72e+06 -2.50e-02 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : NULL\ .. .. ..$ : chr [1:3] "longitude" "latitude" "altitude" ..@ bbox : num [1:3, 1:2] 3565100 5716700 -4950 3584900 5716900 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:3] "longitude" "latitude" "altitude" .. .. ..$ : chr [1:2] "min" "max" ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots .. .. ..@ projargs: chr " +init=epsg:31467 +proj=tmerc +lat_0=0 +lon_0=9 +k=1 +x_0=3500000 +y_0=0 +ellps=bessel __truncated__

> sd.loc <- predict(SNDMHT.m, predictionLocations=new3D.loc[[1]], nfold=0, subset.observations = SNDMHT.m@sp@coords[,1])

Generating predictions using the trend model (RK method)... [using ordinary kriging] 100% done

> int90 <- sd.loc@predicted$SNDMHT.t[1] + c(-1.645, 1.645)*sqrt(sd.loc@predicted$var1.var[1]) > invlogit(int90)

[1] 10.97902 26.44757

> new3D.loc[[1]]@coords[1,]

longitude latitude altitude 3579950.000 5716850.000 -0.025

This means that the 90% confidence interval of predictions for this 3D location is 11-26% of sand. Because the gstatModel is basically a global 3D model for this soil property, we can estimate uncertainty at any new 3D location, the only requirement is that values of covariates at this location are known, and that the location is submitted as `SpatialPixelsDataFrame`

object.

To visualize uncertainty, you can consider using geostatistical simulations:

> SNDMHT.xy <- spTransform(SNDMHT.geo, CRS("+init=epsg:31467")) > sel.stripe <- eberg_spc@predicted@coords[,2] > min(SNDMHT.xy@coords[,2]) # 2400 locations > loc <- eberg_spc@predicted[sel.stripe,] > new3D.loc <- sp3D(loc) > sd.loc <- predict(SNDMHT.m, predictionLocations=new3D.loc[[1]], nsim=20, block=c(0,0,0))

Generating 20 conditional simulations using the trend model (RK method)... drawing 20 GLS realisations of beta... [using conditional Gaussian simulation] 100% done

> sd.loc@realizations <- calc(sd.loc@realizations, fun=invlogit)

This function will create an object of class ''RasterBrickSimulations'', then can then be visualized by using the plotKML package:

> str(sd.loc, max.level=2)

Formal class 'RasterBrickSimulations' [package "plotKML"] with 3 slots ..@ variable : chr "SNDMHT" ..@ sampled :Formal class 'SpatialLines' [package "sp"] with 3 slots ..@ realizations:Formal class 'RasterBrick' [package "raster"] with 12 slots

> plotKML(sd.loc, file="SNDMHT_sims.kml", zlim=c(10,85))

```
KML file header opened for parsing...
Reprojecting to +proj=longlat +datum=WGS84 ...
...
Parsing to KML...
Reprojecting to +proj=longlat +datum=WGS84 ...
Parsing to KML...
Closing SNDMHT_sims.kml
```

The resulting simulations are shown in figure below. Note that the geostatistical simulations at block support can be very time consuming, so that running this code for millions of locations is probably not a good idea. One of the purposes of the GSIF package for R is to allow for faster methods to run geostatistical simulations (further development will likely go via Python).

### Predicting soil classes

GSIF package also provides functionality for pedometric mapping of soil classes. Soil types can be mapped using the wrapper functions spmultinom and spfkm. This will run supervised fuzzy k-means using a list of covariates layers provided as `SpatialPixelsDataFrame`

object, and optional class centers and class variances. As in the case of continuous/numeric variables, the process consists of model fitting and predictions. In GSIF package, the two are wrapped into a single function:

> formulaString = soiltype ~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10 > eberg_sm <- spmultinom(formulaString, eberg.xy, eberg_spc@predicted)

Fitting a multinomial logistic regression model... # weights: 144 (121 variable) initial value 1853.740361 iter 10 value 1200.197532 iter 20 value 1167.479984 iter 30 value 1156.517152 iter 40 value 1150.287170 iter 50 value 1146.787227 iter 60 value 1144.369824 iter 70 value 1144.156270 iter 80 value 1143.913263 iter 90 value 1143.907377 final value 1143.907359 converged Estimated Cohen Kappa (weighted): 0.3116 Warning message: In multinom(formulaString, ov, ...) : group ‘Hw’ is empty

The spmultinom tries to estimate distribution of soil types by fitting a multinomial logistic regression model (multinom package) available via the package nnet (Venables and Ripley, 2002). The output is an object of class SpatialMemberships, which contains predictions, the model parameters, class centres and variances, and membership maps derived per class. The resulting Kappa statistics shows that the predicted classes match with ca 31% observed classes, which is a standardly low value for soil taxonomic data.

Class centres and variances can also be passed via the `class.c`

and `class.sd`

arguments via the spfkm function (alternative to spmultinom), in which case spfkm will directly derived supervised fuzzy k-means.

To see the estimated class centers:

> eberg_sm@class.c

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 A -1.9136970 0.42290575 -0.8199637 0.06929519 0.14254910 -0.29501667 0.04445353 -0.569246722 0.59515829 B 0.1451428 0.05149812 1.0232911 -1.38193489 -0.88743587 -0.15377161 -0.07627217 -0.077118762 -0.05831949 D -0.4773107 0.33677264 1.5855751 1.21311891 -0.20838896 -0.24014369 0.29626729 0.816515294 -0.34327845 G 3.1222982 1.48675002 -0.9355987 -0.37516028 1.06098275 -0.21828971 -0.22613231 1.241218926 -0.71454825 Ha -2.0722482 0.87009506 -1.0655838 0.03860135 -0.02849083 0.41575214 0.20629221 -0.233234399 -0.16586679 L 1.8885517 0.03098325 -0.4946663 0.10999082 0.51145092 -0.12456134 -0.21485459 0.006779172 0.06215298 Q 0.4974576 -1.57456661 -0.2174762 0.53124815 -0.14682206 0.01991976 -0.15147949 -0.951987328 0.45050045 R -2.1043180 0.50564145 -1.0560584 -0.05346994 0.12978516 -0.25752102 0.08547869 -0.075669733 0.18523146 S 0.7410611 -1.00398803 -0.6111388 0.14103309 0.15076192 -0.03602664 -0.11831015 0.349219067 0.05584245 Z -1.4125863 0.03343204 1.7252433 1.14667200 0.72181439 -0.13599257 0.29550948 0.118988806 0.29309998 PC10 A -1.23773329 B -0.01363216 D 0.03916544 G -0.09389767 Ha -0.02458035 L -0.11880316 Q 0.12662896 R -0.79474759 S 0.26756947 Z -0.15280722

> row.names(eberg_sm@class.c)

[1] "A" "B" "D" "G" "Ha" "L" "Q" "R" "S" "Z"

Mapped soil class memberships can also be used to map soil properties, which is then equivalent to weighted averaging. We can derive the memberships using the fuzzy k-means to ensure that they sum up to 1:

> eberg_sm <- spfkm(formulaString, eberg.xy, eberg_spc@predicted)

```
Trying to estimate the class centres using the 'multinom' method...
Fitting a multinomial logistic regression model...
...
```

The regression model changes to e.g.:

> glm.formulaString2 = as.formula(paste("SNDMHT_A ~ ", paste(names(eberg_sm@mu), collapse="+"), "-1")) > glm.formulaString2

SNDMHT_A ~ A + B + D + G + Hw + K + L + Q + R + S + Z - 1

> SNDMHT.m2 <- fit.gstatModel(observations=eberg.xy, glm.formulaString2, covariates=eberg_sm@mu) > summary(SNDMHT.m2@regModel)

Call: glm(formula = SNDMHT_A ~ B + D + G + Hw + K + L + Q + R + S + Z - 1, family = family, data = rmatrix) Deviance Residuals: Min 1Q Median 3Q Max -51.603 -8.522 -1.022 8.397 55.466 Coefficients: Estimate Std. Error t value Pr(>|t|) B 58.103 1.492 38.955 < 2e-16 *** D 21.022 1.961 10.717 < 2e-16 *** G 13.500 7.090 1.904 0.05742 . Hw 15.879 5.225 3.039 0.00248 ** K 20.159 3.965 5.084 5.06e-07 *** L 24.465 1.720 14.224 < 2e-16 *** Q 32.552 2.535 12.839 < 2e-16 *** R 17.559 5.991 2.931 0.00352 ** S 32.760 1.628 20.124 < 2e-16 *** Z 17.072 1.988 8.589 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 251.3532) Null deviance: 839197 on 567 degrees of freedom Residual deviance: 140004 on 557 degrees of freedom (1 observation deleted due to missingness) AIC: 4754.7 Number of Fisher Scoring iterations: 2

Note that intercept needs to be taken out, so that the best predictor of the sand content for some soil type is basically the mean value of the sand for that soil type (for example class “B” is expected to have an average sand content of about 58%). If you compare the model based on soil classes and model fitted in the previous section (`SNDMHT.m`

), you can see that fitting data using a 3D model results in a slightly better fit. Nevertheless, soil classes are in this case study significant estimators of sand content (especially classes `B`

, `D`

, `L`

, `Q`

, `S`

and `Z`

).

## Predicting with multi-scale data

In the next examples we demonstrate how to produce predictions using multi-scale and/or multi-source data. By multi-scale data we imply covariates used for geostatistical mapping that are available at 2 or more (distinctly different) resolutions, but then cover the same area of interest (see also: RasterStack in the raster package). Or in other words, we fit models and generate predictions with covariates at different resolutions, but with the same extent. In the case of the multi-source data, covariates can be of any scale, they can have a variable extent, and variable accuracy. As a general strategy, for multi-scale data we propose fitting a single model to combined covariates, while for the multi-source data we propose using data assimilation methods (merging of predictions).

### Regression-kriging using multiscale data

Ebergötzen case study contains covariates at 100 and 25 m resolution. Both list of covariates have the same extent, which allows us to fit a model by still using the same fit.gstatModel function used above:

> eberg_grids <- list(eberg_grid, eberg_grid25) > unlist(sapply(eberg_grids, names))

[1] "PRMGEO6" "DEMSRT6" "TWISRT6" "TIRAST6" "LNCCOR6" "DEMTOPx" "HBTSOLx" "TWITOPx" "NVILANx"

> glm.formulaString3 = observedValue ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6+LNCCOR6+TWITOPx+NVILANx+ns(altitude, df=4) > SNDMHT.m3 <- fit.gstatModel(observations=eberg.geo, glm.formulaString3, covariates=eberg_grids, methodid="SNDMHT.t")

In this case, fit.gstatModel function recognizes that the input to the model is a list of grids, and hence overlays points and grids in a loop. Note however, that the model fitting might be time consuming as the model needs to overlay points and grids, then merge resulting outputs in a loop.

The model above (`SNDMHT.m3`

) would be more difficult to use for prediction because it can happen that some land cover classes or soil mapping units do not contain any points. The model will thus not be able to predict values at new locations and R will report an error e.g.

`"error in model... factor 'LNCCOR6' has new level(s)"`

Thus, a more robust way to generate predictions with multiscale covariates is to, again, convert the original predictors to principal components that are all continuous and independent. We first need to downscale all predictors from 100 m to 25 m resolution using the gdalwarp function:

> eberg_grid25@data <- cbind(eberg_grid25@data, gdalwarp(eberg_grid, pixsize=eberg_grid25@grid@cellsize[1], + GridTopology=eberg_grid25@bbox, resampling_method="cubicspline")@data)

Resampling 5 layers to "+init=epsg:31467" with grid cell size of 25 ...

So that we can derive soil predictive component via the spc function:

> formulaString2 <- ~ TWITOPx+NVILANx+PRMGEO6+DEMSRT6+TWISRT6+TIRAST6 > eberg_spc25 <- spc(eberg_grid25, formulaString2)

which now allows us to model spatial distribution of sand content as a function of 14 components, at four times better resolution:

> glm.formulaString3 = as.formula(paste("logits(SNDMHT) ~ ", paste(names(eberg_spc25@predicted), collapse="+"), "+ ns(altitude, df=4)")) > glm.formulaString3

logits(SNDMHT) ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 + PC12 + PC13 + PC14 + ns(altitude, df = 4)

> SNDMHT.m3 <- fit.gstatModel(observations=eberg.geo, glm.formulaString3, + covariates=eberg_spc25@predicted, methodid="SNDMHT.t")

Note that the resulting model (`glm.formulaString3`

) is only slightly better than the model we fitted using the 100 m resolution data. For mapping sand content in this study area, there seems to be not much gain in using 25 m resolution covariates considering the model significance. The difference is, however, that the predictions with 25 m resolution covariates show four times better detail than the model with 100 m resolution grids (see small topographic features in the figure below), hence on the end there is a significant gain in using higher resolution data.

It is worth mentioning that preparation of the prediction locations for multiscale data is, however, slightly more complicated as the covariates had to be resampled from 100 m resolution to 25 m resolution. This operation can be time consuming and can lead to artifacts. GSIF, by default, uses FWTools (gdalwarp function) for resampling and reprojection of grids. FWTools is highly suitable for processing large data, and has many possibilities considering the downscaling algorithms. For example, via FWTools you can pass the resampling method bicubicspline (Keys, 1981), which is effective for downscaling relatively smooth grids such as DEM or DEM parameters. The problem is that R needs to read and write spatial grids between FWTools, so that processing grids using FWTools can still take significant amount of time / RAM when working with large grids.

### Merging multi-source data

Imagine if the covariates for the Ebergötzen case study did not have the same extent, so that by overlaying the points we would loose significant parts of the callibration data as the covariates would not match. To avoid this, we can instead produce multiple models for each extent, then use some type of the data assimilation method to merge the predictions produced by various models. A sensitive approach to merge multiple predictions is to use the per-pixel accuracy to assign the weights, so that more accurate predictions receive more weight (Heuvelink and Bierkens, 1992).

In practice, merging multi-source predictions takes only few more steps than in the case of multiscale data. First, we fit models using the two scales independently:

> formulaString.l <- list(~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6, ~ DEMTOPx+TWITOPx+NVILANx) > eberg_grids_spc <- spc(eberg_grids, formulaString.l)

Converting PRMGEO6 to indicators... Converting covariates to principal components... Converting covariates to principal components...

> glm.formulaString.l <- lapply(eberg_grids_spc, + FUN=function(x){as.formula(paste("logits(SNDMHT.t) ~ ", paste(names(x@predicted), collapse="+"), + "+ ns(altitude, df=4)"))}) > glm.formulaString.l

[[1]] logits(SNDMHT.t) ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 + ns(altitude, df = 4) <environment: 0x000000002a46cc58> [[2]] logits(SNDMHT.t) ~ PC1 + PC2 + PC3 + ns(altitude, df = 4) <environment: 0x000000002ac47918>

> SNDMHT.ml <- fit.gstatModel(observations=eberg.geo, + glm.formulaString.l, lapply(eberg_grids_spc, methods/html/slot.html">slot, "predicted"), + methodid="SNDMHT.t", rvgm=NULL)

Second, we can produce predictions using a list of covariates at different scales:

> new3D.ml <- sapply(eberg_grids_spc, FUN=function(x){sp3D(x@predicted, stdepths=-0.025)}) > sd.ml <- predict(SNDMHT.ml, predictionLocations=new3D.ml, nfold=2, verbose=TRUE, mask.extra=FALSE)

Generating predictions using the trend model (RK method)... [using ordinary kriging] Running 2-fold cross validation... |==============================================================================================| 100% Generating predictions using the trend model (RK method)... [using ordinary kriging] Running 2-fold cross validation... |==============================================================================================| 100%

In the third step, we can sum up the predictions using weighted averaging. Merge function will by default use results of cross-validation to scale the prediction variances, which are then used as weigths:

> sd.mlc <- merge(sd.ml[[1]], sd.ml[[2]], silent=FALSE)

Resampling 2 layers to " +init=epsg:31467 +p " with grid cell size: 25 ... |==============================================================================================| 100% Cross-validation RMSE (type = link): observedValue_1 observedValue_2 0.5713 0.6340

In this case the predictions using the 100 m resolution data will receive a slightly higher weights.

The results presented in figure above confirm that this method is equally valid and leads to almost identical predictions as when all covariates are used together (compare with the previous section). The final map again reflects mainly the patterns of the coarser predictors as these are more dominant. Assuming that the two predictions use completely independent covariates, the output should indeed be very similar to using a method that combines the two models. In fact, also in the case of regression-kriging, predictions using covariates are summed up to predictions using kriging (of residuals). In practice, the covariates might be correlated, and the models could overlap, so that the output of the merge function might be biased. In fact, merging multi-source data is highly sensitive to type of models used to generate predictions and associated assumptions (for a discussion see e.g. Gotway and Young, 2002; Gotway and Young, 2007). Nevertheless, this approach is attractive for global soil mapping applications where predictions can be produced for areas of various extent, or using completely independent methods that can not be combined mathematically. For example, by following this approach, one could build models and produce predictions of soil properties for the whole world but at coarse resolution of e.g. 1 km, then use these predictions to fill-in the gaps for a regional model at resolution of 250 m or 100 m. It is important, however, to be aware that the covariates, even if they are at different scales, are cross-correlated and hence this approach might lead to biased predictions (over or under estimation).

Finally, the propagated prediction uncertainty for merged predictions is more difficult to estimate as one can not simply take the mean of the prediction (kriging) variances. Instead, geostatistical simulations can be used to produce multiple realizations (see figure above). The propagated prediction error can then be derived by aggregating multiple realizations per pixel (see e.g. raster package).

## References

- Bivand, R.S., Pebesma, E.J., and Gómez-Rubio, V., (2008)
**Applied Spatial Data Analysis with R**. Springer, 378 p. - Böhner, J., McCloy, K. R. and Strobl, J. (Eds), (2006)
**SAGA — Analysis and Modelling Applications**. Göttinger Geographische Abhandlungen, Heft 115. Verlag Erich Goltze GmbH, Göttingen, 117 pp. - Diggle, P.J., Ribeiro Jr, P.J., (2006)
**Model-based Geostatistics**, Springer Series in Statistics, 230 p. - Gotway, C.A., Young, L.J., (2002) Combining Incompatible Spatial Data. Journal of the American Statistical Association, 97 (458): 632-648.
- Gotway, C.A., Young, L.J., (2007) A Geostatistical Approach to Linking Geographically Aggregated Data From Different Sources. Journal of Computational and Graphical Statistics, 16 (1), 115-135
- Hengl, T., Nikolic, M., MacMillan, R.A., (2013) Mapping efficiency and information content. International Journal of Applied Earth Observation and Geoinformation, special issue Spatial Statistics Conference, 22: 127–138.
- Hijmans, R.J, Elith, J., (2012) Species distribution modeling with R. CRAN, Vignette for the dismo package, 72 p.
- Heuvelink, G.B.M., Bierkens, M.F.P. (1992) Combining soil maps with interpolations from point observations to predict quantitative soil properties. Geoderma 55 (1-2): 1-15.
- Keys, R. (1981) Cubic convolution interpolation for digital image processing. IEEE Transactions on Signal Processing, Acoustics, Speech, and Signal Processing 29 (6): 1153–1160.
- Pebesma, E., (2003)
**Gstat user's manual**. Dept. of Physical Geography, Utrecht University, p. 100, www.gstat.org - Skaggs, T. H., Arya, L. M., Shouse, P. J., Mohanty, B. P., (2001) Estimating Particle-Size Distribution from Limited Soil Texture Data. Soil Science Society of America Journal 65 (4): 1038-1044.
- Venables, W. N. and Ripley, B. D. (2002)
**Modern Applied Statistics with S**. 4th edition. Springer.