# Machine Learning Algorithms for soil mapping

*Edited by*: T. Hengl- Part of:
**ISRIC Spring School**

This tutorial looks at some common Machine learning algorithms (MLA's) that are potentially of interest to soil mapping projects i.e. for generating spatial prediction. We put especial focus on using the tree-based algorithms such as random forest and/or gradient boosting. For a more in-depth overview of machine learning algorithms used in statistics refer to the CRAN Task View on Machine Learning & Statistical Learning. Some other examples of how MLA's can be used to fit Pedo-Transfer-Functions can be found here.

The code used in this tutorial can be downloaded from here.

## Loading the packages and data

We start by loading all required packages:

> library(plotKML)

plotKML version 0.5-5 (2015-12-24) URL: http://plotkml.r-forge.r-project.org/

> library(sp) > library(randomForest)

randomForest 4.6-12 Type rfNews() to see new features/changes/bug fixes.

> library(nnet) > library(e1071) > library(GSIF) > library(plyr) > library(raster) > library(caret) > library(Cubist) > library(GSIF) > library(xgboost)

Next, we load the (Ebergotzen) data set which consists of soil augers and stack of rasters containing all covariates:

> data(eberg) > data(eberg_grid) > coordinates(eberg) <- ~X+Y > proj4string(eberg) <- CRS("+init=epsg:31467") > gridded(eberg_grid) <- ~x+y > proj4string(eberg_grid) <- CRS("+init=epsg:31467")

The covariates can be further converted to principal components:

> eberg_spc <- spc(eberg_grid, ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6)

Converting PRMGEO6 to indicators... Converting covariates to principal components...

> eberg_grid@data <- cbind(eberg_grid@data, eberg_spc@predicted@data)

All further analysis is run using the regression matrix (produced using overlay of points and grids), which contains values of the target variable and all covariates for all training points:

> ov <- over(eberg, eberg_grid) > m <- cbind(ov, eberg@data) > dim(m)

[1] 3670 44

In this case the regression matrix consists of 3670 observations and has 44 columns.

## Spatial prediction of soil classes using MLA's

In the first example we focus on mapping soil types using the auger data. First, we need to filter out some classes that do not come frequently enough to allow for statistical modelling. As a rule of thumb, a class should have at least 5 observations:

> xg = summary(m$TAXGRSC, maxsum=length(levels(m$TAXGRSC))) > str(xg)

Named int [1:13] 790 704 487 376 252 215 186 86 71 23 ... - attr(*, "names")= chr [1:13] "Braunerde" "Parabraunerde" "Pseudogley" "Regosol" ...

> selg.levs = attr(xg, "names")[xg > 5] > m$soiltype <- m$TAXGRSC > m$soiltype[which(!m$TAXGRSC %in% selg.levs)] <- NA > m$soiltype <- droplevels(m$soiltype) > str(summary(m$soiltype, maxsum=length(levels(m$soiltype))))

Named int [1:11] 790 704 487 376 252 215 186 86 71 43 ... - attr(*, "names")= chr [1:11] "Braunerde" "Parabraunerde" "Pseudogley" "Regosol" ...

In this case 2 classes had less than 5 observations and have been removed from further analysis. We should also remove all points that contain missing values for any combination of covariates and target variable:

> m <- m[complete.cases(m[,1:(ncol(eberg_grid)+2)]),] > m$soiltype <- as.factor(m$soiltype)

We can now test fitting a MLA i.e. a random forest model using four covariate layers (parent material map, elevation, TWI and Aster thermal band):

> s <- sample.int(nrow(m), 500) > TAXGRSC.rf <- randomForest(x=m[-s,paste0("PC",1:10)], y=m$soiltype[-s], xtest=m[s,paste0("PC",1:10)], ytest=m$soiltype[s]) > TAXGRSC.rf$test$confusion[,"class.error"] <code rd> Auenboden Braunerde Gley 0.9000000 0.4817518 0.7692308 Kolluvisol Parabraunerde Pararendzina 0.6923077 0.5454545 0.6818182 Pelosol Pseudogley Ranker 0.5882353 0.6842105 1.0000000 Regosol Rendzina 0.6833333 0.6666667

By specifying `xtest`

and `ytest`

we run both model fitting and cross-validation with 500 excluded points. The results show relatively high prediction error of about 60% i.e. relative classification accuracy of about 40%.

We can also test some other MLA's that are suited for this data — multinom from the nnet package, and svm (Support Vector Machine) from the e1071 package:

> TAXGRSC.rf <- randomForest(x=m[,paste0("PC",1:10)], y=m$soiltype) > fm = as.formula(paste("soiltype~", paste(paste0("PC",1:10), collapse="+"))) > fm soiltype ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 > TAXGRSC.mn <- multinom(fm, m)

# weights: 132 (110 variable) initial value 6119.428736 iter 10 value 4161.338634 iter 20 value 4118.296050 iter 30 value 4054.454486 iter 40 value 4020.653949 iter 50 value 3995.113270 iter 60 value 3980.172669 iter 70 value 3975.188371 iter 80 value 3973.743572 iter 90 value 3973.073564 iter 100 value 3973.064186 final value 3973.064186 stopped after 100 iterations

> TAXGRSC.svm <- svm(fm, m, probability=TRUE, cross=5) > TAXGRSC.svm$tot.accuracy

[1] 40.12539

This shows about the same accuracy levels as for random forest. Because all three methods produce comparable accuracy, we can also merge predictions by calculating a simple average (an alternative would be to use the caretEnsemble package to optimize merging of multiple models):

> probs1 <- predict(TAXGRSC.mn, eberg_grid@data, type="probs", na.action = na.pass) > probs2 <- predict(TAXGRSC.rf, eberg_grid@data, type="prob", na.action = na.pass) > probs3 <- attr(predict(TAXGRSC.svm, eberg_grid@data, probability=TRUE, na.action = na.pass), "probabilities")

Derive an average:

> leg <- levels(m$soiltype) > lt <- list(probs1[,leg], probs2[,leg], probs3[,leg]) > probs <- Reduce("+", lt) / length(lt) > eberg_soiltype = eberg_grid > eberg_soiltype@data <- data.frame(probs)

Check that all predictions sum up to 100%:

> ch <- rowSums(eberg_soiltype@data) > summary(ch)

Min. 1st Qu. Median Mean 3rd Qu. 1 1 1 1 1 Max. 1

To plot the result we can use the raster package:

> plot(raster::stack(eberg_soiltype), col=SAGA_pal[[1]], zlim=c(0,1))

By using the produced predictions we can further derive Confusion Index (to map thematic uncertainty) and see if some classes could be aggregated. We can also generate factor-type map by selecting for each pixel class which is most probable, by using e.g.:

> eberg_soiltype$cl <- apply(eberg_soiltype@data,1,which.max)

## Modelling numeric soil properties using h2o

Random forest is suited for both classification and regression problems (it is one of the most popular MLA's for soil mapping), hence we can use it also for modelling numeric soil properties i.e. to fit models and generate predictions. However, because randomForest package in R is not suited for large data sets, we can also use some parallelized version of random forest (or more scalable) i.e. the one implemented in the h2o package (Richter et al. 2015). h2o is a Java-based implementation hence installing the package requires Java libraries (size of package is about: 51MB) and all computing is in principle run outside of R i.e. within the JVM (Java Virtual Machine).

In the following example we look at mapping sand content for top horizons. To initiate h2o we run:

> library(h2o) > localH2O = h2o.init()

java version "1.7.0_79" Java(TM) SE Runtime Environment (build 1.7.0_79-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode) Starting H2O JVM and connecting: .... Connection successful! R is connected to the H2O cluster: H2O cluster uptime: 9 seconds 596 milliseconds H2O cluster version: 3.8.1.3 H2O cluster total nodes: 1 H2O cluster total memory: 1.71 GB H2O cluster total cores: 4 H2O cluster allowed cores: 2 H2O cluster healthy: TRUE\ H2O Connection ip: localhost H2O Connection port: 54321 H2O Connection proxy: NA\ R Version: R version 3.2.4 (2016-03-16) Note: As started, H2O is limited to the CRAN default of 2 CPUs. Shut down and restart H2O as shown below to use all your CPUs. > h2o.shutdown() > h2o.init(nthreads = -1)

This shows that only 2 cores will be used for computing (to control the number of cores you can use the `nthreads`

argument). Next, we need to prepare regression matrix and prediction locations using the `as.h2o`

function so that they are visible to h2o:

> eberg.hex = as.h2o(m, destination_frame = "eberg.hex")

` |=====================================| 100%`

> eberg.grid = as.h2o(eberg_grid@data, destination_frame = "eberg.grid")

` |=====================================| 100%`

We can now fit a random forest model by using all computing power we have:

> RF.m = h2o.randomForest(y = which(names(m)=="SNDMHT_A"), x = which(names(m) %in% paste0("PC",1:10)), training_frame = eberg.hex, ntree = 50)

` |=====================================| 100%`

> RF.m

Model Details: ============== H2ORegressionModel: drf Model ID: DRF_model_R_1462738948867_1 Model Summary: number_of_trees model_size_in_bytes 1 50 594057 min_depth max_depth mean_depth 1 20 20 20.00000 min_leaves max_leaves mean_leaves 1 972 1077 1023.34000 H2ORegressionMetrics: drf ** Reported on training data. ** Description: Metrics reported on Out-Of-Bag training samples MSE: 223.0514 R2 : 0.5087279 Mean Residual Deviance : 223.0514

This shows that the model fitting R-square is about 50%. This is also visible from the predicted vs observed plot:

> plot(m$SNDMHT_A, as.data.frame(h2o.predict(RF.m, eberg.hex, na.action=na.pass))$predict, asp=1, xlab="measured", ylab="predicted")

` |=====================================| 100%`

To produce a map based on predictions we use:

> eberg_grid$RFx <- as.data.frame(h2o.predict(RF.m, eberg.grid, na.action=na.pass))$predict

` |=====================================| 100%`

> plot(raster(eberg_grid["RFx"]), col=SAGA_pal[[1]], zlim=c(10,90)) > points(eberg, pch="+")

h2o has another MLA of interest to soil mapping called “deep learning” (a feed-forward multilayer artificial neural network). Fitting the model is equivalent to using random forest:

> DL.m <- h2o.deeplearning(y = which(names(m)=="SNDMHT_A"), x = which(names(m) %in% paste0("PC",1:10)), training_frame = eberg.hex)

` |=====================================| 100%`

> DL.m

Model Details: ============== H2ORegressionModel: deeplearning Model ID: DeepLearning_model_R_1462738948867_2 Status of Neuron Layers: predicting SNDMHT_A, regression, gaussian distribution, Quadratic loss, 42.601 weights/biases, 508,0 KB, 25.520 training samples, mini-batch size 1 layer units type dropout l1 1 1 10 Input 0.00 % 2 2 200 Rectifier 0.00 % 0.000000 3 3 200 Rectifier 0.00 % 0.000000 4 4 1 Linear 0.000000 l2 mean_rate rate_RMS momentum 1 2 0.000000 0.025641 0.017642 0.000000 3 0.000000 0.227769 0.253055 0.000000 4 0.000000 0.002558 0.001907 0.000000 mean_weight weight_RMS mean_bias 1 2 0.005033 0.105783 0.355502 3 -0.017964 0.071243 0.954336 4 0.000362 0.053619 0.115043 bias_RMS 1 2 0.061916 3 0.020658 4 0.000000 H2ORegressionMetrics: deeplearning ** Reported on training data. ** Description: Metrics reported on full training frame MSE: 225.3349 R2 : 0.5036984 Mean Residual Deviance : 225.3349

Which shows a performance comparable to random forest model. The output predictions (map) does look somewhat different from the random forest predictions:

> eberg_grid$DLx <- as.data.frame(h2o.predict(DL.m, eberg.grid, na.action=na.pass))$predict

` |=====================================| 100%`

> plot(raster(eberg_grid["DLx"]), col=SAGA_pal[[1]], zlim=c(10,90)) > points(eberg, pch="+")

Which of the two methods should we use? Since they both have comparable performance, the most logical option is to generate ensemble (merged) predictions i.e. to produce a map that shows patterns between the two methods (note: many sophisticated MLA such as random forest, neural nets, SVM and similar will often produce comparable results i.e. they are often equally applicable and there is no clear *winner*). We can use weighted average i.e. R-square as a simple approach to produce merged predictions:

> rf.R2 = RF.m@model$training_metrics@metrics$r2 > dl.R2 = DL.m@model$training_metrics@metrics$r2 > eberg_grid$SNDMHT_A <- rowSums(cbind(eberg_grid$RFx*rf.R2, eberg_grid$DLx*dl.R2), na.rm=TRUE)/(rf.R2+dl.R2) > plot(raster(eberg_grid["SNDMHT_A"]), col=SAGA_pal[[1]], zlim=c(10,90)) > points(eberg, pch="+")

Indeed, the output map now shows patterns of both methods and is more likely slightly more accurate than any of the individual MLA's (Krogh, 1996).

## Spatial prediction of 3D (numeric) variables

In the last exercise we look at another two ML-based packages that are also of interest for soil mapping projects — cubist (Kuhn et al, 2012) and xgboost (Chen and Guestrin, 2016). To fine-tune some of the models we will also use the caret package, which is highly recommended for optimizing model fitting and cross-validation.

We look at another soil data set from Australia called "Edgeroi". We can load the profile data and covariates by using:

> data(edgeroi) > edgeroi.spc <- join(edgeroi$sites, edgeroi$horizons, type='inner')

Joining by: SOURCEID

> coordinates(edgeroi.spc) <- ~ LONGDA94 + LATGDA94 > proj4string(edgeroi.spc) <- CRS("+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs") > edgeroi.spc <- spTransform(edgeroi.spc, CRS("+init=epsg:28355")) > con <- url("http://gsif.isric.org/lib/exe/fetch.php?media=edgeroi.grids.rda") > load(con) > gridded(edgeroi.grids) <- ~x+y > proj4string(edgeroi.grids) <- CRS("+init=epsg:28355")

We are interested in modelling soil organic carbon content in g/kg for different depths. We again start by producing the regression matrix:

> ov2 <- over(edgeroi.spc, edgeroi.grids) > m2 <- cbind(ov2, edgeroi.spc@data)

Because we will run 3D modelling, we also need to add depth of horizon, here set at the center of horizon:

> m2$DEPTH <- m2$UHDICM + (m2$LHDICM - m2$UHDICM)/2

The model of interest is:

> formulaStringP2 = ORCDRC ~ DEMSRT5+TWISRT5+PMTGEO5+EV1MOD5+EV2MOD5+EV3MOD5+DEPTH > mP2 <- m2[complete.cases(m2[,all.vars(formulaStringP2)]),]

Note that `DEPTH`

is used as a covariate, which makes this model 3D as one can predict anywhere in 3D space. To improve random forest modelling, we use the caret package that tries to pick up also the optimal mtry parameter based on the cross-validation performance:

> ctrl <- trainControl(method="repeatedcv", number=5, repeats=1) > sel <- sample.int(nrow(mP2), 500) > tr.ORCDRC.rf <- train(formulaStringP2, data=mP2[sel,], method = "rf", trControl = ctrl, tuneLength = 3) > tr.ORCDRC.rf

Random Forest 500 samples 18 predictor No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 399, 401, 398, 401, 401 Resampling results across tuning parameters: mtry RMSE Rsquared RMSE SD 2 4.500389 0.5331524 0.8175690 7 4.171750 0.5497456 0.8331998 12 4.209349 0.5439468 0.8250320 Rsquared SD 0.10781622 0.09216059 0.09272870 RMSE was used to select the optimal model using the smallest value. The final value used for the model was mtry = 7.

In this case mtry = 7 seems to achieve best performance. Note that we sub-set the initial matrix to speed up fine-tuning of the parameters (otherwise the computing time could easily blow up). Next, we can fit the final model using:

> ORCDRC.rf <- train(formulaStringP2, data=mP2, method = "rf", tuneGrid=data.frame(mtry=7), trControl=trainControl(method="none"))

Variable importance plot shows that DEPTH is definitively the most important predictor:

> varImpPlot(ORCDRC.rf$finalModel)

We can also try fitting models using the xgboost package and the cubist package. On the end of the statistical modelling process, we can merge the predictions by using the R-square estimated for each technique:

> edgeroi.grids$DEPTH = 2.5 > edgeroi.grids$Random_forest <- predict(ORCDRC.rf, edgeroi.grids@data, na.action = na.pass) > edgeroi.grids$Cubist <- predict(ORCDRC.cb, edgeroi.grids@data, na.action = na.pass) > edgeroi.grids$XGBoost <- predict(ORCDRC.gb, edgeroi.grids@data, na.action = na.pass) > edgeroi.grids$ORCDRC_5cm <- (edgeroi.grids$Random_forest*w1+edgeroi.grids$Cubist*w2+edgeroi.grids$XGBoost*w3)/(w1+w2+w3) > plot(stack(edgeroi.grids[c("Random_forest","Cubist","XGBoost","ORCDRC_5cm")]), col=SAGA_pal[[1]], zlim=c(5,65))

The final plot shows that xgboost possibly overpredicts and that cubist possibly underpredicts values, while random forest is somewhere in-between the two. Again, merged predictions are probably the safest option considering that all three MLA's have a similar performance.

We can quickly test the overall performance using a script on github prepared for testing performance of merged predictions:

> source_https <- function(url, ...) { + require(RCurl) + cat(getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")), file = basename(url)) + source(basename(url)) + } > source_https("https://raw.githubusercontent.com/ISRICWorldSoil/SoilGrids250m/master/grids/cv/cv_functions.R")

We can hence run 5-fold cross validation:

> test.ORC <- cv_numeric(formulaStringP2, rmatrix=mP2, nfold=5, idcol="SOURCEID", Log=TRUE)

Running 5-fold cross validation with model re-fitting method ranger ... snowfall 1.84-6.1 initialized (using snow 0.4-1): parallel execution on 2 CPUs.

> str(test.ORC)

List of 2 $ CV_residuals:'data.frame': 2076 obs. of 4 variables: ..$ Observed : num [1:2076] 10.75 4.8 7.95 1.6 0.4 ... ..$ Predicted: num [1:2076] 12.21 5.06 9.55 1.36 1.09 ... ..$ SOURCEID : Factor w/ 359 levels "199_CAN_CP111_1",..: 16 16 17 17 17 18 18 19 20 20 ... ..$ fold : int [1:2076] 1 1 1 1 1 1 1 1 1 1 ... $ Summary :'data.frame': 1 obs. of 6 variables: ..$ ME : num 0.0407 ..$ MAE : num 2.2 ..$ RMSE : num 4.09 ..$ R.squared : num 0.651 ..$ logRMSE : num 0.413 ..$ logR.squared: num 0.777

Which shows that the R-squared based on cross-validation is about 65% i.e. the average error of predicting soil organic carbon content using ensemble method is about +/-4 g/kg. The final observed-vs-predict plot shows that the model is unbiased and that the predictions generally match cross-validation points:

> plt0 <- xyplot(test.ORC[[1]]$Predicted~test.ORC[[1]]$Observed, asp=1, par.settings=list(plot.symbol = list(col=alpha("black", 0.6), fill=alpha("red", 0.6), pch=21, cex=0.9)), scales=list(x=list(log=TRUE, equispaced.log=FALSE), y=list(log=TRUE, equispaced.log=FALSE)), xlab="measured", ylab="predicted (machine learning)") > plt0

## Summary points

In summary, MLA's are fairly attractive for soil mapping and soil modelling problems in general as they often perform better than standard linear models (already recognized by Moran and Bui in 2002). This is possible due to the following three reasons:

- Non-linear relationships between soil forming factors and soil properties can be better fitted using MLA's,
- Tree-based models are more efficient in representing
*local*relationships, which is often the case with soil data, - In the case of MLA, data problems such as multicolinearity and non-Gaussian distribution are secondary, which simplifies preparation of the models,

On the other hand MLA's can be computationally very intensive and hence require careful planning, especially as the number of points goes beyond few thousand and number of covariates beyond a dozen. Note also that svm is computationally very intensive and is probably not suited for large data sets.

## References

- Chen, T. and Guestrin, C., 2016. XGBoost: A Scalable Tree Boosting System. arXiv preprint arXiv:1603.02754.
- Krogh, P.S.A., 1996. Learning with ensembles: How over-fitting can be useful. In Proceedings of the 1995 Conference (Vol. 8, p. 190).
- Kuhn, M., Weston, S., Keefer, C. and Coulter, N., 2012. Cubist Models For Regression. R package Vignette R package, 18 pp.
- Kuhn, M., 2008. Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5).
- Moran, C.J. and Bui, E.N., 2002. Spatial data mining for enhanced soil map modelling. International Journal of Geographical Information Science, 16(6), pp.533-549.
- Richter, A.N., Khoshgoftaar, T.M., Landset, S. and Hasanin, T., 2015, August. A Multi-Dimensional Comparison of Toolkits for Machine Learning with Big Data. In Information Reuse and Integration (IRI), 2015 IEEE International Conference on (pp. 1-8). IEEE.
- Wright, M.N. and Ziegler, A., 2015. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. arXiv preprint arXiv:1508.04409.