Preparation of soil covariates

Before we are able to fit spatial prediction models and generate soil maps, a significant amount of time is often spent on preparing covariate 'layers' that can be used as independent variables ('predictors') in the statistical modelling. Typical operations used to generate soil covariate layers include:

  • Converting polygon maps to rasters,
  • Downscaling or upscaling (aggregating) rasters to match the target resolution (i.e. preparing a stack),
  • Filtering out missing pixels / reducing noise and multicolinearity (data overlap) problems,
  • Overlaying and subsetting raster stacks and points,

The following examples should give you some ideas about how to program those steps using the shortest possible syntax, running the fastest and most robust algorithms. Raster data can often be very large (millions of pixels) so processing large stacks of remote sensing scenes in R needs to be planned carefully. The complete R tutorial you can download from github. Instructions on how to install and set-up all software used in this example you can find here.

Before we can attach a polygon map to other stacks of covariates, this needs to be rasterized i.e. converted to a raster layer defined with its bounding box (extent) and spatial resolution. Consider for example the Ebergötzen data set polygon map from the plotKML package:

> library(plotKML)
> data(eberg_zones)
> spplot(eberg_zones[1])
spplot eberg zones.

We can convert this object to a raster by using the raster package. Note that before we can run the operation, we need to know the target grid system i.e. extent of the grid and spatial resolution. We can use this from an existing layer:

> data("eberg_grid25")
> gridded(eberg_grid25) <- ~x+y
> proj4string(eberg_grid25) <- CRS("+init=epsg:31467")
> r <- raster(eberg_grid25)
> r
class       : RasterLayer
dimensions  : 400, 400, 160000  (nrow, ncol, ncell)
resolution  : 25, 25  (x, y)
extent      : 3570000, 3580000, 5708000, 5718000  (xmin, xmax, ymin, ymax)
coord. ref. : +init=epsg:31467 +proj=tmerc +lat_0=0 +lon_0=9 +k=1 +x_0=3500000 +y_0=0 +datum=potsdam +units=m +no_defs +ellps=bessel +towgs84=598.1,73.7,418.2,0.202,0.045,-2.455,6.7
data source : in memory
names       : DEMTOPx
values      : 159, 428  (min, max)

The “eberg_grids25” object is a SpatialPixelsDataFrame, which is a spatial gridded data structure of the sp package package. The raster package also offers data structures for spatial (gridded) data, and stores such data as 'RasterLayer' class. Gridded data can be converted from class SpatialPixelsDataFrame to Raster layer with the ''raster'' command. The ''CRS'' command of the sp package can be used to set a spatial projection. EPSG projection 31467 is a German coordinate system. (Each coordinate system has an associated EPSG number that can be obtained from this website).

Conversion from polygon to raster is now possible via the ''rasterize'' command:

> names(eberg_zones)
[1] "ZONES"
> eberg_zones_r <- rasterize(eberg_zones, r, field="ZONES")
> plot(eberg_zones_r)

Converting large polygons in R using the raster package could be very time-consuming, hence often a more practical approach is to use SAGA GIS (which can handle large data and is easy to run in parallel). First, you need to export the polygon map to shapefile format that can be done with commands of the rgdal package package:

> library(rgdal)
> eberg_zones$ZONES_int <- as.integer(eberg_zones$ZONES)
> writeOGR(eberg_zones["ZONES_int"], "eberg_zones.shp", ".", "ESRI Shapefile")

The writeOGR() command writes a SpatialPolygonsDataFrame (the data structure for polygon data in R) to an ESRI shapefile. Here we only writing the attribute “ZONES_int” to the shapefile. It is, however, also possible to write all attributes of the SpatialPolygonsDataFrame to a shapefile.

Next, you can locate the (previously installed) SAGA GIS command line program:

> saga_cmd = "C:/Progra~1/SAGA-GIS/saga_cmd.exe"

and finally use the module ''grid_gridding'' to convert the shapefile to a grid:

> pix = 25
> system(paste0(saga_cmd, ' grid_gridding 0
  -INPUT \"eberg_zones.shp\"
  -FIELD \"ZONES_int\"
  -GRID \"eberg_zones.sgrd\"
  -TARGET_USER_SIZE ', pix, '
  -TARGET_USER_XMIN ', extent(r)[1]+pix/2,'
  -TARGET_USER_XMAX ', extent(r)[2]-pix/2, '
  -TARGET_USER_YMIN ', extent(r)[3]+pix/2,'
  -TARGET_USER_YMAX ', extent(r)[4]-pix/2))
   #####   ##   #####    ##
  ###     ###  ##       ###
   ###   # ## ##  #### # ##
    ### ##### ##    # #####
 ##### #   ##  ##### #   ##
SAGA Version: 2.2.7 (64 bit)
library path: C:\Progra~1\SAGA-GIS\modules\
library name: grid_gridding
library     : Gridding
tool        : Shapes to Grid
author      : O.Conrad (c) 2003
processors  : 12 [12]
Load shapes: eberg_zones.shp...
Shapes: eberg_zones
Attribute: ZONES
Output Values: attribute
Method for Multiple Values: last
Polygon: cell
Preferred Target Grid Type: Integer (1 byte)
Target Grid System: user defined
Cellsize: 25.000000
Left: 3570012.500000
Right: 3579987.500000
Bottom: 5708012.500000
Top: 5717987.500000
Fit: nodes
> eberg_zones_r2 <- readGDAL("eberg_zones.sdat")
eberg_zones.sdat has GDAL driver SAGA
and has 400 rows and 400 columns

With the system() command we can invoke an operating system (OS) command, here we use it to run the saga_cmd.exe file from R. The paste0 function is used to paste together a string that is passed to the system() command. The string starts with the OS command we would like to invoke (here saga_cmd.exe) followed by input required for the running the OS command.

Note that the bounding box (in SAGA GIS) needs to be defined using the center of the corner pixel and not the corners, hence we take half of the pixel size for extent coordinates from raster package. Also note that the class names have been lost during rasterization (we work with integers in SAGA GIS), but we can attach them back by using e.g.:

> levels(eberg_zones$ZONES)
[1] "Clay_and_loess"  "Clayey_derivats" "Sandy_material"
[4] "Silt_and_sand"
> eberg_zones_r2$ZONES <- as.factor(eberg_zones_r2$band1)
> levels(eberg_zones_r2$ZONES) = levels(eberg_zones$ZONES)
> summary(eberg_zones_r2$ZONES)
 Clay_and_loess Clayey_derivats  Sandy_material   Silt_and_sand
          28667           35992           21971           73370
Eberg zones rasterized to 25 m resolution.

In order for all covariates to perfectly stack, we also need to adjust resolution of some covariates that have either too coarse or too fine resolution than the target resolution. Process of bringing raster layers to target grid is also known as resampling. Consider the following example from the Ebergotzen case study:

> data(eberg_grid)
> gridded(eberg_grid) <- ~x+y
> proj4string(eberg_grid) <- CRS("+init=epsg:31467")
> names(eberg_grid)

In this case we have few layers that we would like to use for spatial prediction in combination with the maps produced in the previous sections, but their resolution is 100 m i.e. about 16 times coarser. Probably the most robust way to resample rasters is to use the ''gdalwarp'' function from the GDAL software. Assuming that you have already installed GDAL, you only need to locate the program on your system, and then you can again run ''gdalwarp'' via the system command:

> writeGDAL(eberg_grid["TWISRT6"], "eberg_grid_TWISRT6.tif")
> gdalwarp = "C:/Progra~1/GDAL/gdalwarp.exe"
> system(paste0(gdalwarp, ' eberg_grid_TWISRT6.tif eberg_grid_TWISRT6_25m.tif -r \"cubicspline\" -te ', paste(as.vector(extent(r))[c(1,3,2,4)], collapse=" "),' -tr ', pix, ' ', pix, ' -overwrite'))
Creating output file that is 400P x 400L.
Processing input file eberg_grid_TWISRT6.tif.
0...10...20...30...40...50...60...70...80...90...100 - done.

The writeGDAL commands writes the TWISRT6 grid, that is stored in the eberg_grid grid stack, to a TIFF file. This TIFF is subsequently read by the gdalwarp function and resampled to a 25m TIFF file using cubicspline, which will fill in values between original grid using smooth surfaces. Note that the paste0 function in the system() command pastes together the following string: “C:/Progra~1/GDAL/gdalwarp.exe eberg_grid_TWISRT6.tif eberg_grid_TWISRT6_25m.tif -r \”cubicspline\“ -te 3570000 5708000 3580000 5718000 -tr 25 25 -overwrite”.

We can compare the two maps (the original and the downscaled) next to each other by using:

> par(mfrow=c(1,2))
> zlim = range(eberg_grid$TWISRT6, na.rm=TRUE)
> image(raster(eberg_grid["TWISRT6"]), col=SAGA_pal[[1]], zlim=zlim, main="Original", asp=1)
> image(raster("eberg_grid_TWISRT6_25m.tif"), col=SAGA_pal[[1]], zlim=zlim, main="Downscaled", asp=1)
Original TWI vs downscaled map from 100 m to 25 m.

The map on the right looks much smoother of course (assuming that this variable varies continuously in space, this could very well be an accurate picture), but it is important to realize that downscaling can only be implemented up to certain target resolution i.e. only for certain features. For example, downscaling TWI from 100 to 25 m is not much of problem, but to go beyond 10 m would probably result in large differences from a TWI calculated at 10 m resolution (in other words: be careful with downscaling because it is often not trivial).

The opposite process from downscaling is upscaling or aggregation. Although this one can also potentially be tricky, it is much more straight process than downscaling. We recommend using the average method in GDAL for aggregating values e.g.:

> system(paste0(gdalwarp, ' eberg_grid_TWISRT6.tif eberg_grid_TWISRT6_250m.tif -r \"average\" -te ', paste(as.vector(extent(r))[c(1,3,2,4)], collapse=" "),' -tr 250 250 -overwrite'))
Creating output file that is 40P x 40L.
Processing input file eberg_grid_TWISRT6.tif.
0...10...20...30...40...50...60...70...80...90...100 - done.
Original TWI vs aggregated map from 100 m to 250 m.

Now that we have established connection between R and SAGA GIS, we can also use SAGA GIS to derive some standard DEM parameters of interest to soil mapping. To automate further processing, we make the following function:

saga_DEM_derivatives <- function(INPUT, MASK=NULL, sel=c("SLP","TWI","CRV","VBF","VDP","OPN","DVM")){
    ## Fill in missing DEM pixels:
    suppressWarnings( system(paste0(saga_cmd, ' grid_tools 25 -GRID=\"', INPUT, '\" -MASK=\"', MASK, '\" -CLOSED=\"', INPUT, '\"')) )
  ## Slope:
  if(any(sel %in% "SLP")){
    try( suppressWarnings( system(paste0(saga_cmd, ' ta_morphometry 0 -ELEVATION=\"', INPUT, '\" -SLOPE=\"', gsub(".sgrd", "_slope.sgrd", INPUT), '\" -C_PROF=\"', gsub(".sgrd", "_cprof.sgrd", INPUT), '\"') ) ) )
  ## TWI:
  if(any(sel %in% "TWI")){
    try( suppressWarnings( system(paste0(saga_cmd, ' ta_hydrology 15 -DEM=\"', INPUT, '\" -TWI=\"', gsub(".sgrd", "_twi.sgrd", INPUT), '\"') ) ) )
  ## MrVBF:
  if(any(sel %in% "VBF")){
    try( suppressWarnings( system(paste0(saga_cmd, ' ta_morphometry 8 -DEM=\"', INPUT, '\" -MRVBF=\"', gsub(".sgrd", "_vbf.sgrd", INPUT), '\" -T_SLOPE=10 -P_SLOPE=3') ) ) )
  ## Valley depth:
  if(any(sel %in% "VDP")){
    try( suppressWarnings( system(paste0(saga_cmd, ' ta_channels 7 -ELEVATION=\"', INPUT, '\" -VALLEY_DEPTH=\"', gsub(".sgrd", "_vdepth.sgrd", INPUT), '\"') ) ) )
  ## Openess:
  if(any(sel %in% "OPN")){
    try( suppressWarnings( system(paste0(saga_cmd, ' ta_lighting 5 -DEM=\"', INPUT, '\" -POS=\"', gsub(".sgrd", "_openp.sgrd", INPUT), '\" -NEG=\"', gsub(".sgrd", "_openn.sgrd", INPUT), '\" -METHOD=0' ) ) ) )
  ## Deviation from Mean Value:
  if(any(sel %in% "DVM")){
    suppressWarnings( system(paste0(saga_cmd, ' statistics_grid 1 -GRID=\"', INPUT, '\" -DEVMEAN=\"', gsub(".sgrd", "_devmean.sgrd", INPUT), '\" -RADIUS=11' ) ) )

To run this function we only need DEM as input:

> writeGDAL(eberg_grid["DEMSRT6"], "DEMSRT6.sdat", "SAGA")
> saga_DEM_derivatives("DEMSRT6.sgrd")
> dem.lst <- list.files(pattern=glob2rx("^DEMSRT6_*.sdat"))
> plot(stack(dem.lst), col=SAGA_pal[[1]])

This function can now be used with any DEM to derive the standard 7-8 DEM parameters such as slope and curvature, TWI and MrVBF, positive and negative openess, valley depth and deviation from mean value. You could easily add more parameters to this function and then test if some of the other DEM derivatives can help improve mapping soil properties and classes. Note that SAGA GIS will by default optimize computing of DEM derivatives by using most of available cores to compute (parallelization is turned on automatically).

Some standard DEM derivatives.

After we have brought all covariates to the same grid definition, what can still represent a problem for using covariates in spatial modelling are:

  • Missing pixels,
  • Artifacts and noise,
  • Multicolinearity (i.e. data overlap),

In a stack with tens of rasters, the weakest layer could cause serious problems for producing soil maps as the missing pixels and artifacts would propagate to predictions. Missing pixels happen due to various reasons: in the case of remote sensing, missing pixels can be due to clouds or similar; noise is often due to atmospheric conditions. Missing pixels (as long as we are dealing with few patches of missing pixels) can be efficiently filtered by using gap filling functionality available in the SAGA GIS e.g.:

> par(mfrow=c(1,2))
> image(raster(eberg_grid["test"]), col=SAGA_pal[[1]], zlim=zlim, main="Original", asp=1)
> image(raster("test.sdat"), col=SAGA_pal[[1]], zlim=zlim, main="Filtered", asp=1)

In this example we use the same input and output file for filling in gaps. There are several other gap filling possibilities in SAGA GIS including Close Gaps with Spline, Close Gaps with Stepwise Resampling and Close One Cell Gaps. Note all of these are equally applicable to all missing pixel problems, but having <10% of missing pixels is often not much of a problem for soil mapping.

Another elegant way to filter the missing pixels, reduce noise and reduce data overlap is to use Principal Components transformation of original data. This is available also via the GSIF function ''spc'':

> data(eberg_grid)
> gridded(eberg_grid) <- ~x+y
> proj4string(eberg_grid) <- CRS("+init=epsg:31467")
> formulaString <- ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6
> eberg_spc <- spc(eberg_grid, formulaString)
Converting PRMGEO6 to indicators...
Converting covariates to principal components...
> names(eberg_spc@predicted) # 11 components on the end;
 [1] "PC1"  "PC2"  "PC3"  "PC4"  "PC5"  "PC6"  "PC7"  "PC8"  "PC9"  "PC10"
[11] "PC11"
> rd = range(eberg_spc@predicted@data[,1], na.rm=TRUE)
> plot(stack(eberg_spc@predicted[1:11]), zlim=rd, col=rev(rainbow(65)[1:48]))
11 PCs derived using eberg covariates.

The advantages of using the ''spc'' function are:

  • All output soil covariates are numeric (and not a mixture of factors and numeric),
  • Last 1-2 PCs often contain signal noise and could be excluded from modelling,
  • In further analysis it becomes easier to remove covariates that do not help in modelling (e.g. by using step-wise selection and similar),

A disadvantage of using SPCs (spatial predictive components) is that the components are often abstract so that interpretation of correlations can become cumbersome. Also, if one of the layers contains many factor levels, then the number of output covariates might blow up, which become impractical as we should then have at least 10 observations per covariate to avoid overfitting.

Now that we have prepared all covariates (resampled them to the same grid and filtered out all problems), we can proceed with running overlays and fitting statistical models. Assuming that we deal with large number of files, an elegant way to read all those to R is by using the raster package, especially the ''stack'' and ''raster'' commands. In the following example we can list all files of interest, and then read them all at once:

> grd.lst <- list.files(pattern="25m")
> grd.lst
[1] "DEMTOPx_25m.tif"            "eberg_grid_TWISRT6_25m.tif"
[3] "eberg_zones_25m.tif"
> grid25m <- stack(grd.lst)
> grid25m <- as(grid25m, "SpatialGridDataFrame")
> str(grid25m)
Formal class 'SpatialGridDataFrame' [package "sp"] with 4 slots
  ..@ data       :'data.frame': 160000 obs. of  3 variables:
  .. ..$ DEMTOPx_25m           : num [1:160000] 383 380 375 364 350 341 337 335 332 326 ...
  .. ..$ eberg_grid_TWISRT6_25m: num [1:160000] 10.9 10.9 11.1 11.2 11.4 ...
  .. ..$ eberg_zones_25m       : num [1:160000] 2 2 2 2 2 2 2 2 2 2 ...
  ..@ grid       :Formal class 'GridTopology' [package "sp"] with 3 slots
  .. .. ..@ cellcentre.offset: Named num [1:2] 3570013 5708013
  .. .. .. ..- attr(*, "names")= chr [1:2] "s1" "s2"
  .. .. ..@ cellsize         : num [1:2] 25 25
  .. .. ..@ cells.dim        : int [1:2] 400 400
  ..@ bbox       : num [1:2, 1:2] 3570000 5708000 3580000 5718000
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:2] "s1" "s2"
  .. .. ..$ : chr [1:2] "min" "max"
  ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
  .. .. ..@ projargs: chr "+proj=tmerc +lat_0=0 +lon_0=9 +k=1 +x_0=3500000 +y_0=0 +ellps=bessel +towgs84=598.1,73.7,418.2,0.202,0.045,-2.455,6.7 +units=m "| __truncated__

One could now save a the prepared covariates stored in SpatialGridDataFrame as an RDATA object for future use.

> save(grid25m, file = "covariates25m.RDATA")

To overlay rasters and points and prepare a regression matrix, we can either use the ''over'' function from the sp package, or ''extract'' function from the raster package. By using raster package, one can run overlay even without reading the rasters to memory:

> library(sp)
> data(eberg)
> coordinates(eberg) <- ~X+Y
> proj4string(eberg) <- CRS("+init=epsg:31467")
> ov =, eberg))
> str(ov[complete.cases(ov),])
'data.frame':   2778 obs. of  3 variables:
 $ DEMTOPx_25m           : num  270 357 354 328 363 358 328 340 353 311 ...
 $ eberg_grid_TWISRT6_25m: num  16.1 14.9 14.7 16 13.6 ...
 $ eberg_zones_25m       : num  2 2 2 2 2 2 2 2 2 2 ...

If the raster layers can not be stacked and if each layer is available in a different projection system, you can also create a function that reprojects points to the target raster layer projection system: = function(i, y){raster::extract(raster(i), na.rm=FALSE, spTransform(y, proj4string(raster(i))))}

which can also be run in parallel for example by using the parallel package:

ov = data.frame(mclapply(grd.lst,, y=eberg))
names(ov) = basename(grd.lst)

In similar way one could also make wrapper functions that downscale/upscale grids, then filter missing values and stack all data together so that it become available in the working memory (sp grid or pixels object). Overlay and model fitting is also implemented directly in the GSIF package, so any attempt to fit models will automatically perform overlay.

As R is often inefficient in handling large objects in memory (such as large raster images), a good strategy to run raster processing in R is to consider using for example the clusterR function from the raster package, which automatically parallelizes use of raster functions. To have full control over parallelization, you can alternatively tile large rasters using the getSpatialTiles function from the GSIF package and process them as separate objects in parallel. The following examples shows how to run a simple function in parallel on tiles and then mosaic those after all processing has been completed. Consider for example the GeoTiff from the rgdal package:

> fn = system.file("pictures/SP27GTIF.TIF", package = "rgdal")
> obj <- GDALinfo(fn)
Warning message:
statistics not supported by this driver

We can split that object in 35 tiles, each of 5 x 5 km in size by running:

> tiles <- GSIF::getSpatialTiles(obj, block.x=5000, return.SpatialPolygons = FALSE)
Generating 35 tiles...
Returning a list of tiles for an object of class GDALobj with 0 percent overlap
> tiles.pol <- GSIF::getSpatialTiles(obj, block.x=5000, return.SpatialPolygons = TRUE)
Generating 35 tiles...
Returning a list of tiles for an object of class GDALobj with 0 percent overlap
> tile.pol = SpatialPolygonsDataFrame(tiles.pol, tiles)
> plot(raster(fn), col=bpy.colors(20))
> lines(tile.pol, lwd=2)
Example of a tiling system derived using the GSIF::getSpatialTiles function.

rgdal further allows us to read only a single tile of the GeoTiff by using the offset and region.dim arguments:

> x = readGDAL(fn, offset=unlist(tiles[1,c("offset.y","offset.x")]), region.dim=unlist(tiles[1,c("region.dim.y","region.dim.x")]), output.dim=unlist(tiles[1,c("region.dim.y","region.dim.x")]), silent = TRUE)
> spplot(x)
A tile of a satellite image.

We would like to run a function on this raster in parallel, for example a simple function that converts values to 0/1 values based on a threshold:

fun_mask <- function(i, tiles, dir="./tiled/", threshold=190){
  out.tif = paste0(dir, "T", i, ".tif")
    x = readGDAL(fn, offset=unlist(tiles[i,c("offset.y","offset.x")]), region.dim=unlist(tiles[i,c("region.dim.y","region.dim.x")]), output.dim=unlist(tiles[i,c("region.dim.y","region.dim.x")]), silent = TRUE)
    x$mask = ifelse(x$band1>threshold, 1, 0)
    writeGDAL(x["mask"], type="Byte", mvFlag = 255, out.tif, options=c("COMPRESS=DEFLATE"))

This can now be run through mclapply function from the parallel package (which automatically employs all available cores):

> x0 = mclapply(1:nrow(tiles), FUN=fun_mask, tiles=tiles)

We can look in the the tiles folder, and this should shows 35 produced GeoTiffs. These can be further used to construct a virtual mosaic by using:

> t.lst <- list.files(path="./tiled", pattern=glob2rx("^T*.tif$"), full.names=TRUE, recursive=TRUE)
> cat(t.lst, sep="\n", file="SP27GTIF_tiles.txt")
> system('gdalbuildvrt -input_file_list SP27GTIF_tiles.txt SP27GTIF.vrt')
0...10...20...30...40...50...60...70...80...90...100 - done.
> system('gdalwarp SP27GTIF.vrt SP27GTIF_mask.tif -ot \"Byte\" -dstnodata 255 -co \"BIGTIFF=YES\" -r \"near\" -overwrite -co \"COMPRESS=DEFLATE\"')
Creating output file that is 699P x 929L.
Processing input file SP27GTIF.vrt.
Using internal nodata values (e.g. 255) for image SP27GTIF.vrt.
0...10...20...30...40...50...60...70...80...90...100 - done.

Note we use few important settings here for GDAL e.g. -overwrite -co “COMPRESS=DEFLATE” to overwrite the GeoTiff and internally compress it to save space and -r “near” basically no resampling just binding tiles together. Also, if the output GeoTiff is HUGE, you will most likely have to turn on -co “BIGTIFF=YES” otherwise gdalwarp would not run through. The output mosaic looks like this:

Final processed output.

So in summary, an efficient way to compute with R, when it comes to HUGE rasters, is to (1) design a tiling system that optimizes use of RAM and read/write spead of a disk, (2) prepare and test a function that can be then run in parallel, and (3) stitch back all tiles to a large raster using gdalwarp. Tiling and and stitching can not be applied universally to all problems (e.g. functions that require global geographical search or all data in the raster), in which cases tiling should be applied with overlap (to minimize boundary effects) or to irregular tiling systems (e.g. per watershed). Once an optimal tiling system and function is prepared, R is not any more limit to running efficient computing, but only how much RAM and cores you have available i.e. it becomes purely a hardware problem.

wiki soil covariates
 stars  from 1 votes (Details)