Importing and formatting soil data in R

This tutorial reviews soil data classes and functions for R: how to organize and reformat soil data in R for spatial analysis, how to import soil data to R, how to export data and plot it in Google Earth. To learn more about the Global Soil Information Facilities (GSIF), visit the main project page. Some of the functions listed below are described in detail within the GSIF package for R.

Download this tutorial as R script.

In the following example we look at how to convert texture-by-hand estimated classes to texture fractions i.e. sand, silt and clay content in %. We focus on the USDA texture-by-hand classes. There classes are embedded in the soiltexture package kindly contributed by Julien Moeys. The USDA texture triangle can be accessed by:

> library(soiltexture)
Loading required package: sp
Loading required package: MASS
'soiltexture' loaded.
> TT.plot(class.sys = "USDA.TT")
Soil texture triangle based on the USDA system. Generated using the soiltexture package in R.

We can also print out a table with all class names and vertices numbers that defines each class:

> TT.classes.tbl(class.sys="USDA.TT", collapse=", ")
      abbr     name
 [1,] "Cl"     "clay"
 [2,] "SiCl"   "silty clay"
 [3,] "SaCl"   "sandy clay"
 [4,] "ClLo"   "clay loam"
 [5,] "SiClLo" "silty clay loam"
 [6,] "SaClLo" "sandy clay loam"
 [7,] "Lo"     "loam"
 [8,] "SiLo"   "silty loam"
 [9,] "SaLo"   "sandy loam"
[10,] "Si"     "silt"
[11,] "LoSa"   "loamy sand"
[12,] "Sa"     "sand"
      points
 [1,] "24, 1, 5, 6, 2"
 [2,] "2, 6, 7"
 [3,] "1, 3, 4, 5"
 [4,] "5, 4, 10, 11, 12, 6"
 [5,] "6, 12, 13, 7"
 [6,] "3, 8, 9, 10, 4"
 [7,] "10, 9, 16, 17, 11"
 [8,] "11, 17, 22, 23, 18, 19, 13, 12"
 [9,] "8, 14, 21, 22, 17, 16, 9"
[10,] "18, 23, 26, 19"
[11,] "14, 15, 20, 21"
[12,] "15, 25, 20"

So knowing that the soil texture classes are defined geometrically, a logical estimate of the texture fractions from a class is to take the geometric centre of each polygon in the texture triangle. To estimate where the geometric centre is, we can for example use the functionality in the sp package. We start by creating a SpatialPolygons object, for which we have to calculate coordinates in the xy space and bind polygons one by one:

> vert <- TT.vertices.tbl(class.sys = "USDA.TT")
> vert$x <- 1-vert$SAND+(vert$SAND-(1-vert$SILT))*0.5
> vert$y <- vert$CLAY*sin(pi/3)
> USDA.TT <- data.frame(TT.classes.tbl(class.sys = "USDA.TT", collapse = ", "))
> TT.pnt <- as.list(rep(NA, length(USDA.TT$name)))
> poly.lst <- as.list(rep(NA, length(USDA.TT$name)))
> for(i in 1:length(USDA.TT$name)){
+   ## strip the vertices numbers:
+   TT.pnt[[i]] <- as.integer(strsplit(unclass(paste(USDA.TT[i, "points"])), ", ")[[1]])
+   ## create a list of polygons:
+   poly.lst[[i]] <- vert[TT.pnt[[i]],c("x","y")]
+   ## add extra point:
+   poly.lst[[i]] <- Polygons(list(Polygon(rbind(poly.lst[[i]], poly.lst[[i]][1,]))), ID=i)
+ }
> ## convert texture triangle to a spatial object:
> poly.sp <- SpatialPolygons(poly.lst, proj4string=CRS(as.character(NA)))
> poly.USDA.TT <- SpatialPolygonsDataFrame(poly.sp, data.frame(ID=USDA.TT$name), match.ID=FALSE)

The resulting object now contains also slots of type labpt which is exactly the geometric gravity point of the first polygon automatically derived by the SpatialPolygons function.

> methods/html/slot.html">slot(methods/html/slot.html">slot(poly.USDA.TT, "polygons")[[1]], "labpt")
[1] 0.4901261 0.5449653

Next we need to create a function that converts the xy coordinates (columns) in a texture triangle to texture fraction values. Let's call this function get.TF.from.XY:

> get.TF.from.XY <- function(df, xcoord, ycoord) {
+     df$CLAY <- df[,ycoord]/sin(pi/3)
+     df$SAND <- (2 - df$CLAY - 2 * df[,xcoord]) * 0.5
+     df$SILT <- 1 - (df$SAND + df$CLAY)
+     return(df)
+ }

and now everything is ready to estimate the soil fractions based on a system of classes. For the case of the USDA classifications system we get:

> USDA.TT.cnt <- data.frame(t(sapply(methods/html/slot.html">slot(poly.USDA.TT, "polygons"), methods/html/slot.html">slot, "labpt")))
> USDA.TT.cnt$name <- poly.USDA.TT$ID
> USDA.TT.cnt <- get.TF.from.XY(USDA.TT.cnt, "X1", "X2")
> USDA.TT.cnt[,c("SAND","SILT","CLAY")] <- signif(USDA.TT.cnt[,c("SAND","SILT","CLAY")], 2)
> USDA.TT.cnt[,c("name","SAND","SILT","CLAY")]
              name  SAND  SILT  CLAY
1             clay 0.200 0.180 0.630
2       silty clay 0.067 0.470 0.470
3       sandy clay 0.520 0.067 0.420
4        clay loam 0.320 0.340 0.340
5  silty clay loam 0.100 0.560 0.340
6  sandy clay loam 0.600 0.130 0.270
7             loam 0.410 0.400 0.190
8       silty loam 0.210 0.650 0.130
9       sandy loam 0.650 0.250 0.100
10            silt 0.073 0.870 0.053
11      loamy sand 0.820 0.120 0.058
12            sand 0.920 0.050 0.033

Now that we have created a function that creates values in the texture triangle to texture fractions, we can go further and even estimate what is the uncertainty of estimating each texture fraction based on the class. For this we can use simulations i.e. randomly sample 100 points within some texture class and then derive standard deviations for each texture fraction. Note that, although this sounds as a complicated operation, we can run this in two lines of code. For example to estimate uncertainty of converting the class Cl (clay) to texture fractions we can simulate 100 random points the class polygon using the spsample function from the sp package (Bivand et al. 2007):

> sim.Cl <- data.frame(spsample(poly.USDA.TT[poly.USDA.TT$ID=="clay",], type="random", n=100))
> sim.Cl <- get.TF.from.XY(sim.Cl, "x", "y")
> sd(sim.Cl$SAND); sd(sim.Cl$SILT); sd(sim.Cl$CLAY)
[1] 0.1298167
[1] 0.1119373
[1] 0.1529671

which means that we should not expect better precision of estimating the clay content based on class Cl than +/- 15%.

For some real soil profile data set we could also plot all texture fractions in the texture triangle to see how frequently one should expect some soil classes to appear:

> require(GSIF)
> data(afsp)
> tdf <- afsp$horizons[,c("CLYPPT", "SLTPPT", "SNDPPT")]
> tdf <- tdf[!is.na(tdf$SNDPPT)&!is.na(tdf$SLTPPT)&!is.na(tdf$CLYPPT),]
> tdf <- tdf[runif(nrow(tdf))<.15,]
> tdf$Sum = rowSums(tdf)
> for(i in c("CLYPPT", "SLTPPT", "SNDPPT")) { tdf[,i] <- tdf[,i]/tdf$Sum * 100 }
> names(tdf)[1:3] <- c("CLAY", "SILT", "SAND")
> TT.plot(class.sys = "USDA.TT", tri.data = tdf, grid.show = FALSE, pch="+", cex=.4, col="red")
Distribution of observed soil textures for the Africa Soil Profiles.

This shows that not all positions in the triangle have the same prior probability. So probably a more sensitive way to estimate uncertainty of converting soil texture classes to fractions would be to run simulations using a density image showing the actual distribution of classes and then, by using the rpoint function in the spatstat package, we could also derive an even more realistic conversions from texture-by-hand classes to texture fractions.

In the next example we look at the Munsell color codes and conversion algorithms from a code to RGB and other color spaces. Munsell color codes can be matched with RGB values via the Munsell color codes conversion table. You can load a table with 2350 entries from the GSIF dokuwiki:

> load(file("http://gsif.isric.org/lib/exe/fetch.php?media=munsell_rgb.rdata"))
> library(colorspace)
> munsell.rgb[round(runif(1)*2350, 0),]
    Munsell R  G   B
81 10B_2_12 0 60 123
> library(colorspace)
> as(RGB(R=munsell.rgb[1007,"R"]/255, G=munsell.rgb[1007,"G"]/255, B=munsell.rgb[1007,"B"]/255), "HSV")
            H          S         V
[1,] 3.529412 0.07981221 0.8352941

This shows that, for any given Munsell color code, it is relatively easy to convert them to any other color system available in R.

Within the R package aqp one can directly transform Munsell color codes to some color classes in R (Beaudette et al, 2013). For example, to convert the Munsell color code to RGB values from the example above we would run:

> aqp::munsell2rgb(the_hue = "10B", the_value = 2, the_chroma = 12)
[1] "#003A7CFF"

Now the colors are coded in the hexadecimal format, which is quite abstract but can be easily browsed via some web color table. To get the actual RGB values we would run:

> col2rgb("#003A7CFF")
      [,1]
red      0
green   58
blue   124

The hex triplet format is also very similar to the color format used in the KML:

> plotKML::col2kml("#003A7CFF")
[1] "#ff7c3a00"

To plot the actual colors based on actual soil profile data base we often need to prepare the color codes before we can run the conversion (Viscarra Rossel et al. 2006). In the case of the Africa Soil Profile Database:

> data(afsp)
> paste(afsp$horizons[1:10,"MCOMNS"])
 [1] "NA"       "NA"       "NA"       "NA"
 [5] "10YR3/3"  "7.5YR5/2" "7.5YR5/4" "10YR6/4"
 [9] "10YR7/4"  "7.5YR7/4"

the Munsell color codes have been prepared as text. Hence we need to spend some effort to separate hue from saturation and intensity before we can derive and plot actual colors. We start by merging the tables of interest so both coordinates and Munsell color codes are available in the same table:

> mcol <- join(afsp$horizons[,c("SOURCEID","MCOMNS","UHDICM","LHDICM")], afsp$sites[,c("SOURCEID","LONWGS84","LATWGS84")], type='inner')
Joining by: SOURCEID
> mcol <- mcol[!is.na(mcol$MCOMNS),]
> str(mcol)
'data.frame':   29525 obs. of  8 variables:
 $ SOURCEID: Factor w/ 16706 levels "AO 1445_P100/56",..: 888 888 888 888 888 888 889 889 889 889 ...
 $ MCOMNS  : Factor w/ 634 levels "<Null>","1;5YR4/6",..: 88 564 566 143 153 583 60 104 104 536 ...
 $ UHDICM  : num  0 8 25 50 81 133 0 8 19 30 ...
 $ LHDICM  : num  8 25 50 81 133 160 8 19 30 50 ...
 $ LONWGS84: num  17.6 17.6 17.6 17.6 17.6 ...
 $ LATWGS84: num  -11 -11 -11 -11 -11 ...

Next we need to format all Munsell color codes to Hue_Saturation_Intensity format. We can slowly replace the existing codes until all codes can be matched with the RGB table:

> mcol$Munsell <- sub(" ", "", sub("/", "_", mcol$MCOMNS))
> hue.lst <- expand.grid(c("2.5", "5", "7.5", "10"), c("YR","GY","BG","YE","YN","YY","R","Y","B","G"))
> hue.lst$mhue <- paste(hue.lst$Var1, hue.lst$Var2, sep="")
> for(j in hue.lst$mhue[1:28]){
+   mcol$Munsell <- sub(j, paste(j, "_", sep=""), mcol$Munsell, fixed=TRUE)
+ }
> mcol$depth <- mcol$UHDICM + (mcol$LHDICM-mcol$UHDICM)/2
> mcol.RGB <- merge(mcol, munsell.rgb, by="Munsell")
> str(mcol.RGB)
'data.frame':   11404 obs. of  11 variables:
 $ Munsell : chr  "10R_2_2" "10R_2_2" "10R_2_2" "10R_3_2" ...
 $ SOURCEID: Factor w/ 16706 levels "AO 1445_P100/56",..: 11730 11730 13232 4026 7665 14862 4035 1185 7294 7287 ...
 $ MCOMNS  : Factor w/ 634 levels "<Null>","1;5YR4/6",..: 9 9 9 12 12 12 12 12 12 12 ...
 $ UHDICM  : num  10 0 0 0 98 0 20 0 10 45 ...
 $ LHDICM  : num  35 10 30 15 150 5 53 11 31 82 ...
 $ LONWGS84: num  32.23 32.23 4.76 14.84 39.83 ...
 $ LATWGS84: num  -26.15 -26.15 8.79 -5.08 -3.62 ...
 $ depth   : num  22.5 5 15 7.5 124 2.5 36.5 5.5 20.5 63.5 ...
 $ R       : int  67 67 67 91 91 91 91 91 91 91 ...
 $ G       : int  48 48 48 68 68 68 68 68 68 68 ...
 $ B       : int  45 45 45 63 63 63 63 63 63 63 ...

Which allows us to plot the actual observed colors of the top soil (0-30 cm) for the whole of Africa:

> mcol.RGB <- mcol.RGB[!is.na(mcol.RGB$R),]
> mcol.RGB$Rc <- round(mcol.RGB$R/255, 3)
> mcol.RGB$Gc <- round(mcol.RGB$G/255, 3)
> mcol.RGB$Bc <- round(mcol.RGB$B/255, 3)
> mcol.RGB$col <- rgb(mcol.RGB$Rc, mcol.RGB$Gc, mcol.RGB$Bc)
> mcol.RGB <- mcol.RGB[mcol.RGB$depth>0 & mcol.RGB$depth<30 & !is.na(mcol.RGB$col),]
> coordinates(mcol.RGB) <- ~ LONWGS84+LATWGS84
> download.file("http://gsif.isric.org/lib/exe/fetch.php?media=admin.af.rda", "admin.af.rda")
trying URL 'http://gsif.isric.org/lib/exe/fetch.php?media=admin.af.rda'
Content type 'application/r-core' length 435084 bytes (424 Kb)
opened URL
downloaded 424 Kb
> load("admin.af.rda")
> proj4string(admin.af) <- "+proj=latlong +datum=WGS84"
> country <- as(admin.af, "SpatialLines")
> par(mar=c(.0,.0,.0,.0))
> plot(country, col="darkgrey", asp=1)
> points(mcol.RGB, pch=21, bg=mcol.RGB$col, col="black")
Actual observed soil colors (moist) for the top soil based on the Africa Soil Profiles Database.

Finally, via the plotKML package you can also plot the actual colors of horizons by converting tables to SoilProfileCollection class in the aqp package. Consider this soil profile from Nigeria:

> lon = 3.90; lat = 7.50; id = "ISRIC:NG0017"; FAO1988 = "LXp"
> top = c(0, 18, 36, 65, 87, 127)
> bottom = c(18, 36, 65, 87, 127, 181)
> ORCDRC = c(18.4, 4.4, 3.6, 3.6, 3.2, 1.2)
> hue = c("7.5YR", "7.5YR", "2.5YR", "5YR", "5YR", "10YR")
> value = c(3, 4, 5, 5, 5, 7); chroma = c(2, 4, 6, 8, 4, 3)
> ## prepare a SoilProfileCollection:
> prof1 <- join(data.frame(id, top, bottom, ORCDRC, hue, value, chroma),  data.frame(id, lon, lat, FAO1988), type='inner')
Joining by: id
> prof1$soil_color <- with(prof1, munsell2rgb(hue, value, chroma))
Notice: converting hue to character
> depths(prof1) <- id ~ top + bottom
Warning message:
converting IDs from factor to character
> site(prof1) <- ~ lon + lat + FAO1988
> coordinates(prof1) <- ~ lon + lat
> proj4string(prof1) <- CRS("+proj=longlat +datum=WGS84")
> prof1
Object of class SoilProfileCollection
Number of profiles: 1
 
Horizon attributes:
            id top bottom ORCDRC   hue value chroma soil_color
1 ISRIC:NG0017   0     18   18.4 7.5YR     3      2  #584537FF
2 ISRIC:NG0017  18     36    4.4 7.5YR     4      4  #7E5A3BFF
3 ISRIC:NG0017  36     65    3.6 2.5YR     5      6  #A96C4FFF
4 ISRIC:NG0017  65     87    3.6   5YR     5      8  #B06A32FF
5 ISRIC:NG0017  87    127    3.2   5YR     5      4  #9A7359FF
6 ISRIC:NG0017 127    181    1.2  10YR     7      3  #C4AC8CFF
 
Sampling site attributes:
            id FAO1988
1 ISRIC:NG0017     LXp
 
Spatial Data:
    min max
lon 3.9 3.9
lat 7.5 7.5
CRS arguments: +proj=longlat +datum=WGS84

Once an object is in the format of SoilProfileCollection it can be directly plotted in Google Earth via the generic plotKML command:

> plotKML(prof1, var.name="ORCDRC", color.name="soil_color")
KML file opened for writing...
Writing KML...
Closing  prof1.kml
Soil profile from Nigeria plotted in Google Earth with actual observed colors.

In the following examples we look at possibilities of using Machine Learning to predict soil properties and classes from other soil properties and classes. In the first example, we try to build a Pedo-Transfer-Function (PTF) to predict bulk density using soil properties such as organic carbon content, soil texture and coarse fragments. Bulk density is often available only for a part of soil profiles, so if we could use a PTF to fill in all gaps in bulk density, then most likely we would not need to drop BD from further analysis. For testing PTFs to predict bulk density from other soil properties we will use a subset of the ISRIC WISE soil profile data set (Batjes, 2006), which can be loaded from:

> library(randomForestSRC)
> library(ggRandomForests)
> library(ggplot2)
> library(scales)
 
> load(file("http://gsif.isric.org/lib/exe/fetch.php?media=sprops.wise.rda"))
> str(SPROPS.WISE)
'data.frame':    47833 obs. of  17 variables:
 $ SOURCEID: Factor w/ 10253 levels "AF0001","AF0002",..: 1 1 1 2 2 2 2 3 3 3 ...
 $ SAMPLEID: chr  "AF0001_1" "AF0001_2" "AF0001_3" "AF0002_1" ...
 $ UHDICM  : int  0 15 60 0 20 60 110 0 20 50 ...
 $ LHDICM  : int  15 60 150 20 60 110 170 20 50 110 ...
 $ DEPTH   : num  7.5 37.5 105 10 40 85 140 10 35 80 ...
 $ CRFVOL  : int  20 NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ...
 $ CECSUM  : num  NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ...
 $ SNDPPT  : int  40 10 10 40 15 10 40 40 65 60 ...
 $ CLYPPT  : int  20 35 35 20 20 35 20 20 10 25 ...
 $ BLD     : num  NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ...
 $ SLTPPT  : int  40 55 55 40 65 55 40 40 25 15 ...
 $ PHIHOX  : num  7.9 7.9 7.9 8.5 8.6 8.5 8.8 8.8 9.2 8.9 ...
 $ PHIKCL  : num  NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ NA\ ...
 $ ORCDRC  : num  7.6 2.3 0.9 12.8 6 3.9 2.7 5.9 2.4 NA\ ...
 $ LONWGS84: num  69.2 69.2 69.2 69.2 69.2 ...
 $ LATWGS84: num  34.5 34.5 34.5 34.5 34.5 34.5 34.5 34.5 34.5 34.5 ...
 $ SOURCEDB: chr  "WISE" "WISE" "WISE" "WISE" ...

For model fitting we will use the randomForestSRC package, which is a robust implementation of random forest algorithm with options for parallelization and visualization of model outputs:

> rfsrc_BD <- rfsrc(BLD ~ ORCDRC + PHIHOX + SNDPPT + CLYPPT + CRFVOL + DEPTH, data=SPROPS.WISE)
> rfsrc_BD
                         Sample size: 3330
                     Number of trees: 1000
          Minimum terminal node size: 5
       Average no. of terminal nodes: 688.877
No. of variables tried at each split: 2
              Total no. of variables: 6
                            Analysis: RF-R
                              Family: regr
                      Splitting rule: mse
                % variance explained: 39.88
                          Error rate: 46197.83

which shows that model explains only about 40% (RMSE about +/-200), but it least it could be used to fill in the missing values for BD which can be significant. We can plot the partial plots between the target variable and all covariates by using:

Bulk density as a function of organic carbon, pH, sand and clay content coarse fragments and depth.

Obviously the key to explaining bulk density is soil organic carbon, while depth and pH are #2 and #3 most important covariates. Using this MLA-based model we can predict bulk density for various combinations of soil properties:

> predict(rfsrc_BD, data.frame(ORCDRC=1.2, PHIHOX=7.6, SNDPPT=45, CLYPPT=12, CRFVOL=0, DEPTH=20))$predicted
[1] 1544.766
> predict(rfsrc_BD, data.frame(ORCDRC=150, PHIHOX=4.6, SNDPPT=25, CLYPPT=35, CRFVOL=0, DEPTH=20))$predicted
[1] 903.9599

In the second example we use ISRIC WISE to to build a correlation function to translate soil classes from one classification system to the other. The training data can be loaded from:

> load(file("http://gsif.isric.org/lib/exe/fetch.php?media=wise_tax.rda"))
> str(WISE_tax)
'data.frame':    8189 obs. of  7 variables:
 $ SOURCEID: Factor w/ 8189 levels "AF0001","AF0002",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ LATWGS84: num  34.5 34.5 34.5 34.3 32.4 ...
 $ LONWGS84: num  69.2 69.2 69.2 61.4 62.1 ...
 $ TAXNWRB : Factor w/ 146 levels "#N/A","Albic Arenosol",..: 104 9 9 72 17 16 122 49 8 9 ...
 $ TAXOUSDA: Factor w/ 1728 levels ""," Calciorthid",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ LFORM   : chr  "LV" "LV" "LV" "LV" ...
 $ LANDUS  : chr  "AA4" "AA6" "AA6" "AA4" ...

we also need to get the cleaned legend for USDA classification:

> leg <- read.csv(file("http://gsif.isric.org/lib/exe/fetch.php?media=taxousda_greatgroups.csv"))
> str(leg)
'data.frame':    434 obs. of  4 variables:
 $ Great_Group: Factor w/ 434 levels "Acraquox","Acrohumox",..: 9 57 77 112 121 145 170 259 286 301 ...
 $ Suborder   : Factor w/ 79 levels "Albolls","Andepts",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Order      : Factor w/ 12 levels "Alfisols","Andisols",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ TAX        : Factor w/ 434 levels "Alfisols_Aqualfs_Albaqualfs",..: 1 2 3 4 5 6 7 8 9 10 ...

Our objective is to develop a function to translate WRB classed into USDA classes with a help of some soil properties. We can try to add soil pH and clay content to increase accuracy of the model:

> x.PHIHOX <- aggregate(SPROPS.WISE$PHIHOX, by=list(SPROPS.WISE$SOURCEID), FUN=mean, na.rm=TRUE); names(x.PHIHOX)[1] = "SOURCEID"
> x.CLYPPT <- aggregate(SPROPS.WISE$CLYPPT, by=list(SPROPS.WISE$SOURCEID), FUN=mean, na.rm=TRUE); names(x.CLYPPT)[1] = "SOURCEID"
> WISE_tax$PHIHOX <- join(WISE_tax, x.PHIHOX, type="left")$x
Joining by: SOURCEID
> WISE_tax$CLYPPT <- join(WISE_tax, x.CLYPPT, type="left")$x
Joining by: SOURCEID

After that we need to cleanup the classes so that we can focus on USDA suborders only:

> WISE_tax.sites <- WISE_tax[complete.cases(WISE_tax[,c("TAXNWRB","PHIHOX","CLYPPT","TAXOUSDA")]),]
> WISE_tax.sites$TAXOUSDA.f <- NA
> for(j in leg$Suborder){
+   sel <- grep(j, WISE_tax.sites$TAXOUSDA, ignore.case=TRUE)
+   WISE_tax.sites$TAXOUSDA.f[sel] = j
+ }
> WISE_tax.sites$TAXOUSDA.f <- as.factor(WISE_tax.sites$TAXOUSDA.f)
> WISE_tax.sites$TAXNWRB <- as.factor(paste(WISE_tax.sites$TAXNWRB))

and finally we can fit a model to translate WRB profiles to USDA suborders:

> TAXNUSDA.rf <- rfsrc(TAXOUSDA.f ~ TAXNWRB + PHIHOX + CLYPPT, data=WISE_tax.sites)
                        Sample size: 962
           Frequency of class labels: 17, 25, 10, 44, 3, 15, 10, 27, 16, 1, 3, 1, 11, 1, 3, 20, 15, 10, 4, 13, 9, 95, 13, 19, 4, 1, 3, 48, 33, 7, 16, 56, 56, 3, 71, 46, 47, 127, 40, 4, 2, 6, 7
                     Number of trees: 1000
          Minimum terminal node size: 1
       Average no. of terminal nodes: 359.212
No. of variables tried at each split: 2
              Total no. of variables: 3
                            Analysis: RF-C
                              Family: class
                      Splitting rule: gini
              Normalized Brier score: 73.88
                          Error rate: 0.55, 0.53, 0.48, 0.7, 0.14, 1, 0.2, 0.8, 0.26, 0.56, 1, 1, 1, 0.55, 1, 0.67, 0.65, 1, 1, 1, 0.54, 0.89, 0.59, 1, 0.53, 0.5, 1, 1, 0.69, 0.76, 1, 0.94, 0.91, 0.57, 1, 0.39, 0.07, 0.45, 0.35, 0.8, 1, 1, 0.83, 0.71
 
...
    Overall error rate: 55.09%

which shows that the average accuracy is about 45%. We can test converting some classes with the help of additional soil properties:

> newdata = data.frame(TAXNWRB=factor("Calcaric Cambisol", levels=levels(WISE_tax.sites$TAXNWRB)), PHIHOX=7.8, CLYPPT=12)
> x <- data.frame(predict(TAXNUSDA.rf, newdata, type="prob")$predicted)
> x[,order(1/x)[1:2]]
  Ochrepts Orthids
1   0.3205   0.153

so for example, the two most likely classes to convert Calcaric Cambisols seem to be Ochrepts and Orthids, which is not that much different from expert opinion (Krasilnikov, 2009) in fact.


wiki soil data
 stars  from 1 votes (Details)