Supporting information
Supporting information for Valavi, R., Elith, J., Lahoz-Monfort, J., Guillera-Arroita, G. (2021) Modelling species presence-only data with random forests. Ecography, 44: 1–12. DOI: 10.1111/ecog.05615
To account for the known sampling bias in the NCEAS dataset (Phillips et al., 2009), we generate two types of background samples. One is the target-group-background (TGB) samples introduced by Phillips et al. (2009). In this method, the presence sites from each biological group (including the target species itself) are used as the background samples for each species in the same biological group. The idea is that the set of sites for the biological group indicates sampling effort across the region, and hence informs the model about the sampling effort producing the data for any one species. Consistent with the presence samples, the TGB data are reduced to one record per grid cell. For the details read Phillips et al. (2009) and vignette “Modeling NCEAS data” in the disdat
package (Elith et al. 2020). The result of these background samples are presented in the next section.
As the number of background samples in the TGB samples are not very high (see next section), to keep similar ratio of background to presence in the virtual species, we implemented a second approach. In this method, 50,000
background samples are taken from a bias layer. The bias layer is generated by a kernel density estimation (KDE) of the TGB followed by the approach implemented in Kujala et al. (2015). For the band-width of the KDE, we used 1 / 5
of the largest extent of each region. The KDE is fitted in the spatialEco
R package (Evans JS, 2020). Here is the R code that shows how we generated the samples. Both KDE and TGB samples and the code to fit the models on the NCEAS dataset are provided in the supplementary materials archived on the Open Science Framework (OSF).
# loading required libraries
library(spatialEco)
library(dismo)
library(disdat)
library(sf)
# read TGB data
tgbsample <- read.csv("background_samples/tgbs/swi_tree.bak.csv")
# convert to spatial object
tgbsp <- sf::st_as_sf(tgbsample, coords = c("x", "y"), crs = disdat::disCRS("SWI"))
plot(tgbsp)
# read a raster mask for the region
rs <- raster::raster("background_samples/tgbs_mask/SWI.tif")
# get the largest extent of the region
yext <- (raster::extent(rs)[4] - raster::extent(rs)[3]) / 5 # y range
xext <- (raster::extent(rs)[2] - raster::extent(rs)[1]) / 5 # x range
max_ext <- max(c(yext, xext))
tgb_kde <- spatialEco::sp.kde(x = sf::as_Spatial(tgbsp),
bw = round(max_ext, 2),
newdata = rs,
standardize = TRUE,
scale.factor = 10000)
plot(tgb_kde)
# generate 50k random samples from the KDE raster file
kde_samples <- dismo::randomPoints(tgb_kde,
n = 50000,
prob = TRUE)
The performance of the models fitted with NCEAS presence-TGB samples is shown in Fig. S1 The performance of the models is similar to the models fitted on the KDE samples (presented in the main text). In both evaluation metrics, the default classification RF had a relatively poor performance in all regions (except SWI), followed by regression RF and weighted. The SWI dataset gained the highest overall AUC. Weighted has a higher COR than the default and regression RF in both SA and AWT regions. This could be due to the fact that SWI has more occurrence points (Elith et al., 2020). The down-sample, equal-sample and both shallow RFs achieved higher AUC and COR for all regions. Overall, the AUC and COR for TGB is higher than those for KDE samples.
As we discussed in the main paper, the problem of RF is not only the high number of background samples but the environmental overlap between the presence and background/absence sites. This is evident in the result of the models fitted on the TGB samples. The ratio between the number of presence to TGB samples is larger than the presence to 50,000 KDE samples (see number of TGB samples below). Even though the presence-TGB samples are not extremely imbalanced, we still see the improvement of the models with sampling approaches (equal and down samples) and the shallow methods.
Number of TGB samples:
[1] 147 379 530 569 714 1221 1351 2503 3298 11429
Figure 1: Performance of different RF models on different regions of NCEAS data with TGB. The values in cells are AUC and COR for the left and right figures respectively. Colour varies with values, yellow (low) to dark navy (high).
The Friedman’s Aligned Rank test was highly significant, indicating the statistical difference between the models in both metrics. The statistical significance of models based on the adjusted p-values is visualised in Fig. S2. This figure shows: 1) shallow, shallow-tuned, down-sample and equal-sample are not significantly different, 2) all these four models are significantly better than the default, weighted and regression RF models, and 3) the default, weighted and regression RF are also not significantly different.
Friedman's rank sum test
data: auc_wide
Friedman's chi-squared = 298.52, df = 6, p-value < 2.2e-16
Figure 2: Statistical significance of AUC for different models fitted on presence-TGB data.
Each box in Fig. S3 indicates a model and the boxes with connection are the one with no statistically significant differences. The green box shows the model with the highest rank in this metric and dataset.
Friedman's rank sum test
data: auc_wide
Friedman's chi-squared = 91.989, df = 6, p-value < 2.2e-16
Figure 3: Statistical significance of COR for different models fitted on presence-TGB data.
Notice that only the models with a direct link are not statistically different (in that specific metric and dataset). Although all models are connected in Fig. S3, but not all of them are directly connected. Fig. S3 shows that the differences in the rank of equal-sample, down-sample, shallow and shallow-tuned are statistically insignificant. The rank of shallow is not significantly different from weighted, but the rank of the other three are (equal-sample, down-sample and shallow-tuned). The regression RF and Default are not statistically different from the weighted classification RF, but they are different from the shallow RFs, equal and down sample. All other following figures can be interpreted in the same way.
Friedman's rank sum test
data: auc_wide
Friedman's chi-squared = 431.72, df = 6, p-value < 2.2e-16
Figure 4: Statistical significance of AUC for different models fitted on presence-KDE data.
Friedman's rank sum test
data: auc_wide
Friedman's chi-squared = 243.89, df = 6, p-value < 2.2e-16
Figure 5: Statistical significance of COR for different models fitted on presence-KDE data.
Figure 6: The average rank of AUC vs COR for the models fitted on presence-KDE data. The values of the x and y axis are reversed to have the lowest rank (better models) on the top-right corner and vice versa.
Fig. S6 show the mean rank of the models on the presence-KDE dataset. The down-sample has the lowest rank (the best rank; top-right) in both AUC and COR.
Friedman's rank sum test
data: auc_wide
Friedman's chi-squared = 504.08, df = 6, p-value < 2.2e-16
Figure 7: Statistical significance of AUC for different models fitted to simulated species.
Figure 8: The average rank of AUC vs COR for the models for simulated species. The values of the x and y axis are reversed to have the lowest rank (better models) in the top-right corner and vice versa.
For the simulated species, to test whether there is variability in the prediction of each modelling approach, we fitted each model 10 times on the 100 virtual species and took the standard deviation of AUC and COR (Fig. S9). The shallow and down-sampling showed the lowest variability in their AUC and COR. In contrast, equal-sample and shallow-tuned had the highest variability among all. The variability of equal-sample is due to the fact that it does not take many background samples so, the model produces a slightly different result each time. The shallow-tuned is using random cross-validation to tune its parameters, so each time a slightly different set of parameters could be selected. Note that the level of variability is relatively small overall.
Figure 9: Standard deviation of AUC and COR for ten runs on each virtual species.
We implemented six strategies for modelling presence-background data with RF (plus the the default classification RF model). Here, we provide the code example of how we chose the parameters.
This example is run on a virtual species made based on some raster predictors provided in the dismo
R package, available from CRAN. This is different to the Australian virtual species used in the main manuscript. Note that the background samples in this example is only 10,000 random background samples (all the cells available in the raster layers). This is enough to demonstrate how models are fitted, and can be repeated by readers (for the complete code of our analysis see codes archived on OSF available here!).
# loading required packages
library(dismo)
library(dplyr)
library(virtualspecies)
# reading raster files
fnames <- list.files(path = paste(system.file(package = "dismo"), "/ex", sep = ""),
pattern = "grd",
full.names = TRUE )[-9] # remove last covariate
predictors <- raster::stack(fnames)
# generate a virtual species
virtual_sp <- generateSpFromPCA(raster.stack = predictors,
means = c(-3, -1),
sds = c(0.6, 1.5),
plot = FALSE)
virtual_sp <- virtual_sp$suitab.raster
plot(virtual_sp)
# sample some points for generating presence samples
set.seed(100)
rnd_points <- randomPoints(virtual_sp, 150, prob = TRUE)
# generate presence samples with binomial trials using
# suitability of virtual species as success rate
set.seed(100)
occurence <- rnd_points %>%
as.data.frame() %>%
mutate(prob = raster::extract(virtual_sp, .),
occ = rbinom(nrow(.), 1, prob = prob)) %>%
dplyr::filter(occ == 1) %>%
dplyr::select(x, y)
# plot occurrence points
plot(virtual_sp)
points(occurence$x, occurence$y)
# generating background points
bg <- randomPoints(predictors, n = 10000) %>%
as.data.frame()
# extract raster values and create training data
training <- bind_rows(occurence, bg) %>% # combine occurrence and background
raster::extract(predictors, .) %>% # extract values of raster covariates
as.data.frame() %>%
mutate(occ = c(rep(1, nrow(occurence)), rep(0, nrow(bg)))) %>% # create 0 and 1 response
tidyr::drop_na() # drop the NA values
# print the first few rows and columns
head(training)
bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 occ
1 267 2351 961 244 331 213 118 264 1
2 255 3416 1491 194 330 190 139 248 1
3 257 2782 1302 161 335 199 136 245 1
4 266 2090 982 156 326 214 112 262 1
5 248 2911 1389 148 327 184 144 238 1
6 263 2402 857 270 319 212 107 260 1
randomForest
package:Four of the presented strategies in the main text and the default classification RF are implemented using the randomForest
package (Liaw & Wiener, 2002). These include the down-sample, equal-sample, weighted and regression RF. Whilst, the other two solutions namely, shallow and shallow-tuned, are fitted with ranger
package in the results of the main text because the ranger
package provides better options (namely the Hellinger distance splitting criterion, and direct control over tree depth). Thus we do not fit shallow RF in the randomForest
package.
For consistency, we use 1000 trees in all the methods implemented in this package.
RF fits 500 trees by default. Here we used 1000 tress to be consistent with the other methods. Although, RF is not very sensitive to the increase in ntree
parameter (Probst et al., 2019a), it is recommended to use high number of trees to achieve higher regression in prediction and variable importance (Probst et al., 2019b).
# loading the library
library(randomForest)
# convert response to factor for classificatin
training$occ <- as.factor(training$occ)
# default classification RF
rf_def <- randomForest(occ ~ .,
data = training,
ntree = 1000) # number of trees
# predict to raster layers
pred_def <- predict(predictors, rf_def, type = "prob", index = 2)
For the weighted classification RF, we calculated weights that are inversely proportional to the class numbers (number of presence and background samples) (Albon, 2018). This creates one weight for each class, also know as class weights.
As an aside, weighting can work differently in different software implementation of RF. For instance, weights in RF implemented in the R
package ranger
(Wright & Ziegler, 2017) and cforest
model from the party
package (Zeileis et al., 2006) operate by changing the proportion of each class in bootstrap samples i.e. over/under sampling of the classes. Here, in the randomForest
package weights (cost) are applied to the Gini index (denoted as \(Gini_w\) here) as follows (Breiman et al., 1984; Kuhn & Johnson, 2013):
\[ Gini_w = C(0|1)p_0 (1- p_0 ) + C(1|0) p_1 (1- p_1 )=[C(0|1)+C(1|0)] p_0 p_1, \]
where \(p_0\) and \(p_1\) are the proportion of background and presence classes in the partition respectively, \(C(0|1)\) is the cost of mistakenly predicting a class 1 (presence) sample as class 0 (background), and \(C(1|0)\) is the opposite.
# calculate the weights
cwt <- 1 / prop.table(table(training$occ))
# weighted classification RF
rf_wtd <- randomForest(occ ~ .,
data = training,
ntree = 1000,
classwt = cwt)
# predict to raster layers
pred_wtd <- predict(predictors, rf_wtd, type = "prob", index = 2)
For the equal-sample (Barbet-Massin et al., 2012), the model is fitted on 10 datasets each with the same number of presences and background samples. The same procedure can also be done with the ranger
package by replacing the model fitting and prediction codes.
# equal number of present and background with 10 repeats
prNum <- as.numeric(table(training$occ)["1"]) # number of presences
rf_eql_models <- list()
for(t in seq_len(10)){
eqsamp <- c(which(training$occ == 1), sample(which(training$occ == 0), prNum))
rf_eql <- randomForest(occ ~ .,
data = training[eqsamp,],
ntree = 1000)
rf_eql_models[[t]] <- rf_eql
}
# predict to raster layers with the 10 fitted models and aggregate the predictions
pred_eql <- lapply(rf_eql_models, function(x) predict(predictors, x, type = "prob", index = 2)) %>%
raster::stack() %>%
mean()
The down-sampling method uses the same number of presence and background samples in bootstrapping of each tree. This is done by the sampsize
argument. This argument accepts a named-vector that contains the number of samples should be taken from each class (presences and background samples are classes here) in the process of generating each tree. Although the number of presence and background samples are taken equally in each tree, the final forest will have many background samples as each tree could take different random samples.
# calculate sub-samples
prNum <- as.numeric(table(training$occ)["1"]) # number of presence records
spsize <- c("0" = prNum, "1" = prNum) # sample size for both classes
# RF with down-sampling
rf_dws <- randomForest(occ ~ .,
data = training,
ntree = 1000,
sampsize = spsize,
replace = TRUE) # make sure samples are with replacement (default)
# predict to raster layers
pred_dws <- predict(predictors, rf_dws, type = "prob", index = 2)
To fit RF with regression trees (regression-RF), the response variable (0s and 1s or presence-backgrounds) should be numeric rather than categorical (factor). Here, the response (occ
) is already converted to factor
data type for fitting classification RF, so we convert it back to numerical to fit a regression RF model.
# convert back to numeric for regression model
training$occ <- as.numeric(as.character(training$occ))
# regression RF
rf_reg <- randomForest(occ ~ .,
data = training,
ntree = 1000)
# predict to raster layers
pred_reg <- predict(predictors, rf_reg)
ranger
package:We only fitted shallow and shallow-tuned models in the ranger
package (Wright & Ziegler, 2017) for our study. However, you can fit all other methods in the ranger
package. Here we present the code for modelling virtual species data with this package. In the un-tuned version of shallow model (i.e., shallow-RF), we use a constant tree depth of 2
and the Hellinger distance splitting criterion. We do not repeat the equal-sample approach here, readers should able to do it very easily by the replacing the code from previous examples.
Fitting a regression RF model with ranger
package. To speed-up the computation, you can also specify the number of CPU cores for parallel processing by num.threads
argument.
# loading the library
library(ranger)
# fit a regression RF with ranger packaeg
rng_reg <- ranger(formula = occ ~ .,
data = training,
num.trees = 1000,
num.threads = 6) # number of CPU cores to run
# predict to raster layers
pred_rng_reg <- predict(predictors, rng_reg,
fun = function(model, ...) predict(model, ...)$predictions)
Notice that prediction of ranger
model returns a list of different information including the predictions. So, for prediction on raster layers you need to provide a function (as presented above) to extract the correct information in the predict
function of the raster
package.
For fitting an RF with classification tree you need to change the response column to a factor variable. You also need to set probability
argument to TRUE
to get the probability predictions (note that with presence-background data the output is considered as relative likelihood rather than probability).
# convert response to factor for classificatin
training$occ <- as.factor(training$occ)
# fitting a default classification RF in ranger
rng_def <- ranger(formula = occ ~ .,
data = training,
num.trees = 1000,
probability = TRUE, # fit a probability forest
num.threads = 6) # number of CPU cores to run
# predict to raster layers
pred_rng_def <- predict(predictors, rng_def,
fun = function(model, ...) predict(model, ...)$predictions[,"1"])
For fitting a weighted RF we use the same weights applied in randomForest
package.
# calculate the weights
cwt <- 1 / prop.table(table(training$occ))
# fitting a weighted RF in ranger
rng_wtd <- ranger(formula = occ ~ .,
data = training,
num.trees = 1000,
probability = TRUE, # fit a probability forest
class.weights = cwt,
num.threads = 6) # number of CPU cores to run
# predict to raster layers
pred_rng_wtd <- predict(predictors, rng_wtd,
fun = function(model, ...) predict(model, ...)$predictions[,"1"])
Fitting a down-sampling approach in ranger
package is not as straightforward as in the randomForest
package. But we can change the weights in a way that the model takes a relatively equal number of bootstrap samples from both classes (presence and background i.e., 1 and 0) to fit each tree. This will result in balanced data in trees that produce similar results to down-sampling in the randomForest
package.
# down-sampling with ranger package
# calculating the case weights (equal weights)
prNum <- as.numeric(table(training$occ)["1"]) # number of presences
bgNum <- as.numeric(table(training$occ)["0"]) # number of backgrounds
casewts <- ifelse(training$occ == 1, 1, prNum / bgNum)
# fitting RF with ranger
rng_dws <- ranger(formula = occ ~ .,
data = training,
num.trees = 1000,
probability = TRUE, # fit a probability forest
sample.fraction = prNum / bgNum,
case.weights = casewts,
num.threads = 6) # number of CPU cores to run
# predict to raster layers
pred_rng_dws <- predict(predictors, rng_dws,
fun = function(model, ...) predict(model, ...)$predictions[,"1"])
For fitting a shallow RF, we set the max.depth = 2
and splitrule = "hellinger"
. This forces the model to fit shallow trees.
# fitting shallow RF with ranger
rf_rng_shallow <- ranger(formula = occ ~ .,
data = training,
num.trees = 2000,
probability = TRUE, # fit a probability forest
splitrule = "hellinger",
max.depth = 2, # fit shallow trees
num.threads = 6) # number of CPU cores to run
# predict to raster layers
pred_rng_shallow <- predict(predictors, rf_rng_shallow,
type = "response",
fun = function(model, ...) predict(model, ...)$predictions[,"1"])
Here we provide code for tuning the parameters of shallow model in the ranger
package. This is done by searching over a grid of parameters to find the best combination. There are other packages (e.g. caret
and mlr
) that do parameter tuning for different machine learning models. The current implementation (at the time of publication) of these package for the parameter tuning of ranger
has limited options, thus we provided a tuning function here.
# function for tuning the parameters of shallow model
tune_ranger <- function(data,
y = "occ", # name of presence-background column
k = 5, # number of cross-validation folds
max.depth = 2:5, # this can be a vector of numbers
splitrule = c("gini", "extratrees", "hellinger"), # only allowed options
num.trees = 1000, # this can be a vector of numbers
mtry = NULL, # this can be a vector of numbers; NULL = default
threads = 4){ # number of CPU cores
require(ranger)
splitrule <- match.arg(splitrule)
names(data)[which(names(data) == y)] <- "po"
data$po <- as.factor(data$po)
if(is.null(mtry)){
mtry <- floor(sqrt(ncol(data) - 1))
}
grid <- expand.grid(max.depth = max.depth,
splitrule = splitrule,
num.trees = num.trees,
mtry = mtry,
stringsAsFactors = FALSE)
# create balanced CV folds
folds <- caret::createFolds(y = as.factor(data$po), k = k)
evalmodel <- data.frame(depth = rep(NA, nrow(grid)), split = NA)
iteration <- nrow(grid)
# pb <- progress::progress_bar$new(format = "Progress [:bar] :percent in :elapsed",
# total = iteration, clear = FALSE, width = 75)
for(i in seq_along(grid[,1])){
modauc <- c()
for(k in seq_along(folds)){
trainSet <- unlist(folds[-k])
testSet <- unlist(folds[k])
mod <- ranger::ranger(formula = po ~ .,
data = data[trainSet, ],
probability = TRUE,
num.trees = grid$num.trees[i],
splitrule = grid$splitrule[i],
max.depth = grid$max.depth[i],
mtry = grid$mtry[i],
num.threads = threads,
replace = TRUE)
pred <- predict(mod, data[testSet, ], type = "response")$predictions[,"1"]
modauc[k] <- precrec::auc(precrec::evalmod(scores = pred,
labels = data$po[testSet]))[1,4]
}
evalmodel$depth[i] <- grid$max.depth[i]
evalmodel$split[i] <- grid$splitrule[i]
evalmodel$ntrees[i] <- grid$num.trees[i]
evalmodel$mtry[i] <- grid$mtry[i]
evalmodel$aucme[i] <- mean(modauc)
# evalmodel$aucse[i] <- sd(modauc) / sqrt(5)
# pb$tick()
}
bestparam <- which.max(evalmodel$aucme)
print(evalmodel[bestparam, ])
finalmod <- ranger::ranger(formula = po ~ .,
data = data,
probability = TRUE,
num.trees = evalmodel$ntrees[bestparam],
splitrule = evalmodel$split[bestparam],
max.depth = evalmodel$depth[bestparam],
num.threads = threads,
replace = TRUE)
return(finalmod)
}
# fitting ranger-tuned model
rf_shallow_tuned <- tune_ranger(data = training,
y = "occ",
max.depth = 1:4,
splitrule = c("hellinger"),
num.trees = 2000,
threads = 6)
depth split ntrees mtry aucme
4 4 hellinger 2000 2 0.9260329
The spatial prediction of the different implementations of RF in previous examples (Fig. S10) further illustrates the results. Here we only present the models we used in the main text (see Fig. S11 and S12 for all methods in both packages). To help visualisation, predictions are rescaled from 0 to 1, so if there are a few high predictions and the rest low, the map will be pale (i.e., overfitting). It is the general spatial pattern of predictions that is important. The default, weighted and regression RFs only predicted to the presence sites and locations close to them (very small dots in their map) and did not follow the general pattern of the virtual species suitability, implying an extreme case of overfitting. On the other hand, equal-sample, down-sample and shallow closely follow the spatial pattern of the virtual species (the truth).
library(tmap)
library(scales)
# rescale and stack the predicted maps
maps <- list("default" = pred_def,
"weighted" = pred_wtd,
"regression" = pred_reg,
"downsample" = pred_dws,
"equal_sample" = pred_eql,
"shallow_tuned" = pred_shall_tune,
"shallow" = pred_rng_shallow,
"truth" = virtual_sp) %>%
lapply(function(x) calc(x, function(y) rescale(y, c(0,1)))) %>%
raster::stack()
# plot the maps
tm_shape(maps) +
tm_raster(title = "Likelihood",
palette = rev(terrain.colors(30)),
style = "cont",
breaks = c(0, 0.25, 0.5, 0.75, 1))
Figure 10: Spatial prediction of different RF implementations on the virtual species (truth). Training data had 102 presences and 9775 random background points. For better comparison, the values in all maps were linearly rescaled between 0 and 1.
Here, you can see the correlation coefficient of each predicted map with the virtual species (truth).
# extracting values for truth
truth <- as.data.frame(maps, na.rm = TRUE) %>%
pull(truth)
# calculate correlation with truth
as.data.frame(maps, na.rm = TRUE) %>%
dplyr::select(- truth) %>%
apply(2, function(x) cor(x, truth)) %>%
round(3)
default weighted regression downsample equal_sample
0.355 0.334 0.409 0.942 0.928
shallow_tuned shallow
0.935 0.892
Spatial prediction of models fitted with the randomForest
package is presented in Fig. S11. To make the differences clear, we rescaled all the values between 0 and 1 (for the predictions of both packages). Notice that we did not fit shallow model with the randomForest
package.
Figure 11: Spatial prediction of different RF implementations with randomForest package.
Spatial prediction of models fitted with the ranger
package is presented in Fig. S12. The truth is the same as the figure above.
Figure 12: Spatial prediction of different RF implementations with ranger package.
As we discussed in the main text that RF (by default parameters) overfits the overlapped presence-background data by fitting overly deep trees. Sampling methods, i.e. down-sampling and equal-sampling, reduce the chance of class overlap by balancing the response data (presence and background samples) and thus avoid overfitting the training data. This should result in trees that are less deep than the RF models with the default parameters. To test whether sampling methods fit shallower trees than the other variants (default classification-RF, regression-RF, and weighted-RF), we calculated the number of terminal nodes (averaged over all trees of each RF model) for models fitted with the randomForest
package on all NCEAS species (Fig. S13). The number of terminal nodes is an indication of the depth of trees with a higher number of terminal nodes showing the deeper trees. We calculated the number of terminal nodes by the ck37r::rf_count_terminal_nodes()
function.
Figure 13: The average number of terminal nodes for 212 species of the NCEAS dataset fitted with KDE background samples. For sake of presentation, we removed 13 species that had higher than 5000 terminal nodes in the regression model.
Fig. S13 shows a clear reduction in the average number of terminal nodes in down-sampling and equal-sampling compared to the other models. To further highlight pairwise differences we showed the difference in the number of terminal nodes between the default model and the others (Fig. S14).
Figure 14: The difference in the number of terminal nodes with the default (classification-RF) model. Positive numbers show higher number of terminal nodes than the default model and vice versa.
You can see in Fig. S14 and the following summary that the sampling methods has always a lower number of terminal nodes than the default classification-RF model. The weighted model had either equal or even more number of terminal nodes, and regression-RF had always a higher average number of terminal nodes.
regression weighted downsample
Min. : 82.44 Min. : 0.528 Min. :-858.31
1st Qu.: 326.39 1st Qu.: 18.733 1st Qu.:-189.28
Median : 550.34 Median : 33.760 Median : -84.22
Mean : 766.92 Mean : 53.939 Mean :-146.70
3rd Qu.: 947.86 3rd Qu.: 71.775 3rd Qu.: -47.58
Max. :2709.49 Max. :265.466 Max. : -12.43
equalsample
Min. :-924.01
1st Qu.:-200.45
Median : -88.59
Mean :-155.00
3rd Qu.: -49.36
Max. : -12.78
Here is the summary of the number of terminal nods in the default RF model (for ease of comparison).
default
Min. : 15.19
1st Qu.: 59.46
Median : 105.75
Mean : 194.93
3rd Qu.: 247.03
Max. :1210.87
Albon, C. (2018) Machine learning with python cookbook: Practical solutions from preprocessing to deep learning, O’Reilly Media, Inc.
Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A. (1984) Classification and regression trees, CRC press.
Elith, J., Graham, C.H., Valavi, R., Abegg, M., Bruce, C., Ferrier, S., Ford, A., Guisan, A., Hijmans, R.J., Huettmann, F., Lohmann, L., Loiselle, B., Moritz, C., Overton, J., Peterson, A.T., Phillips, S., Richardson, K., Williams, S.E., Wiser, S.K., Wohlgemuth, T. and Zimmermann, N.E. (2020) Presence-only and presence-absence data for comparing species distribution modeling methods. Biodiversity Informatics, 15, 69–80.}
Evans JS (2020) spatialEco. R package version 1.3-1, URL: https://github.com/jeffreyevans/spatialEco.
Kujala, H., Whitehead, A.L. and Wintle, B.A. (2015) Identifying conservation priorities and assessing impacts and trade‐offs of potential future development in the Lower Hunter Valley in New South Wales. 106.
Kuhn, M. and Johnson, K. (2013) Applied predictive modeling, Springer.
Liaw, A. and Wiener, M. (2002) Classification and regression by randomForest. R news, 2, 18–22.
Phillips, S.J., Dudík, M., Elith, J., Graham, C.H., Lehmann, A., Leathwick, J. and Ferrier, S. (2009) Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. Ecological applications, 19, 181–197.
Probst, P., Boulesteix, A. and Bischl, B. (2019a) Tunability: Importance of Hyperparameters of Machine Learning Algorithms. Journal of Machine Learning Research, 20, 1–32.
Probst, P., Wright, M.N. and Boulesteix, A. (2019b) Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9.
Wright, M.N. and Ziegler, A. (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77.
Zeileis, A., Hothorn, T. and Hornik, K. (2006) party: A Laboratory for Recursive Part(y)itioning. 16.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Valavi, et al., "Modelling species presence-only data with random forests. Ecography.", Ecography, 2021
BibTeX citation
@article{valavi2021modelling, author = {Valavi, Roozbeh and Elith, Jane and Lahoz-Monfort, José J. and Guillera-Arroita, Gurutzeta}, title = {Modelling species presence-only data with random forests. Ecography.}, journal = {Ecography}, year = {2021}, doi = {10.1111/ecog.05615}, volume = {44} }