Modelling species presence-only data with random forests. Ecography.

Roozbeh Valavi; Jane Elith; José J. Lahoz-Monfort; Gurutzeta Guillera-Arroita

doi:10.1111/ecog.05615

Supporting information for Valavi, R., Elith, J., Lahoz-Monfort, J., Guillera-Arroita, G. (2021) Modelling species presence-only data with random forests. Ecography, 44: 1–12. DOI: 10.1111/ecog.05615

Generating background samples

To account for the known sampling bias in the NCEAS dataset (Phillips et al., 2009), we generate two types of background samples. One is the target-group-background (TGB) samples introduced by Phillips et al. (2009). In this method, the presence sites from each biological group (including the target species itself) are used as the background samples for each species in the same biological group. The idea is that the set of sites for the biological group indicates sampling effort across the region, and hence informs the model about the sampling effort producing the data for any one species. Consistent with the presence samples, the TGB data are reduced to one record per grid cell. For the details read Phillips et al. (2009) and vignette “Modeling NCEAS data” in the disdat package (Elith et al. 2020). The result of these background samples are presented in the next section.

As the number of background samples in the TGB samples are not very high (see next section), to keep similar ratio of background to presence in the virtual species, we implemented a second approach. In this method, 50,000 background samples are taken from a bias layer. The bias layer is generated by a kernel density estimation (KDE) of the TGB followed by the approach implemented in Kujala et al. (2015). For the band-width of the KDE, we used 1 / 5 of the largest extent of each region. The KDE is fitted in the spatialEco R package (Evans JS, 2020). Here is the R code that shows how we generated the samples. Both KDE and TGB samples and the code to fit the models on the NCEAS dataset are provided in the supplementary materials archived on the Open Science Framework (OSF).

# loading required libraries
library(spatialEco)
library(dismo)
library(disdat)
library(sf)

# read TGB data
tgbsample <- read.csv("background_samples/tgbs/swi_tree.bak.csv")
# convert to spatial object
tgbsp <- sf::st_as_sf(tgbsample, coords = c("x", "y"), crs = disdat::disCRS("SWI"))
plot(tgbsp)

# read a raster mask for the region
rs <- raster::raster("background_samples/tgbs_mask/SWI.tif")

# get the largest extent of the region
yext <- (raster::extent(rs)[4] - raster::extent(rs)[3]) / 5 # y range
xext <- (raster::extent(rs)[2] - raster::extent(rs)[1]) / 5 # x range
max_ext <- max(c(yext, xext))

tgb_kde <- spatialEco::sp.kde(x = sf::as_Spatial(tgbsp),
                              bw = round(max_ext, 2),
                              newdata = rs,
                              standardize = TRUE,
                              scale.factor = 10000)
plot(tgb_kde)

# generate 50k random samples from the KDE raster file
kde_samples <- dismo::randomPoints(tgb_kde, 
                                   n = 50000, 
                                   prob = TRUE)

Results for TGB

The performance of the models fitted with NCEAS presence-TGB samples is shown in Fig. S1 The performance of the models is similar to the models fitted on the KDE samples (presented in the main text). In both evaluation metrics, the default classification RF had a relatively poor performance in all regions (except SWI), followed by regression RF and weighted. The SWI dataset gained the highest overall AUC. Weighted has a higher COR than the default and regression RF in both SA and AWT regions. This could be due to the fact that SWI has more occurrence points (Elith et al., 2020). The down-sample, equal-sample and both shallow RFs achieved higher AUC and COR for all regions. Overall, the AUC and COR for TGB is higher than those for KDE samples.

As we discussed in the main paper, the problem of RF is not only the high number of background samples but the environmental overlap between the presence and background/absence sites. This is evident in the result of the models fitted on the TGB samples. The ratio between the number of presence to TGB samples is larger than the presence to 50,000 KDE samples (see number of TGB samples below). Even though the presence-TGB samples are not extremely imbalanced, we still see the improvement of the models with sampling approaches (equal and down samples) and the shallow methods.

Number of TGB samples:

 [1]   147   379   530   569   714  1221  1351  2503  3298 11429

Performance of different RF models on different regions of NCEAS data with TGB. The values in cells are AUC and COR for the left and right figures respectively. Colour varies with values, yellow (low) to dark navy (high).

Figure 1: Performance of different RF models on different regions of NCEAS data with TGB. The values in cells are AUC and COR for the left and right figures respectively. Colour varies with values, yellow (low) to dark navy (high).

Statistical test on TGB results

The Friedman’s Aligned Rank test was highly significant, indicating the statistical difference between the models in both metrics. The statistical significance of models based on the adjusted p-values is visualised in Fig. S2. This figure shows: 1) shallow, shallow-tuned, down-sample and equal-sample are not significantly different, 2) all these four models are significantly better than the default, weighted and regression RF models, and 3) the default, weighted and regression RF are also not significantly different.


    Friedman's rank sum test

data:  auc_wide
Friedman's chi-squared = 298.52, df = 6, p-value < 2.2e-16

Figure 2: Statistical significance of AUC for different models fitted on presence-TGB data.

Each box in Fig. S3 indicates a model and the boxes with connection are the one with no statistically significant differences. The green box shows the model with the highest rank in this metric and dataset.


    Friedman's rank sum test

data:  auc_wide
Friedman's chi-squared = 91.989, df = 6, p-value < 2.2e-16

Figure 3: Statistical significance of COR for different models fitted on presence-TGB data.

Notice that only the models with a direct link are not statistically different (in that specific metric and dataset). Although all models are connected in Fig. S3, but not all of them are directly connected. Fig. S3 shows that the differences in the rank of equal-sample, down-sample, shallow and shallow-tuned are statistically insignificant. The rank of shallow is not significantly different from weighted, but the rank of the other three are (equal-sample, down-sample and shallow-tuned). The regression RF and Default are not statistically different from the weighted classification RF, but they are different from the shallow RFs, equal and down sample. All other following figures can be interpreted in the same way.

Statistical comparison of models

Ranking and statistical tests for KDE


    Friedman's rank sum test

data:  auc_wide
Friedman's chi-squared = 431.72, df = 6, p-value < 2.2e-16

Figure 4: Statistical significance of AUC for different models fitted on presence-KDE data.


    Friedman's rank sum test

data:  auc_wide
Friedman's chi-squared = 243.89, df = 6, p-value < 2.2e-16

Figure 5: Statistical significance of COR for different models fitted on presence-KDE data.

The average rank of AUC vs COR for the models fitted on presence-KDE data. The values of the x and y axis are reversed to have the lowest rank (better models) on the top-right corner and vice versa.

Figure 6: The average rank of AUC vs COR for the models fitted on presence-KDE data. The values of the x and y axis are reversed to have the lowest rank (better models) on the top-right corner and vice versa.

Fig. S6 show the mean rank of the models on the presence-KDE dataset. The down-sample has the lowest rank (the best rank; top-right) in both AUC and COR.

Ranking and statistical test for simulated species


    Friedman's rank sum test

data:  auc_wide
Friedman's chi-squared = 504.08, df = 6, p-value < 2.2e-16

Figure 7: Statistical significance of AUC for different models fitted to simulated species.

The average rank of AUC vs COR for the models for simulated species. The values of the x and y axis are reversed to have the lowest rank (better models) in the top-right corner and vice versa.

Figure 8: The average rank of AUC vs COR for the models for simulated species. The values of the x and y axis are reversed to have the lowest rank (better models) in the top-right corner and vice versa.

Prediction variability: simulated species

For the simulated species, to test whether there is variability in the prediction of each modelling approach, we fitted each model 10 times on the 100 virtual species and took the standard deviation of AUC and COR (Fig. S9). The shallow and down-sampling showed the lowest variability in their AUC and COR. In contrast, equal-sample and shallow-tuned had the highest variability among all. The variability of equal-sample is due to the fact that it does not take many background samples so, the model produces a slightly different result each time. The shallow-tuned is using random cross-validation to tune its parameters, so each time a slightly different set of parameters could be selected. Note that the level of variability is relatively small overall.

Figure 9: Standard deviation of AUC and COR for ten runs on each virtual species.

Model implementation

We implemented six strategies for modelling presence-background data with RF (plus the the default classification RF model). Here, we provide the code example of how we chose the parameters.

This example is run on a virtual species made based on some raster predictors provided in the dismo R package, available from CRAN. This is different to the Australian virtual species used in the main manuscript. Note that the background samples in this example is only 10,000 random background samples (all the cells available in the raster layers). This is enough to demonstrate how models are fitted, and can be repeated by readers (for the complete code of our analysis see codes archived on OSF available here!).

Generating a virtual species

# loading required packages
library(dismo)
library(dplyr)
library(virtualspecies)

# reading raster files
fnames <- list.files(path = paste(system.file(package = "dismo"), "/ex", sep = ""),
                     pattern = "grd", 
                     full.names = TRUE )[-9] # remove last covariate
predictors <- raster::stack(fnames)

# generate a virtual species
virtual_sp <- generateSpFromPCA(raster.stack = predictors,
                                means = c(-3, -1),
                                sds = c(0.6, 1.5), 
                                plot = FALSE)
virtual_sp <- virtual_sp$suitab.raster
plot(virtual_sp)

# sample some points for generating presence samples
set.seed(100)
rnd_points <- randomPoints(virtual_sp, 150, prob = TRUE)

# generate presence samples with binomial trials using
# suitability of virtual species as success rate
set.seed(100)
occurence <- rnd_points %>% 
  as.data.frame() %>%
  mutate(prob = raster::extract(virtual_sp, .),
         occ = rbinom(nrow(.), 1, prob = prob)) %>% 
  dplyr::filter(occ == 1) %>% 
  dplyr::select(x, y)

# plot occurrence points
plot(virtual_sp)     
points(occurence$x, occurence$y)

# generating background points
bg <- randomPoints(predictors, n = 10000) %>% 
  as.data.frame()

# extract raster values and create training data
training <-  bind_rows(occurence, bg) %>% # combine occurrence and background
  raster::extract(predictors, .) %>% # extract values of raster covariates
  as.data.frame() %>% 
  mutate(occ = c(rep(1, nrow(occurence)), rep(0, nrow(bg)))) %>% # create 0 and 1 response
  tidyr::drop_na() # drop the NA values

# print the first few rows and columns
head(training)

  bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 occ
1  267  2351   961   244  331  213  118  264   1
2  255  3416  1491   194  330  190  139  248   1
3  257  2782  1302   161  335  199  136  245   1
4  266  2090   982   156  326  214  112  262   1
5  248  2911  1389   148  327  184  144  238   1
6  263  2402   857   270  319  212  107  260   1

Model fitting with `randomForest` package:

Four of the presented strategies in the main text and the default classification RF are implemented using the randomForest package (Liaw & Wiener, 2002). These include the down-sample, equal-sample, weighted and regression RF. Whilst, the other two solutions namely, shallow and shallow-tuned, are fitted with ranger package in the results of the main text because the ranger package provides better options (namely the Hellinger distance splitting criterion, and direct control over tree depth). Thus we do not fit shallow RF in the randomForest package.

For consistency, we use 1000 trees in all the methods implemented in this package.

The default (classification) RF

RF fits 500 trees by default. Here we used 1000 tress to be consistent with the other methods. Although, RF is not very sensitive to the increase in ntree parameter (Probst et al., 2019a), it is recommended to use high number of trees to achieve higher regression in prediction and variable importance (Probst et al., 2019b).

# loading the library
library(randomForest)

# convert response to factor for classificatin
training$occ <- as.factor(training$occ)

# default classification RF
rf_def <- randomForest(occ ~ .,
                       data = training,
                       ntree = 1000) # number of trees

# predict to raster layers
pred_def <-  predict(predictors, rf_def, type = "prob", index = 2)

Weighted RF

For the weighted classification RF, we calculated weights that are inversely proportional to the class numbers (number of presence and background samples) (Albon, 2018). This creates one weight for each class, also know as class weights.

As an aside, weighting can work differently in different software implementation of RF. For instance, weights in RF implemented in the R package ranger (Wright & Ziegler, 2017) and cforest model from the party package (Zeileis et al., 2006) operate by changing the proportion of each class in bootstrap samples i.e. over/under sampling of the classes. Here, in the randomForest package weights (cost) are applied to the Gini index (denoted as \(Gini_w\) here) as follows (Breiman et al., 1984; Kuhn & Johnson, 2013):

\[ Gini_w = C(0|1)p_0 (1- p_0 ) + C(1|0) p_1 (1- p_1 )=[C(0|1)+C(1|0)] p_0 p_1, \]

where \(p_0\) and \(p_1\) are the proportion of background and presence classes in the partition respectively, \(C(0|1)\) is the cost of mistakenly predicting a class 1 (presence) sample as class 0 (background), and \(C(1|0)\) is the opposite.

# calculate the weights
cwt <- 1 / prop.table(table(training$occ)) 

# weighted classification RF
rf_wtd <- randomForest(occ ~ .,
                       data = training,
                       ntree = 1000,
                       classwt = cwt)

# predict to raster layers
pred_wtd <- predict(predictors, rf_wtd, type = "prob", index = 2)

Equal-sample RF

For the equal-sample (Barbet-Massin et al., 2012), the model is fitted on 10 datasets each with the same number of presences and background samples. The same procedure can also be done with the ranger package by replacing the model fitting and prediction codes.

# equal number of present and background with 10 repeats
prNum <- as.numeric(table(training$occ)["1"]) # number of presences

rf_eql_models <- list()
for(t in seq_len(10)){
  eqsamp <- c(which(training$occ == 1), sample(which(training$occ == 0), prNum))
  rf_eql <- randomForest(occ ~ .,
                         data = training[eqsamp,],
                         ntree = 1000)
  rf_eql_models[[t]] <- rf_eql
}

# predict to raster layers with the 10 fitted models and aggregate the predictions
pred_eql <- lapply(rf_eql_models, function(x) predict(predictors, x, type = "prob", index = 2)) %>% 
  raster::stack() %>% 
  mean()

Down-sampled RF

The down-sampling method uses the same number of presence and background samples in bootstrapping of each tree. This is done by the sampsize argument. This argument accepts a named-vector that contains the number of samples should be taken from each class (presences and background samples are classes here) in the process of generating each tree. Although the number of presence and background samples are taken equally in each tree, the final forest will have many background samples as each tree could take different random samples.

# calculate sub-samples
prNum <- as.numeric(table(training$occ)["1"]) # number of presence records
spsize <- c("0" = prNum, "1" = prNum) # sample size for both classes

# RF with down-sampling
rf_dws <- randomForest(occ ~ .,
                       data = training,
                       ntree = 1000,
                       sampsize = spsize,
                       replace = TRUE) # make sure samples are with replacement (default)

# predict to raster layers
pred_dws <-  predict(predictors, rf_dws, type = "prob", index = 2)

Regression RF

To fit RF with regression trees (regression-RF), the response variable (0s and 1s or presence-backgrounds) should be numeric rather than categorical (factor). Here, the response (occ) is already converted to factor data type for fitting classification RF, so we convert it back to numerical to fit a regression RF model.

# convert back to numeric for regression model
training$occ <- as.numeric(as.character(training$occ))

# regression RF
rf_reg <- randomForest(occ ~ .,
                       data = training,
                       ntree = 1000)

# predict to raster layers
pred_reg  <- predict(predictors, rf_reg)

Model fitting with `ranger` package:

We only fitted shallow and shallow-tuned models in the ranger package (Wright & Ziegler, 2017) for our study. However, you can fit all other methods in the ranger package. Here we present the code for modelling virtual species data with this package. In the un-tuned version of shallow model (i.e., shallow-RF), we use a constant tree depth of 2 and the Hellinger distance splitting criterion. We do not repeat the equal-sample approach here, readers should able to do it very easily by the replacing the code from previous examples.

Regression RF

Fitting a regression RF model with ranger package. To speed-up the computation, you can also specify the number of CPU cores for parallel processing by num.threads argument.

# loading the library
library(ranger)

# fit a regression RF with ranger packaeg
rng_reg <- ranger(formula = occ ~ .,
                  data = training, 
                  num.trees = 1000,
                  num.threads = 6) # number of CPU cores to run

# predict to raster layers
pred_rng_reg <- predict(predictors, rng_reg,
                        fun = function(model, ...) predict(model, ...)$predictions)

Notice that prediction of ranger model returns a list of different information including the predictions. So, for prediction on raster layers you need to provide a function (as presented above) to extract the correct information in the predict function of the raster package.

The default (classification) RF

For fitting an RF with classification tree you need to change the response column to a factor variable. You also need to set probability argument to TRUE to get the probability predictions (note that with presence-background data the output is considered as relative likelihood rather than probability).

# convert response to factor for classificatin
training$occ <- as.factor(training$occ)

# fitting a default classification RF in ranger
rng_def <- ranger(formula = occ ~ .,
                  data = training, 
                  num.trees = 1000,
                  probability = TRUE, # fit a probability forest
                  num.threads = 6) # number of CPU cores to run

# predict to raster layers
pred_rng_def <- predict(predictors, rng_def,
                        fun = function(model, ...) predict(model, ...)$predictions[,"1"])

Weighted RF

For fitting a weighted RF we use the same weights applied in randomForest package.

# calculate the weights
cwt <- 1 / prop.table(table(training$occ)) 

# fitting a weighted RF in ranger
rng_wtd <- ranger(formula = occ ~ .,
                  data = training, 
                  num.trees = 1000, 
                  probability = TRUE, # fit a probability forest
                  class.weights = cwt,
                  num.threads = 6) # number of CPU cores to run

# predict to raster layers
pred_rng_wtd <- predict(predictors, rng_wtd,
                        fun = function(model, ...) predict(model, ...)$predictions[,"1"])

Down-sampled RF

Fitting a down-sampling approach in ranger package is not as straightforward as in the randomForest package. But we can change the weights in a way that the model takes a relatively equal number of bootstrap samples from both classes (presence and background i.e., 1 and 0) to fit each tree. This will result in balanced data in trees that produce similar results to down-sampling in the randomForest package.

# down-sampling with ranger package
# calculating the case weights (equal weights)
prNum <- as.numeric(table(training$occ)["1"]) # number of presences
bgNum <- as.numeric(table(training$occ)["0"]) # number of backgrounds
casewts <- ifelse(training$occ == 1, 1, prNum / bgNum)

# fitting RF with ranger
rng_dws <- ranger(formula = occ ~ .,
                  data = training, 
                  num.trees = 1000,
                  probability = TRUE, # fit a probability forest
                  sample.fraction = prNum / bgNum,
                  case.weights = casewts,
                  num.threads = 6) # number of CPU cores to run

# predict to raster layers
pred_rng_dws <- predict(predictors, rng_dws,
                        fun = function(model, ...) predict(model, ...)$predictions[,"1"])

RF with shallow trees

For fitting a shallow RF, we set the max.depth = 2 and splitrule = "hellinger". This forces the model to fit shallow trees.

# fitting shallow RF with ranger
rf_rng_shallow <- ranger(formula = occ ~ .,
                         data = training, 
                         num.trees = 2000,
                         probability = TRUE, # fit a probability forest
                         splitrule = "hellinger",
                         max.depth = 2, # fit shallow trees
                         num.threads = 6) # number of CPU cores to run

# predict to raster layers
pred_rng_shallow <- predict(predictors, rf_rng_shallow, 
                        type = "response",
                        fun = function(model, ...) predict(model, ...)$predictions[,"1"])

RF with tuned shallow trees (shallow-tuned)

Here we provide code for tuning the parameters of shallow model in the ranger package. This is done by searching over a grid of parameters to find the best combination. There are other packages (e.g. caret and mlr) that do parameter tuning for different machine learning models. The current implementation (at the time of publication) of these package for the parameter tuning of ranger has limited options, thus we provided a tuning function here.

# function for tuning the parameters of shallow model
tune_ranger <- function(data, 
                        y = "occ", # name of presence-background column
                        k = 5, # number of cross-validation folds
                        max.depth = 2:5, # this can be a vector of numbers
                        splitrule = c("gini", "extratrees", "hellinger"), # only allowed options
                        num.trees = 1000, # this can be a vector of numbers
                        mtry = NULL, # this can be a vector of numbers; NULL = default 
                        threads = 4){ # number of CPU cores
  require(ranger)
  splitrule <- match.arg(splitrule)
  names(data)[which(names(data) == y)] <- "po"
  data$po <- as.factor(data$po)
  if(is.null(mtry)){
    mtry <- floor(sqrt(ncol(data) - 1))
  }
  grid <- expand.grid(max.depth = max.depth, 
                      splitrule = splitrule,
                      num.trees = num.trees,
                      mtry = mtry,
                      stringsAsFactors = FALSE)
  # create balanced CV folds
  folds <- caret::createFolds(y = as.factor(data$po), k = k)
  evalmodel <- data.frame(depth = rep(NA, nrow(grid)), split = NA)
  iteration <- nrow(grid)
  # pb <- progress::progress_bar$new(format = "Progress [:bar] :percent in :elapsed",
  #                                  total = iteration, clear = FALSE, width = 75)
  for(i in seq_along(grid[,1])){
    modauc <- c()
    for(k in seq_along(folds)){
      trainSet <- unlist(folds[-k])
      testSet <- unlist(folds[k])
      mod <- ranger::ranger(formula = po ~ .,
                            data = data[trainSet, ], 
                            probability = TRUE,
                            num.trees = grid$num.trees[i],
                            splitrule = grid$splitrule[i],
                            max.depth = grid$max.depth[i],
                            mtry = grid$mtry[i],
                            num.threads = threads,
                            replace = TRUE)
      pred <- predict(mod, data[testSet, ], type = "response")$predictions[,"1"]
      modauc[k] <- precrec::auc(precrec::evalmod(scores = pred, 
                                                 labels = data$po[testSet]))[1,4]
    }
    evalmodel$depth[i] <- grid$max.depth[i]
    evalmodel$split[i] <- grid$splitrule[i]
    evalmodel$ntrees[i] <- grid$num.trees[i]
    evalmodel$mtry[i] <- grid$mtry[i]
    evalmodel$aucme[i] <- mean(modauc)
    # evalmodel$aucse[i] <- sd(modauc) / sqrt(5)
    # pb$tick()
  }
  bestparam <- which.max(evalmodel$aucme)
  print(evalmodel[bestparam, ])
  finalmod <- ranger::ranger(formula = po ~ .,
                             data = data, 
                             probability = TRUE,
                             num.trees = evalmodel$ntrees[bestparam],
                             splitrule = evalmodel$split[bestparam],
                             max.depth = evalmodel$depth[bestparam],
                             num.threads = threads,
                             replace = TRUE)
  return(finalmod)
}

# fitting ranger-tuned model
rf_shallow_tuned <- tune_ranger(data = training,
                                y = "occ",
                                max.depth = 1:4,
                                splitrule = c("hellinger"),
                                num.trees = 2000,
                                threads = 6)

  depth     split ntrees mtry     aucme
4     4 hellinger   2000    2 0.9260329

# predict to raster layers
pred_shall_tune <- predict(predictors, rf_shallow_tuned, 
                           type = "response",
                           fun = function(model, ...) predict(model, ...)$predictions[,"1"])

Spatial prediction

The spatial prediction of the different implementations of RF in previous examples (Fig. S10) further illustrates the results. Here we only present the models we used in the main text (see Fig. S11 and S12 for all methods in both packages). To help visualisation, predictions are rescaled from 0 to 1, so if there are a few high predictions and the rest low, the map will be pale (i.e., overfitting). It is the general spatial pattern of predictions that is important. The default, weighted and regression RFs only predicted to the presence sites and locations close to them (very small dots in their map) and did not follow the general pattern of the virtual species suitability, implying an extreme case of overfitting. On the other hand, equal-sample, down-sample and shallow closely follow the spatial pattern of the virtual species (the truth).

library(tmap)
library(scales)

# rescale and stack the predicted maps
maps <- list("default" = pred_def,
             "weighted" = pred_wtd,
             "regression" = pred_reg,
             "downsample" = pred_dws,
             "equal_sample" = pred_eql,
             "shallow_tuned" = pred_shall_tune, 
             "shallow" = pred_rng_shallow, 
             "truth" = virtual_sp) %>% 
  lapply(function(x) calc(x, function(y) rescale(y, c(0,1)))) %>% 
  raster::stack()

# plot the maps
tm_shape(maps) +
  tm_raster(title = "Likelihood", 
            palette = rev(terrain.colors(30)), 
            style = "cont",
            breaks = c(0, 0.25, 0.5, 0.75, 1))

Spatial prediction of different RF implementations on the virtual species (truth). Training data had 102 presences and 9775 random background points. For better comparison, the values in all maps were linearly rescaled between 0 and 1.

Figure 10: Spatial prediction of different RF implementations on the virtual species (truth). Training data had 102 presences and 9775 random background points. For better comparison, the values in all maps were linearly rescaled between 0 and 1.

Here, you can see the correlation coefficient of each predicted map with the virtual species (truth).

# extracting values for truth
truth <- as.data.frame(maps, na.rm = TRUE) %>% 
  pull(truth)
# calculate correlation with truth
as.data.frame(maps, na.rm = TRUE) %>% 
  dplyr::select(- truth) %>% 
  apply(2, function(x) cor(x, truth)) %>% 
  round(3)

      default      weighted    regression    downsample  equal_sample 
        0.355         0.334         0.409         0.942         0.928 
shallow_tuned       shallow 
        0.935         0.892

Compare the plot from the two packages

Spatial prediction of models fitted with the randomForest package is presented in Fig. S11. To make the differences clear, we rescaled all the values between 0 and 1 (for the predictions of both packages). Notice that we did not fit shallow model with the randomForest package.

Figure 11: Spatial prediction of different RF implementations with randomForest package.

Spatial prediction of models fitted with the ranger package is presented in Fig. S12. The truth is the same as the figure above.

Figure 12: Spatial prediction of different RF implementations with ranger package.

Tree depth in different models

As we discussed in the main text that RF (by default parameters) overfits the overlapped presence-background data by fitting overly deep trees. Sampling methods, i.e. down-sampling and equal-sampling, reduce the chance of class overlap by balancing the response data (presence and background samples) and thus avoid overfitting the training data. This should result in trees that are less deep than the RF models with the default parameters. To test whether sampling methods fit shallower trees than the other variants (default classification-RF, regression-RF, and weighted-RF), we calculated the number of terminal nodes (averaged over all trees of each RF model) for models fitted with the randomForest package on all NCEAS species (Fig. S13). The number of terminal nodes is an indication of the depth of trees with a higher number of terminal nodes showing the deeper trees. We calculated the number of terminal nodes by the ck37r::rf_count_terminal_nodes() function.

The average number of terminal nodes for 212 species of the NCEAS dataset fitted with KDE background samples. For sake of presentation, we removed 13 species that had higher than 5000 terminal nodes in the regression model.

Figure 13: The average number of terminal nodes for 212 species of the NCEAS dataset fitted with KDE background samples. For sake of presentation, we removed 13 species that had higher than 5000 terminal nodes in the regression model.

Fig. S13 shows a clear reduction in the average number of terminal nodes in down-sampling and equal-sampling compared to the other models. To further highlight pairwise differences we showed the difference in the number of terminal nodes between the default model and the others (Fig. S14).

The difference in the number of terminal nodes with the default (classification-RF) model. Positive numbers show higher number of terminal nodes than the default model and vice versa.

Figure 14: The difference in the number of terminal nodes with the default (classification-RF) model. Positive numbers show higher number of terminal nodes than the default model and vice versa.

You can see in Fig. S14 and the following summary that the sampling methods has always a lower number of terminal nodes than the default classification-RF model. The weighted model had either equal or even more number of terminal nodes, and regression-RF had always a higher average number of terminal nodes.

   regression         weighted         downsample     
 Min.   :  82.44   Min.   :  0.528   Min.   :-858.31  
 1st Qu.: 326.39   1st Qu.: 18.733   1st Qu.:-189.28  
 Median : 550.34   Median : 33.760   Median : -84.22  
 Mean   : 766.92   Mean   : 53.939   Mean   :-146.70  
 3rd Qu.: 947.86   3rd Qu.: 71.775   3rd Qu.: -47.58  
 Max.   :2709.49   Max.   :265.466   Max.   : -12.43  
  equalsample     
 Min.   :-924.01  
 1st Qu.:-200.45  
 Median : -88.59  
 Mean   :-155.00  
 3rd Qu.: -49.36  
 Max.   : -12.78

Here is the summary of the number of terminal nods in the default RF model (for ease of comparison).

    default       
 Min.   :  15.19  
 1st Qu.:  59.46  
 Median : 105.75  
 Mean   : 194.93  
 3rd Qu.: 247.03  
 Max.   :1210.87

References

Albon, C. (2018) Machine learning with python cookbook: Practical solutions from preprocessing to deep learning, O’Reilly Media, Inc.
Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A. (1984) Classification and regression trees, CRC press.
Elith, J., Graham, C.H., Valavi, R., Abegg, M., Bruce, C., Ferrier, S., Ford, A., Guisan, A., Hijmans, R.J., Huettmann, F., Lohmann, L., Loiselle, B., Moritz, C., Overton, J., Peterson, A.T., Phillips, S., Richardson, K., Williams, S.E., Wiser, S.K., Wohlgemuth, T. and Zimmermann, N.E. (2020) Presence-only and presence-absence data for comparing species distribution modeling methods. Biodiversity Informatics, 15, 69–80.}
Evans JS (2020) spatialEco. R package version 1.3-1, URL: https://github.com/jeffreyevans/spatialEco.
Kujala, H., Whitehead, A.L. and Wintle, B.A. (2015) Identifying conservation priorities and assessing impacts and trade‐offs of potential future development in the Lower Hunter Valley in New South Wales. 106.
Kuhn, M. and Johnson, K. (2013) Applied predictive modeling, Springer.
Liaw, A. and Wiener, M. (2002) Classification and regression by randomForest. R news, 2, 18–22.
Phillips, S.J., Dudík, M., Elith, J., Graham, C.H., Lehmann, A., Leathwick, J. and Ferrier, S. (2009) Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. Ecological applications, 19, 181–197.
Probst, P., Boulesteix, A. and Bischl, B. (2019a) Tunability: Importance of Hyperparameters of Machine Learning Algorithms. Journal of Machine Learning Research, 20, 1–32.
Probst, P., Wright, M.N. and Boulesteix, A. (2019b) Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9.
Wright, M.N. and Ziegler, A. (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77.
Zeileis, A., Hothorn, T. and Hornik, K. (2006) party: A Laboratory for Recursive Part(y)itioning. 16.

Modelling species presence-only data with random forests. Ecography.

Generating background samples

Results for TGB

Statistical test on TGB results

Statistical comparison of models

Ranking and statistical tests for KDE

Ranking and statistical test for simulated species

Prediction variability: simulated species

Model implementation

Generating a virtual species

Model fitting with `randomForest` package:

The default (classification) RF

Weighted RF

Equal-sample RF

Down-sampled RF

Regression RF

Model fitting with `ranger` package:

Regression RF

The default (classification) RF

Weighted RF

Down-sampled RF

RF with shallow trees

RF with tuned shallow trees (shallow-tuned)

Spatial prediction

Compare the plot from the two packages

Tree depth in different models

References

Reuse

Citation

Modelling species presence-only data with random forests. Ecography.

Generating background samples

Results for TGB

Statistical test on TGB results

Statistical comparison of models

Ranking and statistical tests for KDE

Ranking and statistical test for simulated species

Prediction variability: simulated species

Model implementation

Generating a virtual species

Model fitting with randomForest package:

The default (classification) RF

Weighted RF

Equal-sample RF

Down-sampled RF

Regression RF

Model fitting with ranger package:

Regression RF

The default (classification) RF

Weighted RF

Down-sampled RF

RF with shallow trees

RF with tuned shallow trees (shallow-tuned)

Spatial prediction

Compare the plot from the two packages

Tree depth in different models

References

Reuse

Citation

Model fitting with `randomForest` package:

Model fitting with `ranger` package: