Calculate Shapley value-based variable dependence.

Calculate how a species responses to environmental variables using Shapley values.

Usage

shap_dependence(
  model,
  var_occ,
  variables,
  si = 1000,
  shap_nsim = 100,
  visualize = FALSE,
  seed = 10,
  pfun = .pfun_shap
)

Arguments

model: (isolation_forest or other model). The SDM. It could be the item model of POIsotree made by function isotree_po. It also could be other user-fitted models as long as the pfun can work on it.
var_occ: (data.frame, tibble) The data.frame style table that include values of environmental variables at occurrence locations.
variables: (stars) The stars of environmental variables. It should have multiple attributes instead of dims. If you have raster object instead, you could use st_as_stars to convert it to stars or use read_stars directly read source data as a stars. You also could use item variables of POIsotree made by function isotree_po.
si: (integer) The number of samples to generate response curves. If it is too small, the response curves might be biased. The default value is 1000.
shap_nsim: (integer) The number of Monte Carlo repetitions in SHAP method to use for estimating each Shapley value. When the number of variables is large, a smaller shap_nsim could be used. See details in documentation of function explain in package fastshap. The default is 100.
visualize: (logical) if TRUE, plot the variable dependence plots. The default is FALSE.
seed: (integer) The seed for any random progress. The default is 10.
pfun: (function) The predict function that requires two arguments, object and newdata. It is only required when model is not isolation_forest. The default is the wrapper function designed for iForest model in itsdm.

Value

(ShapDependence) A list of

dependences_cont (list) A list of Shapley values of continuous variables
dependences_cat (list) A list of Shapley values of categorical variables
feature_values (data.frame) A table of feature values

Details

The values show how each environmental variable independently affects the modeling prediction. They show how the Shapley value of each variable changes as its value is varied.

References

Strumbelj, Erik, and Igor Kononenko. "Explaining prediction models and individual predictions with feature contributions." Knowledge and information systems 41.3 (2014): 647-665.doi:10.1007/s10115-013-0679-x
Sundara rajan, Mukund, and Amir Najmi. "The many Shapley values for model explanation ." International Conference on Machine Learning. PMLR, 2020.
https://github.com/bgreenwell/fastshap
https://github.com/shap/shap

Examples

# \donttest{
# Using a pseudo presence-only occurrence dataset of
# virtual species provided in this package
library(dplyr)
library(sf)
library(stars)
library(itsdm)

data("occ_virtual_species")
obs_df <- occ_virtual_species %>% filter(usage == "train")
eval_df <- occ_virtual_species %>% filter(usage == "eval")
x_col <- "x"
y_col <- "y"
obs_col <- "observation"

# Format the observations
obs_train_eval <- format_observation(
  obs_df = obs_df, eval_df = eval_df,
  x_col = x_col, y_col = y_col, obs_col = obs_col,
  obs_type = "presence_only")

env_vars <- system.file(
  'extdata/bioclim_tanzania_10min.tif',
  package = 'itsdm') %>% read_stars() %>%
  slice('band', c(1, 5, 12, 16))

# With imperfect_presence mode,
mod <- isotree_po(
  obs_mode = "imperfect_presence",
  obs = obs_train_eval$obs,
  obs_ind_eval = obs_train_eval$eval,
  variables = env_vars, ntrees = 10,
  sample_size = 0.8, ndim = 2L,
  seed = 123L, nthreads = 1,
  response = FALSE,
  spatial_response = FALSE,
  check_variable = FALSE)

var_dependence <- shap_dependence(
  model = mod$model,
  var_occ = mod$vars_train,
  variables = mod$variables)
plot(var_dependence, target_var = "bio1", related_var = "bio16")
# }

if (FALSE) { # \dontrun{
##### Use Random Forest model as an external model ########
library(randomForest)
# Prepare data
data("occ_virtual_species")
obs_df <- occ_virtual_species %>%
  filter(usage == "train")

env_vars <- system.file(
  'extdata/bioclim_tanzania_10min.tif',
  package = 'itsdm') %>% read_stars() %>%
  slice('band', c(1, 5, 12)) %>%
  split()

model_data <- stars::st_extract(
  env_vars, at = as.matrix(obs_df %>% select(x, y))) %>%
  as.data.frame()
names(model_data) <- names(env_vars)
model_data <- model_data %>%
  mutate(occ = obs_df[['observation']])
model_data$occ <- as.factor(model_data$occ)

mod_rf <- randomForest(
  occ ~ .,
  data = model_data,
  ntree = 200)

pfun <- function(X.model, newdata) {
  # for data.frame
  predict(X.model, newdata, type = "prob")[, "1"]
}

shap_dependences <- shap_dependence(
  model = mod_rf,
  var_occ = model_data %>% select(-occ),
  variables = env_vars,
  visualize = FALSE,
  seed = 10,
  pfun = pfun)
} # }