Build Isolation forest species distribution model and explain the the model and outputs.
Source:R/isotree_po.R
isotree_po.Rd
Call Isolation forest and its variations to do species distribution modeling and optionally call a collection of other functions to do model explanation.
Usage
isotree_po(
obs_mode = "imperfect_presence",
obs,
obs_ind_eval = NULL,
variables,
categ_vars = NULL,
contamination = 0.1,
ntrees = 100L,
sample_size = 1,
ndim = 1L,
seed = 10L,
...,
offset = 0,
response = TRUE,
spatial_response = TRUE,
check_variable = TRUE,
visualize = FALSE
)
Arguments
- obs_mode
(
string
) The mode of observations for training. It should be one ofc("perfect_presence", "imperfect_presence", "presence_absence")
. "perfect_presence" means presence-only occurrences without errors/uncertainties/bias, which should be rare in reality. "Imperfect_presence" means presence-only occurrences with errors/uncertainties/bias, which should be a most common case. "presence_absence" means presence-absence observations regardless quality. See details to learn how to set it. The default is "imperfect_presence".- obs
(
sf
) Thesf
of observation for training. It is recommended to call functionformat_observation
to format the occurrence (obs
) before passing it here. Otherwise, make sure there is a column named "observation" for observation.- obs_ind_eval
(
sf
orNULL
) Optionalsf
of observations for independent test. It is recommended to call functionformat_observation
to format the occurrence (obs
) before passing it here. Otherwise, make sure there is a column named "observation" for observation. IfNULL
, no independent test set will be used. The default isNULL
.- variables
(
RasterStack
orstars
) The stack of environmental variables.- categ_vars
(
vector
ofcharacter
orNULL
) The names of categorical variables. Must be the same as the names invariables
.- contamination
(
numeric
) The percentage of abnormal cases within a dataset. BecauseiForest
is an outlier detection algorithm. It picks up abnormal cases (much fewer) from normal cases. This argument is used to set how many abnormal cases should be there if the users have the power to control. See details for how to set it. The value should be less than 0.5. Here we constrain it in (0, 0.3]. The default value is 0.1.- ntrees
(
integer
) The number of trees for the isolation forest. It must be integer, which you could use functionas.integer
to convert to. The default is100L
.- sample_size
(
numeric
) It should be a rate for sampling size in[0, 1]
. The default is1.0
.- ndim
(
integer
) ExtensionLevel for isolation forest. It must be integer, which you could use functionas.integer
to convert to. Also, it must be no smaller than the dimension of environmental variables. When it is 1, the model is a traditional isolation forest, otherwise the model is an extended isolation forest. The default is 1.- seed
(
integer
) The random seed used in the modeling. It should be an integer. The default is10L
.- ...
Other arguments that
isolation.forest
needs.- offset
(
numeric
) The offset to adjust fitted suitability. The default is zero. Highly recommend to leave it as default.- response
(
logical
) IfTRUE
, generate response curves. The default isTRUE
.- spatial_response
(
logical
) IfTRUE
, generate spatial response maps. The default isTRUE
because it might be slow. NOTE that here SHAP-based map is not generated because it is slow. If you want it be mapped, you could call functionspatial_response
to make it.- check_variable
(
logical
) IfTRUE
, check the variable importance. The default isTRUE
.- visualize
(
logical
) IfTRUE
, generate the essential figures related to the model. The default isFALSE
.
Value
(POIsotree
) A list of
model (
isolation.forest
) The threshold set in function inputsvariables (
stars
) The formatted image stack of environmental variablesbackground_samples (
sf
) Asf
of background points for training dataset evaluation or SHAP dependence plotindependent_test (
sf
orNULL
) Asf
of test occurrence datasetbackground_samples_test (
sf
orNULL
) Asf
of background points for test dataset evaluation or SHAP dependence plotvars_train (
data.frame
) Adata.frame
with values of each environmental variables for training occurrencepred_train (
data.frame
) Adata.frame
with values of prediction for training occurrenceeval_train (
POEvaluation
) A list of presence-only evaluation metrics based on training dataset. See details ofPOEvaluation
inevaluate_po
var_test (
data.frame
orNULL
) Adata.frame
with values of each environmental variables for test occurrencepred_test (
data.frame
orNULL
) Adata.frame
with values of prediction for test occurrenceeval_test (
POEvaluation
orNULL
) A list of presence-only evaluation metrics based on test dataset. See details ofPOEvaluation
inevaluate_po
prediction (
stars
) The predicted environmental suitabilitymarginal_responses (
MarginalResponse
orNULL
) A list of marginal response values of each environmental variables. See details inmarginal_response
offset (
numeric
) The offset value set as inputs.independent_responses (
IndependentResponse
orNULL
) A list of independent response values of each environmental variables. See details inindependent_response
shap_dependences (
ShapDependence
orNULL
) A list of variable dependence values of each environmental variables. See details inshap_dependence
spatial_responses (
SpatialResponse
orNULL
) A list of spatial variable dependence values of each environmental variables. See details inshap_dependence
variable_analysis (
VariableAnalysis
orNULL
) A list of variable importance analysis based on multiple metrics. See details invariable_analysis
Details
For "perfect_presence", a user-defined number (contamination
) of samples
will be taken from background to let iForest
function normally.
If "imperfect_presence", no further actions is required.
If the obs_mode is "presence_absence", a contamination
percent
of absences will be randomly selected and work together with all presences
to train the model.
NOTE: obs_mode and mode only works for obs
. obs_ind_eval
will follow its own structure.
Please read details of algorithm isolation.forest
on
https://github.com/david-cortes/isotree, and
the R documentation of function isolation.forest
.
References
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation forest." 2008 eighth ieee international conference on data mining.IEEE, 2008. doi:10.1109/ICDM.2008.17
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 1-39. doi:10.1145/2133360.2133363
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "On detecting clustered anomalies using SCiForest." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, Heidelberg, 2010. doi:10.1007/978-3-642-15883-4_18
Ha riri, Sahand, Matias Carrasco Kind, and Robert J. Brunner. "Extended isolation forest." IEEE Transactions on Knowledge and Data Engineering (2019). doi:10.1109/TKDE.2019.2947676
References of related feature such as response curves and variable importance will be listed under their own functions
Examples
# \donttest{
########### Presence-absence mode #################
library(dplyr)
library(sf)
library(stars)
library(itsdm)
# Load example dataset
data("occ_virtual_species")
obs_df <- occ_virtual_species %>% filter(usage == "train")
eval_df <- occ_virtual_species %>% filter(usage == "eval")
x_col <- "x"
y_col <- "y"
obs_col <- "observation"
obs_type <- "presence_absence"
# Format the observations
obs_train_eval <- format_observation(
obs_df = obs_df, eval_df = eval_df,
x_col = x_col, y_col = y_col, obs_col = obs_col,
obs_type = obs_type)
# Load variables
env_vars <- system.file(
'extdata/bioclim_tanzania_10min.tif',
package = 'itsdm') %>% read_stars() %>%
slice('band', c(1, 5, 12))
# Modeling
mod_virtual_species <- isotree_po(
obs_mode = "presence_absence",
obs = obs_train_eval$obs,
obs_ind_eval = obs_train_eval$eval,
variables = env_vars, ntrees = 10,
sample_size = 0.6, ndim = 1L,
seed = 123L, nthreads = 1)
# Check results
## Evaluation based on training dataset
print(mod_virtual_species$eval_train)
plot(mod_virtual_species$eval_train)
## Response curves
plot(mod_virtual_species$marginal_responses)
plot(mod_virtual_species$independent_responses,
target_var = c('bio1', 'bio5'))
plot(mod_virtual_species$shap_dependence)
## Relationships between target var and related var
plot(mod_virtual_species$shap_dependence,
target_var = c('bio1', 'bio5'),
related_var = 'bio12', smooth_span = 0)
# Variable importance
mod_virtual_species$variable_analysis
plot(mod_virtual_species$variable_analysis)
########### Presence-absence mode ##################
# Load example dataset
data("occ_virtual_species")
obs_df <- occ_virtual_species %>% filter(usage == "train")
eval_df <- occ_virtual_species %>% filter(usage == "eval")
x_col <- "x"
y_col <- "y"
obs_col <- "observation"
# Format the observations
obs_train_eval <- format_observation(
obs_df = obs_df, eval_df = eval_df,
x_col = x_col, y_col = y_col, obs_col = obs_col,
obs_type = "presence_only")
# Modeling with perfect_presence mode
mod_perfect_pres <- isotree_po(
obs_mode = "perfect_presence",
obs = obs_train_eval$obs,
obs_ind_eval = obs_train_eval$eval,
variables = env_vars, ntrees = 10,
sample_size = 0.6, ndim = 1L,
seed = 123L, nthreads = 1)
# Modeling with imperfect_presence mode
mod_imperfect_pres <- isotree_po(
obs_mode = "imperfect_presence",
obs = obs_train_eval$obs,
obs_ind_eval = obs_train_eval$eval,
variables = env_vars, ntrees = 10,
sample_size = 0.6, ndim = 1L,
seed = 123L, nthreads = 1)
# }