% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/cv_cluster.R
\name{cv_cluster}
\alias{cv_cluster}
\title{Use environmental or spatial clustering to separate train and test folds}
\usage{
cv_cluster(
  x,
  column = NULL,
  r = NULL,
  k = 5L,
  scale = TRUE,
  raster_cluster = FALSE,
  num_sample = 10000L,
  biomod2 = TRUE,
  report = TRUE,
  ...
)
}
\arguments{
\item{x}{a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species data or ground truth sample for image classification).}

\item{column}{character (optional). Indicating the name of the column in which response variable (e.g. species data as a binary
response i.e. 0s and 1s) is stored. This is only used to see whether all the folds contain all the classes in the final report.}

\item{r}{a terra SpatRaster object of covariates to identify environmental groups. If provided, clustering will be done
in environmental space rather than spatial coordinates of sample points.}

\item{k}{integer value. The number of desired folds for cross-validation. The default is \code{k = 5}.}

\item{scale}{logical; whether to scale the input rasters (recommended) for clustering.}

\item{raster_cluster}{logical; if \code{TRUE}, the clustering is done over the entire raster layer,
otherwise it will be over the extracted raster values of the sample points. See details for more information.}

\item{num_sample}{integer; the number of samples from raster layers to build the clusters (when \code{raster_cluster = FALSE}).}

\item{biomod2}{logical. Creates a matrix of folds that can be directly used in the \pkg{biomod2} package as
a \emph{data.split.table} for cross-validation.}

\item{report}{logical; whether to print the report of the records per fold.}

\item{...}{additional arguments for \code{stats::kmeans} function, e.g. \code{algorithm = "MacQueen"}.}
}
\value{
An object of class S3. A list of objects including:
   \itemize{
    \item{folds_list - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices}
    \item{folds_ids - a vector of values indicating the number of the fold for each observation (each number corresponds to the same point in x)}
    \item{biomod_table - a matrix with the folds to be used in \pkg{biomod2} package}
    \item{k - number of the folds}
    \item{column - the name of the column if provided}
    \item{type - indicates whether spatial or environmental clustering was done.}
    \item{records - a table with the number of points in each category of training and testing}
    }
}
\description{
This function uses clustering methods to specify sets of similar environmental
conditions based on the input covariates, or cluster of spatial coordinates of the sample data.
Sample data (i.e. species data) corresponding to any of
these groups or clusters are assigned to a fold. Clustering is done
using \code{\link[stats]{kmeans}} for both approaches. The only requirement is \code{x} that leads to
a clustering of the confidantes of sample data. Otherwise, by providing \code{r}, environmental
clustering is done.
}
\details{
As k-means algorithms use Euclidean distance to estimate clusters, the input raster covariates should be quantitative variables.
Since variables with wider ranges of values might dominate the clusters and bias the environmental clustering (Hastie et al., 2009),
all the input rasters are first scaled and centred (\code{scale = TRUE}) within the function.

If \code{raster_cluster = TRUE}, the clustering is done in the raster space. In this approach the clusters will be consistent throughout the region
and different sample datasets in the same region (for comparison). However, this may result in a cluster(s)
that covers none of the species records (the spatial location of response samples),
especially when species data is not dispersed throughout the region or the number of clusters (k or folds) is high. In this
case, the number of folds is less than specified \code{k}. If \code{raster_cluster = FALSE}, the clustering will be done in
species points and the number of the folds will be the same as \code{k}.

Note that the input raster layer should cover all the species points, otherwise an error will rise. The records with no raster
value should be deleted prior to the analysis or another raster layer must be provided.
}
\examples{
\donttest{
library(blockCV)

# import presence-absence species data
points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV"))
# make an sf object from data.frame
pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)

# load raster data
path <- system.file("extdata/au/", package = "blockCV")
files <- list.files(path, full.names = TRUE)
covars <- terra::rast(files)

# spatial clustering
set.seed(6)
sc <- cv_cluster(x = pa_data,
                 column = "occ", # optional; name of the column with response
                 k = 5)

# environmental clustering
set.seed(6)
ec <- cv_cluster(r = covars, # if provided will be used for environmental clustering
                 x = pa_data,
                 column = "occ", # optional; name of the column with response
                 k = 5,
                 scale = TRUE)

}
}
\references{
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction ( 2nd ed., Vol. 1).
}
\seealso{
\code{\link{cv_buffer}} and \code{\link{cv_spatial}}
}
