% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/consensus_cluster.R
\name{consensus_cluster}
\alias{consensus_cluster}
\title{Consensus clustering}
\usage{
consensus_cluster(data, nk = 2:4, p.item = 0.8, reps = 1000,
  algorithms = NULL, nmf.method = c("scd", "lee"), nmf.loss = c("mse",
  "mkl"), xdim = NULL, ydim = NULL, rlen = 200, alpha = c(0.05, 0.01),
  minPts = 5, distance = "euclidean", prep.data = c("none", "full",
  "sampled"), scale = TRUE, type = c("conventional", "robust", "tsne",
  "largevis"), min.var = 1, progress = TRUE, seed.nmf = 123456,
  seed.data = 1, file.name = NULL, time.saved = FALSE)
}
\arguments{
\item{data}{data matrix with rows as samples and columns as variables}

\item{nk}{number of clusters (k) requested; can specify a single integer or a
range of integers to compute multiple k}

\item{p.item}{proportion of items to be used in subsampling within an
algorithm}

\item{reps}{number of subsamples}

\item{algorithms}{vector of clustering algorithms for performing consensus
clustering. Must be any number of the following: "nmf", "hc", "diana",
"km", "pam", "ap", "sc", "gmm", "block", "som", "cmeans", "hdbscan". A
custom clustering algorithm can be used.}

\item{nmf.method}{specify NMF-based algorithms to run. By default both
"scd" and "lee" algorithms are called. See \code{\link[NNLM:nnmf]{NNLM::nnmf()}} for details.}

\item{nmf.loss}{specify loss function to use for NMF. Either "mse" for mean
square error (default) or "mkl" for mean KL-divergence. See \code{\link[NNLM:nnmf]{NNLM::nnmf()}}
for details.}

\item{xdim}{x dimension of the SOM grid}

\item{ydim}{y dimension of the SOM grid}

\item{rlen}{the number of times the complete data set will be presented to
the SOM network.}

\item{alpha}{SOM learning rate, a vector of two numbers indicating the amount
of change. Default is to decline linearly from 0.05 to 0.01 over \code{rlen}
updates. Not used for the batch algorithm.}

\item{minPts}{minimum size of clusters for HDBSCAN. Default is 5.}

\item{distance}{a vector of distance functions. Defaults to "euclidean".
Other options are given in \code{\link[stats:dist]{stats::dist()}}. A custom distance function can
be used.}

\item{prep.data}{Prepare the data on the "full" dataset, the "sampled"
dataset, or "none" (default).}

\item{scale}{logical; should the data be centered and scaled?}

\item{type}{if we use "conventional" measures (default), then the mean and
standard deviation are used for centering and scaling, respectively. If
"robust" measures are specified, the median and median absolute deviation
(MAD) are used. Alternatively, we can apply "tsne" or "largevis" as other
methods of dimension reduction.}

\item{min.var}{minimum variability measure threshold used to filter the
feature space for only highly variable features. Only features with a
minimum variability measure across all samples greater than \code{min.var} will
be used. If \code{type = "conventional"}, the standard deviation is the measure
used, and if \code{type = "robust"}, the MAD is the measure used.}

\item{progress}{logical; should a progress bar be displayed?}

\item{seed.nmf}{random seed to use for NMF-based algorithms}

\item{seed.data}{seed to use to ensure each algorithm operates on the same
set of subsamples}

\item{file.name}{if not \code{NULL}, the returned array will be saved at each
iteration as well as at the end of the function call to an \code{rds} object
with \code{file.name} as the file name.}

\item{time.saved}{logical; if \code{TRUE}, the date saved is appended to
\code{file.name}. Only applicable when \code{file.name} is not \code{NULL}.}
}
\value{
An array of dimension \code{nrow(x)} by \code{reps} by \code{length(algorithms)} by
\code{length(nk)}. Each cube of the array represents a different k. Each slice
of a cube is a matrix showing consensus clustering results for algorithms.
The matrices have a row for each sample, and a column for each subsample.
Each entry represents a class membership.

When "hdbscan" is part of \code{algorithms}, we do not include its clustering
array in the consensus result. Instead, we report two summary statistics as
attributes: the proportion of outliers and the number of clusters.
}
\description{
Runs consensus clustering across subsamples of the data, clustering
algorithms, and cluster sizes.
}
\details{
See examples for how to use custom algorithms and distance functions. The
default clustering algorithms provided are:
\itemize{
\item "nmf": Nonnegative Matrix Factorization (using sequential coordinate-wise
descent or Lee's multiplicative algorithm; See Note for specifications.)
\item "hc": Hierarchical Clustering
\item "diana": DIvisive ANAlysis Clustering
\item "km": K-Means Clustering
\item "pam": Partition Around Medoids
\item "ap": Affinity Propagation
\item "sc": Spectral Clustering using Radial-Basis kernel function
\item "gmm": Gaussian Mixture Model using Bayesian Information Criterion on EM
algorithm
\item "block": Biclustering using a latent block model
\item "som": Self-Organizing Map (SOM) with Hierarchical Clustering
\item "cmeans": Fuzzy C-Means Clustering
\item "hdbscan": Hierarchical Density-based Spatial Clustering of Applications
with Noise (HDBSCAN)
}

The progress bar increments on every unit of \code{reps}.
}
\note{
The \code{nmf.method} options are "scd" for sequential coordinate-wise
descent and "lee" for Lee's multiplicative algorithm. When "hdbscan" is
chosen as an algorithm to use, its results are excluded from the rest of
the consensus clusters. This is because there is no guarantee that the
cluster assignment will have every sample clustered; more often than not
there will be noise points or outliers. In addition, the number of distinct
clusters may not even be equal to \code{nk}.
}
\examples{
data(hgsc)
dat <- hgsc[1:100, 1:50]

# Custom distance function
manh <- function(x) {
  stats::dist(x, method = "manhattan")
}

# Custom clustering algorithm
agnes <- function(d, k) {
  return(as.integer(stats::cutree(cluster::agnes(d, diss = TRUE), k)))
}

assign("agnes", agnes, 1)

cc <- consensus_cluster(dat, reps = 6, algorithms = c("pam", "agnes"),
distance = c("euclidean", "manh"), progress = FALSE)
str(cc)
}
\author{
Derek Chiu, Aline Talhouk
}
