% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/saturationFilter.R
\name{saturationFilter}
\alias{classifSF}
\alias{classifSF.default}
\alias{classifSF.formula}
\alias{consensusSF}
\alias{consensusSF.default}
\alias{consensusSF.formula}
\alias{saturationFilter}
\alias{saturationFilter.default}
\alias{saturationFilter.formula}
\title{Saturation Filters}
\usage{
\method{saturationFilter}{formula}(formula, data, ...)

\method{saturationFilter}{default}(x, noiseThreshold = NULL,
  classColumn = ncol(x), ...)

\method{consensusSF}{formula}(formula, data, ...)

\method{consensusSF}{default}(x, nfolds = 10, consensusLevel = nfolds - 1,
  noiseThreshold = NULL, classColumn = ncol(x), ...)

\method{classifSF}{formula}(formula, data, ...)

\method{classifSF}{default}(x, nfolds = 10, noiseThreshold = NULL,
  classColumn = ncol(x), ...)
}
\arguments{
\item{formula}{A formula describing the classification variable and the attributes to be used.}

\item{data, x}{Data frame containing the tranining dataset to be filtered.}

\item{...}{Optional parameters to be passed to other methods.}

\item{noiseThreshold}{The threshold for removing noisy instances in the saturation filter.
Authors recommend values between 0.25 and 2. If it is set to \code{NULL}, the
threshold is appropriately chosen according to the number of training instances.}

\item{classColumn}{Positive integer indicating the column which contains the (factor of) classes.
By default, the last column is considered.}

\item{nfolds}{For \code{consensusSF} and \code{classifSF}, number of folds
in which the dataset is split.}

\item{consensusLevel}{For \code{consensusSF}, it sets the (minimum) number of 'noisy
votes' an instance must get in order to be removed. By default, the \code{nfolds-1} filters built
over each instance must label it as noise.}
}
\value{
An object of class \code{filter}, which is a list with seven components:
\itemize{
   \item \code{cleanData} is a data frame containing the filtered dataset.
   \item \code{remIdx} is a vector of integers indicating the indexes for
   removed instances (i.e. their row number with respect to the original data frame).
   \item \code{repIdx} is a vector of integers indicating the indexes for
   repaired/relabelled instances (i.e. their row number with respect to the original data frame).
   \item \code{repLab} is a factor containing the new labels for repaired instances.
   \item \code{parameters} is a list containing the argument values.
   \item \code{call} contains the original call to the filter.
   \item \code{extraInf} is a character that includes additional interesting
   information not covered by previous items.
}
}
\description{
Data complexity based filters for removing label noise from a dataset as a
preprocessing step of classification. For more information, see 'Details' and
'References' sections.
}
\details{
Based on theoretical studies about data complexity (Gamberger & Lavrac, 1997),
\code{saturationFilter} removes those
instances which most enable to reduce the CLCH (Complexity of the Least Complex Hypotheses)
of the training dataset. The full method can be looked up in (Gamberger et al., 1999), and
the previous step of \emph{literals} extraction is detailed in (Gamberger et al., 1996).

\code{consensusSF} splits the dataset in \code{nfolds} folds, and applies
\code{saturationFilter} to every combination of \code{nfolds-1} folds. Those instances
with (at least) \code{consensusLevel} 'noisy votes' are removed.

\code{classifSF} combines \code{saturationFilter} with a \code{nfolds}-folds cross validation
scheme (the latter in the spirit of filters such as \code{\link{EF}}, \code{\link{CVCF}}).
Namely, the dataset is split in \code{nfolds} folds and, for every combination
of \code{nfolds-1} folds, \code{saturationFilter} is applied and a classifier
(we implement a standard C4.5 tree) is built. Instances
from the excluded fold are removed according to this classifier.
}
\examples{
# Next example is not run because saturation procedure is time-consuming.
\dontrun{
data(iris)
out1 <- saturationFilter(Species~., data = iris)
out2 <- consensusSF(Species~., data = iris)
out3 <- classifSF(Species~., data = iris)
print(out1)
print(out2)
print(out3)
}
}
\references{
Gamberger D., Lavrac N., Groselj C. (1999, June): Experiments with noise
filtering in a medical domain. In \emph{ICML} (pp. 143-151).

Gamberger D., Lavrac N., Dzeroski S. (1996, January): Noise elimination in
inductive concept learning: A case study in medical diagnosis. In
\emph{Algorithmic Learning Theory} (pp. 199-212). Springer Berlin Heidelberg.

Gamberger D., Lavrac N. (1997): Conditions for Occam's razor applicability
and noise elimination (pp. 108-123). Springer Berlin Heidelberg.
}

