% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/ebm.R
\name{ebm}
\alias{ebm}
\title{Explainable Boosting Machine (EBM)}
\usage{
ebm(
  formula,
  data,
  max_bins = 1024L,
  max_interaction_bins = 64L,
  interactions = 0.9,
  exclude = NULL,
  validation_size = 0.15,
  outer_bags = 16L,
  inner_bags = 0L,
  learning_rate = 0.04,
  greedy_ratio = 10,
  cyclic_progress = FALSE,
  smoothing_rounds = 500L,
  interaction_smoothing_rounds = 100L,
  max_rounds = 25000L,
  early_stopping_rounds = 100L,
  early_stopping_tolerance = 1e-05,
  min_samples_leaf = 4L,
  min_hessian = 0,
  reg_alpha = 0,
  reg_lambda = 0,
  max_delta_step = 0,
  gain_scale = 5,
  min_cat_samples = 10L,
  cat_smooth = 10,
  missing = "separate",
  max_leaves = 2L,
  monotone_constraints = NULL,
  objective = c("auto", "log_loss", "rmse", "poisson_deviance",
    "tweedie_deviance:variance_power=1.5", "gamma_deviance", "pseudo_huber:delta=1.0",
    "rmse_log"),
  n_jobs = -1L,
  random_state = 42L,
  ...
)
}
\arguments{
\item{formula}{A \link[stats:formula]{formula} of the form \code{y ~ x1 + x2 + ...}.}

\item{data}{A data frame containing the variables in the model.}

\item{max_bins}{Max number of bins per feature for the main effects stage.
Default is 1024.}

\item{max_interaction_bins}{Max number of bins per feature for interaction
terms. Default is 64.}

\item{interactions}{Interaction terms to be included in the model. Default is
0.9. Current options include:
\itemize{
\item Integer (1 <= interactions): Count of interactions to be automatically
selected.
\item Percentage (interactions < 1.0): Determine the integer count of
interactions by multiplying the number of features by this percentage.
\item List of numeric pairs: The pairs contain the indices of the features within
each additive term. In addition to pairs, the interactions parameter accepts
higher order interactions. It also accepts univariate terms which will cause
the algorithm to boost the main terms at the same time as the interactions.
When boosting mains at the same time as interactions, the \code{exclude} parameter
should be set to \code{"mains"} and currently \code{max_bins} needs to be equal to
\code{max_interaction_bins}.
}}

\item{exclude}{Features or terms to be excluded. Default is \code{NULL}.}

\item{validation_size}{Validation set size. Used for early stopping during
boosting, and is needed to create outer bags. Default is 0.15. Options are:
\itemize{
\item Integer (1 <= \code{validation_size}): Count of samples to put in the validation
sets.
\item Percentage (\code{validation_size} < 1.0): Percentage of the data to put in the
validation sets.
\item 0: Turns off early stopping. Outer bags have no utility. Error bounds will
}}

\item{outer_bags}{Number of outer bags. Outer bags are used to generate error bounds and help with smoothing the graphs.}

\item{inner_bags}{Number of inner bags. Default is 0 which turns off inner
bagging.}

\item{learning_rate}{Learning rate for boosting. Deafult is 0.04.}

\item{greedy_ratio}{The proportion of greedy boosting steps relative to
cyclic boosting steps. A value of 0 disables greedy boosting, effectively
turning it off. Default is 10.}

\item{cyclic_progress}{This parameter specifies the proportion of the
boosting cycles that will actively contribute to improving the model's
performance. It is expressed as a logical or numeric between 0 and 1, with
the default set to \code{TRUE} (1.0), meaning 100\% of the cycles are expected to
make forward progress. If forward progress is not achieved during a cycle,
that cycle will not be wasted; instead, it will be used to update internal
gain calculations related to how effective each feature is in predicting the
target variable. Setting this parameter to a value less than 1.0 can be
useful for preventing overfitting. Default is \code{FALSE}.}

\item{smoothing_rounds}{Number of initial highly regularized rounds to set
the basic shape of the main effect feature graphs. Default is 500.}

\item{interaction_smoothing_rounds}{Number of initial highly regularized
rounds to set the basic shape of the interaction effect feature graphs during
fitting. Default is 100.}

\item{max_rounds}{Total number of boosting rounds with \code{n_terms} boosting
steps per round. Default is 25000.}

\item{early_stopping_rounds}{Number of rounds with no improvement to trigger
early stopping. 0 turns off early stopping and boosting will occur for
exactly \code{max_rounds}. Default is 100.}

\item{early_stopping_tolerance}{Tolerance that dictates the smallest delta
required to be considered an improvement which prevents the algorithm from
early stopping. \code{early_stopping_tolerance} is expressed as a percentage of
the early stopping metric. Negative values indicate that the individual
models should be overfit before stopping. EBMs are a bagged ensemble of
models. Setting the \code{early_stopping_tolerance} to zero (or even negative),
allows learning to overfit each of the individual models a little, which can
improve the accuracy of the ensemble as a whole. Overfitting each of the
individual models reduces the bias of each model at the expense of increasing
the variance (due to overfitting) of the individual models. But averaging the
models in the ensemble reduces variance without much change in bias. Since
the goal is to find the optimum bias-variance tradeoff for the ensemble of
models---not the individual models---a small amount of overfitting of the
individual models can improve the accuracy of the ensemble as a whole.
Default is 1e-05.}

\item{min_samples_leaf}{Minimum number of samples allowed in the leaves.
Default is 4.}

\item{min_hessian}{Minimum hessian required to consider a potential split
valid. Default is 0.0.}

\item{reg_alpha}{L1 regularization. Default is 0.0.}

\item{reg_lambda}{L2 regularization. Default is 0.0.}

\item{max_delta_step}{Used to limit the max output of tree leaves; <=0.0
means no constraint. Default is 0.0.}

\item{gain_scale}{Scale factor to apply to nominal categoricals. A scale
factor above 1.0 will cause the algorithm focus more on the nominal
categoricals. Default is 5.0.}

\item{min_cat_samples}{Minimum number of samples in order to treat a category
separately. If lower than this threshold the category is combined with other
categories that have low numbers of samples. Default is 10.}

\item{cat_smooth}{Used for the categorical features. This can reduce the
effect of noises in categorical features, especially for categories with
limited data. Default is 10.0.}

\item{missing}{Method for handling missing values during boosting. Default is
\code{"separate"}. The placement of the missing value bin can influence the
resulting model graphs. For example, placing the bin on the "low" side may
cause missing values to affect lower bins, and vice versa. This parameter
does not affect the final placement of the missing bin in the model (the
missing bin will remain at index 0 in the \code{term_scores_} attribute). Possible
values for missing are:
\itemize{
\item \code{"low"}: Place the missing bin on the left side of the graphs.
\item \code{"high"}: Place the missing bin on the right side of the graphs.
\item \code{"separate"}: Place the missing bin in its own leaf during each boosting
step, effectively making it location-agnostic. This can lead to overfitting,
especially when the proportion of missing values is small.
\item \code{"gain"}: Choose the best leaf for the missing value contribution at each
boosting step, based on gain.
}}

\item{max_leaves}{Maximum number of leaves allowed in each tree.
Default is 2.}

\item{monotone_constraints}{Default is NULL. This parameter allows you to
specify monotonic constraints for each feature's relationship with the target
variable during model fitting. However, it is generally recommended to apply
monotonic constraints post-fit using the \code{monotonize()} attribute rather than
setting them during the fitting process. This recommendation is based on the
observation that, during fitting, the boosting algorithm may compensate for a
monotone constraint on one feature by utilizing another correlated feature,
potentially obscuring any monotonic violations. If you choose to define
monotone constraints, \code{monotone_constraints} should be a numeric vector with
a length equal to the number of features. Each element in the list
corresponds to a feature and should take one of the following values:
\itemize{
\item 0: No monotonic constraint is imposed on the corresponding feature's
partial response.
\item +1: The partial response of the corresponding feature should be
monotonically increasing with respect to the target.
\item -1: The partial response of the corresponding feature should be
monotonically decreasing with respect to the target.
}}

\item{objective}{The objective function to optimize. Current options include:
\itemize{
\item \code{"auto"} (try to determine automatically between \code{"log_loss"} and \code{"rmse"}).
\item \code{"rmse"} (root mean squared error).
\item \code{"poisson_deviance"} (e.g., for counts or non-negative integers).
\item \code{"tweedie_deviance:variance_power=1.5"} (e.g., for modeling total loss in
insurance applications).
\item \code{"gamma_deviance"} (e.g., for positive continuous response).
\item \code{"pseudo_huber:delta=1.0"} (e.g., for robust regression).
\item \code{"rmse_log"} (\code{"rmse"} with a log link function).
}

Default is \code{"auto"} which assumes \code{"log_loss"} if the response is a factor or
character string and \code{"rmse"} otherwise. It's a good idea to always
explicitly set this argument.}

\item{n_jobs}{Number of jobs to run in parallel. Default is -1. Negative
integers are interpreted as following
\href{https://github.com/joblib/joblib}{joblib}'s formula (\code{n_cpus + 1 + n_jobs}),
just like \href{https://scikit-learn.org/stable/}{scikit-learn}. For example,
\code{n_jobs = -2} means using all threads except 1.}

\item{random_state}{Random state. Setting to \code{NULL} generates non-repeatable
sequences. Default is 42 to remain consistent with the corresponding Python
module.}

\item{...}{Additional optional argument. (Currently ignored.)}
}
\value{
An object of class \code{"EBM"} for which there are \link[=print.EBM]{print},
\link[=predict.EBM]{predict}, \link[=plot.EBM]{plot}, and \link[=merge.EBM]{merge} methods.
}
\description{
This function is an R wrapper for the explainable boosting functions in the
Python \href{https://github.com/interpretml/interpret}{interpret} library. It
trains an Explainable Boosting Machine (EBM) model, which is a tree-based,
cyclic gradient boosting generalized additive model with automatic
interaction detection. EBMs are often as accurate as state-of-the-art
blackbox models while remaining completely interpretable.
}
\details{
In short, EBMs have the general form

\deqn{E\left[g\left(Y|\boldsymbol{x}\right)\right] = \theta_0 + \sum_if_i\left(x_i\right) + \sum_{ij}f_{ij}\left(x_i, x_j\right) \quad \left(i \ne j\right),}

where,
\itemize{
\item \eqn{g} is a link function that allows the model to handle various response
types (e.g., the logit link for logistic regression or Poisson deviance for
modeling counts and rates);
\item \eqn{\theta_0} is a constant intercept (or bias term);
?
\item \eqn{f_i} is the term contribution (or shape function) for predictor
\eqn{x_i} (i.e., it captures the main effect of \eqn{x_i} on
\eqn{E\left[Y|\boldsymbol{x}\right]});
\item \eqn{f_{ij}} is the term contribution for the pair of predictors \eqn{x_i}
and \eqn{x_j} (i.e., it captures the joint effect, or pairwise interaction
effect of \eqn{x_i} and \eqn{x_j} on \eqn{E\left[Y|\boldsymbol{x}\right]}).
}
}
\examples{
\dontrun{
  #
  # Regression example
  #

  # Fit a default EBM regressor
  fit <- ebm(mpg ~ ., data = mtcars, objective = "rmse")

  # Generate some predictions
  head(predict(fit, newdata = mtcars))
  head(predict(fit, newdata = mtcars, se_fit = TRUE))

  # Show global summary and GAM shape functions
  plot(fit)  # term importance scores
  plot(fit, term = "cyl")
  plot(fit, term = "cyl", interactive = TRUE)

  # Explain prediction for first observation
  plot(fit, local = TRUE, X = subset(mtcars, select = -mpg)[1L, ])
}

}
