% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/aggregation-trees.R
\name{build_aggtree}
\alias{build_aggtree}
\alias{inference_aggtree}
\title{Aggregation Trees}
\usage{
build_aggtree(
  Y_tr,
  D_tr,
  X_tr,
  Y_hon = NULL,
  D_hon = NULL,
  X_hon = NULL,
  cates_tr = NULL,
  cates_hon = NULL,
  method = "aipw",
  scores = NULL,
  ...
)

inference_aggtree(object, n_groups, boot_ci = FALSE, boot_R = 2000)
}
\arguments{
\item{Y_tr}{Outcome vector for training sample.}

\item{D_tr}{Treatment vector for training sample.}

\item{X_tr}{Covariate matrix (no intercept) for training sample.}

\item{Y_hon}{Outcome vector for honest sample.}

\item{D_hon}{Treatment vector for honest sample.}

\item{X_hon}{Covariate matrix (no intercept) for honest sample.}

\item{cates_tr}{Optional, predicted CATEs for training sample. If not provided by the user, CATEs are estimated internally via a \code{\link[grf]{causal_forest}}.}

\item{cates_hon}{Optional, predicted CATEs for honest sample. If not provided by the user, CATEs are estimated internally via a \code{\link[grf]{causal_forest}}.}

\item{method}{Either \code{"raw"} or \code{"aipw"}, controls how node predictions are computed.}

\item{scores}{Optional, vector of scores to be used in computing node predictions. Useful to save computational time if scores have already been estimated. Ignored if \code{method == "raw"}.}

\item{...}{Further arguments from \code{\link[rpart]{rpart.control}}.}

\item{object}{An \code{aggTrees} object.}

\item{n_groups}{Number of desired groups.}

\item{boot_ci}{Logical, whether to compute bootstrap confidence intervals.}

\item{boot_R}{Number of bootstrap replications. Ignored if \code{boot_ci == FALSE}.}
}
\value{
\code{\link{build_aggtree}} returns an \code{aggTrees} object.\cr

\code{\link{inference_aggtree}} returns an \code{aggTrees.inference} object, which in turn contains the \code{aggTrees} object used
in the call.
}
\description{
Nonparametric data-driven approach to discovering heterogeneous subgroups in a selection-on-observables framework.
The approach constructs a sequence of groupings, one for each level of granularity. Groupings are nested and
feature an optimality property. For each grouping, we obtain point estimation and standard errors for the group average
treatment effects (GATEs) using debiased machine learning procedures. Additionally, we assess whether systematic heterogeneity
is found by testing the hypotheses that the differences in the GATEs across all pairs of groups are zero. Finally, we investigate
the driving mechanisms of effect heterogeneity by computing the average characteristics of units in each group.
}
\details{
Aggregation trees are a three-step procedure. First, the conditional average treatment effects (CATEs) are estimated using any
estimator. Second, a tree is grown to approximate the CATEs. Third, the tree is pruned to derive a nested sequence of optimal
groupings, one for each granularity level. For each level of granularity, we can obtain point estimation and inference about
the GATEs.\cr

To implement this methodology, the user can rely on two core functions that handle the various steps.\cr
\subsection{Constructing the Sequence of Groupings}{

\code{\link{build_aggtree}} constructs the sequence of groupings (i.e., the tree) and estimate the GATEs in each node. The
GATEs can be estimated in several ways. This is controlled by the \code{method} argument. If \code{method == "raw"}, we
compute the difference in mean outcomes between treated and control observations in each node. This is an unbiased estimator
in randomized experiment. If \code{method == "aipw"}, we construct doubly-robust scores and average them in each node. This
is unbiased also in observational studies. Honest regression forests and 5-fold cross fitting are used to estimate the
propensity score and the conditional mean function of the outcome (unless the user specifies the argument \code{scores}).\cr

The user can provide a vector of the estimated CATEs via the \code{cates_tr} and \code{cates_hon} arguments. If no CATEs are provided,
these are estimated internally via a \code{\link[grf]{causal_forest}} using only the training sample, that is, \code{Y_tr}, \code{D_tr},
and \code{X_tr}.\cr
}

\subsection{GATEs Estimation and Inference}{

\code{\link{inference_aggtree}} takes as input an \code{aggTrees} object constructed by \code{\link{build_aggtree}}. Then, for
the desired granularity level, chosen via the \code{n_groups} argument, it provides point estimation and standard errors for
the GATEs. Additionally, it performs some hypothesis testing to assess whether we find systematic heterogeneity and computes
the average characteristics of the units in each group to investigate the driving mechanisms.
\subsection{Point estimates and standard errors for the GATEs}{

GATEs and their standard errors are obtained by fitting an appropriate linear model. If \code{method == "raw"}, we estimate
via OLS the following:

\deqn{Y_i = \sum_{l = 1}^{|T|} L_{i, l} \gamma_l + \sum_{l = 1}^{|T|} L_{i, l} D_i \beta_l + \epsilon_i}

with \code{L_{i, l}} a dummy variable equal to one if the i-th unit falls in the l-th group, and |T| the
number of groups. If the treatment is randomly assigned, one can show that the betas identify the GATE of
each group. However, this is not true in observational studies due to selection into treatment. In this case, the user is
expected to use \code{method == "aipw"} when calling \code{\link{build_aggtree}}. In this case,
\code{\link{inference_aggtree}} uses the scores in the following regression:

\deqn{score_i = \sum_{l = 1}^{|T|} L_{i, l} \beta_l + \epsilon_i}

This way, betas again identify the GATEs.\cr

Regardless of \code{method}, standard errors are estimated via the Eicker-Huber-White estimator.\cr

If \code{boot_ci == TRUE}, the routine also computes asymmetric bias-corrected and accelerated 95\% confidence intervals using 2000 bootstrap
samples. Particularly useful when the honest sample is small-ish.
}

\subsection{Hypothesis testing}{

\code{\link{inference_aggtree}} uses the standard errors obtained by fitting the linear models above to test the hypotheses
that the GATEs are different across all pairs of leaves. Here, we adjust p-values to account for multiple hypotheses testing
using Holm's procedure.
}

\subsection{Average Characteristics}{

\code{\link{inference_aggtree}} regresses each covariate on a set of dummies denoting group membership. This way, we get the
average characteristics of units in each leaf, together with a standard error. Leaves are ordered in increasing order of their
predictions (from most negative to most positive). Standard errors are estimated via the Eicker-Huber-White estimator.
}

}

\subsection{Caution on Inference}{

Regardless of the chosen \code{method}, both functions estimate the GATEs, the linear models, and the average characteristics
of units in each group using only observations in the honest sample. If the honest sample is empty (this happens when the
user either does not provide \code{Y_hon}, \code{D_hon}, and \code{X_hon} or sets them to \code{NULL}), the same data used to
construct the tree are used to estimate the above quantities. This is fine for prediction but invalidates inference.
}
}
\examples{
\donttest{## Generate data.
set.seed(1986)

n <- 1000
k <- 3

X <- matrix(rnorm(n * k), ncol = k)
colnames(X) <- paste0("x", seq_len(k))
D <- rbinom(n, size = 1, prob = 0.5)
mu0 <- 0.5 * X[, 1]
mu1 <- 0.5 * X[, 1] + X[, 2]
Y <- mu0 + D * (mu1 - mu0) + rnorm(n)

## Training-honest sample split.
honest_frac <- 0.5
splits <- sample_split(length(Y), training_frac = (1 - honest_frac))
training_idx <- splits$training_idx
honest_idx <- splits$honest_idx

Y_tr <- Y[training_idx]
D_tr <- D[training_idx]
X_tr <- X[training_idx, ]

Y_hon <- Y[honest_idx]
D_hon <- D[honest_idx]
X_hon <- X[honest_idx, ]

## Construct sequence of groupings. CATEs estimated internally.
groupings <- build_aggtree(Y_tr, D_tr, X_tr, # Training sample.
                           Y_hon, D_hon, X_hon) # Honest sample.

## Alternatively, we can estimate the CATEs and pass them.
library(grf)
forest <- causal_forest(X_tr, Y_tr, D_tr) # Use training sample.
cates_tr <- predict(forest, X_tr)$predictions
cates_hon <- predict(forest, X_hon)$predictions

groupings <- build_aggtree(Y_tr, D_tr, X_tr, # Training sample.
                           Y_hon, D_hon, X_hon, # Honest sample.
                           cates_tr, cates_hon) # Predicted CATEs.

## We have compatibility with generic S3-methods.
summary(groupings)
print(groupings)
plot(groupings) # Try also setting 'sequence = TRUE'.

## To predict, do the following.
tree <- subtree(groupings$tree, cv = TRUE) # Select by cross-validation.
head(predict(tree, data.frame(X_hon)))

## Inference with 4 groups.
results <- inference_aggtree(groupings, n_groups = 4)

summary(results$model) # Coefficient of leafk is GATE in k-th leaf.

results$gates_diff_pairs$gates_diff # GATEs differences.
results$gates_diff_pairs$holm_pvalues # leaves 1-2 not statistically different.

## LATEX.
print(results, table = "diff")
print(results, table = "avg_char")}

}
\references{
\itemize{
\item Di Francesco, R. (2022). Aggregation Trees. CEIS Research Paper, 546. \doi{10.2139/ssrn.4304256}.
}
}
\seealso{
\code{\link{plot.aggTrees}} \code{\link{print.aggTrees.inference}}
}
\author{
Riccardo Di Francesco
}
