% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/eim-class.R
\name{get_agg_proxy}
\alias{get_agg_proxy}
\title{Runs the EM algorithm aggregating adjacent groups, maximizing the variability of macro-group allocation in ballot boxes.}
\usage{
get_agg_proxy(
  object = NULL,
  X = NULL,
  W = NULL,
  json_path = NULL,
  sd_statistic = "maximum",
  sd_threshold = 0.05,
  method = "mult",
  feasible = TRUE,
  nboot = 50,
  allow_mismatch = TRUE,
  seed = NULL,
  ...
)
}
\arguments{
\item{object}{An object of class \code{eim}, which can be created using the \link{eim} function. This parameter should not be used if either (i) \code{X} and \code{W} matrices or (ii) \code{json_path} is supplied. See \strong{Note} in \link{run_em}.}

\item{X}{A \verb{(b x c)} matrix representing candidate votes per ballot box.}

\item{W}{A \verb{(b x g)} matrix representing group votes per ballot box.}

\item{json_path}{A path to a JSON file containing \code{X} and \code{W} fields, stored as nested arrays. It may contain additional fields with other attributes, which will be added to the returned object.}

\item{sd_statistic}{String indicates the statistic for the standard deviation \verb{(g x c)} matrix for the stopping condition, i.e., the algorithm stops when the statistic is below the threshold. It can take the value \code{maximum}, in which case computes the maximum over the standard deviation matrix, or \code{average}, in which case computes the average.}

\item{sd_threshold}{Numeric with the value to use as a threshold for the statistic (\code{sc_statistic}) of the standard deviation of the estimated probabilities. Defaults to 0.05.}

\item{method}{An optional string specifying the method used for estimating the E-step. Valid
options are:
\itemize{
\item \code{mult}: The default method, using a single sum of Multinomial distributions.
\item \code{mvn_cdf}: Uses a Multivariate Normal CDF distribution to approximate the conditional probability.
\item \code{mvn_pdf}: Uses a Multivariate Normal PDF distribution to approximate the conditional probability.
\item \code{mcmc}: Uses MCMC to sample vote outcomes. This is used to estimate the conditional probability of the E-step.
\item \code{exact}: Solves the E-step using the Total Probability Law.
}}

\item{feasible}{Logical indicating whether the returned matrix must strictly satisfy the \code{sd_threshold}.
If \code{TRUE}, no output is returned if the method does not find a group aggregation whose standard deviation statistic is below the threshold. If \code{FALSE} and the latter holds, it returns the group aggregation obtained from the DP with the the lowest standard deviation statistic. See \strong{Details} for more information. Default is \code{TRUE}.}

\item{nboot}{Integer specifying how many times to run the
EM algorithm.}

\item{allow_mismatch}{Boolean, if \code{TRUE}, allows a mismatch between the voters and votes for each ballot-box, only works if \code{method} is \code{"mvn_cdf"}, \code{"mvn_pdf"}, \code{"mult"} and \code{"mcmc"}. If \code{FALSE}, throws an error if there is a mismatch. By default it is \code{TRUE}.}

\item{seed}{An optional integer indicating the random seed for the randomized algorithms. This argument is only applicable if \code{initial_prob = "random"} or \code{method} is either \code{"mcmc"} or \code{"mvn_cdf"}. Aditionally, it sets the random draws of the ballot boxes.}

\item{...}{Additional arguments passed to the \link{run_em} function that will execute the EM algorithm.}
}
\value{
It returns an eim object with the same attributes as the output of \link{run_em}, plus the attributes:
\itemize{
\item \strong{sd}: A \verb{(a x c)} matrix with the standard deviation of the estimated probabilities computed with bootstrapping. Note that \code{a} denotes the number of macro-groups of the resulting group aggregation, it should be between \code{1} and \code{g}.
\item \strong{nboot}: Number of samples used for the \link{bootstrap} method.
\item \strong{seed}: Random seed used (if specified).
\item \strong{sd_statistic}: The statistic used as input.
\item \strong{sd_threshold}: The threshold used as input.
\item \strong{is_feasible}:  Boolean indicating whether the statistic of the standard deviation matrix is below the threshold.
\item \strong{group_agg}: Vector with the resulting group aggregation. See \strong{Examples} for more details.
}

Additionally, it will create the \code{W_agg} attribute with the aggregated groups, along with the attributes corresponding to running \link{run_em} with the aggregated groups.
}
\description{
This function estimates the voting probabilities (computed using \link{run_em}) aggregating adjacent groups so that the estimated probabilities' standard deviation (computed using \link{bootstrap}) is below a given threshold. See \strong{Details} for more information.
}
\details{
Groups need to have an order relation so that adjacent groups can be merged. Groups of consecutive column indices in the matrix W are considered adjacent. For example, consider the following seven groups defined by voters' age ranges: 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, and 80+. A possible group aggregation can be a macro-group composed of the three following age ranges: 20-39, 40-59, and 60+. Since there are multiple group aggregations, even for a fixed number of macro-groups, a Dynamic Program (DP) mechanism is used to find the group aggregation that maximizes the sum of the standard deviation of the macro-groups proportions among ballot boxes for a specific number of macro-groups. If no group aggregation standard deviation statistic meets the threshold condition, \code{NULL} is returned.

To find the best group aggregation, the function runs the DP iteratively, starting with all groups (this case is trivial since the group aggregation is such that all macro-groups match exactly the original groups). If the standard deviation statistic (\code{sd_statistic}) is below the threshold (\code{sd_threshold}), it stops. Otherwise, it runs the DP such that the number of macro-groups is one unit less than the original number of macro-groups. If the standard deviation statistic is below the threshold, it stops. This continues until either the algorithm stops, or until no group aggregation obtained by the DP satisfies the threshold condition. If the former holds, then the last group aggregation obtained (before stopping) is returned; while if the latter holds, then no output is returned unless the user sets the input parameter \code{feasible=FALSE}, in which case it returns the group aggregation that has the least standard deviation statistic, among the group-aggregations obtained from the DP.
}
\examples{
# Example 1: Using a simulated instance
simulations <- simulate_election(
    num_ballots = 400,
    num_candidates = 3,
    num_groups = 6,
    group_proportions = c(0.4, 0.1, 0.1, 0.1, 0.2, 0.1),
    lambda = 0.7,
    seed = 42
)

result <- get_agg_proxy(
    X = simulations$X,
    W = simulations$W,
    sd_threshold = 0.015,
    seed = 42
)

result$group_agg # c(2 6)
# This means that the resulting group aggregation is conformed by
# two macro-groups: one that has the original groups 1 and 2; and
# a second that has the original groups 3, 4, 5, and 6:
# {[1, 2], [3, 6]}

# Example 2: Using the chilean election results
data(chile_election_2021)

niebla_df <- chile_election_2021[chile_election_2021$ELECTORAL.DISTRICT == "NIEBLA", ]

# Create the X matrix with selected columns
X <- as.matrix(niebla_df[, c("C1", "C2", "C3", "C4", "C5", "C6", "C7")])

# Create the W matrix with selected columns
W <- as.matrix(niebla_df[, c(
    "X18.19", "X20.29",
    "X30.39", "X40.49",
    "X50.59", "X60.69",
    "X70.79", "X80."
)])

solution <- get_agg_proxy(
    X = X, W = W,
    allow_mismatch = TRUE, sd_threshold = 0.03,
    sd_statistic = "average", nboot = 100, seed = 42
)

solution$group_agg # c(3, 4, 5, 6, 8)
# This means that the resulting group aggregation consists of
# five macro-groups: one that includes the original groups 1, 2, and 3;
# three singleton groups (4, 5, and 6); and one macro-group that includes groups 7 and 8.
# {[1, 2, 3], [4], [5], [6], [7, 8]}

}
\seealso{
The \link{eim} object and \link{run_em} implementation.
}
