This CRAN task view contains a list of packages that includes
methods typically used in official statistics and survey methodology.
Many packages provide functionality for more than one of the topics listed
below. Therefore this list is not a strict categorization and packages can be
listed more than once. Please note that not all topics that are of interest in
official statistics are listed below, because functionality in R might be
missing until now.
Complex Survey Design: General Comments

Package
sampling
includes many different algorithms for
drawing survey samples and calibrating the design weights.

Package
survey
can also handle moderate data sets and is the
standard package for dealing with already drawn survey samples in R. Once
the given survey design is specified within the function
svydesign(), point and variance estimates can be computed.

Package
simFrame
is designed for performing simulation
studies in official statistics. It provides a framework for comparing
different point and variance estimators under different survey designs as
well as different conditions regarding missing values, representative and
nonrepresentative outliers.
Complex Survey Design: Details

Package
survey
allows to specify a complex survey design
(stratified sampling design, cluster sampling, multistage sampling and pps
sampling with or without replacement) for an already drawn survey sample in
order to compute accurate point and variance estimates.

Various algorithms for drawing a sample are implemented in package
sampling
(Brewer, Midzuno, pps, systematic, Sampford, balanced
(cluster or stratified) sampling via the cube method, etc.).

The
pps
package contains functions to select samples using pps
sampling. Also stratified simple random sampling is possible as well as to
compute joint inclusion probabilities for Sampford's method of pps sampling.

Package
sampfling
implements the Sampford algorithm to obtain
a sample without replacement and with unequal probabilities.

Package
stratification
allows univariate stratification of survey
populations with a generalisation of the LavalleeHidiroglou method.

Package
SamplingStrata
offers an approach for choosing the best
stratification of a sampling frame in a multivariate and multidomain setting,
where the sampling sizes in each strata are determined in order to satisfy accuracy
constraints on target estimates.
To evaluate the distribution of target variables in different strata, information of the sampling frame,
or data from previous rounds of the same survey, may be used.
Complex Survey Design: Point and Variance Estimation

Package
survey
allows to specify a complex survey design. The
resulting object can be used to estimate (HorvitzThompson) totals, means,
ratios and quantiles for domains or the whole survey sample, and to apply
regression models. Variance estimation for means, totals and ratios can be
done either by Taylor linearization or resampling (BRR, jackkife, bootstrap
or userdefined).

Package
EVER
provides the estimation of variance for complex
designs by deleteagroup jackknife replication for (HorvitzThompson)
totals, means, absolute and relative frequency distributions, contingency
tables, ratios, quantiles and regression coefficients even for domains.

Package
laeken
provides functions to estimate certain Laeken
indicators (atriskofpoverty rate, quintile share ratio, relative median
riskofpoverty gap, Gini coefficient) including their variance for domains
and stratas based on bootstrap resampling.

Package
simFrame
allows to compare (userdefined) point and
variance estimators in a simulation environment.
Complex Survey Design: Calibration

Package
survey
allows for poststratification, generalized
raking/calibration, GREG estimation and trimming of weights.

Package
EVER
provide facilities (function
kottcalibrate()) to calibrate either on a total number of units
in the population, on mariginal distributions or joint distributions of
categorical variables, or on totals of quantitative variables.

The
calib()
function in Package
sampling
allows to
calibrate for nonresponse (with response homogeneity groups) for stratified
samples.

The
calibWeights()
function in package
laeken
is a
possible faster (depending on the example) implementation of parts of
calib()
from package
sampling.

Package
reweight
allows for calibration of survey weights for
categorical survey data so that the marginal distributions of certain
variables fit more closely to those from a given population, but does not
allow complex sampling designs.
Editing and Visual Inspection of Microdata
Editing tools:

Package
editrules
convert readable linear (in)equalities into matrix form.

Package
deducorrect
depends on package
editrules
and applies deductive correction of simple rounding, typing and
sign errors based on balanced edits. Values are changed so that the given balanced edits are fulfilled. To determine which values are changed the Levensteinmetric is applied.

Package
SeleMix
can be used for selective editing for continuous scaled data.
A mixture model (Gaussian contamination model) based on response(s) y and a depended set of covariates is fit to the data to
quantify the impact of errors to the estimates.

Package
rrcovNA
provides robust location and scatter estimation and robust
principal component analysis with high breakdown point for
incomplete data. It is therefore
applicable to find representative and nonrepresentative outliers.
Visual tools:

Package
VIM
is designed to visualize missing values
using suitable plot methods. It can be used to analyse the structure of missing values in microdata using univariate, bivariate, multiple and multivariate plots where the
information of missing values
from specified variables are highlighted in selected variables.
It also comes with a graphical user interface.

Package
tabplot
provides the tableplot visualization method, which is used to profile or explore large statistical datasets.
Up to a dozen of variables are shown columnwise as bar charts (numeric variables) or stacked bar charts (factors).
Key aspects of the analysis with tableplots are the smoothness of a data distribution,
the selective occurrence of missing values, and the distribution of correlated variables.
Imputation
A distinction between iterative modelbased methods, knearest neighbor methods
and miscellaneous methods is made. However, often the criteria for using a
method depend on the scale of the data, which in official statistics are
typically a mixture of continuous, semicontinuous, binary, categorical and
count variables. In addition, measurement errors may corrupt nonrobust imputation methods.
Note that only few imputation methods can deal with mixed types of variables and only few methods account for robustness issues.
EMbased Imputation Methods:

Package
mi
provides iterative EMbased multiple Bayesian
regression imputation of missing values and model checking of the regression
models used. The regression models for each variable can also be
userdefined. The data set may consist of continuous, semicontinuous,
binary, categorical and/or count variables.

Package
mice
provides iterative EMbased multiple regression
imputation. The data set may consist of continuous, binary, categorical
and/or count variables.

Package
mitools
provides tools to perform analyses and combine
results from multiplyimputated datasets.

Package
Amelia
provides multiple imputation where first bootstrap
samples with the same dimensions as the original data are drawn, and then
used for EMbased imputation. It is also possible to impute longitudial
data. The package in addition comes with a graphical user interface.

Package
VIM
provides EMbased multiple imputation (function
irmi()) using robust estimations, which allows to adequately
deal with data including outliers. It can handle data consisting of
continuous, semicontinuous, binary, categorical and/or count variables.

Package
mix
provides iterative EMbased multiple regression
imputation. The data set may consist of continuous, binary or categorical
variables, but methods for semicontinuous variables are missing.

Package
pan
provides multiple imputation for multivariate panel or
clustered data.

Package
norm
provides EMbased multiple imputation for
multivariate normal data.

Package
cat
provides EMbased multiple imputation for multivariate
categorical data.

Package
MImix
provides tools to combine results for
multiplyimputed data using mixture approximations.

Package
robCompositions
provides iterative modelbased imputation
for compositional data (function
impCoda()).
Nearest Neighbor Imputation Methods

Package
VIM
provides an implementation of the popular
sequential and random (within a domain) hotdeck algorithm.

VIM
also provides a fast knearest neighbor (knn) algorithm which can be used for large data sets.
It uses a modification of the Gower Distance for numerical, categorical, ordered, continuous and semicontinous variables.

Package
yaImpute
performs popular nearest neighbor routines for
imputation of continuous variables where different metrics and methods can be
used for determining the distance between observations.

Function
SeqKNN()
in Package
SeqKnn
imputes the
missing values in continuously scaled variables sequentially. First, it
separates the dataset into incomplete and complete observations. The
observations in the incomplete set are imputed by the order of missing rate.
Once the missing values in an observations are imputed, the imputed
observation is moved into the complete set. It can only applied to continuous scaled variables.

Package
robCompositions
provides knn imputation for
compositional data (function
impKNNa()) using the Aitchison
distance and adjustment of the nearest neighbor.

Package
rrcovNA
provides an algorithm for (robust) sequential imputation (function
impSeq()
and
impSeqRob()
by minimizing the determinant of the covariance of the augmented data matrix. It's application is limited to continuous scaled data.

Package
impute
on Bioconductor impute provides knn imputation of continuous
variables.
Miscellaneous Imputation Methods:

Package
missMDA
allows to impute incomplete continuous variables
by principal component analysis (PCA) or categorical variables by multiple
correspondence analysis (MCA).

Package
mice
(function
mice.impute.pmm()) and
Package
Hmisc
(function
aregImpute()) allow
predicitve mean matching imputation.

Package
VIM
allows to visualize the structure of missing values
using suitable plot methods. It also comes with a graphical user interface.
Statistical Disclosure Control
Data from statistical agencies and other institutions are in its raw form
mostly confidential and data providers have to be ensure confidentiality by
both modifying the original data so that no statistical units can be
reidentified and by guaranting a minimum amount of information loss.

Package
sdcMicro
can be used for the generation of confidential
(micro)data, i.e. for the generation of public and scientificuse files.
The package also comes with a graphical user interface.

Package
simPopulation
simulates synthetic, confidential, closetoreality populations for
surveys based on sample data. Such population data can then be used for
extensive simulation studies in official statistics, using
simFrame
for example.

Package
sdcTable
can be used to provide confidential
(hierarchical) tabular data. It includes the HITAS and the HYPERCUBE
technique and uses package
lpSolve
for solving (a large amount of)
linear programs.
Seasonal Adjustment
For general time series methodology we refer to the
TimeSeries
task view.

Decomposition of time series can be done with the function
decompose(), or more advanced by using the function
stl(), both from the basic
stats
package.
Decomposition is also possible with the
StructTS()
function,
which can also be found in the
stats
package.

Many powerful tools can be accessed via package
x12. It provides
a wrapper function and GUI for the
X12 binaries
under
windows, which have to be installed first.
Statistical Record Matching

Package
StatMatch
provides functions to perform statistical
matching between two data sources sharing a number of common variables. It
creates a synthetic data set after matching of two data sources via a
likelihood aproach or via hotdeck.

Package
RecordLinkage
provides functions for linking and
deduplicating data sets.

Package
MatchIt
allows nearest neighbor matching, exact matching, optimal matching and full matching amonst
other matching methods. If two data sets have to be matched, the data must come as one data frame including a factor
variable which includes information about the membership of each observation.
Indices and Indicators

Package
laeken
provides functions to estimate popular
riskofpoverty and inequality indicators (atriskofpoverty rate, quintile
share ratio, relative median riskofpoverty gap, Gini coefficient).
In addition, standard and robust methods for tail modeling of Pareto
distributions are provided for semiparametric estimation of indicators
from continuous univariate distributions such as income variables.

Package
ineq
computes various inequality measures (Gini, Theil,
entropy, among others), concentration measures (Herfindahl, Rosenbluth), and poverty
measures (Watts, Sen, SST, and Foster). It also computes and draws empirical and theoretical
Lorenz curves as well as Pen's parade. It is not designed to deal with sampling weights directly
(these could only be emulated via
rep(x, weights)).

Function
priceIndex()
from package
micEcon
allows to
estimate the Paasche, the Fisher and the Laspeyres price indices.
Additional Packages and Functionalities

Package
samplingbook
includes sampling procedures from the book
'Stichproben. Methoden und praktische Umsetzung mit R' by Goeran Kauermann
and Helmut Kuechenhoff (2010).

Package
SDaA
is designed to reproduce results from Lohr, S. (1999)
'Sampling: Design and Analysis, Duxbury' and includes the data sets from this
book.

Package
TeachingSampling
includes functionality for sampling
designs and parameter estimation in finite populations.

Package
memisc
includes tools for the management of survey data,
graphics and simulation.

Package
odfWeave.survey
provides support for
odfWeave
for the
survey
package.

Package
spsurvey
includes facilities for spatial survey design and
analysis for equal and unequal probability (stratified) sampling.

The
FFD
package is designed to calculate optimal sample sizes of a population of animals
living in herds for surveys to substantiate freedom from disease.
The criteria of estimating the sample sizes take the herdlevel clustering of
diseases as well as imperfect diagnostic tests into account and select the samples
based on a twostage design. Inclusion probabilities are not considered in the estimation.
The package provides a graphical user interface as well.

Package
nlme
provides facilities to fit Gaussian linear and nonlinear mixedeffects models and
lme4
provides facilities to fit linear and generalized linear mixedeffects model, both used in
small area estimation.

The
pxR
package provides a set of functions for reading
and writing PCAxis files, used by different statistical
organizations around the globe for disemination of their (multidimensional) tables.