\name{clusterSetup}
\Rdversion{1.1}
\docType{methods}
\alias{clusterSetup}
\alias{clusterSetup-methods}
\alias{clusterSetup,ANY,data.frame,character-method}
\alias{clusterSetup,ANY,data.frame,missing-method}
\alias{clusterSetup,ANY,data.frame,SampleControl-method}
%% aliases to avoid confusion due to capitalization
\alias{clustersetup}
\alias{ClusterSetup}
\alias{Clustersetup}

\title{Set up multiple samples on a snow cluster}
\description{
Generic function for setting up multiple samples on a \code{snow} cluster.
}
\usage{
clusterSetup(cl, x, control, \dots)

\S4method{clusterSetup}{ANY,data.frame,SampleControl}(cl, x, control)
}
\arguments{
  \item{cl}{a \code{snow} cluster.}
  \item{x}{the \code{data.frame} to sample from.}
  \item{control}{a control object inheriting from the virtual class 
    \code{"VirtualSampleControl"} or a character string specifying such a 
    control class (the default being \code{"SampleControl"}).}
  \item{\dots}{if \code{control} is a character string or missing, the slots of 
    the control object may be supplied as additional arguments.  See 
    \code{"\linkS4class{SampleControl}"} for details on the slots.}
}
\details{
  A fundamental design principle of the framework in the case of design-based 
  simulation studies is that the sampling procedure is separated from the 
  simulation procedure.  Two main advantages arise from setting up all samples 
  in advance. 
  
  First, the repeated sampling reduces overall computation time dramatically in 
  certain situations, since computer-intensive tasks like stratification need 
  to be performed only once.  This is particularly relevant for large 
  population data.  In close-to-reality simulation studies carried out in 
  research projects in survey statistics, often up to 10000 samples are drawn 
  from a population of millions of individuals with stratified sampling 
  designs.  For such large data sets, stratification takes a considerable 
  amount of time and is a very memory-intensive task.  If the samples are taken 
  on-the-fly, i.e., in every simulation run one sample is drawn, the function 
  to take the stratified sample would typically split the population into the 
  different strata in each of the 10000 simulation runs.  If all samples are 
  drawn in advance, on the other hand, the population data need to be split 
  only once and all 10000 samples can be taken from the respective strata 
  together.
  
  Second, the samples can be stored permanently, which simplifies the 
  reproduction of simulation results and may help to maximize comparability of 
  results obtained by different partners in a research project.  In particular, 
  this is useful for large population data, when complex sampling techniques 
  may be very time-consuming.  In research projects involving different 
  partners, usually different groups investigate different kinds of estimators. 
  If the two groups use not only the same population data, but also the same 
  previously set up samples, their results are highly comparable.
  
  The computational performance of setting up multiple samples can be increased 
  by parallel computing.  In \code{simFrame}, parallel computing is implemented 
  using the package \code{snow}.  Note that all objects and packages required 
  for the computations (including \code{simFrame}) need to be made available on 
  every worker process.
   
  In order to prevent problems with random numbers and to ensure 
  reproducibility, random number streams should be used.  In \R, the packages 
  \code{rlecuyer} and \code{rsprng} are available for creating random number 
  streams, which are supported by \code{snow} via the function 
  \code{clusterSetupRNG}.
  
  The control class \code{"SampleControl"} is highly flexible and allows 
  stratified sampling as well as sampling of whole groups rather than 
  individuals with a specified sampling method.  Hence it is often sufficient 
  to implement the desired sampling method for the simple non-stratified case 
  to extend the existing framework.  See \code{"\linkS4class{SampleControl}"} 
  for some restrictions on the argument names of such a function, which should 
  return a vector containing the indices of the sampled observations.
  
  Nevertheless, for very complex sampling procedures, it is possible to define 
  a control class \code{"MySampleControl"} extending 
  \code{"\linkS4class{VirtualSampleControl}"}, and the corresponding method 
  \code{clusterSetup(cl, x, control)} with signature \code{'ANY, data.frame, 
  MySampleControl'}.  In order to optimize computational performance, it is 
  necessary to efficiently set up multiple samples.  Thereby the slot \code{k} 
  of \code{"VirtualSampleControl"} needs to be used to control the number of 
  samples, and the resulting object must be of class 
  \code{"\linkS4class{SampleSetup}"}.
}
\value{
  An object of class \code{"SampleSetup"}.
}
\section{Methods}{
  \describe{
  \item{\code{cl = "ANY", x = "data.frame", control = "character"}}{set up 
    multiple samples on a \code{snow} cluster using a control class specified 
    by the character string \code{control}.  The slots of the control object 
    may be supplied as additional arguments.}
  \item{\code{cl = "ANY", x = "data.frame", control = "missing"}}{set up 
    multiple samples on a \code{snow} cluster using a control object of class 
    \code{"SampleControl"}.  Its slots may be supplied as additional arguments.}
  \item{\code{cl = "ANY", x = "data.frame", control = "SampleControl"}}{set up 
    multiple samples on a \code{snow} cluster as defined by the control object 
    \code{control}.}
  }
}
\author{Andreas Alfons}
\references{
Alfons, A., Templ, M. and Filzmoser, P. (2010) An Object-Oriented Framework for 
Statistical Simulation: The \R Package \pkg{simFrame}. \emph{Journal of 
Statistical Software}, \bold{37}(3), 1--36. URL 
\url{http://www.jstatsoft.org/v37/i03/}.

L'Ecuyer, P., Simard, R., Chen E and Kelton, W. (2002) An Object-Oriented 
Random-Number Package with Many Long Streams and Substreams. \emph{Operations 
Research}, \bold{50}(6), 1073--1075.

Mascagni, M. and Srinivasan, A. (2000) Algorithm 806: \code{SPRNG}: A Scalable 
Library for Pseudorandom Number Generation. \emph{ACM Transactions on 
Mathematical Software}, \bold{26}(3), 436--461.

Rossini, A., Tierney L. and Li, N. (2007) Simple Parallel Statistical Computing 
in \R. \emph{Journal of Computational and Graphical Statistics}, \bold{16}(2), 
399--420.

Tierney, L., Rossini, A. and Li, N. (2009) \code{snow}: A Parallel Computing 
Framework for the \R System. \emph{International Journal of Parallel 
Programming}, \bold{37}(1), 78--90.
}
\seealso{
  \code{\link[snow:snow-startstop]{makeCluster}}, 
  \code{\link[snow:snow-rand]{clusterSetupRNG}}, 
  \code{\link{setup}}, \code{\link{draw}}, \code{"\linkS4class{SampleControl}"}, 
  \code{"\linkS4class{VirtualSampleControl}"}, 
  \code{"\linkS4class{SampleSetup}"}
}
\examples{
\dontrun{
# these examples require at least dual core processor

# load data
data(eusilcP)

# start snow cluster
cl <- makeCluster(2, type = "SOCK")

# load package and data on workers
clusterEvalQ(cl, {
        library(simFrame)
        data(eusilcP)
    })

# simple random sampling
srss <- clusterSetup(cl, eusilcP, size = 20, k = 4)
summary(srss)
draw(eusilcP[, c("id", "eqIncome")], srss, i = 1)

# group sampling
gss <- clusterSetup(cl, eusilcP, grouping = "hid", size = 10, k = 4)
summary(gss)
draw(eusilcP[, c("hid", "id", "eqIncome")], gss, i = 2)

# stratified simple random sampling
ssrss <- clusterSetup(cl, eusilcP, design = "region", 
    size = c(2, 5, 5, 3, 4, 5, 3, 5, 2), k = 4)
summary(ssrss)
draw(eusilcP[, c("id", "region", "eqIncome")], ssrss, i = 3)

# stratified group sampling
sgss <- clusterSetup(cl, eusilcP, design = "region", 
    grouping = "hid", size = c(2, 5, 5, 3, 4, 5, 3, 5, 2), k = 4)
summary(sgss)
draw(eusilcP[, c("hid", "id", "region", "eqIncome")], sgss, i = 4)

# stop snow cluster
stopCluster(cl)
}
}
\keyword{distribution}
\keyword{methods}
