% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/cluster_pair_minsim.R
\name{cluster_pair_minsim}
\alias{cluster_pair_minsim}
\title{Generate pairs with a minimal similarity using multiple processes}
\usage{
cluster_pair_minsim(
  cluster,
  x,
  y,
  on,
  minsim = 0,
  comparators = list(default_comparator),
  default_comparator = identical(),
  keep_simsum = TRUE,
  deduplication = FALSE,
  name = "default"
)
}
\arguments{
\item{cluster}{a cluster object as created by \code{\link[parallel]{makeCluster}}
from \code{parallel} or \code{\link[snow]{makeCluster}} from \code{snow}.}

\item{x}{first \code{data.frame}}

\item{y}{second \code{data.frame}. Ignored when \code{deduplication = TRUE}.}

\item{on}{the variables defining the blocks or strata for which 
all pairs of \code{x} and \code{y} will be generated.}

\item{minsim}{minimal similarity score.}

\item{comparators}{named list of functions with which the variables are compared. 
This function should accept two vectors. Function should either return a vector
or a \code{data.table} with multiple columns.}

\item{default_comparator}{variables for which no comparison function is defined using
\code{comparators} is compares with the function \code{default_comparator}.}

\item{keep_simsum}{add a variable \code{minsim} to the result with the similarity 
score of the pair.}

\item{deduplication}{generate pairs from only \code{x}. Ignore \code{y}. This 
is usefull for deduplication of \code{x}.}

\item{name}{the name of the resulting object to create locally on the different
R processes.}
}
\value{
A object of type \code{cluster_pairs} which is a \code{list} containing the
cluster and the name of the pairs object on the cluster nodes. For the pairs
objects created on the nodes see the documentation of \code{\link{pair}}.
}
\description{
Generates all combinations of records from \code{x} and \code{y} where the 
blocking variables are equal.
}
\details{
Generating (all) pairs of the records of two data sets, is usually the first 
step when linking the two data sets. However, this often results in a too 
large number of records. \code{pair_minsim} will only keep pairs with a 
similarity score equal or larger than \code{minsim}. The similarity score is
calculated by summing the results of the comparators for all variables 
of \code{on}.

\code{x} is split into \code{length{cluster}} parts which are distributed
over the worker nodes. \code{y} is copied to each of the nodes. On the nodes
then \code{\link{cluster_pair_minsim}} is called. The pairs are stored in the global
object \code{reclin_env} on the nodes in the variable \code{name}. The pairs
can then be further processes using functions such as
\code{\link{compare_pairs}}, and \code{\link{tabulate_patterns}}. The function
\code{\link{cluster_collect}} collects the pairs from each of the nodes.
}
\examples{
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)
# Either address or postcode has to match to keep a pair
pairs <- cluster_pair_minsim(cl, linkexample1, linkexample2, 
   on = c("postcode", "address"), minsim = 1)
stopCluster(cl)

}
\seealso{
\code{\link{cluster_pair}} and \code{\link{cluster_pair_blocking}} are 
other methods to generate pairs.
}
