% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/match_df.R
\name{match_df}
\alias{match_df}
\title{Check and clean spelling or codes of multiple variables in a data frame}
\usage{
match_df(
  x = data.frame(),
  dictionary = list(),
  from = 1,
  to = 2,
  by = 3,
  order = NULL,
  warn = FALSE
)
}
\arguments{
\item{x}{a character or factor vector}

\item{dictionary}{a data frame or named list of data frames with at least two
columns defining the word list to be used. If this is a data frame, a third
column must be present to split the dictionary by column in \code{x} (see
\code{by}).}

\item{from}{a column name or position defining words or keys to be replaced}

\item{to}{a column name or position defining replacement values}

\item{by}{character or integer. If \code{dictionary} is a data frame,
then this column in defines the columns in \code{x} corresponding to each
section of the \code{dictionary} data frame. This defaults to \code{3}, indicating the
third column is to be used.}

\item{order}{a character the column to be used for sorting the values in
each data frame. If the incoming variables are factors, this determines how
the resulting factors will be sorted.}

\item{warn}{if \code{TRUE}, warnings and errors from \code{\link[=match_vec]{match_vec()}} will be
shown as a single warning. Defaults to \code{FALSE}, which shows nothing.}
}
\value{
a data frame with re-defined data based on the dictionary
}
\description{
This function allows you to clean your data according to
pre-defined rules encapsulated in either a data frame or list of data frames.
It has application for addressing mis-spellings and recoding variables (e.g.
from electronic survey data).
}
\details{
By default, this applies the function \code{\link[=match_vec]{match_vec()}} to all
columns specified by the column names listed in \code{by}, or, if a
global dictionary is used, this includes all \code{character} and \code{factor}
columns as well.

\subsection{\code{by} column}{

Spelling variables within \code{dictionary} represent keys that you want to match
to column names in \code{x} (the data set). These are expected to match exactly
with the exception of two reserved keywords that starts with a full stop:
\itemize{
\item \code{.regex [pattern]}: any column whose name is matched by \verb{[pattern]}. The
\verb{[pattern]} should be an unquoted, valid, PERL-flavored regular expression.
\item \code{.global}: any column (see Section \emph{Global dictionary})
}

}

\subsection{Global dictionary}{

A global dictionary is a set of definitions applied to all valid columns of
\code{x} indiscriminantly.
\itemize{
\item \strong{.global keyword in \code{by}}: If you want to apply a set of definitions to
all valid columns in addition to specified columns, then you can include a
\code{.global} group in the \code{by} column of your \code{dictionary} data frame. This is
useful for setting up a dictionary of common spelling errors. \emph{NOTE:
specific variable definitions will override global defintions.} For
example: if you have a column for cardinal directions and a definiton for
\code{N = North}, then the global variable \code{N = no} will not override that. See
Example.
\item \strong{\code{by = NULL}}: If you want your data frame to be applied to
all character/factor columns indiscriminantly, then setting
\code{by = NULL} will use that dictionary globally.
}

}
}
\examples{

# Read in dictionary and coded date examples --------------------

dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
  stringsAsFactors = FALSE)
dat <- read.csv(matchmaker_example("coded-data.csv"),
  stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)

# Clean spelling based on dictionary -----------------------------

dict # show the dict
head(dat) # show the data

res1 <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp")
head(res1)

# Show warnings/errors from each column --------------------------
# Internally, the `match_vec()` function can be quite noisy with warnings for
# various reasons. Thus, by default, the `match_df()` function will keep
# these quiet, but you can have them printed to your console if you use the
# warn = TRUE option:

res1 <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp",
  warn = TRUE)
head(res1)


# You can ensure the order of the factors are correct by specifying
# a column that defines order.

dat[] <- lapply(dat, as.factor)
as.list(head(dat))
res2 <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp",
  order = "orders")
head(res2)
as.list(head(res2))
}
\seealso{
\code{\link[=match_vec]{match_vec()}}, which this function wraps.
}
\author{
Zhian N. Kamvar

Patrick Barks
}
