% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dedupe_wide.R
\name{dedupe_wide}
\alias{dedupe_wide}
\title{Dedupe across multiple columns}
\usage{
dedupe_wide(
  x,
  cols_dedupe,
  cols_expand = NULL,
  max_new_cols = NULL,
  enable_drop = TRUE
)
}
\arguments{
\item{x}{A data.frame without column named '....idx' and any column which ends by four dots and number (e.g. 'column....2').}

\item{cols_dedupe}{A character vector of length min. 2 of columns' names in \code{x} used to dedupe. Deduplicated data from these columns will be saved into new columns, number of which is control by \code{max_new_cols}.}

\item{cols_expand}{A character vector of columns' names in \code{x} or \code{NULL} (means: none except those used to dedupe) indicating columns with data to keep in case of non-consistent data, i.e. unique data from these columns will be saved into new columns, number of which is control by \code{max_new_cols}.}

\item{max_new_cols}{A numeric vector length 1 or \code{NULL} (means: without limit) indicating how many new columns can be created to store unique data from columns passed to \code{cols_dedupe} and each column passed to \code{cols_expand}. Cannot be lower than 1.}

\item{enable_drop}{A logical vector length 1: should given column be dropped if (after deduplication) contains only missing data (\code{NA})? Applicable only to columns used to dedupe.}
}
\value{
If duplicated data found - data.frame with changed columns' names and optionally additional columns (in some cases less columns, depends on \code{enable_drop} argument). Otherwise data.frame without changes (except row names removed).
}
\description{
Collapse many rows connected by duplicated data (which can exist in different
rows and columns) into one, based on data in chosen columns, optionally putting
non-consistent data into newly created additional columns.
}
\details{
Columns passed to \code{cols_dedupe} must be atomic.

Row names will always be removed. If you want to preserve row names, simply put in into separate column. Note that if this column won't be passed to \code{cols_expand} argument, only the one row name for duplicated rows will be preserved (row name closest to the top of the table).

Although \code{\link[base]{duplicated}} or \code{\link[base]{unique}} treats missing data (\code{NA}) as duplicated data, this function do not do this (see second example below).

Type of columns passed to \code{cols_dedupe} will be coerced to the most general type.
}
\note{
Internally, function is mainly based on \code{\link[=data.table]{data.table}} functions and thus enabling parallel computation
is possible. To do this, just call \code{\link[data.table]{setDTthreads}} before calling \code{dedupe_wide} function.
}
\examples{
x <- data.frame(tel_1 = c(111, 222, 444, 555),
                tel_2 = c(222, 666, 666, 555),
                name = paste0("name", 1:4))
# rows 1, 2, 3 share the same phone numbers

dedupe_wide(x,
           cols_dedupe = c("tel_1", "tel_2"),
           cols_expand = "name")
# first three collapsed into one, for name4 kept only one phone number (555)
# 'name1', 'name2', 'name3' kept in new columns

y <- data.frame(tel_1 = c(777, 888, NA, NA),
                tel_2 = c(888, 777, NA, NA),
                name = paste0("name", 5:8))
# rows 3 and 4 has only missing data

dedupe_wide(y,
           cols_dedupe = c("tel_1", "tel_2"),
           cols_expand = "name")
# first two rows collapsed into one, nothing change for the rest of rows
}
