% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/topic-nse.R
\name{topic-data-mask}
\alias{topic-data-mask}
\title{What is data-masking and why do I need \verb{\{\{}?}
\description{
Data-masking is a distinctive feature of R whereby programming is performed directly on a data set, with columns defined as normal objects.

\if{html}{\out{<div class="sourceCode r">}}\preformatted{# Unmasked programming
mean(mtcars$cyl + mtcars$am)
#> [1] 6.59375

# Referring to columns is an error - Where is the data?
mean(cyl + am)
#> Error in mean(cyl + am): object 'cyl' not found

# Data-masking
with(mtcars, mean(cyl + am))
#> [1] 6.59375
}\if{html}{\out{</div>}}

While data-masking makes it easy to program interactively with data frames, it makes it harder to create functions. Passing data-masked arguments to functions requires injection with the embracing operator \ifelse{html}{\code{\link[=embrace-operator]{\{\{}}}{\verb{\{\{}} or, in more complex cases, the injection operator \code{\link{!!}}.
}
\section{Why does data-masking require embracing and injection?}{
Injection (also known as quasiquotation) is a metaprogramming feature that allows you to modify parts of a program. This is needed because under the hood data-masking works by \link[=topic-defuse]{defusing} R code to prevent its immediate evaluation. The defused code is resumed later on in a context where data frame columns are defined.

Let's see what happens when we pass arguments to a data-masking function like \code{summarise()} in the normal way:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{my_mean <- function(data, var1, var2) \{
  dplyr::summarise(data, mean(var1 + var2))
\}

my_mean(mtcars, cyl, am)
#> Error in `dplyr::summarise()`:
#> i In argument: `mean(var1 + var2)`.
#> Caused by error in `mean()`:
#> ! object 'cyl' not found
}\if{html}{\out{</div>}}

The problem here is that \code{summarise()} defuses the R code it was supplied, i.e. \code{mean(var1 + var2)}.  Instead we want it to see \code{mean(cyl + am)}. This is why we need injection, we need to modify that piece of code by injecting the code supplied to the function in place of \code{var1} and \code{var2}.

To inject a function argument in data-masked context, just embrace it with \verb{\{\{}:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{my_mean <- function(data, var1, var2) \{
  dplyr::summarise(data, mean(\{\{ var1 \}\} + \{\{ var2 \}\}))
\}

my_mean(mtcars, cyl, am)
#> # A tibble: 1 x 1
#>   `mean(cyl + am)`
#>              <dbl>
#> 1             6.59
}\if{html}{\out{</div>}}

See \ifelse{html}{\link[=topic-data-mask-programming]{Data mask programming patterns}}{\link[=topic-data-mask-programming]{Data mask programming patterns}} to learn more about creating functions around data-masking functions.
}

\section{What does "masking" mean?}{
In normal R programming objects are defined in the current environment, for instance in the global environment or the environment of a function.

\if{html}{\out{<div class="sourceCode r">}}\preformatted{factor <- 1000

# Can now use `factor` in computations
mean(mtcars$cyl * factor)
#> [1] 6187.5
}\if{html}{\out{</div>}}

This environment also contains all functions currently in scope. In a script this includes the functions attached with \code{library()} calls; in a package, the functions imported from other packages. If evaluation was performed only in the data frame, we'd lose track of these objects and functions necessary to perform computations.

To keep these objects and functions in scope, the data frame is inserted at the bottom of the current chain of environments. It comes first and has precedence over the user environment. In other words, it \emph{masks} the user environment.

Since masking blends the data and the user environment by giving priority to the former, R can sometimes use a data frame column when you really intended to use a local object.

\if{html}{\out{<div class="sourceCode r">}}\preformatted{# Defining an env-variable
cyl <- 1000

# Referring to a data-variable
dplyr::summarise(mtcars, mean(cyl))
#> # A tibble: 1 x 1
#>   `mean(cyl)`
#>         <dbl>
#> 1        6.19
}\if{html}{\out{</div>}}

The tidy eval framework provides \link[=.data]{pronouns} to help disambiguate between the mask and user contexts. It is often a good idea to use these pronouns in production code.

\if{html}{\out{<div class="sourceCode r">}}\preformatted{cyl <- 1000

mtcars \%>\%
  dplyr::summarise(
    mean_data = mean(.data$cyl),
    mean_env = mean(.env$cyl)
  )
#> # A tibble: 1 x 2
#>   mean_data mean_env
#>       <dbl>    <dbl>
#> 1      6.19     1000
}\if{html}{\out{</div>}}

Read more about this in \ifelse{html}{\link[=topic-data-mask-ambiguity]{The data mask ambiguity}}{\link[=topic-data-mask-ambiguity]{The data mask ambiguity}}.
}

\section{How does data-masking work?}{
Data-masking relies on three language features:
\itemize{
\item \link[=topic-defuse]{Argument defusal} with \code{\link[=substitute]{substitute()}} (base R) or \code{\link[=enquo]{enquo()}}, \code{\link[=enquos]{enquos()}}, and \ifelse{html}{\code{\link[=embrace-operator]{\{\{}}}{\verb{\{\{}} (rlang). R code is defused so it can be evaluated later on in a special environment enriched with data.
\item First class environments. Environments are a special type of list-like object in which defused R code can be evaluated.  The named elements in an environment define objects. Lists and data frames can be transformed to environments:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{as.environment(mtcars)
#> <environment: 0x7febb17e3468>
}\if{html}{\out{</div>}}
\item Explicit evaluation with \code{\link[=eval]{eval()}} (base) or \code{\link[=eval_tidy]{eval_tidy()}} (rlang). When R code is defused, evaluation is interrupted. It can be resumed later on with \code{\link[=eval]{eval()}}:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{expr(1 + 1)
#> 1 + 1

eval(expr(1 + 1))
#> [1] 2
}\if{html}{\out{</div>}}

By default \code{eval()} and \code{eval_tidy()} evaluate in the current environment.

\if{html}{\out{<div class="sourceCode r">}}\preformatted{code <- expr(mean(cyl + am))
eval(code)
#> Error in mean(cyl + am): object 'am' not found
}\if{html}{\out{</div>}}

You can supply an optional list or data frame that will be converted to an environment.

\if{html}{\out{<div class="sourceCode r">}}\preformatted{eval(code, mtcars)
#> [1] 6.59375
}\if{html}{\out{</div>}}

Evaluation of defused code then occurs in the context of a data mask.
}
}

\section{History}{
The tidyverse embraced the data-masking approach in packages like ggplot2 and dplyr and eventually developed its own programming framework in the rlang package. None of this would have been possible without the following landmark developments from S and R authors.
\itemize{
\item The S language introduced data scopes with \code{\link[=attach]{attach()}} (Becker, Chambers and Wilks, The New S Language, 1988).
\item The S language introduced data-masked formulas in modelling functions (Chambers and Hastie, 1993).
\item Peter Dalgaard (R team) wrote the frametools package in 1997. It was later included in R as \code{\link[base:transform]{base::transform()}} and \code{\link[base:subset]{base::subset()}}. This API is an important source of inspiration for the dplyr package. It was also the first apparition of \emph{selections}, a variant of data-masking extended and codified later on in the \href{https://tidyselect.r-lib.org/articles/syntax.html}{tidyselect package}.
\item In 2000 Luke Tierney (R team) \href{https://github.com/wch/r-source/commit/a945ac8e}{changed formulas} to keep track of their original environments. This change published in R 1.1.0 was a crucial step towards hygienic data masking, i.e. the proper resolution of symbols in their original environments. Quosures were inspired by the environment-tracking mechanism of formulas.
\item Luke introduced \code{\link[base:with]{base::with()}} in 2001.
\item In 2006 the \href{https://r-datatable.com}{data.table package} included data-masking and selections in the \code{i} and \code{j} arguments of the \code{[} method of a data frame.
\item The \href{https://dplyr.tidyverse.org/}{dplyr package} was published in 2014.
\item The rlang package developed tidy eval in 2017 as the data-masking framework of the tidyverse. It introduced the notions of \link[=topic-quosure]{quosure}, \link[=topic-inject]{implicit injection} with \verb{!!}  and \verb{!!!}, and \link[=.data]{data pronouns}.
\item In 2019, injection with \verb{\{\{} was introduced in \href{https://www.tidyverse.org/blog/2019/06/rlang-0-4-0/}{rlang 0.4.0} to simplify the defuse-and-inject pattern. This operator allows R programmers to transport data-masked arguments across functions more intuitively and with minimal boilerplate.
}
}

\section{See also}{
\itemize{
\item \ifelse{html}{\link[=topic-data-mask-programming]{Data mask programming patterns}}{\link[=topic-data-mask-programming]{Data mask programming patterns}}
\item \ifelse{html}{\link[=topic-defuse]{Defusing R expressions}}{\link[=topic-defuse]{Defusing R expressions}}
}
}

\keyword{internal}
