% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/biomarker_qc.R
\name{remove_technical_variation}
\alias{remove_technical_variation}
\title{Remove technical variation from NMR biomarker data in UK Biobank.}
\usage{
remove_technical_variation(
  x,
  remove.outlier.plates = TRUE,
  skip.biomarker.qc.flags = FALSE
)
}
\arguments{
\item{x}{\code{data.frame} containing a dataset extracted by
\href{https://biobank.ctsu.ox.ac.uk/crystal/exinfo.cgi?src=accessing_data_guide}{ukbconv}
with \href{https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220}{UK Biobank fields}
containing the \href{https://research.nightingalehealth.com/biomarkers/}{Nightingale Health NMR metabolomics biomarker} data.}

\item{remove.outlier.plates}{logical, when set to \code{FALSE} biomarker
concentrations on outlier shipment plates (see details) are not set to
missing but simply flagged in the \code{biomarker_qc_flags} \code{data.frame}
in the returned \code{list}.}

\item{skip.biomarker.qc.flags}{logical, when set to \code{TRUE} biomarker QC
flags are not processed or returned.}
}
\value{
a \code{list} containing three \code{data.frames}: \describe{
  \item{biomarkers}{A \code{data.frame} with column names "eid",
       and "visit_index", containing project-specific sample identifier and
       UK Biobank visit index (0 for baseline assessment, 1 for first repeat
       assessment), followed by columns for each biomarker containing their
       absolute concentrations (or ratios thereof) adjusted for technical
       variation. See \code{\link{nmr_info}} for information on each biomarker.}
  \item{biomarker_qc_flags}{A \code{data.frame} with the same format as
        \code{biomarkers} with entries corresponding to the
        \href{https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=221}{quality
        control indicators} for each sample. "High plate outlier" and "Low
        plate outlier" indicate the value was set to missing due to systematic
        abnormalities in the biomarker's concentration on the sample's shipment
        plate compared to all other shipment plates (see Details). For
        composite and derived biomarkers, quality control flags are aggregates
        of any quality control flags for the underlying biomarkers from which
        the composite biomarker or ratio is derived.}
  \item{sample_processing}{A \code{data.frame} containing the
        \href{https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=222}{processing
        information and quality control indicators} for each sample, including
        those derived for removal of unwanted technical variation by this
        function. See \code{\link{sample_qc_info}} for details.}
  \item{log_offset}{A \code{data.frame} containing diagnostic information on
        the offset applied so that biomarkers with concentrations of 0 could
        be log transformed, and any right shift applied to prevent negative
        concentrations after rescaling adjusted residuals back to absolute
        concentrations. Should contain only biomarkers with minimum
        concentrations of 0 (in the "Minimum" column). "Minimum.Non.Zero"
        gives the smallest non-zero concentration for the biomarker.
        "Log.Offset" the small offset added to all samples prior to
        log transformation: half the mininum non-zero concentration.
        "Right.Shift" gives the small offset added to prevent negative
        concentrations that arise after rescaling residuals to log
        concentrations: this should be at least one order of magnitude
        smaller than the smallest non-zero value (i.e. the offset added
        should amount to noise in numeric precision for all samples). See
        publication for more details.}
  \item{outlier_plate_detection}{A \code{data.frame} containing diagnostic
        information and details of outlier plate detection. For each of the
        107 non-derived biomarkers, the median concentration on each of the
        1,352 plates was calculated, then plates were flagged as outliers if
        their median value deviated more than expected from the mean of plate
        medians. "Mean.Plate.Medians" gives the mean of the plate medians for
        each biomarker. "Lower.Limit" and "Upper.Limit" give the values below
        and above which plates are flagged as outliers based on their plate
        median. See publication for more details.}
}
}
\description{
Remove technical variation from NMR biomarker data in UK Biobank.
}
\details{
A multi-step procedure is applied to the raw biomarker data to remove the
effects of technical variation:
\enumerate{
  \item{First biomarker data is filtered to the 107 biomarkers that
  cannot be derived from any combination of other biomarkers.}
  \item{Absolute concentrations are log transformed, with a small offset
  applied to biomarkers with concentrations of 0.}
  \item{Each biomarker is adjusted for the time between sample preparation
  and sample measurement (hours).}
  \item{Each biomarker is adjusted for systematic differences between rows
  (A-H) on the 96-well shipment plates.}
  \item{Each biomarker is adjusted for remaining systematic differences
  between columns (1-12) on the 96-well shipment plates.}
  \item{Each biomarker is adjusted for drift over time within each of the six
  spectrometers. To do so, samples are grouped into 10 bins, within each
  spectrometer, by the date the majority of samples on their respective
  96-well plates were measured.}
  \item{Regression residuals after the sequential adjustments are
  transformed back to absolute concentrations.}
  \item{Samples belonging to shipment plates that are outliers of
  non-biological origin are identified and set to missing.}
  \item{The 61 composite biomarkers and 81 biomarker ratios are recomputed
  from their adjusted parts.}
  \item{An additional 76 biomarker ratios of potential biological
  significance are computed.}
}

At each step, adjustment for technical covariates is performed using
\link[MASS:rlm]{robust linear regression}. Plate row, plate column, and
sample measurement date bin are treated as factors, using the group with the
largest sample size as reference in the regression.

Further details can be found in Ritchie S. C. \emph{et al.} Quality control
and removal of technical variation of NMR metabolic biomarker data in
~120,000 UK Biobank participants, \emph{Sci Data} \strong{10}, 64 (2023). doi:
\href{https://www.nature.com/articles/s41597-023-01949-y}{10.1038/s41597-023-01949-y}

This function takes 10-15 minutes to run and requires at least 14 GB of RAM.
}
\examples{
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package
processed <- remove_technical_variation(ukb_data)

}
