% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Fast_analysis.R
\name{avg_and_regularize}
\alias{avg_and_regularize}
\title{Efficiently average replicates of nucleotide recoding data and regularize}
\usage{
avg_and_regularize(
  Mut_data_est,
  nreps,
  sample_lookup,
  feature_lookup,
  nbin = NULL,
  NSS = FALSE,
  Chase = FALSE,
  BDA_model = FALSE,
  null_cutoff = 0,
  Mutrates = NULL,
  ztest = FALSE
)
}
\arguments{
\item{Mut_data_est}{Dataframe with fraction new estimation information. Required columns are:
\itemize{
\item fnum; numerical ID of feature
\item reps; numerical ID of replicate
\item mut; numerical ID of experimental condition (Exp_ID)
\item logit_fn_rep; logit(fn) estimate
\item kd_rep_est; kdeg estimate
\item log_kd_rep_est; log(kdeg) estimate
\item logit_fn_se; logit(fn) estimate uncertainty
\item log_kd_se; log(kdeg) estimate uncertainty
}}

\item{nreps}{Vector of number of replicates in each experimental condition}

\item{sample_lookup}{Dictionary mapping sample names to various experimental details}

\item{feature_lookup}{Dictionary mapping feature IDs to original feature names}

\item{nbin}{Number of bins for mean-variance relationship estimation. If NULL, max of 10 or (number of logit(fn) estimates)/100 is used}

\item{NSS}{Logical; if TRUE, logit(fn)s are compared rather than log(kdeg) so as to avoid steady-state assumption.}

\item{Chase}{Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of
reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment)}

\item{BDA_model}{Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal
model is used.}

\item{null_cutoff}{bakR will test the null hypothesis of |effect size| < |null_cutoff|}

\item{Mutrates}{List containing new and old mutation rate estimates}

\item{ztest}{TRUE; if TRUE, then a z-test is used for p-value calculation rather than the more conservative moderated t-test.}
}
\value{
List with dataframes providing information about replicate-specific and pooled analysis results. The output includes:
\itemize{
\item Fn_Estimates; dataframe with estimates for the fraction new and fraction new uncertainty for each feature in each replicate.
The columns of this dataframe are:
\itemize{
\item Feature_ID; Numerical ID of feature
\item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
\item Replicate; Numerical ID for replicate
\item logit_fn; logit(fraction new) estimate, unregularized
\item logit_fn_se; logit(fraction new) uncertainty, unregularized and obtained from Fisher Information
\item nreads; Number of reads mapping to the feature in the sample for which the estimates were obtained
\item log_kdeg; log of degradation rate constant (kdeg) estimate, unregularized
\item kdeg; degradation rate constant (kdeg) estimate
\item log_kd_se; log(kdeg) uncertainty, unregularized and obtained from Fisher Information
\item sample; Sample name
\item XF; Original feature name
}
\item Regularized_ests; dataframe with average fraction new and kdeg estimates, averaged across the replicates and regularized
using priors informed by the entire dataset. The columns of this dataframe are:
\itemize{
\item Feature_ID; Numerical ID of feature
\item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
\item avg_log_kdeg; Weighted average of log(kdeg) from each replicate, weighted by sample and feature-specific read depth
\item sd_log_kdeg; Standard deviation of the log(kdeg) estimates
\item nreads; Total number of reads mapping to the feature in that condition
\item sdp; Prior standard deviation for fraction new estimate regularization
\item theta_o; Prior mean for fraction new estimate regularization
\item sd_post; Posterior uncertainty
\item log_kdeg_post; Posterior mean for log(kdeg) estimate
\item kdeg; exp(log_kdeg_post)
\item kdeg_sd; kdeg uncertainty
\item XF; Original feature name
}
\item Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the
reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this
dataframe are:
\itemize{
\item Feature_ID; Numerical ID of feature
\item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
\item L2FC(kdeg); Log2 fold change (L2FC) kdeg estimate or change in logit(fn) if NSS TRUE
\item effect; LFC(kdeg)
\item se; Uncertainty in L2FC_kdeg
\item pval; P-value obtained using effect_size, se, and a z-test
\item padj; pval adjusted for multiple testing using Benjamini-Hochberg procedure
\item XF; Original feature name
}
\item Mut_rates; list of two elements. The 1st element is a dataframe of s4U induced mutation rate estimates, where the mut column
represents the experimental ID and the rep column represents the replicate ID. The 2nd element is the single background mutation
rate estimate used
\item Hyper_Parameters; vector of two elements, named a and b. These are the hyperparameters estimated from the uncertainties for each
feature, and represent the two parameters of a Scaled Inverse Chi-Square distribution. Importantly, a is the number of additional
degrees of freedom provided by the sharing of uncertainty information across the dataset, to be used in the moderated t-test.
\item Mean_Variance_lms; linear model objects obtained from the uncertainty vs. read count regression model. One model is run for each Exp_ID
}
}
\description{
\code{avg_and_regularize} pools and regularizes replicate estimates of kinetic parameters. There are two key steps in this
downstream analysis. 1st, the uncertainty for each feature is used to fit a linear ln(uncertainty) vs. log10(read depth) trend,
and uncertainties for individual features are shrunk towards the regression line. The uncertainty for each feature is a combination of the
Fisher Information asymptotic uncertainty as well as the amount of variability seen between estimates. Regularization of uncertainty
estimates is performed using the analytic results of a Normal distribution likelihood with known mean and unknown variance and conjugate
priors. The prior parameters are estimated from the regression and amount of variability about the regression line. The strength of
regularization can be tuned by adjusting the \code{prior_weight} parameter, with larger numbers yielding stronger shrinkage towards
the regression line. The 2nd step is to regularize the average kdeg estimates. This is done using the analytic results of a
Normal distribution likelihood model with unknown mean and known variance and conjugate priors. The prior parameters are estimated from the
population wide kdeg distribution (using its mean and standard deviation as the mean and standard deviation of the normal prior).
In the 1st step, the known mean is assumed to be the average kdeg, averaged across replicates and weighted by the number of reads
mapping to the feature in each replicate. In the 2nd step, the known variance is assumed to be that obtained following regularization
of the uncertainty estimates.
}
\details{
Effect sizes (changes in kdeg) are obtained as the difference in log(kdeg) means between the reference and experimental
sample(s), and the log(kdeg)s are assumed to be independent so that the variance of the effect size is the sum of the
log(kdeg) variances. P-values assessing the significance of the effect size are obtained using a moderated t-test with number
of degrees of freedom determined from the uncertainty regression hyperparameters and are adjusted for multiple testing using the Benjamini-
Hochberg procedure to control false discovery rates (FDRs).

In some cases, the assumed ODE model of RNA metabolism will not accurately model the dynamics of a biological system being analyzed.
In these cases, it is best to compare logit(fraction new)s directly rather than converting fraction new to log(kdeg).
This analysis strategy is implemented when \code{NSS} is set to TRUE. Comparing logit(fraction new) is only valid
If a single metabolic label time has been used for all samples. For example, if a label time of 1 hour was used for NR-seq
data from WT cells and a 2 hour label time was used in KO cells, this comparison is no longer valid as differences in
logit(fraction new) could stem from differences in kinetics or label times.
}
