If you have no already done so, I would read through our manuscript Measuring and Mitigating PCR Bias in Microbiome Data. This vignette only discusses the computational aspect of our approach, equally important is the design of the PCR calibration curve which is detailed in the manuscript.

An Example

Here I will show an example of how such data (including calibration samples can be modeled). This is the mock community data we analyzed in the manuscript.

library(fido)
library(dplyr)
library(tidyr)
library(ggplot2)

set.seed(5903)
# First load the data
data(pcrbias_mock)

Lets first take a brief look at the data. There are two objects Y (a count table that I already preprocessed just as in our manuscript) and metadata which contains the covariates we need (including the number of PCR cycles each sample has undergone).

Y[1:5,1:5]
#>               cycle13.1 cycle13.2 cycle13.3 cycle14.1 cycle14.2
#> B.longum             27        28        22        37        44
#> B.subtilis          320       299       272       513       650
#> C.aerofaciens        35        32        39        43        84
#> C.hathewayi          61        52        59        93       117
#> C.innocuum          121        91       112       197       208

head(metadata)
#>   sample_name  sample_num cycle_num machine
#> 1   cycle13.1 Calibration        13       3
#> 2   cycle13.2 Calibration        13       3
#> 3   cycle13.3 Calibration        13       3
#> 4   cycle14.1 Calibration        14       4
#> 5   cycle14.2 Calibration        14       4
#> 6   cycle14.3 Calibration        14       4

The only non-obvious variable here is probably machine which just is a categorical variable denoting which of 4 different PCR machines used to amplify a given sample. When writing the paper, we thought this might be a source of bias so we included this as a term in our model (we will do the same here just to demonstrate how).

As fido doesn’t yet have a formula interface (I will write that eventually), you just need to use the formula interface provided by base-R’s model.matrix function.

X <- t(model.matrix(~ cycle_num + sample_num + machine  -1, data = metadata))
X[,1:5]
#>                        1  2  3  4  5
#> cycle_num             13 13 13 14 14
#> sample_numCalibration  1  1  1  1  1
#> sample_numMock1        0  0  0  0  0
#> sample_numMock10       0  0  0  0  0
#> sample_numMock2        0  0  0  0  0
#> sample_numMock3        0  0  0  0  0
#> sample_numMock4        0  0  0  0  0
#> sample_numMock5        0  0  0  0  0
#> sample_numMock6        0  0  0  0  0
#> sample_numMock7        0  0  0  0  0
#> sample_numMock8        0  0  0  0  0
#> sample_numMock9        0  0  0  0  0
#> machine2               0  0  0  0  0
#> machine3               1  1  1  0  0
#> machine4               0  0  0  1  1

You can see that in doing this we have created a design matrix which has encoded the PCR machine using a series of 3 dummy variables. We also have a series of dummy variables denoting which samples are biologically unique (e.g., sample_num). The -1 in the formula just tells R to have a unique intercept for each biological sample (e.g., to use a one-hot-encoding rather than the dummy encoding used for the PCR machines).

Next we are going to specify our model priors and fit the model. A detailed description of the general thought process I like to follow when creating priors in fido is provided in the vignette Tips for Specifying Priors. Here I am just going to a simple prior where I just change Gamma from its default values. If you are wondering, in the manuscript I choose the multiplier 10 based on maximum marginal likelihood. At the end of this vignette I will show an example of how this can be done.

fit <- pibble(Y = Y, X=X, Gamma = 10*diag(nrow(X)))

Next we are going to transform the results into CLR coordinates and interpret them in that space.

fit <- to_clr(fit)

That’s about it. Now its just interpreting the model results. Lets say you want to investigate the estimated unbiased composition, then you just have to look at the inferred random intercepts for the corresponding sample_num variable. We can plot the results simply enough:

# pull out indices for random intercepts corresponding to `sample_num`
focus.covariate <- rownames(X)[which(grepl("sample_num", rownames(X)))]

# Also just so the plot fits nicely in Rmarkdown we are also going to just 
# plot a few of the taxa
focus.coord <- paste0("clr_", c("S.gallolyticus", "R.intestinalis", "L.ruminis")) 

# Also to make the plot fit nicely, I just flip the orientation of the plot 
plot(fit, par="Lambda", focus.cov=focus.covariate, focus.coord=focus.coord) +
  theme(strip.text.y=element_text(angle=0, hjust=1)) +
  facet_grid(.data$covariate~.)
#> Scale for 'colour' is already present. Adding another scale for 'colour', which will
#> replace the existing scale.

The compositional bias introduced at each cycle can also be visualized.

# Also to make the plot fit nicely, I just flip the orientation of the plot 
plot(fit, par="Lambda", focus.cov="cycle_num")
#> Scale for 'colour' is already present. Adding another scale for 'colour', which will
#> replace the existing scale.

The fido package has a bunch of tools for working with such fitted models depending on what you ultimately want to do. See the main pibble vignette for a fuller description of what you can do with such fitted models.

One plot I find particularly useful, is visualizing the calibration data and the fitted bias model. This can be done as follows:


# First transform the data into CLR coordinates (requires pseudo-count to deal with
# zeros). Then will convert to tidy format for ggplot 
tidy_calibration <- clr_array(Y+0.5, 1) %>% # transform to CLR
  as.data.frame() %>% 
  select(starts_with("cycle")) %>%  # select only samples from the calibration
  t() %>% 
  as.data.frame()
tidy_calibration$sample_name <- rownames(tidy_calibration)
tidy_calibration <- tidy_calibration %>% 
  gather(coord, val, -sample_name) %>% 
  mutate(coord = as.numeric(substr(coord, 2, 4))) %>% 
  left_join(metadata, by="sample_name") %>% 
  mutate(coord = names_coords(fit)[coord])
  

# Now the important part - lets grab the pibble result of interest
X.tmp <- matrix(0, nrow(X), 2) # Create fake covariate data to predict the regression line based on 
rownames(X.tmp) <- rownames(X)
X.tmp["cycle_num",2] <- 35
X.tmp["sample_numCalibration",] <- 1
X.tmp # simple, just going to predict the composition for each of these two samples
#>                       [,1] [,2]
#> cycle_num                0   35
#> sample_numCalibration    1    1
#> sample_numMock1          0    0
#> sample_numMock10         0    0
#> sample_numMock2          0    0
#> sample_numMock3          0    0
#> sample_numMock4          0    0
#> sample_numMock5          0    0
#> sample_numMock6          0    0
#> sample_numMock7          0    0
#> sample_numMock8          0    0
#> sample_numMock9          0    0
#> machine2                 0    0
#> machine3                 0    0
#> machine4                 0    0
      # for the plot

# Now predict the fitted regression line for cycle_num using X.tmp
predicted <- predict(fit, newdata=X.tmp, summary=TRUE) %>% 
  mutate(cycle_num = c(0, 35)[sample])

# now plot 
predicted %>% 
  ggplot(aes(x=cycle_num)) +
  geom_ribbon(aes(ymin=p2.5, ymax=p97.5), fill="darkgrey") +
  geom_line(aes(y=mean)) +
  geom_point(data=tidy_calibration, aes(y=val)) +
  facet_grid(coord~.) +
  theme_bw() +
  theme(strip.text.y=element_text(angle=0)) +
  ylab("CLR Coordinates")

There are two things I look for in these plots. First, the data should look linear in this space. If the data does not look linear then there are a few options: (a) something went wrong in your calibration experiment, (b) something is wrong with your code for plotting the calibration data, (c) our theory and prior experiments are wrong and PCR bias is not well approximated as log-ratio linear. Second, you should look to make sure your model is doing a good job fitting the data. Just remember the data here has a
few other sources of variation that the model is accounting for but not plotting. For example, there is batch variation (think about the PCR machine variable we included above). There are also zeros; here we just add a pseudo-count and transform the data, internally fido is actually modeling the zeros which should be more appropriate than the pseudo-count.

Example of using Fido for measuring and mitigating PCR Bias

Overview

An Example

Using Maximum Marginal Likelihood to estimate a scale of Gamma