% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/do_cv.R
\name{do_cv}
\alias{do_cv}
\alias{do_cv_step1}
\alias{infer_step1}
\alias{infer_fixedfeatures}
\title{HTRX Model selection on short haplotypes}
\usage{
do_cv(
  data_nosnp,
  featuredata,
  train_proportion = 0.5,
  sim_times = 20,
  featurecap = dim(featuredata)[2],
  usebinary = 1,
  method = "simple",
  criteria = "BIC",
  gain = TRUE,
  runparallel = FALSE,
  mc.cores = 6,
  tenfoldseed = 123,
  returnall = FALSE,
  verbose = FALSE
)

do_cv_step1(
  data_nosnp,
  featuredata,
  train_proportion = 0.5,
  featurecap = dim(featuredata)[2],
  usebinary = 1,
  method = "simple",
  criteria = "BIC",
  splitseed = 123,
  runparallel = FALSE,
  mc.cores = 6,
  verbose = FALSE
)

infer_step1(
  data_nosnp,
  featuredata,
  train,
  criteria = "BIC",
  featurecap = dim(featuredata)[2],
  usebinary = 1,
  runparallel = FALSE,
  mc.cores = 6,
  verbose = FALSE
)

infer_fixedfeatures(
  data_nosnp,
  featuredata,
  train = (1:nrow(data_nosnp))[-test],
  test,
  features,
  coefficients = NULL,
  gain = TRUE,
  usebinary = 1,
  R2only = FALSE,
  verbose = FALSE
)
}
\arguments{
\item{data_nosnp}{a data frame with outcome (the outcome must be the first column)
 and fixed covariates (for example, sex, age and the first 18 PCs)
and without SNPs or haplotypes.}

\item{featuredata}{a data frame of the feature data, e.g. haplotype data created by HTRX or SNPs.
These features exclude all the data in data_nosnp, and will be selected using 2-step cross-validation.}

\item{train_proportion}{a positive number between 0 and 1 giving
the proportion of the training dataset when splitting data into 2 folds.
By default, train_proportion=0.5.}

\item{sim_times}{an integer giving the number of simulations in step 1 (see details).
By default, sim_times=20.}

\item{featurecap}{a positive integer which manually sets the maximum number of independent features.
By default, featurecap=40.}

\item{usebinary}{a non-negative number representing different models.
Use linear model if usebinary=0,
use logistic regression model via fastglm if usebinary=1 (by default),
and use logistic regression model via glm if usebinary>1.}

\item{method}{the method used for data splitting, either "simple" (default) or "stratified".}

\item{criteria}{the information criteria for model selection, either "BIC" (default) or "AIC".}

\item{gain}{logical. If gain=TRUE (default), report the variance explained in addition to fixed covariates;
otherwise, report the total variance explained by all the variables.}

\item{runparallel}{logical. Use parallel programming based on "mclapply" function or not.
Note that for Windows users, "mclapply" doesn't work, so please set runparallel=FALSE (default).}

\item{mc.cores}{an integer giving the number of cores used for parallel programming.
By default, mc.cores=6.
This only works when runparallel=TRUE.}

\item{tenfoldseed}{a positive integer specifying the seed used to
split data for 10-fold cross validation. By default, tenfoldseed=123.}

\item{returnall}{logical. If returnall=TRUE, return all the candidate models and
the variance explained in each of 10 test set for these the candidate models.
If returnall=FALSE (default), only return the best candidate model
and the variance explained in each of 10 test set by this model.}

\item{verbose}{logical. If verbose=TRUE, print out the inference steps. By default, verbose=FALSE.}

\item{splitseed}{a positive integer giving the seed of data split.}

\item{train}{a vector of the indexes of the training data.}

\item{test}{a vector of the indexes of the test data.}

\item{features}{a character of the fixed features.}

\item{coefficients}{a vector giving the coefficients of the fixed features.
If the fixed features don't have fixed coefficients, set coefficients=NULL (default).}

\item{R2only}{logical. If R2only=TRUE, function infer_fixedfeatures only
returns the variance explained in the test data.
By default, R2only=FALSE.}
}
\value{
\code{\link{do_cv}} returns a list containing the best model selected, and the out-of-sample variance explained in each test set.
If returnall=TRUE, this function also returns all the candidate models,
and the out-of-sample variance explained in each test set by each candidate model.

\code{\link{do_cv_step1}} and \code{\link{infer_step1}} return a list of three candidate models selected by a single simulation.

\code{\link{infer_fixedfeatures}} returns a list of the variance explained in the test set if R2only=TRUE,
otherwise, it returns a list of the variance explained in the test set, the model including all the variables,
and the null model, i.e. the model with outcome and fixed covariates only.
}
\description{
Two step cross-validation used to select the best HTRX model.
It can be applied to select haplotypes based on HTR, or select single nucleotide polymorphisms (SNPs).
}
\details{
Function \code{\link{do_cv}} is the main function used for selecting haplotypes from HTRX or SNPs.
It is a two-step issued and is used for alleviate overfitting.

Step 1: select candidate models. This is to address the model search problem,
and is chosen to obtain a set of models more diverse than
traditional bootstrap resampling.

(1) Randomly sample a subset (50%) of data.
Specifically, when the outcome is binary,
stratified sampling is used to ensure the subset has approximately
the same proportion of cases and controls as the whole data;

(2) Start from a model with fixed covariates (e.g. 18 PCs, sex and age),
and perform forward regression on the subset,
i.e. iteratively choose a feature (in addition to the fixed covariates)
to add whose inclusion enables the model to explain the largest variance,
and select s models with the lowest Bayesian Information Criteria (BIC)
to enter the candidate model pool;

(3) repeat (1)-(2) B times, and select all the different models in the candidate model pool
 as the candidate models.

Step 2: select the best model using 10-fold cross-validation.

(1) Randomly split the whole data into 10 groups with approximately equal sizes,
using stratified sampling when the outcome is binary;

(2) In each of the 10 folds, use a different group as the test dataset,
and take the remaining groups as the training dataset.
Then, fit all the candidate models on the training dataset,
and use these fitted models to compute the additional variance explained by features
(out-of-sample R2) in the test dataset.
Finally, select the candidate model with the biggest
average out-of-sample R2 as the best model.

Function \code{\link{do_cv_step1}} is the Step 1 (1)-(2) described above.
Function \code{\link{infer_step1}} is the Step 1 (2) described above.
Function \code{\link{infer_fixedfeatures}} is used to fit all the candidate models on the training dataset,
and compute the additional variance explained by features (out-of-sample R2) in the test dataset,
as described in the Step 2 (2) bove.
}
\examples{
## use dataset "example_hap1", "example_hap2" and "example_data_nosnp"
## "example_hap1" and "example_hap2" are
## both genomes of 8 SNPs for 5,000 individuals (diploid data)
## "example_data_nosnp" is an example dataset
## which contains the outcome (binary), sex, age and 18 PCs

## visualise the covariates data
## we will use only the first two covariates: sex and age in the example
head(HTRX::example_data_nosnp)

## visualise the genotype data for the first genome
head(HTRX::example_hap1)

## we perform HTRX on the first 4 SNPs
## we first generate all the haplotype data, as defined by HTRX
HTRX_matrix=make_htrx(HTRX::example_hap1[1:300,1:4],
                      HTRX::example_hap2[1:300,1:4])

## If the data is haploid, please set
## HTRX_matrix=make_htrx(HTRX::example_hap1[1:300,1:4],
##                       HTRX::example_hap1[1:300,1:4])

## then perform HTRX using 2-step cross-validation in a single small example
## to compute additional variance explained by haplotypes
## If you want to compute total variance explained, please set gain=FALSE
htrx_results <- do_cv(HTRX::example_data_nosnp[1:300,1:2],
                      HTRX_matrix,train_proportion=0.5,
                      sim_times=1,featurecap=4,usebinary=1,
                      method="simple",criteria="BIC",
                      gain=TRUE,runparallel=FALSE,verbose=TRUE)
## If we want to compute the total variance explained
## we can set gain=FALSE in the above example

## Below is an example with a large sample size and simulations
\donttest{
HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4],
                      HTRX::example_hap1[,1:4])
## next compute the maximum number of independent features
featurecap=htrx_max(nsnp=4,cap=3)
## then perform HTRX using 2-step cross-validation
htrx_results <- do_cv(HTRX::example_data_nosnp[,1:3],
                      HTRX_matrix,train_proportion=0.5,
                      sim_times=10,featurecap=featurecap,usebinary=1,
                      method="stratified",criteria="BIC",
                      gain=TRUE,runparallel=FALSE,verbose=TRUE)
}
}
\references{
Barrie W, Yang Y, Attfield K E, et al. Genetic risk for Multiple Sclerosis originated in Pastoralist Steppe populations. bioRxiv (2022).

Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Stat. 7, 1-26 (1979).

Kass, R. E. & Wasserman, L. A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion. J. Am. Stat. Assoc. 90, 928-934 (1995).
}
