\name{CCA}
\alias{CCA}
\title{Perform sparse canonical correlation analysis using the penalized
matrix decomposition.}
\description{
Given matrices X and Z, which represent two sets of features on the same
set of samples, find sparse u and v such that u'X'Zv is large.  For X and Z, the
samples are on the rows and the features are on the columns. X and Z
must have same number of rows, but may (and usually will) have different
numbers of columns. The columns of X and/or Z can be unordered or
ordered. If unordered, then a lasso penalty will be used to obtain the
corresponding canonical vector. If ordered, then a fused lasso penalty
will be used; this will result in smoothness.
} 
\usage{
CCA(x, z, typex=c("standard", "ordered"),typez=c("standard","ordered"), penaltyx=NULL, penaltyz=NULL, K=1,
niter=15, v=NULL, trace=TRUE, standardize=TRUE, xnames=NULL, znames=NULL, chromx=NULL,
chromz=NULL, upos=FALSE, uneg=FALSE, vpos=FALSE, vneg=FALSE, outcome=NULL, y=NULL, cens=NULL)
}
%- maybe also 'usage' for other objects documented here.
\arguments{
  \item{x}{Data matrix; samples are rows and columns are
    features. Cannot contain missing values.}
  \item{z}{Data matrix; samples are rows and columns are
      features.  Cannot
      contain missing values.}
  \item{typex}{Are the columns of x unordered (type="standard") or
      ordered (type="ordered")? If "standard", then a lasso penalty is
      applied to u, to enforce sparsity. If "ordered" (generally used
      for CGH data), then a fused
      lasso penalty is applied, to enforce both sparsity and
      smoothness.}
  \item{typez}{Are the columns of z unordered (type="standard") or
      ordered (type="ordered")? If "standard", then a lasso penalty is
      applied to v, to enforce sparsity. If "ordered" (generally used
      for CGH data), then a fused
      lasso penalty is applied, to enforce both sparsity and
      smoothness.}
    \item{penaltyx}{The penalty to be applied to the matrix x, i.e. the
      penalty that results in the canonical vector u. If typex is
      "standard" then the L1 bound on u is penaltyx*sqrt(ncol(x)). In
      this case penaltyx must be between 0 and 1 (larger L1 bound corresponds to less penalization). If "ordered" then it's
      the fused lasso penalty lambda, which must be non-negative (larger
      lambda corresponds to more penalization).}
    \item{penaltyz}{The penalty to be applied to the matrix z, i.e. the
      penalty that results in the canonical vector v. If typez is
      "standard" then the L1 bound on v is penaltyz*sqrt(ncol(z)). In
      this case penaltyz  must be between 0 and 1 (larger L1 bound corresponds to less penalization). If "ordered" then it's
      the fused lasso penalty lambda, which must be non-negative (larger
      lambda corresponds to more penalization).}
    \item{K}{The number of u's and v's desired; that is, the number of
      canonical vectors to be obtained.}
  \item{niter}{How many iterations should be performed? Default is 15.}
  \item{v}{The first K columns of the v matrix of the SVD of X'Z. If
    NULL, then the SVD of X'Z will be computed inside the CCA function. However, if
    you plan to run this function multiple times, then save a copy of
    this argument so that it does not need to be re-computed (since that
    process can be time-consuming if X and Z both have high dimension).}
  \item{trace}{Print out progress?}
  \item{standardize}{Should the columns of x and z be centered (to have mean zero)
    and scaled (to have standard deviation 1)? Default is TRUE.}
  \item{xnames}{An optional vector of column names for x.}
  \item{znames}{An optional vector of column names for z.}
  \item{chromx}{Used only if typex is "ordered"; allows user to specify a
    vector of length ncol(x) giving the chromosomal location of each CGH
    spot. This is so that smoothness will be enforced within each
    chromosome, but not between chromosomes.}
  \item{chromz}{Used only if typez is "ordered"; allows user to specify a
    vector of length ncol(z) giving the chromosomal location of each CGH
    spot. This is so that smoothness will be enforced within each
    chromosome, but not between chromosomes.}
  \item{upos}{If TRUE, then require elements of u to be positive. FALSE
    by default. Can only be used if type is "standard".}
  \item{uneg}{If TRUE, then require elements of u to be negative. FALSE
    by default.  Can only be used if type is "standard".}
  \item{vpos}{If TRUE, require
    elements of v to be positive. FALSE by default.  Can only be used if type is "standard".}
  \item{vneg}{If TRUE, require
    elements of v to be negative. FALSE by default.  Can only be used if type is "standard".}
  \item{outcome}{If you would like to incorporate a phenotype into CCA
    analysis - that is, you wish to find features that are correlated
    across the two data sets and also correlated
    with a phenotype - then use one of "survival", "multiclass", or
    "quantitative" to indicate outcome type. Default is NULL.}
  \item{y}{If outcome is not NULL, then this is a vector of phenotypes -
    one for each row of x and z. If outcome is "survival" then these are
    survival times; must be non-negative. If outcome is "multiclass"
    then these are class labels (1,2,3,...). Default NULL.}
  \item{cens}{If outcome is "survival" then these are censoring statuses
    for each observation. 1 is complete, 0 is censored. Default NULL.}
}
\details{
This function is useful for performing an integrative analysis of two
sets of measurements taken on the same set of samples: for instance, gene
expression and CGH measurements on the same set of patients. It takes in
two data sets, called x and z, each of which have (the same set of)
samples on the rows. If z is a matrix of CGH data with *ordered* CGH
spots on the columns, then use typez="ordered". If z consists   of
unordered columns, then use typez="standard". Similarly for typex.
  
  This function performs the penalized matrix decomposition on the data
  matrix $X'Z$. Therefore, the results should be the same  as running
  the PMD function on t(x)\%*\% z. However, when ncol(x)>>nrow(x) and
  ncol(z)>>nrow(z) then using the CCA function is much faster because it
  avoids computation of $X'Z$. 

  The CCA criterion is as follows: find unit vectors $u$ and $v$ such
  that $u'X'Zv$ is maximized subject to constraints on $u$ and $v$.   If
  typex="standard" and typez="standard" then the constraints on $u$ and $v$ are lasso
  ($L_1$). If typex="ordered" then the constraint on $u$ is a fused lasso
penalty (promoting
  sparsity and smoothness). Similarly if typez="ordered".
  
  When type x is "standard": the L1 bound of u is penaltyx*sqrt(ncol(x)).

  When typex is "ordered": penaltyx controls the amount of sparsity and
  smoothness in u, via the fused lasso penalty: $lambda sum_j |u_j| +
  lambda sum_j |u_j - u_(j-1)|$. If NULL, then it will be chosen
  adaptively from the data. 

}
\value{
  \item{u}{u is output. If you asked for multiple factors then each
    column of u is a factor. u has dimension nxK if you asked for K factors.}
  \item{v}{v is output. If you asked for multiple factors then each
    column of v is a factor. v has dimension pxK if you asked for K
    factors.}
  \item{d}{A vector of length K, which can alternatively be computed as
    the diagonal of the matrix $u'X'Zv$.}
  \item{v.init}{The first K factors of the v matrix of the SVD of
    x'z. This is saved in case this function will be re-run later.}
}
\references{Witten, DM and Tibshirani, R and T Hastie (2008) A penalized
  matrix decomposition, with applications to
  sparse principal components and canonical correlation
  analysis. Submitted. <http://www-stat.stanford.edu/~dwitten>}
\author{Daniela M. Witten and Robert Tibshirani}
\seealso{\link{PMD},\link{CCA.permute}}
\examples{
# first, do CCA with type="standard"
# A simple simulated example
u <- matrix(c(rep(1,25),rep(0,75)),ncol=1)
v1 <- matrix(c(rep(1,50),rep(0,450)),ncol=1)
v2 <- matrix(c(rep(0,50),rep(1,50),rep(0,900)),ncol=1)
x <- u\%*\%t(v1) + matrix(rnorm(100*500),ncol=500)
z <- u\%*\%t(v2) + matrix(rnorm(100*1000),ncol=1000)
# Can run CCA with default settings, and can get e.g. 3 components
out <- CCA(x,z,typex="standard",typez="standard",K=3)
print(out,verbose=TRUE) # To get less output, just print(out)
# Or can use CCA.permute to choose optimal parameter values
perm.out <- CCA.permute(x,z,typex="standard",typez="standard",nperms=7)
print(perm.out)
plot(perm.out)
out <- CCA(x,z,typex="standard",typez="standard",K=1,penaltyx=perm.out$bestpenaltyx,penaltyz=perm.out$bestpenaltyz, v=perm.out$v.init)
print(out)


##### The remaining examples are commented out, but uncomment to run: ######

# Not run, to save time:
## Now try CCA with a constraint that elements of u must be negative and
## elements of v must be positive:
#perm.out <- CCA.permute(x,z,typex="standard",typez="standard",nperms=7,
#penaltyxs=seq(.1,.7,len=10), penaltyzs=seq(.1,.7,len=10), uneg=TRUE, vpos=TRUE)
#print(perm.out)
#plot(perm.out)
#out <- CCA(x,z,typex="standard",typez="standard",K=1,penaltyx=perm.out$bestpenaltyx,penaltyz=perm.out$bestpenaltyz,
#v=perm.out$v.init, uneg=TRUE, vpos=TRUE)
#print(out)
#
#
## Suppose we also have a quantitative outcome, y, and we want to find
## features in x and z that are correlated with each other and with the
## outcome:
#y <- rnorm(nrow(x))
#perm.out <- CCA.permute(x,z,typex="standard",typez="standard",outcome="quantitative",y=y, nperms=6)
#print(perm.out)
#out<-CCA(x,z,typex="standard",typez="standard",outcome="quantitative",y=y,penaltyx=perm.out$bestpenaltyx,penaltyz=perm.out$bestpenaltyz)
#print(out)
#
## now, do CCA with type="ordered"
## Example involving the breast cancer data: gene expression + CGH
#set.seed(22)
#data(breastdata)
#attach(breastdata)
#dna <- t(dna)
#rna <- t(rna)
#perm.out <- CCA.permute(x=rna,z=dna[,chrom==1],typex="standard", typez="ordered",nperms=5,penaltyxs=seq(.02,.7,len=10))
## We run CCA using all gene exp. data, but CGH data on chrom 1 only.
#print(perm.out)
#plot(perm.out)
#out <- CCA(x=rna,z=dna[,chrom==1], typex="standard", typez="ordered",penaltyx=perm.out$bestpenaltyx,
#v=perm.out$v.init, penaltyz=perm.out$bestpenaltyz, xnames=substr(genedesc,1,20),
#znames=paste("Pos", sep="", nuc[chrom==1])) # Save time by inputting  lambda and v
#print(out) # could do print(out,verbose=TRUE)
#print(genechr[out$u!=0]) # Cool! The genes associated w/ gain or loss
## on chrom 1 are located on chrom 1!!
#par(mfrow=c(1,1))
#PlotCGH(out$v, nuc=nuc[chrom==1], chrom=chrom[chrom==1],
#main="Regions of gain/loss on Chrom 1 assoc'd with gene expression")
#detach(breastdata)

}

