%% $Id: panpca.Rd 160 2014-07-08 09:04:30Z larssn $

\name{panpca}
\alias{panpca}
\title{
  Principal component analysis of a pan-matrix
}
\description{
  Computes a principal component decomposition of a pan-matrix, with possible scaling and weightings.
}
\usage{
panpca(pan.matrix,scale=0.0,weights=rep(1,dim(pan.matrix)[2]))
}
\arguments{
  \item{pan.matrix}{A \code{Panmat} object, see \code{\link{panMatrix}} for details.}
  \item{scale}{An optional scale to control how copy numbers should affect the distances.}
  \item{weights}{Vector of optional weights of gene clusters.}
}
\details{
  A principal component analysis (PCA) can be computed for any matrix, also a pan-matrix. The principal components will in this case be linear combinations of the gene clusters. One major idea behind PCA is to truncate the space, e.g. instead of considering the genomes as points in a high-dimensional space spanned by all gene clusters, we look for a few \sQuote{smart} combinations of the gene clusters, and visualize the genomes in a low-dimensional space spanned by these directions.
  
  The \samp{scale} can be used to control how copy number differences play a role in the PCA. Usually we assume that going from 0 to 1 copy of a gene is the big change of the genome, and going from 1 to 2 (or more) copies is less. Prior to computing the PCA, the \samp{pan.matrix} is transformed according to the following affine mapping: If the original value in \samp{pan.matrix} is \samp{x}, and \samp{x} is not 0, then the transformed value is \samp{1 + (x-1)*scale}. Note that with \samp{scale=0.0} (default) this will result in 1 regardless of how large \samp{x} was. In this case the PCA only distinguish between presence and absence of gene clusters. If \samp{scale=1.0} the value \samp{x} is left untransformed. In this case the difference between 1 copy and 2 copies is just as big as between 1 copy and 0 copies. For any \samp{scale} between 0.0 and 1.0 the transformed value is shrunk towards 1, but a certain effect of larger copy numbers is still present. In this way you can decide if the PCA should be affected, and to what degree, by differences in copy numbers beyond 1.
    
  The PCA can also up- or downweight some clusters compared to others. The vector \samp{weights} must contain one value for each column in \samp{pan.matrix}. The default is to use flat weights, i.e. all clusters count equal. See \code{\link{geneWeights}} for alternative weighting strategies.
    
  The functions \code{\link{plotScores}} and \code{\link{plotLoadings}} can be used to visualize the results of \code{\link{panpca}}.
}
\value{
  A \code{Panpca} object is returned from this function. This is a small (S3) extension of a \code{list} with elements \samp{Evar}, \samp{Scores}, \samp{Loadings}, \samp{Scale} and \samp{Weights}. 
  
  \samp{Evar} is a vector with one number for each principal component. It contains the relative explained variance for each component, and it always sums to 1.0. This value indicates the importance of each component, and it is always in descending order, the first component being the most important. The \samp{Evar} is typically the first result you look at after a PCA has been computed, as it indicates how many components (directions) you need to capture the bulk of the total variation in the data.
  
  \samp{Scores} is a matrix with one column for each principal component and one row for each genome. The columns are ordered corresponding to the elements in \samp{Evar}. The scores are the coordinates of each genome in the principal component space. See \code{\link{plotScores}} for how to visualize genomes in the score-space.
  
  \samp{Loadings} is a matrix with one column for each principal component and one row for each gene cluster. The columns are ordered corresponding to the elements in \samp{Evar}. The loadings are the contribution from each original gene cluster to the principal component directions. NOTE: Only gene clusters having a non-zero variance is used in a PCA. Gene clusters with the same value for every genome have no impact and are discarded from the \samp{Loadings}. See \code{\link{plotLoadings}} for how to visualize gene clusters in the loading space.
  
  \samp{Scale} and \samp{Weights} are copies of the corresponding input arguments.
  
  The generic functions \code{\link{plot.Panpca}}, \code{\link{summary.Panpca}} and \code{\link{str.Panpca}} are available for \code{Panpca} objects.
}

\author{
  Lars Snipen and Kristian Hovde Liland.
}

\seealso{
  \code{\link{plotScores}}, \code{\link{plotLoadings}}, \code{\link{panTree}}, \code{\link{distManhattan}}, \code{\link{geneWeights}}.
}
\examples{
# Loading two Panmat objects in the micropan package
data(list=c("Mpneumoniae.blast.panmat","Mpneumoniae.domain.panmat"),package="micropan")

# Panpca based on a BLAST clustering Panmat object
ppca.blast <- panpca(Mpneumoniae.blast.panmat)
plot(ppca.blast) # The generic plot function
plotScores(ppca.blast) # A score-plot

# Panpca based on domain sequence clustering Panmat object
w <- geneWeights(Mpneumoniae.domain.panmat,type="shell")
ppca.domains <- panpca(Mpneumoniae.domain.panmat,scale=0.5,weights=w)
summary(ppca.domains)
plotLoadings(ppca.domains)
}
