%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{rphast}
% \VignetteDepends{ape}
% \VignetteKeywords{Phylogenetics, conservation, Hidden Markov Models}
% \VignettePackage{rphast}

\documentclass[10pt]{article}
\usepackage{Sweave}
\usepackage{amsmath,epsfig,fullpage,url,hyperref}


\begin{document}
\title{Detection of Accelerated Regions}
\author{M. J. Hubisz, K. S. Pollard, and A. Siepel}
\SweaveOpts{echo=TRUE,fig=FALSE,eval=TRUE,include=FALSE,engine=R,keep.source=TRUE}
\maketitle

Note: The source code for this example is not included in the rphast 
package (because the run time is too long for routine CRAN checks), but is 
available at the RPHAST webpage:\break
\url{http://compgen.bscb.cornell.edu/rphast/vignette2.Rnw}.

This example will go through the basic steps of detecting accelerated
evolution in a subtree.  In this case, we will look for accelerated regions
in the mouse-rat subtree, focusing on alignments from human
chromosome 22.  

The ideal way to perform this analysis would be to first identify conserved
elements using the same procedure as in example 1, but in a subset of the
alignment which excludes mouse and rat.  Then one could examine the full
alignments for signs of acceleration in mouse-rat in the identified
elements.  However, to keep the example simple we will make use of
precomputed conserved elements from the UCSC genome browser.

The multiple alignments and conserved elements for this example can be
obtained using the UCSC Table Browser for the hg18 assembly.  We also make
use of a neutral phylogenetic model from UCSC.  For convenience, we have
included the conserved elements and neutral model with the RPHAST package.
We have also provided a direct link to the alignment file (see below).

We begin by initializing RPHAST and loading the neutral model and conserved
elements.  We have to change the feature names for the conserved elements
to ensure that they match the name of the reference genome in the alignment
(``hg18''). 
<<rarAnalysis1,fig=FALSE>>=
require("rphast")

exampleArchive <- system.file("extdata", "examples.zip", 
                              package="rphast")
unzip(exampleArchive, c("placentalMammals.mod", "chr22-elements.bed"))

neutralMod <- read.tm("placentalMammals.mod")

elements <- read.feat("chr22-elements.bed")
unique(elements$seqname)
elements$seqname <- "hg18" 
@ 

Next, let's download the posted version of the alignment, which contains a
subset of the full 44 species (to minimize download and compute time).  
The full MAF file can be downloaded through the UCSC Table Browser (under
the hg18 assembly, Comparative Genomics Group,  
Conservation track, multiz44way table).  

Since this is still quite a large alignment, we will use the
pointer.only=TRUE option to read.msa.  This option causes the alignment
object to be stored on the C side, and handled only ``by reference'' (i.e.,
using a pointer) in R.  It can be much more efficient than passing the
entire object between R and C on every function call.  However, some care
must be taken when working with pointer objects in RPHAST, because, unlike
ordinary R objects, their values may be changed as a result of a function
call.  When RPHAST functions change the value of arguments that
are passed as pointers, this behavior is always documented in the help
files.

<<rarAnalysis2,fig=FALSE>>=
if (!file.exists("chr22.maf")) {
  if (!file.exists("chr22.maf.zip"))
    download.file("http://compgen.bscb.cornell.edu/rphast/chr22.maf.zip",
                  "chr22.maf.zip", method="auto")
  unzip("chr22.maf.zip", "chr22.maf")
}
align <- strip.gaps.msa(read.msa("chr22.maf", pointer.only=TRUE))
unlink("chr22.maf")
@ 

In this type of analysis, it is usually necessary to apply a strict set of
filters to the alignments to avoid false positive ``accelerated regions''
that result from misalignments and other data quality problems.  Typically,
we would use synteny, duplication, and repeat filters, as well as filters
to remove regions with too much missing data.  Here we will 
implement a missing data filter.

The following code defines a set of features containing all positions that
do not have missing data in mouse or rat. It then creates another set of
features containing positions with data in a minimum of four species.  We
use the coverage.feat function to take the intersection of these two sets
of features and the conserved elements.  This produces a set of
``informative elements'' that pass the missing data filters.

<<rarAnalysis3,fig=FALSE>>=

hasMouseRat <- informative.regions.msa(align, 2, c("mm9", "rn4"))
hasAtLeastFour <- informative.regions.msa(align, 4)
informativeElements <- coverage.feat(elements, hasMouseRat, 
                                     hasAtLeastFour, get.feats=TRUE)
coverage.feat(informativeElements)/coverage.feat(elements)

@ 

So 79\% of the conserved elements have passed our informative elements filter.

We will also regularize the lengths of the conserved elements, to
simplify our likelihood ratio tests.  (Otherwise, we will have to consider
the length dependency of the likelihood ratios when evaluating significance.)
We will simply split the filtered conserved elements
into fragments of size 50, and discard the leftover fragments:

<<rarAnalysis4,fig=FALSE>>=
splitLength <- 50
splitElements <- split.feat(informativeElements, f=splitLength, 
                            drop=TRUE)

@ 

Now we will run phyloP on these split elements.  PhyloP will calculate a
likelihood ratio for every feature, along with parameter estimates and
approximate p-values (based on an assumed chi-square null distribution):

<<rarAnalyis5,fig=FALSE>>=
obsPhyloP <- phyloP(neutralMod, msa=align, mode="ACC", 
                    features=splitElements, subtree="mm9-rn4")

@ 

For these short elements with a fairly small species set, we would like to
compute empirical p-values, instead of relying on the assumption of a
chi-square null distribution (which holds only asymptotically).  We will
use non-parametric simulations to calculate these empirical p-values.  At
the end of the vignette we will also show an alternative, parametric method
for computing empirical p-values.

We begin by extracting the sites from the original alignment that
correspond to the identified elements, using the ``extract.feature.msa''
function.  Because this function will change a pointer object (as
documented in the help file), we work with a copy of it.  We then generate
5000 synthetic alignments by sampling (with replacement) from this set of
sites.  In real scenarios we would perform many more simulations (at least 
100,000), but we choose 5000 here to make the vingette run quickly.
It is also much more efficient to generate one long
alignment, and interpret it as a concatenation of shorter alignments.
We then run phyloP on this alignment to obtain our null distribution of log
likelihood ratios.  We take care to apply phyloP in exactly the same way as
for the real data.
<<rarAnalysis6,fig=FALSE>>=

elementAlign <- extract.feature.msa(copy.msa(align), informativeElements, 
                                    pointer.only=TRUE)

nrep <- 5000
simMsa <- sample.msa(elementAlign, nrep*splitLength, replace=TRUE)


# produce features allowing long alignment to be interpreted as 
# concatenation of shorter alignments
startIdx <- seq(from=1, by=splitLength, length.out=nrep)
features <- feat(seqname=names.msa(simMsa)[1], src="sim", feat=".",
                 start=startIdx, 
                 end=startIdx+splitLength-1)

nonParaPhyloP <- phyloP(neutralMod, msa=simMsa, mode="ACC",
                         features=features, subtree="mm9-rn4")
@

Let us compare the distributions likelihood ratios for the real and simulated (null) data
sets using a Q-Q plot.  We will also create density plots for both
distributions, on the same scale.
<<qqPlotAndDensity,fig=TRUE,prefix=FALSE,width=6,height=8>>=

layout(matrix(c(1,2), nrow=2), heights=c(0.7, 0.3))
par(mar=c(4.5, 4, 4, 2), mgp=c(2.5,1,0), cex.axis=1.5, cex.lab=1.5)
qqplot(nonParaPhyloP$lnlratio,obsPhyloP$lnlratio,
xlim=c(0,15),ylim=c(0,15), xlab="Simulated likelihood ratio",
      ylab="Observed likelihood ratio")
abline(0, 1, lty=2)
par(mar=c(4,4,1,2))
plot(density(obsPhyloP$lnlratio,adjust=3), lty=1,xlim=c(0,15),
     xlab="Likelihood Ratio",
     ylab="Density",main="", col="red")
lines(density(nonParaPhyloP$lnlratio,adjust=3), lty=1,
col="black",xlim=c(0,15))

@ 

The plots can be seen in Figure \ref{fig:qqplot}.  
Notice that the two distributions appear to be quite similar from the density
plots.  However, an excess of high likelihood ratios for the
observed elements is clearly evident in the Q-Q plot.

Let us now calculate empirical p-values for the real data, and  
adjust them for multiple comparisons using p.adjust.  We will use the
Benjamini-Hochberg method to compute approximate false discovery rates
(FDRs) for each element.  
<<rarAnalysis8,fig=FALSE>>=

empirical.pval <- function(x, dist) {
  sum(x <= dist)/length(dist)
}

nonParaPval <- sapply(obsPhyloP$lnlratio, empirical.pval, 
                      nonParaPhyloP$lnlratio)
nonParaFDR <- p.adjust(nonParaPval, method="BH")
@ 

We will extract those elements having FDR $<$ 0.05 and treat them as
candidate ``RAR''s (Rodent Accelerated Regions).  We will write these
elements to a file, which could be used, say, to display them as a custom
track in the UCSC Genome Browser, or to perform an analysis using Galaxy.
Notice that we have to change the sequence names to allow them to be
recognized by the UCSC Genome Browser.
<<rarAnalysis9,fig=FALSE>>=
nonParaSigFeats <- splitElements[nonParaFDR < 0.05,]
nrow(nonParaSigFeats)
nonParaSigFeats$feature <- "RAR" 
nonParaSigFeats$score <- obsPhyloP$lnlratio[nonParaFDR < 0.05]
nonParaSigFeats$seqname <- "hg18.chr22"
write.feat(nonParaSigFeats, "RAR.gff")

@ 
As a sanity check, let us estimate branch lengths for the RARs and compare
them with estimates for the larger set of conserved elements. We will use
the ``ape'' package to plot the estimated trees side-by-side (Figure 
\ref{fig:ape-plot}).   We can see
that the RARs do show a pronounced evolutionary speedup in the rodents.

<<apePlot,fig=TRUE,prefix=FALSE>>=
consEleModel <- phyloFit(elementAlign, init.mod=neutralMod, no.opt=c("backgd", "ratematrix"))
rarAlign <- extract.feature.msa(copy.msa(align), nonParaSigFeats)
rarModel <- phyloFit(rarAlign, init.mod=neutralMod, no.opt=c("backgd", "ratematrix"))

require("ape")
par(mfrow=c(1,2), mar=c(5,2,4,2))
maxX <- depth.tree(rarModel$tree, "mm9")
plot.phylo(read.tree(text=consEleModel$tree), x.lim=c(0, 0.5), 
           main="Conserved Elements")
plot.phylo(read.tree(text=rarModel$tree), x.lim=c(0, 0.5), 
           main="RAR")

@ 

We made use of phyloP in this example because it happens to be optimized
for tests of 
clade-specific acceleration or conservation.  However, we might want to perform a similar likelihood ratio
test based on other models available in PHAST, but not already supported by
predefined functions for likelihood ratio tests.  Below we show how the
same test could be implemented by making two calls to phyloFit,
corresponding to the null and alternative models.  The R function below could
be used in place of the call to phyloP.

<<rarAnalysis11,eval=FALSE,fig=FALSE>>=

acc.test <- function(msa, mod, subtreeNode) {
  nullMod <- phyloFit(msa, init.mod=mod, scale.only=TRUE, 
                      no.opt=c("backgd", "ratematrix"),
                      quiet=TRUE)
  altMod <- phyloFit(msa, init.mod=mod, scale.only=TRUE, scale.subtree=subtreeNode,
                     no.opt=c("backgd", "ratematrix"),
                     quiet=TRUE)
  tNullSub <- branchlength.tree(subtree(nullMod$tree, subtreeNode, super.tree=FALSE))
  tAltSub <- branchlength.tree(subtree(altMod$tree, subtreeNode, super.tree=FALSE))
  if (tNullSub >= tAltSub) return(0)
  altMod$likelihood - nullMod$likelihood
}

# example function call
llr <- acc.test(sub.msa(align, start.col=splitElements[1,"start"],
                        end.col=splitElements[1,"end"],
                        refseq="hg18"),
                neutralMod, "mm9-rn4")
@ 

An alternative approach for characterizing the null distribution of log
likelihood ratios would be to generate data from a parametric model.  Below
we demonstrate this approach.  For our null model, we use a scaled version 
of the neutral model, with a scale factor estimated from the conserved
elements.  In estimating this scale factor, we exclude the rodents, because
these are the species we wish to test for acceleration.

<<rarAnalysis12,fig=FALSE>>=
seqs <- names(align)
keepSeqs <- seqs[seqs != "mm9" & seqs != "rn4"]
prunedMod <- neutralMod
prunedMod$tree <- prune.tree(neutralMod$tree, keepSeqs, all.but=TRUE)
subAlign <- sub.msa(elementAlign, seqs=keepSeqs)
superTreeMod <- phyloFit(subAlign, init.mod=neutralMod, 
                         scale.only=TRUE, no.opt=c("backgd", "ratematrix"))
superTreeScale <- branchlength.tree(superTreeMod$tree) /
  branchlength.tree(prunedMod$tree)
superTreeScale

scaledNeutralModel <- neutralMod
scaledNeutralModel$tree <- rescale.tree(neutralMod$tree, 
                                        scale=superTreeScale)

# also remove species from model which are not in alignment
scaledNeutralModel$tree <- prune.tree(scaledNeutralModel$tree, 
                                      seqs=seqs, all.but=TRUE)
@ 
Based on this model, we can now generate synthetic alignments using the
simulate.msa function.  We can then run phyloP on these alignments in the
same way we did for the alignments simulated by the nonparametric approach.
<<rarAnalysis13,fig=FALSE>>=
# now we can do parametric simulations using our scaled
# nuetral model.  Again we will just simulate one big
# alignment, and split it up later, for efficiency.
simMsa <- simulate.msa(scaledNeutralModel, nrep*splitLength)

paraPhyloP <- phyloP(neutralMod, msa=simMsa, mode="ACC",
                     features=features, subtree="mm9-rn4")
@ 
Let us see how different the RARs look, if we compute p-values based on
this null distribution instead.
<<rarAnalysis14,fig=FALSE>>=
paraPval <- sapply(obsPhyloP$lnlratio, empirical.pval, 
                   paraPhyloP$lnlratio)
paraFDR <- p.adjust(paraPval, method="BH")
paraSigIdx <- which(paraFDR < 0.05)
length(paraSigIdx)
@ 
The results of this comparison will be somewhat different each time the
code is executed, because the empirical p-values depend on random sampling, but
when we ran it we found that the nonparametric method identified more
elements.  In general, this method may be somewhat less conservative,
perhaps because the
parametric model is overly simple, by failing to allow for rate
variation among sites, indels, missing data, and other features of the real
alignments.  But the number of samples here is quite small (only 5000), so
stochastic effects may cause one run to be quite different from another.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% figures
\begin{figure}
  \begin{center}
    \includegraphics[width=6in,height=8in,angle=0]{qqPlotAndDensity}
  \end{center}
  \caption{On top, Q-Q plot for observed and simulated likelihood ratios.  On bottom, distributions of observed (red) and simulated (black) values.}
  \label{fig:qqplot}
\end{figure}

\begin{figure}
  \begin{center}
    \includegraphics[width=6in,height=4.5in,angle=0]{apePlot}
  \end{center}
  \caption{Trees estimated from conserved elements (left) and predicted rodent accelerated regions (right), plotted by the ape package.}
  \label{fig:ape-plot}
\end{figure}
\end{document}