bioMCMC: Testing for common ancestry

Our commentary on Douglas Theobald's test from Universal Common Ancestry (UCA) just went online. The original idea was to make a user-friendly review of his analysis described in "A formal test of the theory of universal common ancestry", but after a long e-mail exchange between Douglas and us -- actually between him and David, I didn't say much -- we decided to expand the article to include some remaining points of skepticism and spell out the basic problem with his approach.

His work

His test for UCA compares the hypothesis of one single ancestral lineage diverging into all living forms, against scenarios where more than one ancestral populations are needed to explain the current diversity of life. That is, he tries to quantify the possibility that there were two or more ancient life forms still represented today in the major domains, in comparison with the most natural possibility that only one ancestral life form prevailed (with other life forms eventually going extinct, for instance).

These scenarios can be nicely represented by phylogenies: under the UCA hypothesis all existing species can be connected by a single phylogeny (connected network or tree), while under the hypothesis of Independent Origins (IO) the species can be partitioned into disconnected groups (without a branch joining them). And then, once we have the best phylogenies under each hypothesis, we can use the arsenal of model selection methods available to chose between the hypotheses.

Using a curated set of genes highly conserved among all three domains, his test indicated that UCA is a much better explanation than IO, and that each gene is better represented by its own phylogeny than by forcing all genes to follow the same tree -- that is, horizontal gene transfer (HGT) cannot be neglected.

Our comments

Imagine that I developed a method capable of predicting a candidate's academic success based on a questionnaire. But it works only for PhDs on their mid-twenties who have at least a few high-impact single-author publications. And notice: I'm not assuming at all that such a young genius has a secured place in a university. I guess this half-baked analogy summarizes our contention with Douglas' paper.

What called our attention is the fact that the phylogenetic inference methods assume homology at each site, so it is not surprising that it favors UCA for sufficiently good alignments -- your mileage may vary on the definition of "good". These methods are delegating to the alignment the responsibility of handling homology. And he used a data set with a particularly convincing evidence of common ancestry. To transform our argument into a picture we simulated sequences under both UCA and IO scenarios (that is, using one or two phylogenies) and looked at how the resulting alignments would look like. As expected they were very different.

average sequence identity for alignments simulated under UCA and IO
(from doi:10.4081/eb.2012.e5)

The motivation for our skepticism in the UCA test is how it would perform on a blind experiment like Assemblaton or CASP: given a group of sequences of unknown homology status, can the model selection devised by Douglas Theobald tell us if they share a single ancestry? Our impression is that there would be several decisions before doing the actual test -- like optimizing the alignment, possible removal of poorly aligned regions, refusing to do the test if alignment is bad --that might undermine its applicability. So we cannot yet recommend the test for arbitrary data sets.

Frequencies of average identity per column, for alignments simulated under UCA and IO, with real data set values in gray
(from doi:10.4081/eb.2012.e5)

In our article we also wonder about the effect of HGT under the hypothesis of multiple ancestry: what if we find one, and only one gene that strongly supports independent origins? Even if all others fit nicely into a single phylogeny, wouldn't it be evidence of this otherwise extinct lineage?

The publication

We all have dreadful stories about the feedback from peer reviewers, but this manuscript was not such a case. All reviewers seemed to know very well the original work, and could make precise comments on what we were missing or mistaken. The editor, David Liberles, also joined the discussion and gave us some good advice. So we thank them for being fortiter in re, suaviter in modo.

We are preparing another manuscript that is more centered on the examples given in D. Theobald's paper. Actually we started to write this "counter-examples" manuscript prior to our present work, but we had to reorganize it since: 1) we realized that it would be harder to understand it without the current work; 2) D. Theobald recently published a reply that is relevant to our discussion, since it contains a response to a former version of our "counter-examples" manuscript. Our present paper became then necessary to minimize future misunderstandings.

The scripts necessary to reconstruct the simulations and graphics used in our study are available at our home page: http://darwin.uvigo.es/common_origin/. Please let me know if you have any trouble running the scripts, or if you want some more information.

References

Leonardo de Oliveira Martins, David Posada (2012). Proving universal common ancestry with similar sequences Trends in Evolutionary Biology, 14 (1) : 10.4081/eb.2012.e5
Theobald, D. (2010). A formal test of the theory of universal common ancestry Nature, 465 (7295), 219-222 DOI: 10.1038/nature09014

(with thanks to Jonathan Eisen for noticing the article).

Pages

Tuesday, May 15, 2012

Testing for common ancestry

His work

Our comments

The publication

References

No comments:

Post a Comment