Accelerating maximum-likelihood searches using PAUP*

David A. Morrison

Department of Parasitology (SWEPAR)
National Veterinary Institute and
Swedish University of Agricultural Sciences
Uppsala, Sweden

8 January 2007 -- revised 3 January 2008


Background

I spent more than a year investigating ways of speeding-up maximum-likelihood (ML) analysis of nucleotide sequence data for phylogenetic purposes. The results of this work are described in a manuscript published in Systematic Biology (DOI). However, many of the practical details were left out of that publication, and I summarize some of them here, in the belief that they might be of interest to others. The 20 data sets that I used have 14--158 sequences and 411--120,762 aligned nucleotides (ZIP file).

PAUP*

Part of this work involved looking at the computer program PAUP* version 4.0b10 (Swofford 2002), and here I summarize my conclusions from the work relating to that program. ML analyses can be accelerated by up to several orders of magnitude when using this program, by careful use of some of the available options. Furthermore, by adopting a particular search strategy the chances of finding the ML tree are greatly increased. It takes only days rather than weeks or months to analyse each of the 20 data sets thoroughly.

Use of PAUP* Options

(i) Use SPR rather than TBR

The default search strategy in PAUP* is to perform TBR (tree-bisection-reconnection) searches. Having performed > 150 ML analyses, I have only encountered three analyses in which the TBR search from a specified starting tree led to a different final tree than the SPR (subtree prune-regraft) search from the same starting tree.

In two of these cases a very poor tree was used as the starting tree, and the subsequent TBR search found a tree with a better ML score than did the SPR search. There was also one instance where the SPR search found a better tree than did the TBR search. This counter-intuitive result occurs because the TBR search examines the trees in a different order to the SPR search, so that the two searches can potentially move to different sets of trees.

In all other cases, the TBR and SPR searches produced the same result. That is, for all of my data sets the extra swapping done by the TBR search was unnecessary. Thus, a TBR search may be unproductive for ML analyses, with a SPR search being sufficient. For my data sets, the TBR searches usually took 4-5 times as long as the SPR searches (Allen and Steel 2001 provide the appropriate formulae for the minimum number of trees needing to be searched), making the SPR strategy considerably more efficient for arriving at the same result. Use of only SPR branch swapping thus seems to be a viable option for considerably speeding-up ML analysis of nucleotide data.

(ii) Adjust ApproxLim

The next most effective option to change is the ApproxLim option. PAUP* performs an initial rough calculation of the ML value for each tree topology, and then proceeds with the full optimization only if the resulting score is close to the current known optimum ML value (see Rogers and Swofford 1998). The ApproxLim parameter determines how the concept of "close" is defined. The default value is ApproxLim=5, which indicates that approximate scores within 5% of the optimum will be evaluated fully.

Reducing this value can have a dramatic effect on the efficiency of tree searches, especially for large data sets. That is, there can be tens of thousands of trees within 5% of the optimum ML score if the score is large (as it will be when there are lots of taxa or characters), and reducing the ApproxLim value reduces the number of trees that need to be evaluated. For example, for one of my data sets (94 sequences, 10,922 aligned positions), an SPR search (38,817 topologies assessed) with ApproxLim=2 took 13.5 days on a 1.8 GHz iMac G5 while a TBR search (150,650 topologies) with ApproxLim=1 took only 3.75 days, which is a 14x speed-up to find exactly the same final tree.

Adjustment of the ApproxLim parameter must be done carefully, however. When decreasing the ApproxLim value, the following three boundaries are successively reached:
(i) a value above which the analysis follows the same search path, but below which a different search path leads to the same final tree (i.e. some trees are not evaluated fully when they do actually have a better ML score than the current optimum, and thus should cause a change in the search path);
(ii) a value below which the search path leads to a different tree (it may be a tree with a better ML value, but it will usually be worse);
(iii) a value below which the analysis "sticks" on or near the starting tree (i.e. few or no trees are accepted as having a better ML score than the starting tree).

The "knack" of adjusting the ApproxLim value is thus to find boundary (i), as this will result in the fastest search without compromising the quality of the analysis solution. There is no simple way to do this, but it is worthwhile to spend time on it (several hours or even days, depending on the speed of your computer). The simplest approach is trial-and-error. That is, start a series of analyses each with a different but reduced ApproxLim value, while recording the successive ML values of the search in the log file (the PAUP* default is to record every minute). It usually does not take more than a few minutes of analysis to realize when boundary (iii) has been reached, but it can take several thousand trees to distinguish between boundaries (i) and (ii). It is important to note that, when using the ratchet discussed below, the adjustment must take into account the reweighted searches as well as the unweighted searches -- investigating just one type of search can lead to over-reducing the value. My advice is thus to be conservative about reducing the value, since you will be evaluating only the early part of the tree search and the deviation in search path may occur much later (and thus be unobserved).

I have found no data sets for which (i) is > 2%, so this seems like a good default value. I have found no data sets for which (i) is < 0.5%. However, I have found a number of data sets for which a difference of even 0.01 in the value can make a big difference to the speed of the analysis for large data sets (e.g. 0.79 might lead to a faster search than 0.80). Unfortunately, I have found no guidelines by which to make a guess at what the best ApproxLim value will be for any specified data set.

It is important to note the existence of the AdjustAppLim option, as well. PAUP* uses an (as yet) undocumented method to estimate a "suitable" value for ApproxLim, which may in fact be > 5%. If you want to over-ride this behaviour, as I usually do, then you need to set AdjustAppLim=no.

Search Strategy

(i) Ratchet

A single search is usually insufficient for finding the optimal tree or set of optimal trees. The default search strategy in PAUP* is to perform a series of searches from random-addition-sequence starting trees. However, this has been shown to be inefficient for parsimony analyses, with a ratchet-based strategy being much more effective (Nixon 1999). In the latter type of analysis, the current optimal tree is used as the starting point for a search with re-weighted characters, and the tree from that analysis is then used as the next starting tree. Thus, near-optimal trees are used as the starting trees for each search, which increases the chances of finding the set of optimal trees.

This strategy has apparently never been applied to ML analyses. Neither of the strategies labelled as being a "likelihood ratchet" by Clement et al. (1999) and Vos (2003) actually matches the strategy described by Nixon (1999); indeed, Vos (2003) actually abandoned the ratchet part of the strategy completely. I have found that a ratchet-based strategy is much more effective at finding the ML tree than using multiple independent starting trees.

I have spent some time trying to find a combination of search options that allows the ratchet approach to be used effectively for ML analyses with PAUP*. My current strategy is:
(i) use successive approximations (neighbor-joining tree, then NNI search, then SPR search) to estimate the initial tree, and the values of the nucleotide-substitution model, which is then fixed for the remainder of the analysis (see Sullivan et al. 2005);
(ii) use only SPR branch swapping (see above);
(iii) re-weight 25% of the characters (varying this seems to have little effect);
(iv) use ten iterations of re-weighting (more iterations can be used, but ten is enough to find out whether there are many "tree islands");
(v) set the ApproxLim option to no more than 2% (see above);
(vi) set the MulTrees=no option for the re-weighted part of the searches (so that time is not wasted on unproductive searching).

I have made available for download (input4.txt) a fully commented template that can be used as the input file for the PaupRat computer program (Sikes and Lewis 2001). This program uses the template to create a file with the appropriate commands for PAUP*, which is then used to run the ML ratchet analysis. The extensive comments in the template file are intended to explain clearly how the ratchet works in relation to the PAUP* commands.

My experimentation shows that this ratchet strategy finds the best-known tree for all of my data sets (i.e. no other computer program finds a better tree). The time taken for the analysis depends on how much effort is put into adjusting the ApproxLim value. The longest ratchet analysis took 12 days on a 3 GHz Pentium4 PC (plus the time needed to reduce the ApproxLim value to 0.8%). (Actually, the longest analysis I have performed took 25 days to be analysed at ApproxLim=1% but took only 7 days at ApproxLim=0.7%, while still following the same search path; this was for 158 sequences with 2,644 aligned nucleotides.)

There are clearly a number of ways to increase the speed of the ratchet analysis even further, only some of which are currently implemented in PAUP*. For example, since it is not strictly necessary to find the optimal tree during the re-weighted part of each iteration, this search could be shortened by using the currently available PAUP* options to set a limit on: the search time (TimeLimit); the number of branch swaps to occur (RearrLimit); or the number of branches across which swaps can occur (ReconLimit).

More importantly, the current ratchet search is inefficient in that it frequently repeats exactly the same tree search for the unweighted characters. For example, if a search based on re-weighted characters has not moved to a new set of trees or returns to a previous set, then all of the subsequent unweighted search in that iteration will simply be an unnecessary repeat of the previous search. This issue could be dealt with by intelligent book-keeping, such as is already used in PAUP* for multiple stepwise-addition searches based on random input order, where a search iteration is terminated once it reaches a set of trees that was encountered in a previous iteration. This could potentially save a large amount of time, as the search is then restricted to those parts of tree space that have not been searched before. Unfortunately, this option is currently not implemented in PAUP* in a way that can be applied to the ratchet.

(ii) Tree islands

It is important to emphasize the concept of "tree islands" in relation to ML analyses. An island is a set of trees that have similar topologies and near-optimal ML scores. This idea of islands has only recently been discussed for likelihood analyses (e.g. Salter 2001; Johnson et al. 2003; Vos 2003), although it has a long history for parsimony analyses (Maddison 1991; Page 1993; Charleston 1995). When there are multiple islands it will be inadequate to locate and report only a single tree, as has been done in the past, because we also need to consider whether the reported trees are on the same island (i.e. how different they are). This is particularly important if sequences from only one gene are being used, because my results show that there are likely to be many islands for such information-poor data sets. The ratchet strategy is specifically designed to locate multiple tree islands, and to detect multiple peaks on a single island.

It is worth emphasizing the possible existence (and importance) of trees that are only slightly suboptimal with regard to ML score, particularly if these are on different islands. It is evident that any thorough analysis of a data set must include all of these trees. Simply reporting a single optimal tree (i.e. the highest peak from one island) would be very misleading in terms of representing a ML estimate of the true phylogeny. The most extreme case that I have encountered has at least six SPR peaks that differ by < 0.815 log-likelihood units (the best tree has a score of -62,069.395); there are > 1,350 trees with a score < 1 log-likelihood unit below that of the ML tree. Reporting a single ML tree for this data set would be ludicrous.

References

Allen, B.L., and M. Steel. 2001. Subtree transfer operations and their induced metrics on evolutionary trees. Annals of Combinatorics 5: 1-15. (DOI)

Charleston, M.A. 1995. Toward a characterization of landscapes of combinatorial optimization problems, with special attention to the phylogeny problem. Journal of Computational Biology 2: 439-450.

Clement, M., Q. Snell, G. Judd, and M. Whiting. 1999. High performance phylogenetic inference. Pages 335-336 in: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing. IEEE Computer Society, Washington DC. (DOI)

Johnson, R.N., P.-M. Agapow, and R.H. Crozier. 2003. A tree island approach to inferring phylogeny in the ant subfamily Formicinae, with especial reference to the evolution of weaving. Molecular Phylogenetics and Evolution 29: 317-330. (DOI)

Maddison, D.R. 1991. The discovery and importance of multiple islands of most-parsimonious trees. Systematic Zoology 40: 315-328.

Nixon, K.C. 1999. The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics 15: 407-414. (DOI)

Page, R.D.M. 1993. On islands of trees and the efficacy of different methods of branch swapping in finding most-parsimonious trees. Systematic Biology 42: 200-210.

Rogers, J.S., and D.L. Swofford. 1998. A fast method for approximating maximum likelihoods of phylogenetic trees from nucleotide sequences. Systematic Biology 47: 77-89. (DOI)

Salter, L.A. 2001. Complexity of the likelihood surface for a large DNA dataset. Systematic Biology 50: 970-978. (DOI)

Sikes, D.S., and P.O. Lewis. 2001. PAUPRat: PAUP* implementation of the parsimony ratchet. Beta software, version 1. Distributed by the authors. Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, U.S.A. June 2001. (Program)

Sullivan, J., Z. Abdo, P. Joyce, and D.L. Swofford. 2005. Evaluating the performance of a successive-approximations approach to parameter optimization in maximum-likelihood phylogeny estimation. Molecular Biology and Evolution 22: 1386-1392. (DOI)

Swofford, D.L. 2002. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Sinauer Associates, Sunderland MA. (Program)

Vos, R.A. 2003. Accelerated likelihood surface exploration: the likelihood ratchet. Systematic Biology 52: 368-373. (DOI)


Return to the David Morrison home page.



http://acacia.atspace.eu/MLpaup.htm
© Copyright 2007-2012
David Morrison

This page last updated Tuesday 31 January 2012
If you encounter problems with these pages please contact