hosted by
publicationslist.org
    

Nick Goldman


goldman@ebi.ac.uk

Journal articles

2011
Carolin Kosiol, Nick Goldman (2011)  Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models.   J Mol Biol Jun  
Abstract: Over the years, there have been claims that evolution proceeds according to systematically different processes over different timescales and that protein evolution behaves in a non-Markovian manner. On the other hand, Markov models are fundamental to many applications in evolutionary studies. Apparent non-Markovian or time-dependent behavior has been attributed to influence of the genetic code at short timescales and dominance of physicochemical properties of the amino acids at long timescales. However, any long time period is simply the accumulation of many short time periods, and it remains unclear why evolution should appear to act systematically differently across the range of timescales studied. We show that the observed time-dependent behavior can be explained qualitatively by modeling protein sequence evolution as an aggregated Markov process (AMP): a time-homogeneous Markovian substitution model observed only at the level of the amino acids encoded by the protein-coding DNA sequence. The study of AMPs sheds new light on the relationship between amino acid-level and codon-level models of sequence evolution, and our results suggest that protein evolution should be modeled at the codon level rather than using amino acid substitution models.
Notes:
Clemens Lakner, Mark T Holder, Nick Goldman, Gavin J P Naylor (2011)  What's in a likelihood? Simple models of protein evolution and the contribution of structurally viable reconstructions to the likelihood.   Syst Biol 60: 2. 161-174 Mar  
Abstract: Most phylogenetic models of protein evolution assume that sites are independent and identically distributed. Interactions between sites are ignored, and the likelihood can be conveniently calculated as the product of the individual site likelihoods. The calculation considers all possible transition paths (also called substitution histories or mappings) that are consistent with the observed states at the terminals, and the probability density of any particular reconstruction depends on the substitution model. The likelihood is the integral of the probability density of each substitution history taken over all possible histories that are consistent with the observed data. We investigated the extent to which transition paths that are incompatible with a protein's three-dimensional structure contribute to the likelihood. Several empirical amino acid models were tested for sequence pairs of different degrees of divergence. When simulating substitutional histories starting from a real sequence, the structural integrity of the simulated sequences quickly disintegrated. This result indicates that simple models are clearly unable to capture the constraints on sequence evolution. However, when we sampled transition paths between real sequences from the posterior probability distribution according to these same models, we found that the sampled histories were largely consistent with the tertiary structure. This suggests that simple empirical substitution models may be adequate for interpolating changes between observed sequences during phylogenetic inference despite the fact that the models cannot predict the effects of structural constraints from first principles. This study is significant because it provides a quantitative assessment of the biological realism of substitution models from the perspective of protein structure, and it provides insight on the prospects for improving models of protein sequence evolution.
Notes:
Stefan Washietl, Sven Findeiss, Stephan A MĂĽller, Stefan Kalkhof, Martin von Bergen, Ivo L Hofacker, Peter F Stadler, Nick Goldman (2011)  RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data.   RNA 17: 4. 578-594 Apr  
Abstract: With the availability of genome-wide transcription data and massive comparative sequencing, the discrimination of coding from noncoding RNAs and the assessment of coding potential in evolutionarily conserved regions arose as a core analysis task. Here we present RNAcode, a program to detect coding regions in multiple sequence alignments that is optimized for emerging applications not covered by current protein gene-finding software. Our algorithm combines information from nucleotide substitution and gap patterns in a unified framework and also deals with real-life issues such as alignment and sequencing errors. It uses an explicit statistical model with no machine learning component and can therefore be applied "out of the box," without any training, to data from all domains of life. We describe the RNAcode method and apply it in combination with mass spectrometry experiments to predict and confirm seven novel short peptides in Escherichia coli and to analyze the coding potential of RNAs previously annotated as "noncoding." RNAcode is open source software and available for all major platforms at http://wash.github.com/rnacode.
Notes:
Botond Sipos, Tim Massingham, Gregory E Jordan, Nick Goldman (2011)  PhyloSim - Monte Carlo simulation of sequence evolution in the R statistical computing environment.   BMC Bioinformatics 12: 04  
Abstract: The Monte Carlo simulation of sequence evolution is routinely used to assess the performance of phylogenetic inference methods and sequence alignment algorithms. Progress in the field of molecular evolution fuels the need for more realistic and hence more complex simulations, adapted to particular situations, yet current software makes unreasonable assumptions such as homogeneous substitution dynamics or a uniform distribution of indels across the simulated sequences. This calls for an extensible simulation framework written in a high-level functional language, offering new functionality and making it easy to incorporate further complexity.
Notes:
2010
Ari Löytynoja, Nick Goldman (2010)  webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser.   BMC Bioinformatics 11: 11  
Abstract: Phylogeny-aware progressive alignment has been found to perform well in phylogenetic alignment benchmarks and to produce superior alignments for the inference of selection on codon sequences. Its implementation in the PRANK alignment program package also allows modelling of complex evolutionary processes and inference of posterior probabilities for sequence sites evolving under each distinct scenario, either simultaneously with the alignment of sequences or as a post-processing step for an existing alignment. This has led to software with many advanced features, and users may find it difficult to generate optimal alignments, visualise the full information in their alignment results, or post-process these results, e.g. by objectively selecting subsets of alignment sites.
Notes:
2009
Benny Chor, David Horn, Nick Goldman, Yaron Levy, Tim Massingham (2009)  Genomic DNA k-mer spectra: models and modalities.   Genome Biol 10: 10. 10  
Abstract: The empirical frequencies of DNA k-mers in whole genome sequences provide an interesting perspective on genomic complexity, and the availability of large segments of genomic sequence from many organisms means that analysis of k-mers with non-trivial lengths is now possible.
Notes:
2008
Ari Löytynoja, Nick Goldman (2008)  Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis.   Science 320: 5883. 1632-1635 Jun  
Abstract: Genetic sequence alignment is the basis of many evolutionary and comparative studies, and errors in alignments lead to errors in the interpretation of evolutionary information in genomes. Traditional multiple sequence alignment methods disregard the phylogenetic implications of gap patterns that they create and infer systematically biased alignments with excess deletions and substitutions, too few insertions, and implausible insertion-deletion-event histories. We present a method that prevents these systematic errors by recognizing insertions and deletions as distinct evolutionary events. We show theoretically and practically that this improves the quality of sequence alignments and downstream analyses over a wide range of realistic alignment problems. These results suggest that insertions and sequence turnover are more common than is currently thought and challenge the conventional picture of sequence evolution and mechanisms of functional and structural changes.
Notes:
Michael L Tress, Jan-Jaap Wesselink, Adam Frankish, Gonzalo LĂłpez, Nick Goldman, Ari Löytynoja, Tim Massingham, Fabio Pardi, Simon Whelan, Jennifer Harrow, Alfonso Valencia (2008)  Determination and validation of principal gene products.   Bioinformatics 24: 1. 11-17 Jan  
Abstract: MOTIVATION: Alternative splicing has the potential to generate a wide range of protein isoforms. For many computational applications and for experimental research, it is important to be able to concentrate on the isoform that retains the core biological function. For many genes this is far from clear. RESULTS: We have combined five methods into a pipeline that allows us to detect the principal variant for a gene. Most of the methods were based on conservation between species, at the level of both gene and protein. The five methods used were the conservation of exonic structure, the detection of non-neutral evolution, the conservation of functional residues, the existence of a known protein structure and the abundance of vertebrate orthologues. The pipeline was able to determine a principal isoform for 83% of a set of well-annotated genes with multiple variants.
Notes:
Ari Löytynoja, Nick Goldman (2008)  A model of evolution and structure for multiple sequence alignment.   Philos Trans R Soc Lond B Biol Sci 363: 1512. 3913-3919 Dec  
Abstract: We have developed a phylogeny-aware progressive alignment method that recognizes insertions and deletions as distinct evolutionary events and thus avoids systematic errors created by traditional alignment methods. We now extend this method to simultaneously model regional heterogeneity and evolution. This novel method can be flexibly adapted to alignment of nucleotide or amino acid sequences evolving under processes that vary over genomic regions and, being fully probabilistic, provides an estimate of regional heterogeneity of the evolutionary process along the alignment and a measure of local reliability of the solution. Furthermore, the evolutionary modelling of substitution process permits adjusting the sensitivity and specificity of the alignment and, if high specificity is aimed at, leaving sequences unaligned when their divergence is beyond a meaningful detection of homology.
Notes:
Stefan Washietl, Rainer MachnĂ©, Nick Goldman (2008)  Evolutionary footprints of nucleosome positions in yeast.   Trends Genet 24: 12. 583-587 Dec  
Abstract: Using genome-wide maps of nucleosome positions in yeast, we have analyzed the influence of chromatin structure on the molecular evolution of genomic DNA. We have observed, on average, 10-15% lower substitution rates in linker regions than in nucleosomal DNA. This widespread local rate heterogeneity represents an evolutionary footprint of nucleosome positions and reveals that nucleosome organization is a genomic feature conserved over evolutionary timescales.
Notes:
2007
Carolin Kosiol, Ian Holmes, Nick Goldman (2007)  An empirical codon model for protein sequence evolution.   Mol Biol Evol 24: 7. 1464-1479 Jul  
Abstract: In the past, 2 kinds of Markov models have been considered to describe protein sequence evolution. Codon-level models have been mechanistic with a small number of parameters designed to take into account features, such as transition-transversion bias, codon frequency bias, and synonymous-nonsynonymous amino acid substitution bias. Amino acid models have been empirical, attempting to summarize the replacement patterns observed in large quantities of data and not explicitly considering the distinct factors that shape protein evolution. We have estimated the first empirical codon model (ECM). Previous codon models assume that protein evolution proceeds only by successive single nucleotide substitutions, but our results indicate that model accuracy is significantly improved by incorporating instantaneous doublet and triplet changes. We also find that the affiliations between codons, the amino acid each encodes and the physicochemical properties of the amino acids are main factors driving the process of codon evolution. Neither multiple nucleotide changes nor the strong influence of the genetic code nor amino acids' physicochemical properties form a part of standard mechanistic models and their views of how codon evolution proceeds. We have implemented the ECM for likelihood-based phylogenetic analysis, and an assessment of its ability to describe protein evolution shows that it consistently outperforms comparable mechanistic codon models. We point out the biological interpretation of our ECM and possible consequences for studies of selection.
Notes:
Fabio Pardi, Nick Goldman (2007)  Resource-aware taxon selection for maximizing phylogenetic diversity.   Syst Biol 56: 3. 431-444 Jun  
Abstract: Phylogenetic diversity (PD) is a useful metric for selecting taxa in a range of biological applications, for example, bioconservation and genomics, where the selection is usually constrained by the limited availability of resources. We formalize taxon selection as a conceptually simple optimization problem, aiming to maximize PD subject to resource constraints. This allows us to take into account the different amounts of resources required by the different taxa. Although this is a computationally difficult problem, we present a dynamic programming algorithm that solves it in pseudo-polynomial time. Our algorithm can also solve many instances of the Noah's Ark Problem, a more realistic formulation of taxon selection for biodiversity conservation that allows for taxon-specific extinction risks. These instances extend the set of problems for which solutions are available beyond previously known greedy-tractable cases. Finally, we discuss the relevance of our results to real-life scenarios.
Notes:
Lee Bofkin, Nick Goldman (2007)  Variation in evolutionary processes at different codon positions.   Mol Biol Evol 24: 2. 513-521 Feb  
Abstract: Evolutionary studies commonly model single nucleotide substitutions and assume that they occur as independent draws from a unique probability distribution across the sequence studied. This assumption is violated for protein-coding sequences, and we consider modeling approaches where codon positions (CPs) are treated as separate categories of sites because within each category the assumption is more reasonable. Such "codon-position" models have been shown to explain the evolution of codon data better than homogenous models in previous studies. This paper examines the ways in which codon-position models outperform homogeneous models and characterizes the differences in estimates of model parameters across CPs. Using the PANDIT database of multiple species DNA sequence alignments, we quantify the differences in the evolutionary processes at the 3 CPs in a systematic and comprehensive manner, characterizing previously undescribed features of protein evolution. We relate our findings to the functional constraints imposed by the genetic code, protein function, and the types of mutation that cause synonymous and nonsynonymous codon changes. The results increase our understanding of selective constraints and could be incorporated into phylogenetic analyses or gene-finding techniques in the future. The methods used are extended to an overlapping reading frame data set, and we discover that overlapping reading frames do not necessarily cause more stringent evolutionary constraints.
Notes:
T Massingham, N Goldman (2007)  Statistics of the log-det estimator.   Mol Biol Evol 24: 10. 2277-2285 Oct  
Abstract: The log-det estimator is a measure of divergence (evolutionary distance) between sequences of biological characters, DNA or amino acids, for example, and has been shown to be robust to biases in composition that can cause problems for other estimators. We provide a statistical framework to construct high-accuracy confidence intervals for log-det estimates and compare the efficiency of the estimator to that of maximum likelihood using time-reversible Markov models. The log-det estimator is found to have good statistical properties under such general models.
Notes:
Elliott H Margulies, Gregory M Cooper, George Asimenos, Daryl J Thomas, Colin N Dewey, Adam Siepel, Ewan Birney, Damian Keefe, Ariel S Schwartz, Minmei Hou, James Taylor, Sergey Nikolaev, Juan I Montoya-Burgos, Ari Löytynoja, Simon Whelan, Fabio Pardi, Tim Massingham, James B Brown, Peter Bickel, Ian Holmes, James C Mullikin, Abel Ureta-Vidal, Benedict Paten, Eric A Stone, Kate R Rosenbloom, W James Kent, Gerard G Bouffard, Xiaobin Guan, Nancy F Hansen, Jacquelyn R Idol, Valerie V B Maduro, Baishali Maskeri, Jennifer C McDowell, Morgan Park, Pamela J Thomas, Alice C Young, Robert W Blakesley, Donna M Muzny, Erica Sodergren, David A Wheeler, Kim C Worley, Huaiyang Jiang, George M Weinstock, Richard A Gibbs, Tina Graves, Robert Fulton, Elaine R Mardis, Richard K Wilson, Michele Clamp, James Cuff, Sante Gnerre, David B Jaffe, Jean L Chang, Kerstin Lindblad-Toh, Eric S Lander, Angie Hinrichs, Heather Trumbower, Hiram Clawson, Ann Zweig, Robert M Kuhn, Galt Barber, Rachel Harte, Donna Karolchik, Matthew A Field, Richard A Moore, Carrie A Matthewson, Jacqueline E Schein, Marco A Marra, Stylianos E Antonarakis, Serafim Batzoglou, Nick Goldman, Ross Hardison, David Haussler, Webb Miller, Lior Pachter, Eric D Green, Arend Sidow (2007)  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome.   Genome Res 17: 6. 760-774 Jun  
Abstract: A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
Notes:
Koen Geuten, Tim Massingham, Paul Darius, Erik Smets, Nick Goldman (2007)  Experimental design criteria in phylogenetics: where to add taxa.   Syst Biol 56: 4. 609-622 Aug  
Abstract: Accurate phylogenetic inference is a topic of intensive research and debate and has been studied in response to many different factors: for example, differences in the method of reconstruction, the shape of the underlying tree, the substitution model, and varying quantities and types of data. Investigating whether the conditions used might lead to inaccurate inference has been attempted through elaborate data exploration but less attention has been given to creating a unified methodology to enable experimental designs in phylogenetic analysis to be improved and so avoid suboptimal conditions. Experimental design has been part of the field of statistics since the seminal work of Fisher in the early 20th century and a large body of literature exists on how to design optimum experiments. Here we investigate the use of the Fisher information matrix to decide between candidate positions for adding a taxon to a fixed topology, and introduce a parameter transformation that permits comparison of these different designs. This extension to Goldman (1998. Proc. R. Soc. Lond. B. 265: 1779-1786) thus allows investigation of "where to add taxa" in a phylogeny. We compare three different measures of the total information for selecting the position to add a taxon to a tree. Our methods are illustrated by investigating the behavior of the three criteria when adding a branch to model trees, and by applying the different criteria to two biological examples: a simplified taxon-sampling problem in the balsaminoid Ericales and the phylogeny of seed plants.
Notes:
Ewan Birney, John A Stamatoyannopoulos, Anindya Dutta, Roderic GuigĂł, Thomas R Gingeras, Elliott H Margulies, Zhiping Weng, Michael Snyder, Emmanouil T Dermitzakis, Robert E Thurman, Michael S Kuehn, Christopher M Taylor, Shane Neph, Christoph M Koch, Saurabh Asthana, Ankit Malhotra, Ivan Adzhubei, Jason A Greenbaum, Robert M Andrews, Paul Flicek, Patrick J Boyle, Hua Cao, Nigel P Carter, Gayle K Clelland, Sean Davis, Nathan Day, Pawandeep Dhami, Shane C Dillon, Michael O Dorschner, Heike Fiegler, Paul G Giresi, Jeff Goldy, Michael Hawrylycz, Andrew Haydock, Richard Humbert, Keith D James, Brett E Johnson, Ericka M Johnson, Tristan T Frum, Elizabeth R Rosenzweig, Neerja Karnani, Kirsten Lee, Gregory C Lefebvre, Patrick A Navas, Fidencio Neri, Stephen C J Parker, Peter J Sabo, Richard Sandstrom, Anthony Shafer, David Vetrie, Molly Weaver, Sarah Wilcox, Man Yu, Francis S Collins, Job Dekker, Jason D Lieb, Thomas D Tullius, Gregory E Crawford, Shamil Sunyaev, William S Noble, Ian Dunham, France Denoeud, Alexandre Reymond, Philipp Kapranov, Joel Rozowsky, Deyou Zheng, Robert Castelo, Adam Frankish, Jennifer Harrow, Srinka Ghosh, Albin Sandelin, Ivo L Hofacker, Robert Baertsch, Damian Keefe, Sujit Dike, Jill Cheng, Heather A Hirsch, Edward A Sekinger, Julien Lagarde, Josep F Abril, Atif Shahab, Christoph Flamm, Claudia Fried, Jörg HackermĂĽller, Jana Hertel, Manja Lindemeyer, Kristin Missal, Andrea Tanzer, Stefan Washietl, Jan Korbel, Olof Emanuelsson, Jakob S Pedersen, Nancy Holroyd, Ruth Taylor, David Swarbreck, Nicholas Matthews, Mark C Dickson, Daryl J Thomas, Matthew T Weirauch, James Gilbert, Jorg Drenkow, Ian Bell, XiaoDong Zhao, K G Srinivasan, Wing-Kin Sung, Hong Sain Ooi, Kuo Ping Chiu, Sylvain Foissac, Tyler Alioto, Michael Brent, Lior Pachter, Michael L Tress, Alfonso Valencia, Siew Woh Choo, Chiou Yu Choo, Catherine Ucla, Caroline Manzano, Carine Wyss, Evelyn Cheung, Taane G Clark, James B Brown, Madhavan Ganesh, Sandeep Patel, Hari Tammana, Jacqueline Chrast, Charlotte N Henrichsen, Chikatoshi Kai, Jun Kawai, Ugrappa Nagalakshmi, Jiaqian Wu, Zheng Lian, Jin Lian, Peter Newburger, Xueqing Zhang, Peter Bickel, John S Mattick, Piero Carninci, Yoshihide Hayashizaki, Sherman Weissman, Tim Hubbard, Richard M Myers, Jane Rogers, Peter F Stadler, Todd M Lowe, Chia-Lin Wei, Yijun Ruan, Kevin Struhl, Mark Gerstein, Stylianos E Antonarakis, Yutao Fu, Eric D Green, UlaĹź Karaöz, Adam Siepel, James Taylor, Laura A Liefer, Kris A Wetterstrand, Peter J Good, Elise A Feingold, Mark S Guyer, Gregory M Cooper, George Asimenos, Colin N Dewey, Minmei Hou, Sergey Nikolaev, Juan I Montoya-Burgos, Ari Löytynoja, Simon Whelan, Fabio Pardi, Tim Massingham, Haiyan Huang, Nancy R Zhang, Ian Holmes, James C Mullikin, Abel Ureta-Vidal, Benedict Paten, Michael Seringhaus, Deanna Church, Kate Rosenbloom, W James Kent, Eric A Stone, Serafim Batzoglou, Nick Goldman, Ross C Hardison, David Haussler, Webb Miller, Arend Sidow, Nathan D Trinklein, Zhengdong D Zhang, Leah Barrera, Rhona Stuart, David C King, Adam Ameur, Stefan Enroth, Mark C Bieda, Jonghwan Kim, Akshay A Bhinge, Nan Jiang, Jun Liu, Fei Yao, Vinsensius B Vega, Charlie W H Lee, Patrick Ng, Annie Yang, Zarmik Moqtaderi, Zhou Zhu, Xiaoqin Xu, Sharon Squazzo, Matthew J Oberley, David Inman, Michael A Singer, Todd A Richmond, Kyle J Munn, Alvaro Rada-Iglesias, Ola Wallerman, Jan Komorowski, Joanna C Fowler, Phillippe Couttet, Alexander W Bruce, Oliver M Dovey, Peter D Ellis, Cordelia F Langford, David A Nix, Ghia Euskirchen, Stephen Hartman, Alexander E Urban, Peter Kraus, Sara Van Calcar, Nate Heintzman, Tae Hoon Kim, Kun Wang, Chunxu Qu, Gary Hon, Rosa Luna, Christopher K Glass, M Geoff Rosenfeld, Shelley Force Aldred, Sara J Cooper, Anason Halees, Jane M Lin, Hennady P Shulha, Xiaoling Zhang, Mousheng Xu, Jaafar N S Haidar, Yong Yu, Vishwanath R Iyer, Roland D Green, Claes Wadelius, Peggy J Farnham, Bing Ren, Rachel A Harte, Angie S Hinrichs, Heather Trumbower, Hiram Clawson, Jennifer Hillman-Jackson, Ann S Zweig, Kayla Smith, Archana Thakkapallayil, Galt Barber, Robert M Kuhn, Donna Karolchik, Lluis Armengol, Christine P Bird, Paul I W de Bakker, Andrew D Kern, Nuria Lopez-Bigas, Joel D Martin, Barbara E Stranger, Abigail Woodroffe, Eugene Davydov, Antigone Dimas, Eduardo Eyras, Ingileif B HallgrĂ­msdĂłttir, Julian Huppert, Michael C Zody, Gonçalo R Abecasis, Xavier Estivill, Gerard G Bouffard, Xiaobin Guan, Nancy F Hansen, Jacquelyn R Idol, Valerie V B Maduro, Baishali Maskeri, Jennifer C McDowell, Morgan Park, Pamela J Thomas, Alice C Young, Robert W Blakesley, Donna M Muzny, Erica Sodergren, David A Wheeler, Kim C Worley, Huaiyang Jiang, George M Weinstock, Richard A Gibbs, Tina Graves, Robert Fulton, Elaine R Mardis, Richard K Wilson, Michele Clamp, James Cuff, Sante Gnerre, David B Jaffe, Jean L Chang, Kerstin Lindblad-Toh, Eric S Lander, Maxim Koriabine, Mikhail Nefedov, Kazutoyo Osoegawa, Yuko Yoshinaga, Baoli Zhu, Pieter J de Jong (2007)  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.   Nature 447: 7146. 799-816 Jun  
Abstract: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
Notes:
2006
Simon Whelan, Paul I W de Bakker, Emmanuel Quevillon, Nicolas Rodriguez, Nick Goldman (2006)  PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees.   Nucleic Acids Res 34: Database issue. D327-D331 Jan  
Abstract: PANDIT is a database of homologous sequence alignments accompanied by estimates of their corresponding phylogenetic trees. It provides a valuable resource to those studying phylogenetic methodology and the evolution of coding-DNA and protein sequences. Currently in version 17.0, PANDIT comprises 7738 families of homologous protein domains; for each family, DNA and corresponding amino acid sequence multiple alignments are available together with high quality phylogenetic tree estimates. Recent improvements include expanded methods for phylogenetic tree inference, assessment of alignment quality and a redesigned web interface, available at the URL http://www.ebi.ac.uk/goldman-srv/pandit.
Notes:
Peter S Klosterman, Andrew V Uzilov, Yuri R Bendaña, Robert K Bradley, Sharon Chao, Carolin Kosiol, Nick Goldman, Ian Holmes (2006)  XRate: a fast prototyping, training and annotation tool for phylo-grammars.   BMC Bioinformatics 7: 10  
Abstract: Recent years have seen the emergence of genome annotation methods based on the phylo-grammar, a probabilistic model combining continuous-time Markov chains and stochastic grammars. Previously, phylo-grammars have required considerable effort to implement, limiting their adoption by computational biologists.
Notes:
2005
Carolin Kosiol, Nick Goldman (2005)  Different versions of the Dayhoff rate matrix.   Mol Biol Evol 22: 2. 193-199 Feb  
Abstract: Many phylogenetic inference methods are based on Markov models of sequence evolution. These are usually expressed in terms of a matrix (Q) of instantaneous rates of change but some models of amino acid replacement, most notably the PAM model of Dayhoff and colleagues, were originally published only in terms of time-dependent probability matrices (P(t)). Previously published methods for deriving Q have used eigen-decomposition of an approximation to P(t). We show that the commonly used value of t is too large to ensure convergence of the estimates of elements of Q. We describe two simpler alternative methods for deriving Q from information such as that published by Dayhoff and colleagues. Neither of these methods requires approximation or eigen-decomposition. We identify the methods used to derive various different versions of the Dayhoff model in current software, perform a comparison of existing and new implementations, and, to facilitate agreement among scientists using supposedly identical models, recommend that one of the new methods be used as a standard.
Notes:
Tim Massingham, Nick Goldman (2005)  Detecting amino acid sites under positive selection and purifying selection.   Genetics 169: 3. 1753-1762 Mar  
Abstract: An excess of nonsynonymous over synonymous substitution at individual amino acid sites is an important indicator that positive selection has affected the evolution of a protein between the extant sequences under study and their most recent common ancestor. Several methods exist to detect the presence, and sometimes location, of positively selected sites in alignments of protein-coding sequences. This article describes the "sitewise likelihood-ratio" (SLR) method for detecting nonneutral evolution, a statistical test that can identify sites that are unusually conserved as well as those that are unusually variable. We show that the SLR method can be more powerful than currently published methods for detecting the location of positive selection, especially in difficult cases where the strength of selection is low. The increase in power is achieved while relaxing assumptions about how the strength of selection varies over sites and without elevated rates of false-positive results that have been reported with some other methods. We also show that the SLR method performs well even under circumstances where the results from some previous methods can be misleading.
Notes:
Fabio Pardi, Nick Goldman (2005)  Species choice for comparative genomics: being greedy works.   PLoS Genet 1: 6. Dec  
Abstract: Several projects investigating genetic function and evolution through sequencing and comparison of multiple genomes are now underway. These projects consume many resources, and appropriate planning should be devoted to choosing which species to sequence, potentially involving cooperation among different sequencing centres. A widely discussed criterion for species choice is the maximisation of evolutionary divergence. Our mathematical formalization of this problem surprisingly shows that the best long-term cooperative strategy coincides with the seemingly short-term "greedy" strategy of always choosing the next best single species. Other criteria influencing species choice, such as medical relevance or sequencing costs, can also be accommodated in our approach, suggesting our results' broad relevance in scientific policy decisions.
Notes:
Ari Löytynoja, Nick Goldman (2005)  An algorithm for progressive multiple alignment of sequences with insertions.   Proc Natl Acad Sci U S A 102: 30. 10557-10562 Jul  
Abstract: Dynamic programming algorithms guarantee to find the optimal alignment between two sequences. For more than a few sequences, exact algorithms become computationally impractical, and progressive algorithms iterating pairwise alignments are widely used. These heuristic methods have a serious drawback because pairwise algorithms do not differentiate insertions from deletions and end up penalizing single insertion events multiple times. Such an unrealistically high penalty for insertions typically results in overmatching of sequences and an underestimation of the number of insertion events. We describe a modification of the traditional alignment algorithm that can distinguish insertion from deletion and avoid repeated penalization of insertions and illustrate this method with a pair hidden Markov model that uses an evolutionary scoring function. In comparison with a traditional progressive alignment method, our algorithm infers a greater number of insertion events and creates gaps that are phylogenetically consistent but spatially less concentrated. Our results suggest that some insertion/deletion "hot spots" may actually be artifacts of traditional alignment algorithms.
Notes:
2004
Carolin Kosiol, Nick Goldman, Nigel H Buttimore (2004)  A new criterion and method for amino acid classification.   J Theor Biol 228: 1. 97-106 May  
Abstract: It is accepted that many evolutionary changes of amino acid sequence in proteins are conservative: the replacement of one amino acid by another residue has a far greater chance of being accepted if the two residues have similar properties. It is difficult, however, to identify relevant physicochemical properties that capture this similarity. In this paper we introduce a criterion that determines similarity from an evolutionary point of view. Our criterion is based on the description of protein evolution by a Markov process and the corresponding matrix of instantaneous replacement rates. It is inspired by the conductance, a quantity that reflects the strength of mixing in a Markov process. Furthermore we introduce a method to divide the 20 amino acid residues into subsets that achieve good scores with our criterion. The criterion has the time-invariance property that different time distances of the same amino acid replacement rate matrix lead to the same grouping; but different rate matrices lead to different groupings. Therefore it can be used as an automated method to compare matrices derived from consideration of different types of proteins, or from parts of proteins sharing different structural or functional features. We present the groupings resulting from two standard matrices used in sequence alignment and phylogenetic tree estimation.
Notes:
Simon Whelan, Nick Goldman (2004)  Estimating the frequency of events that cause multiple-nucleotide changes.   Genetics 167: 4. 2027-2043 Aug  
Abstract: Existing mathematical models of DNA sequence evolution assume that all substitutions derive from point mutations. There is, however, increasing evidence that larger-scale events, involving two or more consecutive sites, may also be important. We describe a model, denoted SDT, that allows for single-nucleotide, doublet, and triplet mutations. Applied to protein-coding DNA, the SDT model allows doublet and triplet mutations to overlap codon boundaries but still permits data to be analyzed using the simplifying assumption of independence of sites. We have implemented the SDT model for maximum-likelihood phylogenetic inference and have applied it to an alignment of mammalian globin sequences and to 258 other protein-coding sequence alignments from the Pandit database. We find the SDT model's inclusion of doublet and triplet mutations to be overwhelmingly successful in giving statistically significant improvements in fit of model to data, indicating that larger-scale mutation events do occur. Distributions of inferred parameter values over all alignments analyzed suggest that these events are far more prevalent than previously thought. Detailed consideration of our results and the absence of any known mechanism causing three adjacent nucleotides to be substituted simultaneously, however, leads us to suggest that the actual evolutionary events occurring may include still-larger-scale events, such as gene conversion, inversion, or recombination, or a series of rapid compensatory changes.
Notes:
Wendy S W Wong, Ziheng Yang, Nick Goldman, Rasmus Nielsen (2004)  Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites.   Genetics 168: 2. 1041-1051 Oct  
Abstract: The parsimony method of Suzuki and Gojobori (1999) and the maximum likelihood method developed from the work of Nielsen and Yang (1998) are two widely used methods for detecting positive selection in homologous protein coding sequences. Both methods consider an excess of nonsynonymous (replacement) substitutions as evidence for positive selection. Previously published simulation studies comparing the performance of the two methods show contradictory results. Here we conduct a more thorough simulation study to cover and extend the parameter space used in previous studies. We also reanalyzed an HLA data set that was previously proposed to cause problems when analyzed using the maximum likelihood method. Our new simulations and a reanalysis of the HLA data demonstrate that the maximum likelihood method has good power and accuracy in detecting positive selection over a wide range of parameter values. Previous studies reporting poor performance of the method appear to be due to numerical problems in the optimization algorithms and did not reflect the true performance of the method. The parsimony method has a very low rate of false positives but very little power for detecting positive selection or identifying positively selected sites.
Notes:
2003
Ross C Hardison, Krishna M Roskin, Shan Yang, Mark Diekhans, W James Kent, Ryan Weber, Laura Elnitski, Jia Li, Michael O'Connor, Diana Kolbe, Scott Schwartz, Terrence S Furey, Simon Whelan, Nick Goldman, Arian Smit, Webb Miller, Francesca Chiaromonte, David Haussler (2003)  Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution.   Genome Res 13: 1. 13-26 Jan  
Abstract: Six measures of evolutionary change in the human genome were studied, three derived from the aligned human and mouse genomes in conjunction with the Mouse Genome Sequencing Consortium, consisting of (1) nucleotide substitution per fourfold degenerate site in coding regions, (2) nucleotide substitution per site in relics of transposable elements active only before the human-mouse speciation, and (3) the nonaligning fraction of human DNA that is nonrepetitive or in ancestral repeats; and three derived from human genome data alone, consisting of (4) SNP density, (5) frequency of insertion of transposable elements, and (6) rate of recombination. Features 1 and 2 are measures of nucleotide substitutions at two classes of "neutral" sites, whereas 4 is a measure of recent mutations. Feature 3 is a measure dominated by deletions in mouse, whereas 5 represents insertions in human. It was found that all six vary significantly in megabase-sized regions genome-wide, and many vary together. This indicates that some regions of a genome change slowly by all processes that alter DNA, and others change faster. Regional variation in all processes is correlated with, but not completely accounted for, by GC content in human and the difference between GC content in human and mouse.
Notes:
Simon Whelan, Paul I W de Bakker, Nick Goldman (2003)  Pandit: a database of protein and associated nucleotide domains with inferred trees.   Bioinformatics 19: 12. 1556-1563 Aug  
Abstract: MOTIVATION: A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution. RESULTS: The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach.
Notes:
Douglas M Robinson, David T Jones, Hirohisa Kishino, Nick Goldman, Jeffrey L Thorne (2003)  Protein evolution with dependence among codons due to tertiary structure.   Mol Biol Evol 20: 10. 1692-1704 Oct  
Abstract: Markovian models of protein evolution that relax the assumption of independent change among codons are considered. With this comparatively realistic framework, an evolutionary rate at a site can depend both on the state of the site and on the states of surrounding sites. By allowing a relatively general dependence structure among sites, models of evolution can reflect attributes of tertiary structure. To quantify the impact of protein structure on protein evolution, we analyze protein-coding DNA sequence pairs with an evolutionary model that incorporates effects of solvent accessibility and pairwise interactions among amino acid residues. By explicitly considering the relationship between nonsynonymous substitution rates and protein structure, this approach can lead to refined detection and characterization of positive selection. Analyses of simulated sequence pairs indicate that parameters in this evolutionary model can be well estimated. Analyses of lysozyme c and annexin V sequence pairs yield the biologically reasonable result that amino acid replacement rates are higher when the replacements lead to energetically favorable proteins than when they destabilize the proteins. Although the focus here is evolutionary dependence among codons that is associated with protein structure, the statistical approach is quite general and could be applied to diverse cases of evolutionary dependence where surrogates for sequence fitness can be measured or modeled.
Notes:
2002
Nick Goldman, Simon Whelan (2002)  A novel use of equilibrium frequencies in models of sequence evolution.   Mol Biol Evol 19: 11. 1821-1831 Nov  
Abstract: Current mathematical models of amino acid sequence evolution are often applied in variants that match their expected amino acid frequencies to those observed in a data set under analysis. This has been achieved by setting the instantaneous rate of replacement of a residue i by another residue j proportional to the observed frequency of the resulting residue j. We describe a more general method that maintains the match between expected and observed frequencies but permits replacement rates to be proportional to the frequencies of both the replaced and resulting residues, raised to powers other than 1. Analysis of a database of amino acid alignments shows that the description of the evolutionary process in a majority (approximately 70% of 182 alignments) is significantly improved by use of the new method, and a variety of analyses indicate that parameter estimation with the new method is well-behaved. Improved evolutionary models increase our understanding of the process of molecular evolution and are often expected to lead to improved phylogenetic inferences, and so it seems justified to consider our new variants of existing standard models when performing evolutionary analyses of amino acid sequences. Similar methods can be used with nucleotide substitution models, but we have not found these to give corresponding significant improvements to our ability to describe the processes of nucleotide sequence evolution.
Notes:
Pietro Liò, Nick Goldman (2002)  Modeling mitochondrial protein evolution using structural information.   J Mol Evol 54: 4. 519-529 Apr  
Abstract: We present two new models of protein sequence evolution based on structural properties of mitochondrial proteins. We compare these models with others currently used in phylogenetic analyses, investigating their performance over both short and long evolutionary distances. We find that our models that incorporate secondary structure information from mitochondrial proteins are statistically comparable with existing models when studying 13 mitochondrial protein data sets from eutherian mammals. However, our models give a significantly improved description of the evolutionary process when used with 12 mitochondrial proteins from a broader range of organisms including fungi, plants, protists, and bacteria. Our models may thus be of use in estimating mitochondrial protein phylogenies and for the study of processes of mitochondrial protein evolution, in particular for distantly related organisms.
Notes:
Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa Agarwala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E Antonarakis, John Attwood, Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck, Eric Berry, Bruce Birren, Toby Bloom, Peer Bork, Marc Botcherby, Nicolas Bray, Michael R Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John Burton, Jonathan Butler, Robert D Campbell, Piero Carninci, Simon Cawley, Francesca Chiaromonte, Asif T Chinwalla, Deanna M Church, Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook, Richard R Copley, Alan Coulson, Olivier Couronne, James Cuff, Val Curwen, Tim Cutts, Mark Daly, Robert David, Joy Davies, Kimberly D Delehaunty, Justin Deri, Emmanouil T Dermitzakis, Colin Dewey, Nicholas J Dickens, Mark Diekhans, Sheila Dodge, Inna Dubchak, Diane M Dunn, Sean R Eddy, Laura Elnitski, Richard D Emes, Pallavi Eswara, Eduardo Eyras, Adam Felsenfeld, Ginger A Fewell, Paul Flicek, Karen Foley, Wayne N Frankel, Lucinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage, Richard A Gibbs, Gustavo Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves, Eric D Green, Simon Gregory, Roderic GuigĂł, Mark Guyer, Ross C Hardison, David Haussler, Yoshihide Hayashizaki, LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer, Fan Hsu, Axin Hua, Tim Hubbard, Adrienne Hunt, Ian Jackson, David B Jaffe, L Steven Johnson, Matthew Jones, Thomas A Jones, Ann Joy, Michael Kamal, Elinor K Karlsson, Donna Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn Kells, W James Kent, Andrew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David Kulp, Tom Landers, J P Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Christine Lloyd, Susan Lucas, Bin Ma, Donna R Maglott, Elaine R Mardis, Lucy Matthews, Evan Mauceli, John H Mayer, Megan McCarthy, W Richard McCombie, Stuart McLaren, Kirsten McLay, John D McPherson, Jim Meldrim, Beverley Meredith, Jill P Mesirov, Webb Miller, Tracie L Miner, Emmanuel Mongin, Kate T Montgomery, Michael Morgan, Richard Mott, James C Mullikin, Donna M Muzny, William E Nash, Joanne O Nelson, Michael N Nhan, Robert Nicol, Zemin Ning, Chad Nusbaum, Michael J O'Connor, Yasushi Okazaki, Karen Oliver, Emma Overton-Larty, Lior Pachter, GenĂ­s Parra, Kymberlie H Pepin, Jane Peterson, Pavel Pevzner, Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter, Michael Quail, Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alistair G Rust, Ralph Santos, Victor Sapojnikov, Brian Schultz, Jörg Schultz, Matthias S Schwartz, Scott Schwartz, Carol Scott, Steven Seaman, Steve Searle, Ted Sharpe, Andrew Sheridan, Ratna Shownkeen, Sarah Sims, Jonathan B Singer, Guy Slater, Arian Smit, Douglas R Smith, Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles Sugnet, Mikita Suyama, Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp, Catherine Ucla, Abel Ureta-Vidal, Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade, Melanie Wall, Ryan J Weber, Robert B Weiss, Michael C Wendl, Anthony P West, Kris Wetterstrand, Raymond Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey, Sophie Williams, Richard K Wilson, Eitan Winter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-Pyng Yang, Evgeny M Zdobnov, Michael C Zody, Eric S Lander (2002)  Initial sequencing and comparative analysis of the mouse genome.   Nature 420: 6915. 520-562 Dec  
Abstract: The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.
Notes:
2001
S Whelan, N Goldman (2001)  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach.   Mol Biol Evol 18: 5. 691-699 May  
Abstract: Phylogenetic inference from amino acid sequence data uses mainly empirical models of amino acid replacement and is therefore dependent on those models. Two of the more widely used models, the Dayhoff and JTT models, are estimated using similar methods that can utilize large numbers of sequences from many unrelated protein families but are somewhat unsatisfactory because they rely on assumptions that may lead to systematic error and discard a large amount of the information within the sequences. The alternative method of maximum-likelihood estimation may utilize the information in the sequence data more efficiently and suffers from no systematic error, but it has previously been applicable to relatively few sequences related by a single phylogenetic tree. Here, we combine the best attributes of these two methods using an approximate maximum-likelihood method. We implemented this approach to estimate a new model of amino acid replacement from a database of globular protein sequences comprising 3,905 amino acid sequences split into 182 protein families. While the new model has an overall structure similar to those of other commonly used models, there are significant differences. The new model outperforms the Dayhoff and JTT models with respect to maximum-likelihood values for a large majority of the protein families in our database. This suggests that it provides a better overall fit to the evolutionary process in globular proteins and may lead to more accurate phylogenetic tree estimates. Potentially, this matrix, and the methods used to generate it, may also be useful in other areas of research, such as biological sequence database searching, sequence alignment, and protein structure prediction, for which an accurate description of amino acid replacement is required.
Notes:
S Whelan, P Liò, N Goldman (2001)  Molecular phylogenetics: state-of-the-art methods for looking into the past.   Trends Genet 17: 5. 262-272 May  
Abstract: As the amount of molecular sequence data in the public domain grows, so does the range of biological topics that it influences through evolutionary considerations. In recent years, a number of developments have enabled molecular phylogenetic methodology to keep pace. Likelihood-based inferential techniques, although controversial in the past, lie at the heart of these new methods and are producing the promised advances in the understanding of sequence evolution. They allow both a wide variety of phylogenetic inferences from sequence data and robust statistical assessment of all results. It cannot remain acceptable to use outdated data analysis techniques when superior alternatives exist. Here, we discuss the most important and exciting methods currently available to the molecular phylogeneticist.
Notes:
2000
E Hagelberg, N Goldman, P Liò, S Whelan, W Schiefenhövel, J B Clegg, D K Bowden (2000)  Evidence for mitochondrial DNA recombination in a human population of island Melanesia: correction.   Proceedings of the Royal Society of London B 267: 1595-1596  
Abstract: We recently presented evidence of mitochondrial DNA recombination in humans based on the observation of a rare mutation in several unrelated human lineages in Nguna, a small island in Vanuatu, island Melanesia. Since then, the mutation has been shown to be an artefact caused by misalignment of the DNA sequences. Our previous conclusion, that the presence of a rare mutation on different haplotypic backgrounds was a consequence of genetic recombination, is no longer tenable for these data.
Notes:
Z Yang, R Nielsen, N Goldman, A M Pedersen (2000)  Codon-substitution models for heterogeneous selection pressure at amino acid sites.   Genetics 155: 1. 431-449 May  
Abstract: Comparison of relative fixation rates of synonymous (silent) and nonsynonymous (amino acid-altering) mutations provides a means for understanding the mechanisms of molecular sequence evolution. The nonsynonymous/synonymous rate ratio (omega = d(N)d(S)) is an important indicator of selective pressure at the protein level, with omega = 1 meaning neutral mutations, omega < 1 purifying selection, and omega > 1 diversifying positive selection. Amino acid sites in a protein are expected to be under different selective pressures and have different underlying omega ratios. We develop models that account for heterogeneous omega ratios among amino acid sites and apply them to phylogenetic analyses of protein-coding DNA sequences. These models are useful for testing for adaptive molecular evolution and identifying amino acid sites under diversifying selection. Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to estimate the distributions of omega among sites. In all data sets analyzed, the selective pressure indicated by the omega ratio is found to be highly heterogeneous among sites. Previously unsuspected Darwinian selection is detected in several genes in which the average omega ratio across sites is <1, but in which some sites are clearly under diversifying selection with omega > 1. Genes undergoing positive selection include the beta-globin gene from vertebrates, mitochondrial protein-coding genes from hominoids, the hemagglutinin (HA) gene from human influenza virus A, and HIV-1 env, vif, and pol genes. Tests for the presence of positively selected sites and their subsequent identification appear quite robust to the specific distributional form assumed for omega and can be achieved using any of several models we implement. However, we encountered difficulties in estimating the precise distribution of omega among sites from real data sets.
Notes:
T Massingham, N Goldman (2000)  EDIBLE: experimental design and information calculations in phylogenetics.   Bioinformatics 16: 3. 294-295 Mar  
Abstract: Although evolutionary inference from molecular sequences is a statistical problem, little attention has been paid to questions of experimental design. A computer program, EDIBLE, has been developed to perform likelihood calculations based on Markov process models of nucleotide substitution allied with phylogenetic trees, and from these to compute Fisher information measures under different experimental designs. These calculations can be used to answer questions of optimal experimental design in molecular phylogenetics. AVAILABILITY: Source code (ANSI C), executables and documentation for EDIBLE are available from http://ng-dec1.gen.cam. ac.uk/info/index.htmland 'downstream' Web pages. CONTACT: N.Goldman@gen.cam.ac.uk
Notes:
N Goldman, J P Anderson, A G Rodrigo (2000)  Likelihood-based tests of topologies in phylogenetics.   Syst Biol 49: 4. 652-670 Dec  
Abstract: Likelihood-based statistical tests of competing evolutionary hypotheses (tree topologies) have been available for approximately a decade. By far the most commonly used is the Kishino-Hasegawa test. However, the assumptions that have to be made to ensure the validity of the Kishino-Hasegawa test place important restrictions on its applicability. In particular, it is only valid when the topologies being compared are specified a priori. Unfortunately, this means that the Kishino-Hasegawa test may be severely biased in many cases in which it is now commonly used: for example, in any case in which one of the competing topologies has been selected for testing because it is the maximum likelihood topology for the data set at hand. We review the theory of the Kishino-Hasegawa test and contend that for the majority of popular applications this test should not be used. Previously published results from invalid applications of the Kishino-Hasegawa test should be treated extremely cautiously, and future applications should use appropriate alternative tests instead. We review such alternative tests, both nonparametric and parametric, and give two examples which illustrate the importance of our contentions.
Notes:
1999
S Whelan, N Goldman (1999)  Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics.   Mol Biol Evol 16: 1292-1299  
Abstract: Asymptotic statistical theory suggests that when two nested models are compared by a likelihood ratio test, a X2 distribution, with number of degrees of freedom equal to the difference in numbers of free parameters of the two models, can be used for significance testing. This asymptotic result has been assumed to apply in phylogenetics with the support of only a few studies. In this paper, 12 comparisons among a selection of commonly used models of nucleotide substitution were examined to see whether this assumption is reasonable. The true distributions of likelihood ratio statistics were estimated by computer simulation and compared with the appropriate X2 distributions. It was found that X2 distributions are adequate for significance testing in the comparison of models differing by parameters describing transition/transversion bias and/or unequal base frequencies when these parameters have been estimated by maximum likelihood. The X2 distribution was, however, found to be significantly different from the true distributions in the comparison of models differing by parameters describing rate variation across sites (estimated by maximum likelihood) or unequal base frequencies (estimated as the observed base frequencies in an alignment). These last findings may have important consequences for real- model comparisons and for the construction of increasingly complex and realistic models of nucleotide sequence evolution.
Notes:
E Hagelberg, N Goldman, P LiĂł, S Whelan, W Schiefenhövel, J B Clegg, D K Bowden (1999)  Evidence for mitochondrial DNA recombination in a human population of island Melanesia.   Proc Biol Sci 266: 1418. 485-492 Mar  
Abstract: Mitochondrial DNA (mtDNA) analysis has proved useful in studies of recent human evolution and the genetic affinities of human groups of different geographical regions. As part of an extensive survey of mtDNA diversity in present-day Pacific populations, we obtained sequence information of the hypervariable mtDNA control region of 452 individuals from various localities in the western Pacific. The mtDNA types fell into three major groups which reflect the settlement history of the area. Interestingly, we detected an extremely rare point mutation at high frequency in the small island of Nguna in the Melanesian archipelago of Vanuatu. Phylogenetic analysis of the mtDNA data indicated that the mutation was present in individuals of separate mtDNA lineages. We propose that the multiple occurrence of a rare mutation event in one isolated locality is highly improbable, and that recombination between different mtDNA types is a more likely explanation for our observation. If correct, this conclusion has important implications for the use of mtDNA in phylogenetic and evolutionary studies.
Notes:
D D Pollock, W R Taylor, N Goldman (1999)  Coevolving protein residues: maximum likelihood identification and relationship to structure.   J Mol Biol 287: 1. 187-198 Mar  
Abstract: The identification of protein sites undergoing correlated evolution (coevolution) is of great interest due to the possibility that these pairs will tend to be adjacent in the three-dimensional structure. Identification of such pairs should provide useful information for understanding the evolutionary process, predicting the effects of site-directed substitution, and potentially for predicting protein structure. Here, we develop and apply a maximum likelihood method with the aim of improving detection of coevolution. Unlike previous methods which have had limited success, this method allows for correlations induced by phylogenetic relationships and for variation in rate of evolution along branches, and does not rely on accurate reconstruction of ancestral nodes. In order to reduce the complexity of coevolutionary relationships and identify the primary component of pairwise coevolution between two sites, we reduce the data to a two-state system at each site, regardless of the actual number of residues observed at that site. Simulations show that this strategy is good at identifying simple correlations and at recognizing cases in which the data are insufficient to distinguish between coevolution and spurious correlations. The new method was tested by using size and charge characteristics to group the residues at each site, and then evaluating coevolution in myoglobin sequences. Grouping based on physicochemical characteristics allows categorization of coevolving sites into positive and negative coevolution, depending on the correlation between equilibrium state frequencies. We detected a striking excess of negative coevolution (corresponding to charge) at sites brought into proximity by the periodicity of the alpha-helix, and there was also a tendency for sites with significant likelihood ratios to be close in the three-dimensional structure. Sites on the surface of the protein appear to coevolve both when they are close in the structure, and when they are distant, implying a role for folding and/or avoidance of quaternary structure in the coevolution process.
Notes:
P Liò, N Goldman (1999)  Using protein structural information in evolutionary inference: transmembrane proteins.   Mol Biol Evol 16: 12. 1696-1710 Dec  
Abstract: We present a model of amino acid sequence evolution based on a hidden Markov model that extends to transmembrane proteins previous methods that incorporate protein structural information into phylogenetics. Our model aims to give a better understanding of processes of molecular evolution and to extract structural information from multiple alignments of transmembrane sequences and use such information to improve phylogenetic analyses. This should be of value in phylogenetic studies of transmembrane proteins: for example, mitochondrial proteins have acquired a special importance in phylogenetics and are mostly transmembrane proteins. The improvement in fit to example data sets of our new model relative to less complex models of amino acid sequence evolution is statistically tested. To further illustrate the potential utility of our method, phylogeny estimation is performed on primate CCR5 receptor sequences, sequences of l and m subunits of the light reaction center in purple bacteria, guinea pig sequences with respect to lagomorph and rodent sequences of calcitonin receptor and K-substance receptor, and cetacean sequences of cytochrome b.
Notes:
1998
N Goldman (1998)  Effects of sequence alignment procedures on estimates of phylogeny.   BioEssays 20: 287-290  
Abstract: Previous debate about statistical variation in inferred phylogenies has focused on procedures for the estimation of evolutionary relationships from aligned sequences. Morrison and Ellis[1] have recently drawn attention to additional variation attributable to the alignment procedure used and have suggested that this may be highly significant. This raises doubts about our ability to infer reliable phylogenies. Although concerns may not be as serious as their analyses at first imply, Morrison and Ellis[1] have performed a useful service in reminding us that accurate sequence alignment is a crucial part of molecular phylogenetics.
Notes:
P Liò, N Goldman (1998)  Models of molecular evolution and phylogeny.   Genome Res 8: 12. 1233-1244 Dec  
Abstract: Phylogenetic reconstruction is a fast-growing field that is enriched by different statistical approaches and by findings and applications in a broad range of biological areas. Fundamental to these are the mathematical models used to describe the patterns of DNA base substitution and amino acid replacement. These may become some of the basic models for comparative genome research. We discuss these models, including the analysis of observed DNA base and amino acid mutation patterns, the concept of site heterogeneity, and the incorporation of structural biology data, all of which have become particularly important in recent years. We also describe the use of such models in phylogenetic reconstruction and statistical methods for the comparison of different models.
Notes:
N Goldman, J L Thorne, D T Jones (1998)  Assessing the impact of secondary structure and solvent accessibility on protein evolution.   Genetics 149: 1. 445-458 May  
Abstract: Empirically derived models of amino acid replacement are employed to study the association between various physical features of proteins and evolution. The strengths of these associations are statistically evaluated by applying the models of protein evolution to 11 diverse sets of protein sequences. Parametric bootstrap tests indicate that the solvent accessibility status of a site has a particularly strong association with the process of amino acid replacement that it experiences. Significant association between secondary structure environment and the amino acid replacement process is also observed. Careful description of the length distribution of secondary structure elements and of the organization of secondary structure and solvent accessibility along a protein did not always significantly improve the fit of the evolutionary models to the data sets that were analyzed. As indicated by the strength of the association of both solvent accessibility and secondary structure with amino acid replacement, the process of protein evolution-both above and below the species level-will not be well understood until the physical constraints that affect protein evolution are identified and characterized.
Notes:
P Liò, N Goldman, J L Thorne, D T Jones3 (1998)  PASSML: combining evolutionary inference and protein secondary structure prediction.   Bioinformatics 14: 8. 726-733  
Abstract: MOTIVATION: Evolutionary models of amino acid sequences can be adapted to incorporate structure information; protein structure biologists can use phylogenetic relationships among species to improve prediction accuracy. Results : A computer program called PASSML ('Phylogeny and Secondary Structure using Maximum Likelihood') has been developed to implement an evolutionary model that combines protein secondary structure and amino acid replacement. The model is related to that of Dayhoff and co-workers, but we distinguish eight categories of structural environment: alpha helix, beta sheet, turn and coil, each further classified according to solvent accessibility, i.e. buried or exposed. The model of sequence evolution for each of the eight categories is a Markov process with discrete states in continuous time, and the organization of structure along protein sequences is described by a hidden Markov model. This paper describes the PASSML software and illustrates how it allows both the reconstruction of phylogenies and prediction of secondary structure from aligned amino acid sequences. AVAILABILITY: PASSML 'ANSI C' source code and the example data sets described here are available at http://ng-dec1.gen.cam.ac.uk/hmm/Passml.html and 'downstream' Web pages. CONTACT: P.Lio@gen.cam.ac.uk
Notes:
N Goldman (1998)  Phylogenetic information and experimental design in molecular systematics.   Proc Biol Sci 265: 1407. 1779-1786 Sep  
Abstract: Despite the widespread perception that evolutionary inference from molecular sequences is a statistical problem, there has been very little attention paid to questions of experimental design. Previous consideration of this topic has led to little more than an empirical folklore regarding the choice of suitable genes for analysis, and to dispute over the best choice of taxa for inclusion in data sets. I introduce what I believe are new methods that permit the quantification of phylogenetic information in a sequence alignment. The methods use likelihood calculations based on Markov-process models of nucleotide substitution allied with phylogenetic trees, and allow a general approach to optimal experimental design. Two examples are given, illustrating realistic problems in experimental design in molecular phylogenetics and suggesting more general conclusions about the choice of genomic regions, sequence lengths and taxa for evolutionary studies.
Notes:
1997
K Strimmer, N Goldman, A von Haeseler (1997)  Bayesian probabilities and quartet puzzling.   Mol Biol Evol 14: 210-211  
Abstract: Quartet puzzling (QP), a heuristic tree search procedure for maximum-likelihood trees, has recently been introduced (Strimmer and von Haeseler 1996). This method uses maximum-likelihood criteria for quartets of taxa which are then combined to form trees based on larger numbers of taxa. Thus, QP can be practically applied to data sets comprising a much greater number of taxa than can other search algorithms such as stepwise addition and subsequent branch swapping as implemented, e.g., in DNAML (Felsenstein 1993). However, its ability to reconstruct the true tree is less than that of DNAML (Strimmer and von Haeseler 1996). Here, we show that the assignment of penalties in the puzzling step of the QP algorithm is a special case of a more general Bayesian weighting scheme for quartet topologies. Application of this general framework leads to an improvement in the efficiency of QP at recovering the true tree as well as to better theoretical understanding of the method itself. On average, the accuracy of QP increases by 10% over all cases studied, without compromising speed or requiring more computer memory.
Notes:
1996
N Goldman, J L Thorne, D T Jones (1996)  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses.   J Mol Biol 263: 2. 196-208 Oct  
Abstract: Previously proposed methods for protein secondary structure prediction from multiple sequence alignments do not efficiently extract the evolutionary information that these alignments contain. The predictions of these methods are less accurate than they could be, because of their failure to consider explicitly the phylogenetic tree that relates aligned protein sequences. As an alternative, we present a hidden Markov model approach to secondary structure prediction that more fully uses the evolutionary information contained in protein sequence alignments. A representative example is presented, and three experiments are performed that illustrate how the appropriate representation of evolutionary relatedness can improve inferences. We explain why similar improvement can be expected in other secondary structure prediction methods and indeed any comparative sequence analysis method.
Notes:
J L Thorne, N Goldman, D T Jones (1996)  Combining protein evolution and secondary structure.   Mol Biol Evol 13: 5. 666-673 May  
Abstract: An evolutionary model that combines protein secondary structure and amino acid replacement is introduced. It allows likelihood analysis of aligned protein sequences and does not require the underlying secondary (or tertiary) structures of these sequences to be known. One component of the model describes the organization of secondary structure along a protein sequence and another specifies the evolutionary process for each category of secondary structure. A database of proteins with known secondary structures is used to estimate model parameters representing these two components. Phylogeny, the third component of the model, can be estimated from the data set of interest. As an example, we employ our model to analyze a set of sucrose synthase sequences. For the evolution of sucrose synthase, a parametric bootstrap approach indicates that our model is statistically preferable to one that ignores secondary structure.
Notes:
1995
Z Yang, N Goldman, A Friday (1995)  Maximum likelihood trees from DNA sequences : a peculiar statistical estimation problem.   Syst Biol 44: 384-399  
Abstract: The parameter space of the phylogenetic tree estimation problem consists of three components, T, t, and θ. The tree topology T is a discrete entity that is not a proper statistical parameter but that can nevertheless be estimated using the maximum likelihood criterion. Its role is to specify the branch length parameters and the form of the likelihood function(s). Branch lengths t are conditional on T and are meaningful only for specific values of T. Parameters θ in the model of nucleotide substitution are common to all the tree topologies and represent such values as the transition/transversion rate ratio. T and t thus represent the tree, and θ represents the model. With typical DNA sequence data, differences in T have only a small effect on the likelihood, but changing θ will influence the likelihood greatly. Estimates of θ are also found to be insensitive to T, making it possible to obtain reliable estimates of θ and to perform tests concerning the model (θ) even if knowledge of the evolutionary relationship (T) is not available. In contrast, tests concerning t, such as testing the existence of a molecular clock, appear to be more difficult to perform when the true topology is unknown. In this paper, we explore the peculiarity of the parameter space of the tree estimation problem and suggest methods for overcoming some difficulties involved with tests concerning the model. We also address difficulties concerning hypothesis testing on T, i.e., evaluation of the reliability of the estimated tree topology. We note that estimation of and particularly tests concerning T depend critically on the assumed model.
Notes:
J Breuer, N W Douglas, N Goldman, R S Daniels (1995)  Human immunodeficiency virus type 2 (HIV-2) env gene analysis: prediction of glycoprotein epitopes important for heterotypic neutralization and evidence for three genotype clusters within the HIV-2a subtype.   J Gen Virol 76 ( Pt 2): 333-345 Feb  
Abstract: The env gene sequences of ten tissue-culture-adapted human immunodeficiency virus type 2 (HIV-2) isolates from West African patients were determined. Alignment and comparison of the gene sequences and putative translation products with database sequences revealed 11-29% diversity at the nucleotide level and 15-31% variation at the protein level. From analysis of glycoproteins of HIV-2 strains sensitive and resistant to neutralization by HIV-1 antisera, five regions were identified as putative targets for cross-neutralizing antibody. The HIV-2 equivalent of the HIV-1 V3 loop was not included in this number. However, three of the HIV-2 peptides aligned with regions identified as targets for broad neutralization of HIV-1 strains. These were the V2 and CD4-binding domains of gp120 and the Kennedy domain in gp41. Phylogenetic analysis of the env gene sequences, together with HIV-2 env gene sequences published in the Los Alamos database, support the identification of two distinct HIV-2 subtypes, HIV-2a and HIV-2b. The new sequences are located within the HIV-2a subtype and allow prediction of at least three genotypes, designated I-III. Some correlation of genotype with geographical origin of isolates was noted. Genotype I viruses originate from Guinea Bissau and group II viruses mainly originate from The Gambia. One isolate from Guinea Bissau, HIV-2CAM4, appears phylogenetically older than other viruses in the HIV-2a subtype. The possible implications of this in the light of epidemiological findings in Guinea Bissau are discussed.
Notes:
1994
M J Gardner, N Goldman, P Barnett, P W Moore, K Rangachari, M Strath, A Whyte, D H Williamson, R J Wilson (1994)  Phylogenetic analysis of the rpoB gene from the plastid-like DNA of Plasmodium falciparum.   Mol Biochem Parasitol 66: 2. 221-231 Aug  
Abstract: Malaria and other Apicomplexan parasites harbour two extrachromosomal DNAs. One is mitochondrial and the other is a 35-kb circle with some plastid-like features but whose provenance and function is unknown. In addition to genes for rRNAs, tRNAs and ribosomal proteins, the 35-kb circular DNA of Plasmodium falciparum carries an rpoBC operon which encodes subunits of a eubacteria-like RNA polymerase. The phylogenetic analysis of the complete rpoB sequence presented here supports our inference that the 35-kb circle is the remnant of a plastid genome.
Notes:
N Goldman (1994)  Variance to mean ratio, R(t), for poisson processes on phylogenetic trees.   Mol Phylogenet Evol 3: 3. 230-239 Sep  
Abstract: The ratio of expected variance to mean, R(t), of numbers of DNA base substitutions for contemporary sequences related by a "star" phylogeny is widely seen as a measure of the adherence of the sequences' evolution to a Poisson process with a molecular clock, as predicted by the "neutral theory" of molecular evolution under certain conditions. A number of estimators of R(t) have been proposed, all predicted to have mean 1 and distributions based on the chi 2. Various genes have previously been analyzed and found to have values of R(t) far in excess of 1, calling into question important aspects of the neutral theory. In this paper, I use Monte Carlo simulation to show that the previously suggested means and distributions of estimators of R(t) are highly inaccurate. The analysis is applied to star phylogenies and to general phylogenetic trees, and well-known gene sequences are reanalyzed. For star phylogenies the results show that Kimura's estimators ("The Neutral Theory of Molecular Evolution," Cambridge Univ. Press, Cambridge, 1983) are unsatisfactory for statistical testing of R(t), but confirm the accuracy of Bulmer's correction factor (Genetics 123: 615-619, 1989). For all three nonstar phylogenies studied, attained values of all three estimators of R(t), although larger than 1, are within their true confidence limits under simple Poisson process models. This shows that lineage effects can be responsible for high estimates of R(t), restoring some limited confidence in the molecular clock and showing that the distinction between lineage and molecular clock effects is vital.(ABSTRACT TRUNCATED AT 250 WORDS)
Notes:
Z Yang, N Goldman (1994)  Evaluation and extension of Markov process models for the evolution of DNA (in Chinese, with Abstract in English).   Acta Genetica Sinica 21: 17-23  
Abstract: Markov process models of nucleotide substitution are evaluated. A model proposed by Lanave et al (1984), alleged to need no priori assumption about the substitution pattern, is found to have the assumption of reversibility. Calculations based on the 2-p, 4-p, and 6-p substitution schemes show that site variation of substitution speed leads to serious under-estimation of sequence divergence by various methods. Spatial pattern variation also leads to under-estimation, but the discrepancy is slight. A nonhomogeneous Markov process model is used to study the temporal variations of rates and it is shown that the estimated number of substitutions reflects a rate averaged over time. The implications of those results to evolutionary phylogenetics are discussed.
Notes:
Z Yang, N Goldman, A Friday (1994)  Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation.   Mol Biol Evol 11: 2. 316-324 Mar  
Abstract: Using real sequence data, we evaluate the adequacy of assumptions made in evolutionary models of nucleotide substitution and the effects that these assumptions have on estimation of evolutionary trees. Two aspects of the assumptions are evaluated. The first concerns the pattern of nucleotide substitution, including equilibrium base frequencies and the transition/transversion-rate ratio. The second concerns the variation of substitution rates over sites. The maximum-likelihood estimate of tree topology appears quite robust to both these aspects of the assumptions of the models, but evaluation of the reliability of the estimated tree by using simpler, less realistic models can be misleading. Branch lengths are underestimated when simpler models of substitution are used, but the underestimation caused by ignoring rate variation over nucleotide sites is much more serious. The goodness of fit of a model is reduced by ignoring spatial rate variation, but unrealistic assumptions about the pattern of nucleotide substitution can lead to an extraordinary reduction in the likelihood. It seems that evolutionary biologists can obtain accurate estimates of certain evolutionary parameters even with an incorrect phylogeny, while systematists cannot get the right tree with confidence even when a realistic, and more complex, model of evolution is assumed.
Notes:
N Goldman, Z Yang (1994)  A codon-based model of nucleotide substitution for protein-coding DNA sequences.   Mol Biol Evol 11: 5. 725-736 Sep  
Abstract: A codon-based model for the evolution of protein-coding DNA sequences is presented for use in phylogenetic estimation. A Markov process is used to describe substitutions between codons. Transition/transversion rate bias and codon usage bias are allowed in the model, and selective restraints at the protein level are accommodated using physicochemical distances between the amino acids coded for by the codons. Analyses of two data sets suggest that the new codon-based model can provide a better fit to data than can nucleotide-based models and can produce more reliable estimates of certain biologically important measures such as the transition/transversion rate ratio and the synonymous/nonsynonymous substitution rate ratio.
Notes:
1993
N Goldman (1993)  Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences.   Nucleic Acids Res 21: 10. 2487-2491 May  
Abstract: The chaos game representation (CGR) is a scatter plot derived from a DNA sequence, with each point of the plot corresponding to one base of the sequence. If the DNA sequence were a random collection of bases, the CGR would be a uniformly filled square; conversely, any patterns visible in the CGR represent some pattern (information) in the DNA sequence. In this paper, patterns previously observed in a variety of DNA sequences are explained solely in terms of nucleotide, dinucleotide and trinucleotide frequencies.
Notes:
N Goldman (1993)  Simple diagnostic statistical tests of models for DNA substitution.   J Mol Evol 37: 6. 650-661 Dec  
Abstract: The accuracy of models for DNA substitution used in phylogenetic analyses is becoming more important with the increasing availability and analysis of molecular sequence data. It is natural to look for ways of improving these models, and to do this in a planned manner it is useful to be able to identify features of sequences that may not be described adequately. In this paper, I describe three statistics which may give useful diagnostic information on departures from models' predictions. The statistical distributions of these statistics are discussed and simple significance tests are derived. These tests are based on the (estimated) phylogeny of the sequences and so have the advantage of using the information contained in this tree. Examples are given of the application of the new tests to Markov chain models describing the evolution of primate pseudogene sequences and small-subunit RNA sequences.
Notes:
N Goldman (1993)  Statistical tests of models of DNA substitution.   J Mol Evol 36: 2. 182-198 Feb  
Abstract: Penny et al. have written that "The most fundamental criterion for a scientific method is that the data must, in principle, be able to reject the model. Hardly any [phylogenetic] tree-reconstruction methods meet this simple requirement." The ability to reject models is of such great importance because the results of all phylogenetic analyses depend on their underlying models--to have confidence in the inferences, it is necessary to have confidence in the models. In this paper, a test statistic suggested by Cox is employed to test the adequacy of some statistical models of DNA sequence evolution used in the phylogenetic inference method introduced by Felsenstein. Monte Carlo simulations are used to assess significance levels. The resulting statistical tests provide an objective and very general assessment of all the components of a DNA substitution model; more specific versions of the test are devised to test individual components of a model. In all cases, the new analyses have the additional advantage that values of phylogenetic parameters do not have to be assumed in order to perform the tests.
Notes:
1992
1990
N Goldman (1990)  Maximum likelihood inference of phylogenetic trees, with special reference to a Poisson process model of DNA substitution and to parsimony analyses.   Syst Zool 39: 345-361  
Abstract: Maximum likelihood inference is discussed, and some of its advantages and disadvantages are noted. The application of maximum likelihood inference to phylogenetics is examined, and a simple Poisson process model of DNA substitution is used as one example. Further examples follow from the clarification of implicit models underlying traditional “parsimony” and “compatibility” analyses. From the elucidation of these models and analyses, it is seen that Poisson process analysis gives a statistically consistent estimate of phylogeny, and that parsimony methods do indeed have a maximum likelihood foundation but give potentially incorrect estimates of phylogeny. The maximum likelihood formulation provides a common framework within which these analyses are discussed and compared.
Notes:
N Goldman, T B B Paddock, K M Shaw (1990)  Quantitative analysis of shape variation in populations of Surirella fastuosa.   Diatom Research 5: 25-42  
Abstract: The valve outline shapes of two populations of Surirella fastuosa, from Belize and the Philippines, were studied by a modification of the Legendre polynomial method of Stoermer & Ladewski (1982). Principal components and linear discriminant analyses were employed, the latter yielding a rule which correctly classifies over 70% of an 'unknown' sample. However, a comparison of these two populations and 97 taxa that have been described within the S. fastuosa group does not support the maintenance of a large number of species distinguished by slight morphological differences.
Notes:
1989
N Goldman, P J D Lambshead (1989)  Optimization of the Ewens/Caswell neutral model program for community diversity analysis.   Marine Ecology Progress Series 50: 255-261  
Abstract: The neutral model program of Ewens (1972) was introduced into ecology by Caswell (1976) and has become a useful tool for benthic ecology. Lambshead & Platt (1988) noted that Ewens's program presented serious computational problems which could not be effectively overcome by subsampling or by substitution of a simpler equitability method. In this paper a modified program is presented which greatly reduces these computational difficulties, making it suitable for personal computers.
Notes:
1988
N Goldman (1988)  Methods for discrete coding of morphological characters for numerical analysis.   Cladistics 4: 59-71  
Abstract: "Generalized gap-coding", suggested by Archie (1985), is a technique for coding continuous characters for numerical analysis. It consists of two stages: finding discriminant subsets of the taxa in which no pair are identifiably different (to an arbitrary level, and in a sense defined by Archie, 1985, and below), and coding this information into multistate or binary variables. If quantitative data are to be used in cladistic analysis, recently questioned by Pimentel and Riggins (1987) and Cranston and Humphries (1988), then Archie's (1985) method is generally applicable. However, there are problems and so in this paper the theory and method of the first stage are restated more clearly, and modified versions of the second stage are suggested. A new method for recoding multistate ordered variables into a number of variables of fewer states and two techniques for decreasing the size of a data matrix, whilst retaining all the information significant to many numerical analyses, are proposed. Additionally, the terminology used to differentiatte clearly between observable features and their representataion in a data matrix is given in the Appendix.
Notes:
1983

Book chapters

2007
2003
2001
1997
1996
1995
1994
N Goldman, Z Yang (1994)  Models of DNA substitution and the discrimination of evolutionary parameters.   In: Proceedings of the XVIIth International Biometrics Conference, Hamilton, Ontario, Canada 407-421 International Biometric Society  
Abstract: Models of DNA nucleotide substitution are important for the estimation of phylogenetic trees and for the understanding of the evolution of DNA sequencs. Statistical tests of the accuracy of commonly used models often indicate that simple models are inadequate. We have considered problems of assessing the adequacy of models and distinguishing good models from bad ones, and questions of the levels of confidence we can have in inferences derived using the models. We conclude that it is relatively easy to assess models and to estimate their parameters. The primary aim of most researchers, however, is to reconstruct phylogenetic trees and we have much less confidence in our ability to do this. Unusual results arise, suggesting that less accurate models can give better discrimination between candidate trees.
Notes:

PhD theses

1992
Powered by PublicationsList.org.