Dr. Tobias Müller Senior Lecturer (Akademischer Oberrat) Department of Bioinformatics Biocenter (Theodor-Boveri-Institute) Bayerische Julius-Maximilians University of Würzburg Am Hubland, D-97074 Würzburg, Germany
Short Curriculum 2011 - today Senior Lecturer, Department of Bioinformatics, University of Würzburg, Germany 2003 - 2011 Lecturer, Department of Bioinformatics, University of Würzburg, Germany 2001 - 2003 Postdoc, Max Planck Institute for Molecular Genetics, Computational Biology, Berlin, Germany 1999 - 2000 Pre doctoral Fellow, Max Planck Institute for Molecular Genetics, Computational Biology, Berlin, Germany 1996 - 1999 Pre doctoral Fellow, German Cancer Research Center (DKFZ), Division of Theoretical Bioinformatics, Heidelberg, Germany 1991 - 1995 Teaching Assistant at the Mathematical Institute of the University of Bonn
Abstract: The internal transcribed spacer 2 (ITS2) has been used as a phylogenetic marker for more than two decades. As ITS2 research mainly focused on the very variable ITS2 sequence, it confined this marker to low-level phylogenetics only. However, the combination of the ITS2 sequence and its highly conserved secondary structure improves the phylogenetic resolution(1) and allows phylogenetic inference at multiple taxonomic ranks, including species delimitation(2-8). The ITS2 Database(9) presents an exhaustive dataset of internal transcribed spacer 2 sequences from NCBI GenBank(11) accurately reannotated(10). Following an annotation by profile Hidden Markov Models (HMMs), the secondary structure of each sequence is predicted. First, it is tested whether a minimum energy based fold(12) (direct fold) results in a correct, four helix conformation. If this is not the case, the structure is predicted by homology modeling(13). In homology modeling, an already known secondary structure is transferred to another ITS2 sequence, whose secondary structure was not able to fold correctly in a direct fold. The ITS2 Database is not only a database for storage and retrieval of ITS2 sequence-structures. It also provides several tools to process your own ITS2 sequences, including annotation, structural prediction, motif detection and BLAST(14) search on the combined sequence-structure information. Moreover, it integrates trimmed versions of 4SALE(15,16) and ProfDistS(17) for multiple sequence-structure alignment calculation and Neighbor Joining(18) tree reconstruction. Together they form a coherent analysis pipeline from an initial set of sequences to a phylogeny based on sequence and secondary structure. In a nutshell, this workbench simplifies first phylogenetic analyses to only a few mouse-clicks, while additionally providing tools and data for comprehensive large-scale analyses.
Abstract: The first step of any molecular phylogenetic analysis is the selection of the species and sequences to be included, the taxon sampling. Already here different pitfalls exist. Sequences can contain errors, annotations in databases can be inaccurate and even the taxonomic classification of a species can be wrong. Usually, these artefacts become evident only after calculation of the phylogenetic tree. Following, the taxon sampling has to be corrected iteratively. This can become tedious and time consuming, as in most cases the taxon sampling is de-coupled from the further steps of the phylogenetic analysis. Here, we present the ITS2 Workbench (http://its2.bioapps.biozentrum.uni-wuerzburg.de/), which eliminates this problem by a tight integration of taxon sampling, secondary structure prediction, multiple alignment and phylogenetic tree calculation. The ITS2 Workbench has access to more than 280,000 ITS2 sequences and their structures provided by the ITS2 database enabling sequence-structure based alignment and tree reconstruction. This allows the interactive improvement of the taxon sampling throughout the whole phylogenetic tree reconstruction process. Thus, the ITS2 Workbench enables a fast, interactive and iterative taxon sampling leading to more accurate ITS2 based phylogenies.
Abstract: Neisseria meningitidis is a naturally transformable, facultative pathogen colonizing the human nasopharynx. Here, we analyze on a genome-wide level the impact of recombination on gene-complement diversity and virulence evolution in N. meningitidis. We combined comparative genome hybridization using microarrays (mCGH) and multilocus sequence typing (MLST) of 29 meningococcal isolates with computational comparison of a subset of seven meningococcal genome sequences.
Abstract: Background: Gene function analysis of the obligate intracellular bacterium Chlamydia pneumoniae is hampered by the facts that this organism is inaccessible to genetic
manipulations and not cultivable outside the host. The genomes of several strains have been sequenced; however, very little information is available on the gene structure and
transcriptome of C. pneumoniae.
Results: Using a differential RNA-sequencing approach with specific enrichment of primary transcripts, we defined the transcriptome of purified elementary bodies and reticulate bodies of C. pneumoniae strain CWL-029. 565 transcriptional start sites of annotated genes and novel transcripts were mapped. Analysis of adjacent genes for co-transcription revealed 246 polycistronic transcripts. In total, a distinct transcription start site or an affiliation to an operon could be assigned to 862 out of 1074 annotated protein coding genes. Semi-quantitative analysis of mapped cDNA reads revealed significant differences for 288 genes in the RNA levels of genes isolated from elementary bodies and reticulate bodies. We have identified and in part confirmed 75 novel putative non-coding RNAs. The detailed map of transcription start sites at single nucleotide resolution allowed for the first time a comprehensive and saturating analysis of promoter consensus sequences in Chlamydia.
Conclusions: The precise transcriptional landscape as a complement to the genome sequence will provide new insights into the organization, control and function of genes. Novel
non-coding RNAs and identified common promoter motifs will help to understand gene regulation of this important human pathogen.
Abstract: Species diversity is by far highest in arthropods, and is in the tropics exceptionally high in the tree canopy. Until
today canopy diversity has been neglected in research, however, so that the real impact of anthropogenic forest destruction
on species diversity, as well as its functional importance can hardly be assessed. We collected canopy arthropods by insecticidal
knockdown fogging in SE-Asian lowland rain forests between 1992 and 2001, and here we use spiders to investigate
the consequences of slash-and-burn cultivation. We measured species diversity in a primary forest and in six forest fragments
of different age and degree of isolation. Our statistical analysis suggests that neither year of sampling nor tree species significantly
affected spider communities. By contrast, spider communities were clearly determined by forest isolation followed
by forest age, both resulting in spider communities specific to different forest types. In respect of guild composition, spider
communities in the isolated forests were most clearly affected with the result that orb-web weavers had increased at the
expense of sheet-web weavers, agile hunters, and cursorial hunters. Species richness was positively correlated with forest
fragment age only under conditions where colonization was possible. In those gradient forests which adjoined primary
forests, communities approximated those in the primary forest within 40 years. In contrast, a distance of 10 km effectively
prevented re-immigration, resulting in low-diversity communities that showed hardly any development in 50 years.
Our data suggest that most primary forest spiders are habitat specialists with restricted dispersal ability, indicating that the
erosion of biodiversity can only be stopped by a high degree of habitat connectivity
Abstract: In several studies, secondary structures of ribosomal genes have been used to improve the quality of phylogenetic reconstructions. An extensive evaluation of the benefits of secondary structure, however, is lacking.
Abstract: Hidden Markov models (HMMs) play a major role in applications to unravel biomolecular functionality. Though HMMs are technically mature and widely applied in computational biology,
there is a potential of methodical optimisation concerning its modelling of biological data sources with varying sequence lengths.
Single building blocks of these models, the states, are associated with a certain holding time, being the link to the length distribution of represented sequence motifs. An adaptation of regular
HMM topologies to bell-shaped sequence lengths is achieved by a serial chain-linking of hidden states, while residing in the class of conventional hidden Markov models. The factor of
the repetition of states (r) and the parameter for state-specific duration of stay (p) are determined by fitting the distribution of sequence lengths with the method of moments (MM) and maximum
likelihood (ML). Performance evaluations of differently adjusted HMM topologies underline the impact of an optimisation for HMMs based on sequence lengths. Secondary structure prediction
on internal transcribed spacer 2 sequences demonstrates exemplarily the general impact of topological optimisations. In summary, we propose a general methodology to improve the modelling
behaviour of HMMs by topological optimisation with ML and a fast and easily implementable moment estimator.
Abstract: MOTIVATION: Increasing quantity and quality of data in transcriptomics and interactomics create the need for integrative approaches to network analysis. Here, we present a comprehensive R-package for the analysis of biological networks including an exact and a heuristic approach to identify functional modules. RESULTS: The BioNet package provides an extensive framework for integrated network analysis in R. This includes the statistics for the integration of transcriptomic and functional data with biological networks, the scoring of nodes as well as methods for network search and visualization. AVAILABILITY: The BioNet package and a tutorial are available from http://bionet.bioapps.biozentrum.uni-wuerzburg.de.
Abstract: The internal transcribed spacer 2 (ITS2) is a widely used phylogenetic marker. In the past, it has mainly been used for species level classifications. Nowadays, a wider applicability becomes apparent. Here, the conserved structure of the RNA molecule plays a vital role. We have developed the ITS2 Database (http://its2.bioapps.biozentrum.uni-wuerzburg.de) which holds information about sequence, structure and taxonomic classification of all ITS2 in GenBank. In the new version, we use Hidden Markov models (HMMs) for the identification and delineation of the ITS2 resulting in a major redesign of the annotation pipeline. This allowed the identification of more than 160,000 correct full length and more than 50,000 partial structures. In the web interface, these can now be searched with a modified BLAST considering both sequence and structure, enabling rapid taxon sampling. Novel sequences can be annotated using the HMM based approach and modelled according to multiple template structures. Sequences can be searched for known and newly identified motifs. Together, the database and the web server build an exhaustive resource for ITS2 based phylogenetic analyses.
Abstract: The Enterobacteriaceae comprise a large number of clinically relevant species with several individual subspecies. Overlapping virulence-associated gene pools and the high overall genome plasticity often interferes with correct enterobacterial strain typing and risk assessment. Array technology offers a fast, reproducible and standardisable means for bacterial typing and thus provides many advantages for bacterial diagnostics, risk assessment and surveillance. The development of highly discriminative broad-range microbial diagnostic microarrays remains a challenge, because of marked genome plasticity of many bacterial pathogens.
Abstract: DNA microarray technology has already revolutionized basic research in infectious diseases, and whole-genome sequencing efforts have allowed for the fabrication of tailor-made spotted microarrays for an increasing number of bacterial pathogens. However, the application of microarrays in diagnostic microbiology is currently hampered by the high costs associated with microarray experiments and the specialized equipment needed. Here, we show that a thorough bioinformatic postprocessing of the microarray design to reduce the amount of unspecific noise also allows the reliable use of spotted gene expression microarrays for gene content analyses. We further demonstrate that the use of only single-color labeling to halve the costs for dye-labeled nucleotides results in only a moderate decrease in overall specificity and sensitivity. Therefore, gene expression microarrays using only single-color labeling can also reliably be used for gene content analyses, thus reducing the costs for potential routine applications such as genome-based pathogen detection or strain typing.
Abstract: The pennate planktonic diatom Pseudo-nitzschia delicatissima is very common in temperate marine waters and often responsible for blooms. Due to its surrounding rigid silicate frustrule the diatom undergoes successive size reduction as its vegetative reproduction cycle proceeds. Since a long time the life cycle of diatoms has raised scientific interest and some years ago extensive samples of Pseudo-nitzschia have been taken from coastal waters. Mating and cell size reduction experiments were carried out and served us as a data basis for a probabilistic model of cell size reduction. We applied a homogenous non-stationary continuous-time Markov chain to model the development of individual diatoms from an initial size of about 80 microm until cell death which occurred when the size reached its low at about 18 microm. In contrast to conventional curve fitting models we are capable of calculating confidence intervals for estimates of the population ages as well as integrate the process of auxospore formation into the model. We thus propose a unique way to describe the stationary size distribution in a diatom population in terms of cell division and auxospore formation probabilities of its individuals.
Abstract: The internal transcribed spacer 2 (ITS2) of the nuclear ribosomal repeat unit is one of the most commonly applied phylogenetic markers. It is a fast evolving locus, which makes it appropriate for studies at low taxonomic levels, whereas its secondary structure is well conserved, and tree reconstructions are possible at higher taxonomic levels. However, annotation of start and end positions of the ITS2 differs markedly between studies. This is a severe shortcoming, as prediction of a correct secondary structure by standard ab initio folding programs requires accurate identification of the marker in question. Furthermore, the correct structure is essential for multiple sequence alignments based on individual structural features. The present study describes a new tool for the delimitation and identification of the ITS2. It is based on hidden Markov models (HMMs) and verifies annotations by comparison to a conserved structural motif in the 5.8S/28S rRNA regions. Our method was able to identify and delimit the ITS2 in more than 30000 entries lacking start and end annotations in GenBank. Furthermore, 45000 ITS2 sequences with a questionable annotation were re-annotated. Approximately 30000 entries from the ITS2-DB, that uses a homology-based method for structure prediction, were re-annotated. We show that the method is able to correctly annotate an ITS2 as small as 58 nt from Giardia lamblia and an ITS2 as large as 1160 nt from humans. Thus, our method should be a valuable guide during the first and crucial step in any ITS2-based phylogenetic analysis: the delineation of the correct sequence. Sequences can be submitted to the following website for HMM-based ITS2 delineation: http://its2.bioapps.biozentrum.uni-wuerzburg.de.
Abstract: Multiple sequence alignments (MSAs) are one of the most important sources of information in sequence analysis. Many methods have been proposed to detect, extract and visualize their most significant properties. To the same extent that site-specific methods like sequence logos successfully visualize site conservations and sequence-based methods like clustering approaches detect relationships between sequences, both types of methods fail at revealing informational elements of MSAs at the level of sequence-site interactions, i.e. finding clusters of sequences and sites responsible for their clustering, which together account for a high fraction of the overall information of the MSA. To fill this gap, we present here a method that combines the Fisher score-based embedding of sequences from a profile hidden Markov model (pHMM) with correspondence analysis. This method is capable of detecting and visualizing group-specific or conflicting signals in an MSA and allows for a detailed explorative investigation of alignments of any size tractable by pHMMs. Applications of our methods are exemplified on an alignment of the Neisseria surface antigen LP2086, where it is used to detect sites of recombinatory horizontal gene transfer and on the vitamin K epoxide reductase family to distinguish between evolutionary and functional signals.
Abstract: BACKGROUND: Tardigrades represent an animal phylum with extraordinary resistance to environmental stress. RESULTS: To gain insights into their stress-specific adaptation potential, major clusters of related and similar proteins are identified, as well as specific functional clusters delineated comparing all tardigrades and individual species (Milnesium tardigradum, Hypsibius dujardini, Echiniscus testudo, Tulinus stephaniae, Richtersius coronifer) and functional elements in tardigrade mRNAs are analysed. We find that 39.3% of the total sequences clustered in 58 clusters of more than 20 proteins. Among these are ten tardigrade specific as well as a number of stress-specific protein clusters. Tardigrade-specific functional adaptations include strong protein, DNA- and redox protection, maintenance and protein recycling. Specific regulatory elements regulate tardigrade mRNA stability such as lox P DICE elements whereas 14 other RNA elements of higher eukaryotes are not found. Further features of tardigrade specific adaption are rapidly identified by sequence and/or pattern search on the web-tool tardigrade analyzer http://waterbear.bioapps.biozentrum.uni-wuerzburg.de. The work-bench offers nucleotide pattern analysis for promotor and regulatory element detection (tardigrade specific; nrdb) as well as rapid COG search for function assignments including species-specific repositories of all analysed data. CONCLUSION: Different protein clusters and regulatory elements implicated in tardigrade stress adaptations are analysed including unpublished tardigrade sequences.
Abstract: DNA microarrays are a popular technique for the detection of microorganisms. Several approaches using specific oligomers targeting one or a few marker genes for each species have been proposed. Data analysis is usually limited to call a species present when its oligomer exceeds a certain intensity threshold. While this strategy works reasonably well for distantly related species, it does not work well for very closely related species: Cross-hybridization of nontarget DNA prevents a simple identification based on signal intensity. The majority of species of the same genus has a sequence similarity of over 90%. For biodiversity studies down to the species level, it is therefore important to increase the detection power of closely related species. We propose a simple, cost-effective and robust approach for biodiversity studies using DNA microarray technology and demonstrate it on scenedesmacean green algae. The internal transcribed spacer 2 (ITS2) rDNA sequence was chosen as marker because it is suitable to distinguish all eukaryotic species even though parts of it are virtually identical in closely related species. We show that by modelling hybridization behaviour with a matrix algebra approach, we are able to identify closely related species that cannot be distinguished with a threshold on signal intensity. Thus this proof-of-concept study shows that by adding a simple and robust data analysis step to the evaluation of DNA microarrays, species detection can be significantly improved for closely related species with a high sequence similarity.
Abstract: Within Chlorophyceae the ITS2 secondary structure shows an unbranched helix I, except for the 'Hydrodictyon' and the 'Scenedesmus' clade having a ramified first helix. The latter two are classified within the Sphaeropleales, characterised by directly opposed basal bodies in their flagellar apparatuses (DO-group). Previous studies could not resolve the taxonomic position of the 'Sphaeroplea' clade within the Chlorophyceae without ambiguity and two pivotal questions remain open: (1) Is the DO-group monophyletic and (2) is a branched helix I an apomorphic feature of the DO-group? In the present study we analysed the secondary structure of three newly obtained ITS2 sequences classified within the 'Sphaeroplea' clade and resolved sphaeroplealean relationships by applying different phylogenetic approaches based on a combined sequence-structure alignment.
Abstract: MOTIVATION: The Profile Neighbor Joining (PNJ) algorithm as implemented in the software ProfDist is computationally efficient in reconstructing very large trees. Besides the huge amount of sequence data the structure is important in RNA alignment analysis and phylogenetic reconstruction. RESULTS: For this ProfDistS provides a phylogenetic workflow that uses individual RNA secondary structures in reconstructing phylogenies based on sequence-structure alignments-using PNJ with manual or iterative and automatic profile definition. Moreover, ProfDistS can deal also with protein sequences.
Abstract: Mobile phone technology makes use of radio frequency (RF) electromagnetic fields transmitted through a dense network of base stations in Europe. Possible harmful effects of RF fields on humans and animals are discussed, but their effect on plants has received little attention. In search for physiological processes of plant cells sensitive to RF fields, cell suspension cultures of Arabidopsis thaliana were exposed for 24 h to a RF field protocol representing typical microwave exposition in an urban environment. mRNA of exposed cultures and controls was used to hybridize Affymetrix-ATH1 whole genome microarrays. Differential expression analysis revealed significant changes in transcription of 10 genes, but they did not exceed a fold change of 2.5. Besides that 3 of them are dark-inducible, their functions do not point to any known responses of plants to environmental stimuli. The changes in transcription of these genes were compared with published microarray datasets and revealed a weak similarity of the microwave to light treatment experiments. Considering the large changes described in published experiments, it is questionable if the small alterations caused by a 24 h continuous microwave exposure would have any impact on the growth and reproduction of whole plants.
Abstract: The function of a noncoding RNA sequence is mainly determined by its secondary structure and therefore a family of noncoding RNA sequences is much more conserved on the structural level than on the sequence level. Understanding the function of noncoding RNA sequence families requires two things: a hand-crafted or hand-improved alignment and detailed analyses of the secondary structures. There are several tools available that help performing these tasks, but all of them are specialized and focus on only one aspect, editing the alignment or plotting the secondary structure. The problem is both these tasks need to be performed simultaneously.
Abstract: MOTIVATION: With the exponential growth of expression and protein-protein interaction (PPI) data, the frontier of research in systems biology shifts more and more to the integrated analysis of these large datasets. Of particular interest is the identification of functional modules in PPI networks, sharing common cellular function beyond the scope of classical pathways, by means of detecting differentially expressed regions in PPI networks. This requires on the one hand an adequate scoring of the nodes in the network to be identified and on the other hand the availability of an effective algorithm to find the maximally scoring network regions. Various heuristic approaches have been proposed in the literature. RESULTS: Here we present the first exact solution for this problem, which is based on integer-linear programming and its connection to the well-known prize-collecting Steiner tree problem from Operations Research. Despite the NP-hardness of the underlying combinatorial problem, our method typically computes provably optimal subnetworks in large PPI networks in a few minutes. An essential ingredient of our approach is a scoring function defined on network nodes. We propose a new additive score with two desirable properties: (i) it is scalable by a statistically interpretable parameter and (ii) it allows a smooth integration of data from various sources. We apply our method to a well-established lymphoma microarray dataset in combination with associated survival data and the large interaction network of HPRD to identify functional modules by computing optimal-scoring subnetworks. In particular, we find a functional interaction module associated with proliferation over-expressed in the aggressive ABC subtype as well as modules derived from non-malignant by-stander cells. AVAILABILITY: Our software is available freely for non-commercial purposes at http://www.planet-lisa.net.
Abstract: Neisseria meningitidis is a leading cause of infectious childhood mortality worldwide. Most research efforts have hitherto focused on disease isolates belonging to only a few hypervirulent clonal lineages. However, up to 10% of the healthy human population is temporarily colonized by genetically diverse strains mostly with little or no pathogenic potential. Currently, little is known about the biology of carriage strains and their evolutionary relationship with disease isolates. The expression of a polysaccharide capsule is the only trait that has been convincingly linked to the pathogenic potential of N. meningitidis. To gain insight into the evolution of virulence traits in this species, whole-genome sequences of three meningococcal carriage isolates were obtained. Gene content comparisons with the available genome sequences from three disease isolates indicate that there is no core pathogenome in N. meningitidis. A comparison of the chromosome structure suggests that a filamentous prophage has mediated large chromosomal rearrangements and the translocation of some candidate virulence genes. Interspecific comparison of the available Neisseria genome sequences and dot blot hybridizations further indicate that the insertion sequence IS1655 is restricted only to N. meningitidis; its low sequence diversity is an indicator of an evolutionarily recent population bottleneck. A genome-based phylogenetic reconstruction provides evidence that N. meningitidis has emerged as an unencapsulated human commensal from a common ancestor with Neisseria gonorrhoeae and Neisseria lactamica and consecutively acquired the genes responsible for capsule synthesis via horizontal gene transfer.
Abstract: Over the past years, microarray databases have increased rapidly in size. While
they offer a wealth of data, it remains challenging to integrate data arising from
different studies. Here we propose an unsupervised approach of a large-scale meta-analysis
on Arabidopsis thaliana whole genome expression datasets to gain additional
insights into the function and regulation of genes. Applying kernel principal
component analysis and hierarchical clustering we found three major groups of experimental
contrasts sharing a common biological trait. Genes associated to two
of these clusters are known to play an important role in indole-3-acetic acid (IAA)
mediated plant growth and development or pathogen defense. Novel functions could
be assigned to genes including a cluster of serine/threonine kinases that carry two
uncharacterized domains (DUF26) in their receptor part implicated in host defence.
With the approach shown here, hidden interrelations between genes regulated under
different conditions can be unraveled.
Abstract: An increasing number of phylogenetic analyses are based on the internal transcribed spacer 2 (ITS2). They mainly use the fast evolving sequence for low-level analyses. When considering the highly conserved structure, the same marker could also be used for higher level phylogenies. Furthermore, structural features of the ITS2 allow distinguishing different species from each other. Despite its importance, the correct structure is only rarely found by standard RNA folding algorithms. To overcome this hindrance for a wider application of the ITS2, we have developed a homology modelling approach to predict the structure of RNA and present the results of modelling the ITS2 in the ITS2 Database. Here, we describe the database and the underlying algorithms which allowed us to predict the structure for 86 784 sequences, which is more than 55% of all GenBank entries concerning the ITS2. These are not equally distributed over all genera. There is a substantial amount of genera where the structure of nearly all sequences is predicted whereas for others no structure at all was found despite high sequence coverage. These genera might have evolved an ITS2 structure diverging from the standard one. The current version of the ITS2 Database can be accessed via http://its2.bioapps.biozentrum.uni-wuerzburg.de.
Abstract: BACKGROUND: Mantle cell lymphoma (MCL) is an incurable B cell lymphoma and accounts for 6% of all non-Hodgkin's lymphomas. On the genetic level, MCL is characterized by the hallmark translocation t(11;14) that is present in most cases with few exceptions. Both gene expression and comparative genomic hybridization (CGH) data vary considerably between patients with implications for their prognosis. METHODS: We compare patients over and below the median of survival. Exploratory principal component analysis of gene expression data showed that the second principal component correlates well with patient survival. Explorative analysis of CGH data shows the same correlation. RESULTS: On chromosome 7 and 9 specific genes and bands are delineated which improve prognosis prediction independent of the previously described proliferation signature. We identify a compact survival predictor of seven genes for MCL patients. After extensive re-annotation using GEPAT, we established protein networks correlating with prognosis. Well known genes (CDC2, CCND1) and further proliferation markers (WEE1, CDC25, aurora kinases, BUB1, PCNA, E2F1) form a tight interaction network, but also non-proliferative genes (SOCS1, TUBA1B CEBPB) are shown to be associated with prognosis. Furthermore we show that aggressive MCL implicates a gene network shift to higher expressed genes in late cell cycle states and refine the set of non-proliferative genes implicated with bad prognosis in MCL. CONCLUSION: The results from explorative data analysis of gene expression and CGH data are complementary to each other. Including further tests such as Wilcoxon rank test we point both to proliferative and non-proliferative gene networks implicated in inferior prognosis of MCL and identify suitable markers both in gene expression and CGH data.
Abstract: It was shown that compensatory base changes (CBCs) in internal transcribed spacer 2 (ITS2) sequence-structure alignments can be used for distinguishing species. Using the ITS2 Database in combination with 4SALE â a tool for synchronous RNA sequence and secondary structure alignment and editing â in this study we present an in-depth CBC analysis for placozoan ITS2 sequences and their respective secondary structures. This analysis indicates at least two distinct species in Trichoplax (Placozoa) supporting a recently suggested hypothesis, that Placozoa is âno longer a phylum of oneâ.
Abstract: Aiming to find key genes and events, we analyze a large data set on diffuse large B-cell lymphoma (DLBCL) gene-expression (248 patients, 12196 spots). Applying the loess normalization method on these raw data yields improved survival predictions, in particular for the clinical important group of patients with medium survival time. Furthermore, we identify a simplified prognosis predictor, which stratifies different risk groups similarly well as complex signatures. We identify specific, activated B cell-like (ABC) and germinal center B cell-like (GCB) distinguishing genes. These include early (e.g. CDKN3) and late (e.g. CDKN2C) cell cycle genes. Independently from previous classification by marker genes we confirm a clear binary class distinction between the ABC and GCB subgroups. An earlier suggested third entity is not supported. A key regulatory network, distinguishing marked over-expression in ABC from that in GCB, is built by: ASB13, BCL2, BCL6, BCL7A, CCND2, COL3A1, CTGF, FN1, FOXP1, IGHM, IRF4, LMO2, LRMP, MAPK10, MME, MYBL1, NEIL1 and SH3BP5. It predicts and supports the aggressive behaviour of the ABC subgroup. These results help to understand target interactions, improve subgroup diagnosis, risk prognosis as well as therapy in the ABC and GCB DLBCL subgroups.
Abstract: Given two organisms, how can one distinguish whether they belong to the same species or not? This might be straightforward for two divergent organisms, but can be extremely difficult and laborious for closely related ones. A molecular marker giving a clear distinction would therefore be of immense benefit. The internal transcribed spacer 2 (ITS2) has been widely used for low-level phylogenetic analyses. Case studies revealed that a compensatory base change (CBC) in the helix II or helix III ITS2 secondary structure between two organisms correlated with sexual incompatibility. We analyzed more than 1300 closely related species to test whether this correlation is generally applicable. In 93%, where a CBC was found between organisms classified within the same genus, they belong to different species. Thus, a CBC in an ITS2 sequence-structure alignment is a sufficient condition to distinguish even closely related species.
Abstract: We reconstructed a robust phylogenetic tree of the Metazoa, consisting of almost 1,500 taxa, by profile neighbor joining (PNJ), an automated computational method that inherits the efficiency of the neighbor joining algorithm. This tree supports the one proposed in the latest review on metazoan phylogeny. Our main goal is not to discuss aspects of the phylogeny itself, but rather to point out that PNJ can be a valuable tool when the basal branching pattern of a large phylogenetic tree must be estimated, whereas traditional methods would be computationally impractical.
Abstract: The internal transcribed spacer 2 (ITS2) is a phylogenetic marker which has been of broad use in generic and infrageneric level classifications, as its sequence evolves comparably fast. Only recently, it became clear, that the ITS2 might be useful even for higher level systematic analyses. As the secondary structure is highly conserved within all eukaryotes it serves as a valuable template for the construction of highly reliable sequence-structure alignments, which build a fundament for subsequent analyses. Thus, any phylogenetic study using ITS2 has to consider both sequence and structure. We have integrated a homology based RNA structure prediction algorithm into a web server, which allows the detection and secondary structure prediction for ITS2 in any given sequence. Furthermore, the resource contains more than 25,000 pre-calculated secondary structures for the currently known ITS2 sequences. These can be taxonomically searched and browsed. Thus, our resource could become a starting point for ITS2-based phylogenetic analyses and is therefore complementary to databases of other phylogenetic markers, which focus on higher level analyses. The current version of the ITS2 database can be accessed via http://its2.bioapps.biozentrum.uni-wuerzburg.de.
Abstract: MOTIVATION: Due to the growing number of completely sequenced genomes, functional annotation of proteins becomes a more and more important issue. Here, we describe a method for the prediction of sites within protein domains, which are part of protein-ligand interactions. As recently demonstrated, these sites are not trivial to detect because of a varying degree of conservation of their location and type within a domain family. RESULTS: The developed method for the prediction of protein-ligand interaction sites is based on a newly defined interaction profile hidden Markov model (ipHMM) topology that takes structural and sequence data into account. It is based on a homology search via a posterior decoding algorithm that yields probabilities for interacting sequence positions and inherits the efficiency and the power of the profile hidden Markov model (pHMM) methodology. The algorithm enhances the quality of interaction site predictions and is a suitable tool for large scale studies, which was already demonstrated for pHMMs. AVAILABILITY: The MATLAB-files are available on request from the first author.
Abstract: Transformation of plant cells with T-DNA of virulent agrobacteria is one of the most extreme triggers of developmental changes in higher plants. For rapid growth and development of resulting tumors, specific changes in the gene expression profile and metabolic adaptations are required. Increased transport and metabolic fluxes are critical preconditions for growth and tumor development. A functional genomics approach, using the Affymetrix whole genome microarray (approximately 22,800 genes), was applied to measure changes in gene expression. The solute pattern of Arabidopsis thaliana tumors and uninfected plant tissues was compared with the respective gene expression profile. Increased levels of anions, sugars, and amino acids were correlated with changes in the gene expression of specific enzymes and solute transporters. The expression profile of genes pivotal for energy metabolism, such as those involved in photosynthesis, mitochondrial electron transport, and fermentation, suggested that tumors produce C and N compounds heterotrophically and gain energy mainly anaerobically. Thus, understanding of gene-to-metabolite networks in plant tumors promotes the identification of mechanisms that control tumor development.
Abstract: BACKGROUND: In sequence analysis the multiple alignment builds the fundament of all proceeding analyses. Errors in an alignment could strongly influence all succeeding analyses and therefore could lead to wrong predictions. Hand-crafted and hand-improved alignments are necessary and meanwhile good common practice. For RNA sequences often the primary sequence as well as a secondary structure consensus is well known, e.g., the cloverleaf structure of the t-RNA. Recently, some alignment editors are proposed that are able to include and model both kinds of information. However, with the advent of a large amount of reliable RNA sequences together with their solved secondary structures (available from e.g. the ITS2 Database), we are faced with the problem to handle sequences and their associated secondary structures synchronously. RESULTS: 4SALE fills this gap. The application allows a fast sequence and synchronous secondary structure alignment for large data sets and for the first time synchronous manual editing of aligned sequences and their secondary structures. This study describes an algorithm for the synchronous alignment of sequences and their associated secondary structures as well as the main features of 4SALE used for further analyses and editing. 4SALE builds an optimal and unique starting point for every RNA sequence and structure analysis. CONCLUSION: 4SALE, which provides an user-friendly and intuitive interface, is a comprehensive toolbox for RNA analysis based on sequence and secondary structure information. The program connects sequence and structure databases like the ITS2 Database to phylogeny programs as for example the CBCAnalyzer. 4SALE is written in JAVA and therefore platform independent. The software is freely available and distributed from the website at http://4sale.bioapps.biozentrum.uni-wuerzburg.de.
Abstract: The CBCAnalyzer (CBC=compensatory base change) is a custom written software toolbox consisting of three parts, CTTransform, CBCDetect, and CBCTree. CTTransform reads several ct-file formats, and generates a so called "bracket-dot-bracket" format that typically is used as input for other tools such as RNAforester, RNAmovie or MARNA. The latter one creates a multiple alignment based on primary sequences and secondary structures that now can be used as input for CBCDetect. CBCDetect counts CBCs in all against all of the aligned sequences. This is important in detecting species that are discriminated by their sexual incompatibility. The count (distance) matrix obtained by CBCDetect is used as input for CBCTree that reconstructs a phylogram by using the algorithm of BIONJ. In this note we describe the features of the toolbox as well as application examples. The toolbox provides a graphical user interface. It is written in C++ and freely available at: http://cbcanalyzer.bioapps.biozentrum.uni-wuerzburg.de.
Abstract: Structural genomics meets phylogenetics and vice versa: Knowing rRNA secondary structures is a prerequisite for constructing rRNA alignments for inferring phylogenies, and inferring phylogenies is a precondition to understand the evolution of such rRNA secondary structures. Here, both scientific worlds go together. The rRNA internal transcribed spacer 2 (ITS2) region is a widely used phylogenetic marker. Because of its high variability at the sequence level, correct alignments have to take into account structural information. In this study, we examine the extent of the conservation in structure. We present (1) the homology modeled secondary structure of more than 20,000 ITS2 covering about 14,000 species; (2) a computational approach for homology modeling of rRNA structures, which additionally can be applied to other RNA families; and (3) a database providing about 25,000 ITS2 sequences with their associated secondary structures, a refined ITS2 specific general time reversible (GTR) substitution model, and a scoring matrix, available at http://its2.bioapps.biozentrum.uni-wuerzburg.de.
Abstract: The ongoing characterization of novel species creates the need for a molecular marker which can be used for species- and, simultaneously, for mega-systematics. Recently, the use of the internal transcribed spacer 2 (ITS2) sequence was suggested, as it shows a high divergence in sequence with an assumed conservation in structure. This hypothesis was mainly based on small-scale analyses, comparing a limited number of sequences. Here, we report a large-scale analysis of more than 54,000 currently known ITS2 sequences with the goal to evaluate the hypothesis of a conserved structural core and to assess its use for automated large-scale phylogenetics. Structure prediction revealed that the previously described core structure can be found for more than 5000 sequences in a wide variety of taxa within the eukaryotes, indicating that the core secondary structure is indeed conserved. This conserved structure allowed an automated alignment of extremely divergent sequences as exemplified for the ITS2 sequences of a ctenophorean eumetazoon and a volvocalean green alga. All classified sequences, together with their structures can be accessed at http://www.biozentrum.uni-wuerzburg.de/bioinformatik/projects/ITS2.html. Furthermore, we found that, although sample sequences are known for most major taxa, there exists a profound divergence in coverage, which might become a hindrance for general usage. In summary, our analysis strengthens the potential of ITS2 as a general phylogenetic marker and provides a data source for further ITS2-based analyses.
Abstract: Signaling pathways based on the reversible phosphorylation of proteins control most aspects of cellular life in higher organisms. Extracellular stimuli can induce growth, differentiation, survival and the stress response through a number of highly conserved signaling pathways. We discuss how the intensity and duration of signals may have dramatic consequences on the way cells respond to stimuli. Picking the central Ras-Raf-MEK-ERK signal cascade, we developed a mathematical model of how stimuli induce different signal patterns and thereby different cellular responses, depending on cell type and the ratio between B-Raf and C-Raf. Based on biochemical data for activation and dephosphorylation, as well as the differential equations of our model, we suggest a different signaling pattern and response result for B-Raf (strong activation, sustained signal) and C-Raf (steep activation, transient signal). We further support the significance of such differential modulatory signaling by showing different Raf isoform expression in various cell lines and experimental testing of the predicted kinase activities in B-Raf, C-Raf and mutated versions.
Abstract: SUMMARY: ProfDist is a user-friendly software package using the profile-neighbor-joining method (PNJ) in inferring phylogenies based on profile distances on DNA or RNA sequences. It is a tool for reconstructing and visualizing large phylogenetic trees providing new and standard features with a special focus on time efficency, robustness and accuracy. AVAILABILITY: A Windows version of ProfDist comes with a graphical user interface and is freely available at http://profdist.bioapps.biozentrum.uni-wuerzburg.de
Abstract: In the presented work we search for transcription factor binding sites (BS) by including additional information about typical BS patterns. The new proposed score combines the ordinary profile score based on TRANSFAC-matrices together with a score based on pairs of BS. The latter score positively weights pairs of BS that tend to occur together in many regulatory DNA-sequences, in contrast to a random background model. The empirical BS pair frequencies result from our evaluation of a large dataset of orthologous genes.
Abstract: The phylogenetic position of the Mollicutes has been re-examined by using phosphoglycerate kinase (Pgk) amino acid sequences. Hitherto unpublished sequences from Mycoplasma mycoides subsp. mycoides, Mycoplasma hyopneumoniae and Spiroplasma citri were included in the analysis. Phylogenetic trees based on Pgk data indicated a monophyletic origin for the Mollicutes within the Firmicutes, whereas Bacilli (Firmicutes) and Clostridia (Firmicutes) appeared to be paraphyletic. With two exceptions, i.e. Thermotoga (Thermotogae) and Fusobacterium (Fusobacteria), which clustered within the Firmicutes, comparative analyses show that at a low taxonomic level, the resolved phylogenetic relationships that were inferred from both the Pgk protein and 16S rRNA gene sequence data are congruent.
Abstract: BACKGROUND: In phylogenetic analysis we face the problem that several subclade topologies are known or easily inferred and well supported by bootstrap analysis, but basal branching patterns cannot be unambiguously estimated by the usual methods (maximum parsimony (MP), neighbor-joining (NJ), or maximum likelihood (ML)), nor are they well supported. We represent each subclade by a sequence profile and estimate evolutionary distances between profiles to obtain a matrix of distances between subclades. RESULTS: Our estimator of profile distances generalizes the maximum likelihood estimator of sequence distances. The basal branching pattern can be estimated by any distance-based method, such as neighbor-joining. Our method (profile neighbor-joining, PNJ) then inherits the accuracy and robustness of profiles and the time efficiency of neighbor-joining. CONCLUSIONS: Phylogenetic analysis of Chlorophyceae with traditional methods (MP, NJ, ML and MrBayes) reveals seven well supported subclades, but the methods disagree on the basal branching pattern. The tree reconstructed by our method is better supported and can be confirmed by known morphological characters. Moreover the accuracy is significantly improved as shown by parametric bootstrap.
Abstract: SUMMARY: The Helmholtz Network for Bioinformatics (HNB) is a joint venture of eleven German bioinformatics research groups that offers convenient access to numerous bioinformatics resources through a single web portal. The 'Guided Solution Finder' which is available through the HNB portal helps users to locate the appropriate resources to answer their queries by employing a detailed, tree-like questionnaire. Furthermore, automated complex tool cascades ('tasks'), involving resources located on different servers, have been implemented, allowing users to perform comprehensive data analyses without the requirement of further manual intervention for data transfer and re-formatting. Currently, automated cascades for the analysis of regulatory DNA segments as well as for the prediction of protein functional properties are provided. AVAILABILITY: The HNB portal is available at http://www.hnbioinfo.de
Abstract: Transcription factor binding site (TFBS) detection plays an important role in computational biology, with applications in gene finding and gene regulation. The sites are often modeled by gapless profiles, also known as position-weight matrices. Past research has focused on the significance of profile scores (the ability to avoid false positives), but this alone is not enough: The profile must also possess the power to detect the true positive signals. Several completed genomes are now available, and the search for TFBSs is moving to a large scale; so discriminating signal from noise becomes even more challenging.Since TFBS profiles are usually estimated from only a few experimentally confirmed instances, careful regularization is an important issue. We present a novel method that is well suited for this situation.We further develop measures that help in judging profile quality, based on both sensitivity and selectivity of a profile. It is shown that these quality measures can be efficiently computed, and we propose statistically well-founded methods to choose score thresholds.Our findings are applied to the TRANSFAC database of transcription factor binding sites. The results are disturbing: If we insist on a significance level of 5% in sequences of length 500, only 19% of the profiles detect a true signal instance with 95% success probability under varying background sequence compositions.
Abstract: Several publications have focused on fitting a specific distribution
to overall microarray data. Due to a number of biological features
the distribution of overall spot intensities can take various shapes. It
appears to be impossible to find a specific distribution fitting all experiments
even if they are carried out perfectly. Therefore, a probabilistic
representation that models a mixture of various effects would be suitable.
We use a Gaussian mixture model to represent signal intensity profiles.
The advantage of this approach is the derivation of a probabilistic criterion
for expressed and non-expressed genes. Furthermore our approach
does not involve any prior decision on the number of model parameters.
We properly fit microarray data of various shapes by a mixture of
Gaussians using the EM algorithm and determine the complexity of the
mixture model by the Bayesian Information Criterion (BIC). Finally, we
apply our method to simulated data and to biological data.
Abstract: Evolution of proteins is generally modeled as a Markov process acting on each site of the sequence. Replacement frequencies need to be estimated based on sequence alignments. Here we compare three approaches: First, the original method by Dayhoff, Schwartz, and Orcutt (1978) Atlas Protein Seq. Struc. 5:345-352, secondly, the resolvent method (RV) by Müller and Vingron (2000) J. Comput. Biol. 7(6):761-776, and finally a maximum likelihood approach (ML) developed in this paper. We evaluate the methods using a highly divergent and inhomogeneous set of sequence alignments as an input to the estimation procedure. ML is the method of choice for small sets of input data. Although the RV method is computationally much less demanding it performs only slightly worse than ML. Therefore, it is perfectly appropriate for large-scale applications.
Abstract: Given a transmembrane protein, we wish to find related ones by a database search. Due to the strongly hydrophobic amino acid composition of transmembrane domains, suboptimal results are obtained when general-purpose scoring matrices such as BLOSUM are used. Recently, a transmembrane-specific score matrix called PHAT was shown to perform much better than BLOSUM. In this article, we derive a transmembrane score matrix family, called SLIM, which has several distinguishing features. In contrast to currently used matrices, SLIM is non-symmetric. The asymmetry arises because different background compositions are assumed for the transmembrane query and the unknown database sequences. We describe the mathematical model behind SLIM in detail and show that SLIM outperforms PHAT both on simulated data and in a realistic setting. Since non-symmetric score matrices are a new concept in database search methods, we discuss some important theoretical and practical issues.
Abstract: The estimation of amino acid replacement frequencies during molecular evolution is crucial for many applications in sequence analysis. Score matrices for database search programs or phylogenetic analysis rely on such models of protein evolution. Pioneering work was done by Dayhoff et al. (1978) who formulated a Markov model of evolution and derived the famous PAM score matrices. Her estimation procedure for amino acid exchange frequencies is restricted to pairs of proteins that have a constant and small degree of divergence. Here we present an improved estimator, called the resolvent method, that is not subject to these limitations. This extension of Dayhoff's approach enables us to estimate an amino acid substitution model from alignments of varying degree of divergence. Extensive simulations show the capability of the new estimator to recover accurately the exchange frequencies among amino acids. Based on the SYSTERS database of aligned protein families (Krause and Vingron, 1998) we recompute a series of score matrices.
Abstract: The goal of phylogenetics is to reconstruct ancestral relationships between different taxa, e.g., different species in the tree of life, by means of certain characters, such as genomic sequences. We consider the prominent problem of reconstructing the basal phylogenetic tree topology when several subclades have already been identified or are well known by other means, such as morphological characteristics. Whereas most available tools attempt to estimate a fully resolved tree from scratch, the profile neighbor-joining (PNJ) method focuses directly on the mentioned problem and has proven a robust and efficient method for large-scale data sets, especially when used in an iterative way. We describe an implementation of this idea, the ProfDist software package, which is freely available, and apply the method to estimate the phylogeny of the eukaryotes. Overall, the PNJ approach provides a novel effective way to mine large sequence datasets for relevant phylogenetic information.
Abstract: In einem einfachen Modell kann Proteinevolution als eine zeitliche Akkumulation von Aminosäuremutationen interpretiert werden. Dabei werden Aminosäuren mit ähnlichen chemischen oder physikalischen Eigenschaften häufiger durch einander ausgetauscht als unähnliche. Im Dayhoffschen Modell von Proteinevolution wird deshalb die Ãhnlichkeit zweier Aminosäuren durch ihre Austauschhäufigkeit gemessen. Diese sind jedoch abhängig vom evolutionären Abstand der betrachteten homologen Sequenzen. Diese zeitliche Abhängigkeit wird durch einen Markoff-Prozeà modelliert, der an jeder Position des Proteins agiert. Die Entwicklung adäquater Schätzer der Markoff-Ketten-Parameter ist aus mathematischer Sicht ein Hauptanliegen dieser Arbeit. In den etablierten Schätzern wird die evolutionäre Distanz meist gar nicht bzw. nur unzureichend modelliert. In dieser Arbeit werden zwei neue Schätzer entwickelt, die auf dem Dayhoffschen Modell von Proteinevolution basieren, dabei jedoch dem Zeitparameter Rechnung tragen. Dies führt zu zwei stark verbesserten Schätzern, die in extensiven Simulationen validiert wurden. Der eine ist der Maximum-Likelihood-Schätzer, der auf einem relativ kleinen Datenfundament der Schätzer der Wahl ist. Der andere ist die Resolventen-Methode, die sehr effizient mit groÃen Datenmengen umgehen kann. Die geschätzten Parameter der Markoff-Kette sind grundlegend für die Berechnung von phylogenetischen Bäumen, Sequenzdatenbanksuchen und für die Berechnung von Sequenzalignments. Durch die verbesserte Modellierung und Schätzung dieser Parameter erwarten wir verbesserte Ergebnisse der Sequenzanalyseprogramme, die auf diesen Parametern beruhen.