Bioinformatics for High-Throughput Technologies at the Chair of Algorithm Engineering (Ls11) Faculty of Computer Science, TU Dortmund D-44221 Dortmund, Germany
Abstract: BACKGROUND: Transcriptional regulation of gene activity is essential for any living organism. Transcription factors therefore recognize specific binding sites within the DNA to regulate the expression of particular target genes. The genome-scale reconstruction of the emerging regulatory networks is important for biotechnology and human medicine but cost-intensive, time-consuming, and impossible to perform for any species separately. By using bioinformatics methods one can partially transfer networks from well-studied model organisms to closely related species. However, the prediction quality is limited by the low level of evolutionary conservation of the transcription factor binding sites, even within organisms of the same genus. RESULTS: Here we present an integrated bioinformatics workflow that assures the reliability of transferred gene regulatory networks. Our approach combines three methods that can be applied on a large-scale: re-assessment of annotated binding sites, subsequent binding site prediction, and homology detection. A gene regulatory interaction is considered to be conserved if (1) the transcription factor, (2) the adjusted binding site, and (3) the target gene are conserved. The power of the approach is demonstrated by transferring gene regulations from the model organism Corynebacterium glutamicum to the human pathogens C. diphtheriae, C. jeikeium, and the biotechnologically relevant C. efficiens. For these three organisms we identified reliable transcriptional regulations for 40% of the common transcription factors, compared to 5% for which knowledge was available before. CONCLUSION: Our results suggest that trustworthy genome-scale transfer of gene regulatory networks between organisms is feasible in general but still limited by the level of evolutionary conservation.
Abstract: DNA microarrays are a popular technique for the detection of microorganisms. Several approaches using specific oligomers targeting one or a few marker genes for each species have been proposed. Data analysis is usually limited to call a species present when its oligomer exceeds a certain intensity threshold. While this strategy works reasonably well for distantly related species, it does not work well for very closely related species: Cross-hybridization of nontarget DNA prevents a simple identification based on signal intensity. The majority of species of the same genus has a sequence similarity of over 90%. For biodiversity studies down to the species level, it is therefore important to increase the detection power of closely related species. We propose a simple, cost-effective and robust approach for biodiversity studies using DNA microarray technology and demonstrate it on scenedesmacean green algae. The internal transcribed spacer 2 (ITS2) rDNA sequence was chosen as marker because it is suitable to distinguish all eukaryotic species even though parts of it are virtually identical in closely related species. We show that by modelling hybridization behaviour with a matrix algebra approach, we are able to identify closely related species that cannot be distinguished with a threshold on signal intensity. Thus this proof-of-concept study shows that by adding a simple and robust data analysis step to the evaluation of DNA microarrays, species detection can be significantly improved for closely related species with a high sequence similarity.
Abstract: To handle changing environmental surroundings and to manage unfavorable conditions, microbial organisms have evolved complex transcriptional regulatory networks. To comprehensively analyze these gene regulatory networks, several online available databases and analysis platforms have been implemented and established. In this article, we address the typical cycle of scientific knowledge exploration and integration in the area of procaryotic transcriptional gene regulation. We briefly review five popular, publicly available systems that support (i) the integration of existing knowledge, (ii) visualization capabilities and (iii) computer analysis to predict promising wet lab targets. We exemplify the benefits of such integrated data analysis platforms by means of four application cases exemplarily performed with the corynebacterial reference database CoryneRegNet.
Abstract: MOTIVATION: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif. RESULTS: We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis. Availability and Implementation: The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/.
Abstract: Methylation of CpG islands (CGIs) plays an important role in gene silencing. For genome-wide methylation analysis of CGIs in female white blood cells and in sperm, we used four restriction enzymes and a size selection step to prepare DNA libraries enriched with CGIs. The DNA libraries were treated with sodium bisulfite and subjected to a modified 454/Roche Genome Sequencer protocol. We obtained 163 034 and 129 620 reads from blood and sperm, respectively, with an average read length of 133 bp. Bioinformatic analysis revealed that 12 358 (7.6%) blood library reads and 10 216 (7.9%) sperm library reads map to 6167 and 5796 different CGIs, respectively. In blood and sperm DNA, we identified 824 (13.7%) and 482 (8.5%) fully methylated autosomal CGIs, respectively. Differential methylation, which is characterized by the presence of methylated and unmethylated reads of the same CGI, was observed in 53 and 52 autosomal CGIs in blood and sperm DNA, respectively. Remarkably, methylation of X-chromosomal CGIs in female blood cells was most often incomplete (25-75%). Such incomplete methylation was mainly found on the X-chromosome, suggesting that it is linked to X-chromosome inactivation.
Abstract: MOTIVATION: Transcription factors (TFs) play a key role in gene regulation by binding to target sequences. In silico prediction of potential binding of a TF to a binding site is a well-studied problem in computational biology. The binding sites for one TF are represented by a position frequency matrix (PFM). The discovery of new PFMs requires the comparison to known PFMs to avoid redundancies. In general, two PFMs are similar if they occur at overlapping positions under a null model. Still, most existing methods compute similarity according to probabilistic distances of the PFMs. Here we propose a natural similarity measure based on the asymptotic covariance between the number of PFM hits incorporating both strands. Furthermore, we introduce a second measure based on the same idea to cluster a set of the Jaspar PFMs. RESULTS: We show that the asymptotic covariance can be efficiently computed by a two dimensional convolution of the score distributions. The asymptotic covariance approach shows strong correlation with simulated data. It outperforms three alternative methods. The Jaspar clustering yields distinct groups of TFs of the same class. Furthermore, a representative PFM is given for each class. In contrast to most other clustering methods, PFMs with low similarity automatically remain singletons. AVAILABILITY: A website to compute the similarity and to perform clustering, the source code and Supplementary Material are available at http://mosta.molgen.mpg.de.
Abstract: The microarray layout problem is a generalization of the border length minimization problem, and asks to distribute oligonucleotide probes on a microarray and to determine their embeddings in the deposition sequence in such a way that the overall quality of the resulting synthesized probes is maximized. Because of its inherent computational complexity, it is traditionally attacked in several phases: partitioning, placement, and re-embedding. We present the first algorithm, Greedy+, that combines placement and embedding and that results in improved layouts in terms of border length and conflict index (a more realistic measure of probe quality), both on arrays of random probes and on existing Affymetrix GeneChip((R)) arrays. We also present a detailed study on the layouts of the latest GeneChip arrays, and show how Greedy+ can further improve layout quality by as much as 12% in terms of border length and 35% in terms of conflict index.
Abstract: Transcription factors play a key role in gene regulation by interacting with specific binding sites or motifs. Therefore, enrichment of binding motifs is important for genome annotation and efficient computation of the statistical significance, the p-value, of the enrichment of motifs is crucial. We propose an efficient approximation to compute the significance. Due to the incorporation of both strands of the DNA molecules and explicit modeling of dependencies between overlapping hits, we achieve accurate results for any DNA motif based on its Position Frequency Matrix (PFM) representation. The accuracy of the p-value approximation is shown by comparison with the simulated count distribution. Furthermore, we compare the approach with a binomial approximation, (compound) Poisson approximation, and a normal approximation. In general, our approach outperforms these approximations or is equally good but significantly faster. An implementation of our approach is available at http://mosta.molgen.mpg.de.
Abstract: We reconstructed a robust phylogenetic tree of the Metazoa, consisting of almost 1,500 taxa, by profile neighbor joining (PNJ), an automated computational method that inherits the efficiency of the neighbor joining algorithm. This tree supports the one proposed in the latest review on metazoan phylogeny. Our main goal is not to discuss aspects of the phylogeny itself, but rather to point out that PNJ can be a valuable tool when the basal branching pattern of a large phylogenetic tree must be estimated, whereas traditional methods would be computationally impractical.
Abstract: IMS2 is an Integrated Medical Software system for the analysis of Ion Mobility Spectrometry (IMS) data. It assists medical staff with the following IMS data processing steps: acquisition, visualization, classification, and annotation. IMS2 provides data analysis and interpretation features on the one hand, and also helps to improve the classification by increasing the number of the pre-classified datasets on the other hand. It is designed to facilitate early detection of lung cancer, one of the most common cancer types with one million deaths each year around the world. After reviewing the IMS technology, we first describe the software architecture of IMS2 and then the integrated classification module, including necessary pre-processing steps and different classification methods. The Lung Hospital Hemer (Germany) provided IMS data of 35 patients suffering from lung cancer and 72 samples of healthy persons. IMS2 correctly classifies 99% of the samples, evaluated using 10-fold cross-validation.
Abstract: ABSTRACT: BACKGROUND: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. RESULTS: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools (Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192,187 prokaryotic protein sequences (66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. CONCLUSIONS: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.
Abstract: BACKGROUND: The introduction of high-throughput genome sequencing and post-genome analysis technologies, e.g. DNA microarray approaches, has created the potential to unravel and scrutinize complex gene-regulatory networks on a large scale. The discovery of transcriptional regulatory interactions has become a major topic in modern functional genomics. RESULTS: To facilitate the analysis of gene-regulatory networks, we have developed CoryneCenter, a web-based resource for the systematic integration and analysis of genome, transcriptome, and gene regulatory information for prokaryotes, especially corynebacteria. For this purpose, we extended and combined the following systems into a common platform: (1) GenDB, an open source genome annotation system, (2) EMMA, a MAGE compliant application for high-throughput transcriptome data storage and analysis, and (3) CoryneRegNet, an ontology-based data warehouse designed to facilitate the reconstruction and analysis of gene regulatory interactions. We demonstrate the potential of CoryneCenter by means of an application example. Using microarray hybridization data, we compare the gene expression of Corynebacterium glutamicum under acetate and glucose feeding conditions: Known regulatory networks are confirmed, but moreover CoryneCenter points out additional regulatory interactions. CONCLUSION: CoryneCenter provides more than the sum of its parts. Its novel analysis and visualization features significantly simplify the process of obtaining new biological insights into complex regulatory systems. Although the platform currently focusses on corynebacteria, the integrated tools are by no means restricted to these species, and the presented approach offers a general strategy for the analysis and verification of gene regulatory networks. CoryneCenter provides freely accessible projects with the underlying genome annotation, gene expression, and gene regulation data. The system is publicly available at http://www.CoryneCenter.de.
Abstract: ABSTRACT: BACKGROUND: The application of DNA microarray technology in post-genomic analysis of bacterial genome sequences has allowed the generation of huge amounts of data related to regulatory networks. This data along with literature-derived knowledge on regulation of gene expression has opened the way for genome-wide reconstruction of transcriptional regulatory networks. These large-scale reconstructions can be converted into in silico models of bacterial cells that allow a systematic analysis of network behavior in response to changing environmental conditions. Description CoryneRegNet was designed to facilitate the genome-wide reconstruction of transcriptional regulatory networks of corynebacteria relevant in biotechnology and human medicine. During the import and integration process of data derived from experimental studies or literature knowledge CoryneRegNet generates links to genome annotations, to identified transcription factors and to the corresponding cis-regulatory elements. CoryneRegNet is based on a multi-layered, hierarchical and modular concept of transcriptional regulation and was implemented by using the relational database management system MySQL and an ontology-based data structure. Reconstructed regulatory networks can be visualized by using the yFiles JAVA graph library. As an application example of CoryneRegNet, we have reconstructed the global transcriptional regulation of a cellular module involved in SOS and stress response of corynebacteria. CONCLUSIONS: CoryneRegNet is an ontology-based data warehouse that allows pertinent data management of regulatory interactions along with the genome-scale reconstruction of transcriptional regulatory networks. These models can further be combined with metabolic networks to build integrated models of cellular function including both metabolism and its transcriptional regulation.
Abstract: BACKGROUND: In phylogenetic analysis we face the problem that several subclade topologies are known or easily inferred and well supported by bootstrap analysis, but basal branching patterns cannot be unambiguously estimated by the usual methods (maximum parsimony (MP), neighbor-joining (NJ), or maximum likelihood (ML)), nor are they well supported. We represent each subclade by a sequence profile and estimate evolutionary distances between profiles to obtain a matrix of distances between subclades. RESULTS: Our estimator of profile distances generalizes the maximum likelihood estimator of sequence distances. The basal branching pattern can be estimated by any distance-based method, such as neighbor-joining. Our method (profile neighbor-joining, PNJ) then inherits the accuracy and robustness of profiles and the time efficiency of neighbor-joining. CONCLUSIONS: Phylogenetic analysis of Chlorophyceae with traditional methods (MP, NJ, ML and MrBayes) reveals seven well supported subclades, but the methods disagree on the basal branching pattern. The tree reconstructed by our method is better supported and can be confirmed by known morphological characters. Moreover the accuracy is significantly improved as shown by parametric bootstrap.
Abstract: Given a transmembrane protein, we wish to find related ones by a database search. Due to the strongly hydrophobic amino acid composition of transmembrane domains, suboptimal results are obtained when general-purpose scoring matrices such as BLOSUM are used. Recently, a transmembrane-specific score matrix called PHAT was shown to perform much better than BLOSUM. In this article, we derive a transmembrane score matrix family, called SLIM, which has several distinguishing features. In contrast to currently used matrices, SLIM is non-symmetric. The asymmetry arises because different background compositions are assumed for the transmembrane query and the unknown database sequences. We describe the mathematical model behind SLIM in detail and show that SLIM outperforms PHAT both on simulated data and in a realistic setting. Since non-symmetric score matrices are a new concept in database search methods, we discuss some important theoretical and practical issues.
Abstract: The goal of phylogenetics is to reconstruct ancestral relationships between different taxa, e.g., different species in the tree of life, by means of certain characters, such as genomic sequences. We consider the prominent problem of reconstructing the basal phylogenetic tree topology when several subclades have already been identified or are well known by other means, such as morphological characteristics. Whereas most available tools attempt to estimate a fully resolved tree from scratch, the profile neighbor-joining (PNJ) method focuses directly on the mentioned problem and has proven a robust and efficient method for large-scale data sets, especially when used in an iterative way. We describe an implementation of this idea, the ProfDist software package, which is freely available, and apply the method to estimate the phylogeny of the eukaryotes. Overall, the PNJ approach provides a novel effective way to mine large sequence datasets for relevant phylogenetic information.
Abstract: Clustering objects according to given similarity or distance values is a ubiquitous problem in computational biology with diverse applications, e.g., in defining families of orthologous genes, or in the analysis of microarray experiments. While there exists a plenitude of methods, many of them produce clusterings that can be further improved. "Cleaning up" initial clusterings can be formalized as projecting a graph on the space of transitive graphs; it is also known as the cluster editing or cluster partitioning problem in the literature. In contrast to previous work on cluster editing, we allow arbitrary weights on the similarity graph. To solve the so-defined weighted transitive graph projection problem, we present (1) the first exact fixed-parameter algorithm, (2) a polynomial-time greedy algorithm that returns the optimal result on a well-defined subset of "close-to-transitive" graphs and works heuristically on other graphs, and (3) a fast heuristic that uses ideas similar to those from the Fruchterman-Reingold graph layout algorithm. We compare quality and running times of these algorithms on both artificial graphs and protein similarity graphs derived from the 66 organisms of the COG dataset.
Abstract: The microarray layout problem is a generalization of the border length minimization problem and asks to distribute oligonucleotide probes on a microarray and to determine their embeddings in the deposition sequence in such a way that the overall quality of the resulting synthesized probes is maximized. Because of its inherent computational complexity, it is traditionally attacked in several phases: partitioning, placement, and re-embedding. We present the first algorithm, Greedy+, that combines placement and embedding and results in improved layouts in terms of border length and conflict index (a more realistic measure of probe quality), both on arrays of random probes and on existing Affymetrix GeneChip((R)) arrays. We also present a large-scale study on how the layouts of GeneChip arrays have improved over time, and show how Greedy+ can further improve layout quality by as much as 8% in terms of border length and 34% in terms of conflict index.