Ph.D., Bioinformatics, Universidad de la Habana (UH) and Universidad Nacional Autonoma de Mexico (UNAM), 2006 Dissertation: Rebuilding and Comparing the Transcriptional Regulatory Network of Gamma-Proteobacteria using a Bioinformatics Approach Supervisors: Prof. Georgina Espinosa, Biology Faculty (UH) and Dr. Julio Collado-Vides, Genomics Center, (UNAM).
B.Sc., Biochemistry, Universidad de la Habana (UH), 1998 Dissertation: Evidences of an insulin-like signaling circuit in the lobster Panulirus Argus Best Results Award of the Biology Faculty (5,56 points /5)
Short Postdoctoral Fellowship, National Laboratory for Scientific Computing, Brazil. (Three months), 2006 Comparative Genomics
Research Fellowship, National Laboratory for Scientific Computing, Brazil. (Three months), 2005 Comparative Genomics
Research Fellowship, Genomics Center, UNAM, Mexico. (One month), 2004 Comparative Genomics
Research Fellowship, National Laboratory for Scientific Computing, Brazil. (Three months.), 2003 Comparative Genomics
Research Fellowship, EPFL (Federal Polytechnic School of Lausanne), Switzerland (Three months), 2000 Urban Environment and Social Disparities in Latin America
Abstract: The insulin superfamily is composed of a diverse group of proteins that share a common structural design whose most notable feature is a set of disulfide bonds. There is now sufficient experimental and bioinformatics evidence that it is represented in at least a number of well-investigated invertebrates, where they have been found to intervene mainly in complex processes such as mitosis, cell growth, castes differentiation, and fertility. In this article we automated a methodology first proposed elsewhere—that combines sequence similarity with assessing membership to the superfamily by conservation of structuraly key residues—to identify putative insulin-like peptides (ILPs) in completely sequenced genomes, and applied it as a pipeline to a group of 46 organisms both vertebrates and invertebrates. As a result, we were able to identify 1,653 putative members of the insulin superfamily, from 17 putative members in C. savigny to 58 in X. tropicalis. Moreover, we found that structural distinctions—such as peptides length—between functionally diverse members of the superfamily found in vertebrates, that is, insulins, IGFs, and relaxins, are not equally represented in invertebrates genomes, suggesting that such divergence has occurred only recently in the evolutionary history of vertebrates.
Abstract: Several large ongoing initiatives that profit from next-generation sequencing technologies have driven—and in coming years will continue to drive—the emergence of long catalogs of missense single-nucleotide variants (SNVs) in the human genome. As a consequence, researchers have developed various methods and their related computational tools to classify these missense SNVs as probably deleterious or probably neutral polymorphisms. The outputs produced by each of these computational tools are of different natures and
thus difficult to compare and integrate. Taking advantage of the possible complementarity between different tools might allow more accurate classifications. Here we propose an effective approach to integrating the output of some of these tools into a unified classification; this approach is based on a weighted average of the normalized scores of the individual methods (WAS). (In this paper, the approach is illustrated for the integration of five tools.) We show that this WAS outperforms each individual method in the task of classifying missense SNVs as deleterious or neutral. Furthermore, we demonstrate that this WAS can be used not only for classification purposes (deleterious versus neutral mutation) but also as an indicator of the impact of the mutation on the functionality of the mutant protein. In other words, it may be used as a deleteriousness score of missense SNVs. Therefore, we recommend the use of thisWAS as a consensus deleteriousness score of missense mutations (Condel).
Abstract: Background: The fine tuning of two features of the bacterial regulatory machinery have been known to
contribute to the diversity of gene expression within the same regulon: the sequence of Transcription Factor (TF) binding sites, and their location with respect to promoters. While variations of binding sequences modulate the strength of the interaction between the TF and its binding sites, the distance between binding sites and promoters alter the interaction between the TF and the RNA polymerase (RNAP).
Results: In this paper we estimated the dissociation constants (Kd) of several E. coli TFs in their interaction with variants of their binding sequences from the scores resulting from aligning them to Positional Weight Matrices. A correlation coefficient of 0.78 was obtained when pooling together sites for different TFs. The theoretically estimated Kd values were then used, together with the dissociation constants of the RNAP-promoter interaction to analyze activated and repressed promoters. The strength of repressor sites -- i.e., the strength of the interaction between TFs and their binding sites -- is slightly higher than that of activated sites. We explored how different factors such as the variation of binding sequences, the occurrence of more than one binding site, or different RNAP concentrations may influence the promoters' response to the variations of TF concentrations.
We found that the occurrence of several regulatory sites bound by the same TF close to a promoter -- if they
are bound by the TF in an independent manner -- changes the effect of TF concentrations on promoter occupancy, with respect to individual sites. We also found that the occupancy of a promoter will never be more than half if the RNAP concentration-to-Kp ratio is 1 and the promoter is subject to repression; or less than half if the promoter is subject to activation. If the ratio falls to 0.1, the upper limit of occupancy probability for repressed drops below 10%; a descent of the limits occurs also for activated promoters.
Conclusion: The number of regulatory sites may thus act as a versatility-producing device, in addition to serving as a source of robustness of the transcription machinery. Furthermore, our results show that the effects of TF concentration fluctuations on promoter occupancy are constrained by RNAP concentrations.
Abstract: Background
The specific recognition of genomic cis-regulatory elements by transcription factors (TFs) plays an essential role in the regulation of coordinated gene expression. Studying the mechanisms determining binding specificity in protein-DNA interactions is thus an important goal. Most current approaches for modeling TF specific recognition rely on the knowledge of large sets of cognate target sites and consider only the information contained in their primary sequence.
Results
Here we describe a structure-based methodology for predicting sequence motifs starting from the coordinates of a TF-DNA complex. Our algorithm combines information regarding the direct and indirect readout of DNA into an atomistic statistical model, which is used to estimate the interaction potential. We first measure the ability of our method to correctly estimate the binding specificities of eight prokaryotic and eukaryotic TFs that belong to different structural superfamilies. Secondly, the method is applied to two homology models, finding that sampling of interface side-chain rotamers remarkably improves the results. Thirdly, the algorithm is compared with a reference structural method based on contact counts, obtaining comparable predictions for the experimental complexes and more accurate sequence motifs for the homology models.
Conclusions
Our results demonstrate that atomic-detail structural information can be feasibly used to predict TF binding sites. The computational method presented here is universal and might be applied to other systems involving protein-DNA recognition.
Abstract: Our understanding of the evolution of the insulin signaling pathway is still incomplete. One intriguing unanswered question is the explanation of the emergence of the glucostatic role of insulin in mammals. To find out whether this is due to the development of new sets of signaling transduction elements in these organisms, or to the establishment of new interactions between pre-existing proteins, we rebuilt putative orthologous ISPs in 17 eukaryotic organisms. Then, we computed the conservation of orthologous ISPs at different levels, from sequence similarity of orthologous proteins to co-evolution of interacting
domains. We found that the emergence of glucostatic role in mammals can neither be explained by the development of new sets of signaling elements, nor by the establishment of new interactions between preexisting proteins. The comparison of orthlogous IRS molecules indicates that only in mammals have they acquired their complete functionality as efficient recruiters of effector sub-pathways.
Abstract: BACKGROUND: In the past years, several studies begun to unravel the structure, dynamical properties, and evolution of transcriptional regulatory networks. However, even those comparative studies that focus on a group of closely related organisms are limited by the rather scarce knowledge on regulatory interactions outside a few model organisms, such as E. coli among the prokaryotes. RESULTS: In this paper we used the information annotated in Tractor_DB (a database of regulatory networks in gamma-proteobacteria) to calculate a normalized Site Orthology Score (SOS) that quantifies the conservation of a regulatory link across thirty genomes of this subclass. Then we used this SOS to assess how regulatory connections have evolved in this group, and how the variation of basic regulatory connection is reflected on the structure of the chromosome. We found that individual regulatory interactions shift between different organisms, a process that may be described as rewiring the network. At this evolutionary scale (the gamma-proteobacteria subclass) this rewiring process may be an important source of variation of regulatory incoming interactions for individual networks. We also noticed that the regulatory links that form feed forward motifs are conserved in a better correlated manner than triads of random regulatory interactions or pairs of co-regulated genes. Furthermore, the rewiring process that takes place at the most basic level of the regulatory network may be linked to rearrangements of genetic material within bacterial chromosomes, which change the structure of Transcription Units and therefore the regulatory connections between Transcription Factors and structural genes. CONCLUSION: The rearrangements that occur in bacterial chromosomes-mostly inversion or horizontal gene transfer events - are important sources of variation of gene regulation at this evolutionary scale.
Abstract: The version 2.0 of Tractor_DB is now accessible at its three international mirrors: www.bioinfo.cu/Tractor_DB, www.tractor.lncc.br and http://www.ccg.unam.mx/tractorDB. This database contains a collection of computationally predicted Transcription Factors' binding sites in gamma-proteobacterial genomes. These data should aid researchers in the design of microarray experiments and the interpretation of their results. They should also facilitate studies of Comparative Genomics of the regulatory networks of this group of organisms. In this paper we describe the main improvements incorporated to the database in the past year and a half which include incorporating information on the regulatory networks of 13-increasing to 30-new gamma-proteobacteria and developing a new computational strategy to complement the putative sites identified by the original weight matrix-based approach. We have also added dynamically generated navigation tabs to the navigation interfaces. Moreover, we developed a new interface that allows users to directly retrieve information on the conservation of regulatory interactions in the 30 genomes included in the database by navigating a map that represents a core of the known Escherichia coli regulatory network.
Abstract: Experimental data on the Escherichia coli transcriptional regulatory system has been used in the past years to predict new regulatory elements (promoters, transcription factors (TFs), TFs' binding sites and operons) within its genome. As more genomes of gamma-proteobacteria are being sequenced, the prediction of these elements in a growing number of organisms has become more feasible, as a step towards the study of how different bacteria respond to environmental changes at the level of transcriptional regulation. In this work, we present TRACTOR_DB (TRAnscription FaCTORs' predicted binding sites in prokaryotic genomes), a relational database that contains computational predictions of new members of 74 regulons in 17 gamma-proteobacterial genomes. For these predictions we used a comparative genomics approach regarding which several proof-of-principle articles for large regulons have been published. TRACTOR_DB may be currently accessed at http://www.bioinfo.cu/Tractor_DB, http://www.tractor.lncc.br/ or at http://www.cifn.unam.mx/Computational_Genomics/tractorDB. Contact Email id is tractor@cifn.unam.mx.
Abstract: Prokaryotic genomes annotation has focused on genes location and function. The lack of regulatory information has limited the knowledge on cellular transcriptional regulatory networks. However, as more phylogenetically close genomes are sequenced and annotated, the implementation of phylogenetic footprinting strategies for the recognition of regulators and their regulons becomes more important. In this paper we describe a comparative genomics approach to the prediction of new gamma-proteobacterial regulon members. We take advantage of the phylogenetic proximity of Escherichia coli and other 16 organisms of this subdivision and the intensive search of the space sequence provided by a pattern-matching strategy. Using this approach we complement predictions of regulatory sites made using statistical models currently stored in Tractor_DB, and increase the number of transcriptional regulators with predicted binding sites up to 86. All these computational predictions may be reached at Tractor_DB (www.bioinfo.cu/Tractor_DB, www.tractor.lncc.br, www.ccg.unam.mx/Computational_Genomics/tractorDB/). We also take a first step in this paper towards the assessment of the conservation of the architecture of the regulatory network in the gamma-proteobacteria through evaluating the conservation of the overall connectivity of the network.
Abstract: Experimental data on the Escherichia coli transcriptional regulation has enabled the construction of statistical models to predict new regulatory elements within its genome. Far less is known about the transcriptional regulatory elements in other gamma-proteobacteria with sequenced genomes, so it is of great interest to conduct comparative genomic studies oriented to extracting biologically relevant information about transcriptional regulation in these less studied organisms using the knowledge from E. coli. In this work, we use the information stored in the TRACTOR_DB database to conduct a comparative study on the mechanisms of transcriptional regulation in eight gamma-proteobacteria and 38 regulons. We assess the conservation of transcription factors binding specificity across all the eight genomes and show a correlation between the conservation of a regulatory site and the structure of the transcription unit it regulates. We also find a marked conservation of site-promoter distances across the eight organisms and a correspondence of the statistical significance of co-occurrence of pairs of transcription factor binding sites in the regulatory regions, which is probably related to a conserved architecture of higher-order regulatory complexes in the organisms studied. The results obtained in this study using the information on transcriptional regulation in E. coli enable us to conclude that not only transcription factor-binding sites are conserved across related species but also several of the transcriptional regulatory mechanisms previously identified in E. coli.
Abstract: An extraordinary advance of the biomedical sciences will take place in the next years as a result of the Human Genome project. The new technologies based on the molecular genetics and informatics are key factors for this development, since they provide powerful tools for the obtention and analysis of genetic information. The appearance of new technologies has made possible the development of genomics, on making possible the study of the interactions of genes and their influence on the development of diseases. All this influences on the clinical diagnosis, the investigation of new drugs epidemiology and medical informatics. In the last years, data mining has experienced an increase as a support for the phylosophies of information management and knowledge, as well as for the discovery of the meaning of the data stored in big banks. This allows to explore and analyze the databases available to help in the decision-making process and it also facilitates the extraction of the information existing in the texts and to create smart systems capable of understanding them. This is commonly known as text mining. The basic components of data mining and its application to an emerging and trascendent scientifc activity, bioinformatics, are synthetically described.