Abstract: RNA silencing is a conserved mechanism in which small RNAs trigger various forms of sequence-specific gene silencing by guiding Argonaute complexes to target RNAs by means of base pairing. RNA silencing is thought to have evolved as a form of nucleic-acid-based immunity to inactivate viruses and transposable elements. Although the activity of transposable elements in animals has been thought largely to be restricted to the germ line, recent studies have shown that they may also actively transpose in somatic cells, creating somatic mosaicism in animals. In the Drosophila germ line, Piwi-interacting RNAs arise from repetitive intergenic elements including retrotransposons by a Dicer-independent pathway and function through the Piwi subfamily of Argonautes to ensure silencing of retrotransposons. Here we show that, in cultured Drosophila S2 cells, Argonaute 2 (AGO2), an AGO subfamily member of Argonautes, associates with endogenous small RNAs of 20-22 nucleotides in length, which we have collectively named endogenous short interfering RNAs (esiRNAs). esiRNAs can be divided into two groups: one that mainly corresponds to a subset of retrotransposons, and the other that arises from stem-loop structures. esiRNAs are produced in a Dicer-2-dependent manner from distinctive genomic loci, are modified at their 3' ends and can direct AGO2 to cleave target RNAs. Mutations in Dicer-2 caused an increase in retrotransposon transcripts. Together, our findings indicate that different types of small RNAs and Argonautes are used to repress retrotransposons in germline and somatic cells in Drosophila.
Abstract: BACKGROUND: The adaptive immune system (AIS) of jawed vertebrates is a sophisticated system mediated by numerous genes in specialized cells. Phylogenetic analysis indicates that emergence of the AIS followed the occurrence of two rounds of whole-genome duplication (2R-WGD) in early vertebrates, but little direct evidence linking these two events is available. RESULTS: We examined the relationship between 2R-WGD and the gain of AIS-related functions by numerous genes. To analyze the evolution of the many genes related to signal transduction in the AIS (defined as AIS genes), we identified groups of genes (defined as AIS subfamilies) that included at least one human AIS gene, its paralogs (if any), and its Drosophila ortholog(s). Genomic mapping revealed that numerous pairs of AIS genes and their paralogs were part of paralogons - series of paralogous regions that derive from a common ancestor - throughout the human genome, indicating that the genes were retained as duplicates after 2R-WGD. Outgroup comparison analysis revealed that subfamilies in which human and fly genes shared a nervous system-related function were significantly enriched among AIS subfamilies, as compared with the overall incidence of shared nervous system-related functions among all subfamilies in bilaterians. This finding statistically supports the hypothesis that AIS-related signaling genes were ancestrally involved in the nervous system of urbilaterians. CONCLUSION: The current results suggest that 2R-WGD played a major role in the duplication of many signaling genes, ancestrally used in nervous system development and function, that were later co-opted for new functions during evolution of the AIS.
Abstract: Small RNAs triggering RNA silencing are loaded onto Argonautes and then sequence-specifically guide them to target transcripts. Epitope-tagged human Argonautes (hAgo1, hAgo2, hAgo3, and hAgo4) are associated with siRNAs and miRNAs, but only epitope-tagged hAgo2 has been shown to have Slicer activity. Contrarily, how endogenous hAgos behave with respect to small RNA association and target RNA destruction has remained unclear. Here, we produced monoclonal antibodies for individual hAgos. High-throughput pyrosequencing revealed that immunopurified endogenous hAgo2 and hAgo3 associated mostly with miRNAs. Endogenous hAgo3 did not show Slicer function but localized in P-bodies, suggesting that hAgo3 endogenously expressed is, like hAgo2, involved in the miRNA pathway but antagonizes the RNAi activity of hAgo2. Sequence variations of miRNAs were found at both 5' and 3' ends, suggesting that multiple mature miRNAs containing different "seed" sequences can arise from one miRNA precursor. The hAgo antibodies we raised are valuable tools for ascertaining the functional behavior of endogenous Argonautes and miRNAs in RNA silencing.
Abstract: The adenohypophysis of vertebrates receives peptide hormones from the hypothalamus and secretes hormones that regulate diverse physiologic processes in peripheral organs. The adenohypophysis-mediated endocrine system is widely conserved across vertebrates but not invertebrates. Phylogenetic analysis indicates that the emergence of this system coincided with two rounds of whole-genome duplication (2R-WGD) in early vertebrates, but direct evidence linking these events has been unavailable. We detected all human paralogons (series of paralogous regions) formed in early vertebrates as traces of 2R-WGD, and examined the relationship between 2R-WGD and the evolution of genes essential to the adenohypophysis-mediated endocrine system. Regarding genes encoding transcription factors (TFs) involved in the terminal differentiation into hormone-secreting cells in adenohypophyseal development, we showed that most pairs of these genes and their paralogs were part of paralogons. In addition, our analysis also indicated that most of the paralog pairs in families of adenohypophyseal hormones and their receptors were part of paralogons. These results suggest that 2R-WGD played an important role in generating genes encoding adenohypophyseal TFs, hormones, and their receptors for increasing the diversification of hormone repertoire in the adenohypophysis-mediated endocrine system of vertebrates.
Abstract: We present web servers for analysis of non-coding RNA sequences on the basis of their secondary structures. Software tools for structural multiple sequence alignments, structural pairwise sequence alignments and structural motif findings are available from the integrated web server and the individual stand-alone web servers. The servers are located at http://software.ncrna.org, along with the information for the evaluation and downloading. This website is freely available to all users and there is no login requirement.
Abstract: We developed a pair of databases that support two important tasks: annotation of anonymous RNA transcripts and discovery of novel non-coding RNAs. The database combo is called the Functional RNA Database and consists of two databases: a rewrite of the original version of the Functional RNA Database (fRNAdb) and the latest version of the UCSC GenomeBrowser for Functional RNA. The former is a sequence database equipped with a powerful search function and hosts a large collection of known/predicted non-coding RNA sequences acquired from existing databases as well as novel/predicted sequences reported by researchers of the Functional RNA Project. The latter is a UCSC Genome Browser mirror with large additional custom tracks specifically associated with non-coding elements. It also includes several functional enhancements such as a presentation of a common secondary structure prediction at any given genomic window </=500 bp. Our GenomeBrowser supports user authentication and user-specific tracks. The current version of the fRNAdb is a complete rewrite of the former version, hosting a larger number of sequences and with a much friendlier interface. The current version of UCSC GenomeBrowser for Functional RNA features a larger number of tracks and richer features than the former version. The databases are available at http://www.ncrna.org/.
Abstract: BACKGROUND: Aligning multiple RNA sequences is essential for analyzing non-coding RNAs. Although many alignment methods for non-coding RNAs, including Sankoff's algorithm for strict structural alignments, have been proposed, they are either inaccurate or computationally too expensive. Faster methods with reasonable accuracies are required for genome-scale analyses. RESULTS: We propose a fast algorithm for multiple structural alignments of RNA sequences that is an extension of our pairwise structural alignment method (implemented in SCARNA). The accuracies of the implemented software, MXSCARNA, are at least as favorable as those of state-of-art algorithms that are computationally much more expensive in time and memory. CONCLUSION: The proposed method for structural alignment of multiple RNA sequences is fast enough for large-scale analyses with accuracies at least comparable to those of existing algorithms. The source code of MXSCARNA and its web server are available at http://mxscarna.ncrna.org.
Abstract: MOTIVATION: Base pairing probability matrices have been frequently used for the analyses of structural RNA sequences. Recently, there has been a growing need for computing these probabilities for long DNA sequences by constraining the maximal span of base pairs to a limited value. However, none of the existing programs can exactly compute the base pairing probabilities associated with the energy model of secondary structures under such a constraint. RESULTS: We present an algorithm that exactly computes the base pairing probabilities associated with the energy model under the constraint on the maximal span W of base pairs. The complexity of our algorithm is given by O(NW2) in time and O(N+W2) in memory, where N is the sequence length. We show that our algorithm has a higher sensitivity to the true base pairs as compared to that of RNAplfold. We also present an algorithm that predicts a mutually consistent set of local secondary structures by maximizing the expected accuracy function. The comparison of the local secondary structure predictions with those of RNALfold indicates that our algorithm is more accurate. Our algorithms are implemented in the software named 'Rfold.' AVAILABILITY: The C++ source code of the Rfold software and the test dataset used in this study are available at http://www.ncrna.org/software/Rfold/.
Abstract: MOTIVATION: Recent studies have shown that the methods for predicting secondary structures of RNAs on the basis of posterior decoding of the base-pairing probabilities has an advantage with respect to prediction accuracy over the conventionally utilized minimum free energy methods. However, there is room for improvement in the objective functions presented in previous studies, which are maximized in the posterior decoding with respect to the accuracy measures for secondary structures. RESULTS: We propose novel estimators which improve the accuracy of secondary structure prediction of RNAs. The proposed estimators maximize an objective function which is the weighted sum of the expected number of the true positives and that of the true negatives of the base-pairs. The proposed estimators are also improved versions of the ones used in previous works, namely CONTRAfold (Do et al., 2006) for secondary structure prediction from a single RNA sequence and McCaskill-MEA (Kiryu et al., 2007) for common secondary structure prediction from multiple alignments of RNA sequences. We clarify the relations between the proposed estimators and the estimators presented in previous works, and theoretically show that the previous estimators include additional unnecessary terms in the evaluation measures with respect to the accuracy. Furthermore, computational experiments confirm the theoretical analysis by indicating improvement in the empirical accuracy. The proposed estimators represent extensions of the centroid estimators proposed in Ding et al. (2005) and Carvalho and Lawrence (2008), and are applicable to a wide variety of problems in bioinformatics. AVAILABILITY: Supporting information and the CentroidFold software are available online at: http://www.ncrna.org/software/centroidfold/. CONTACT: hamada-michiaki@aist.go.jp.
Abstract: BACKGROUND: Recent discoveries of a large variety of important roles for non-coding RNAs (ncRNAs) have been reported by numerous researchers. In order to analyze ncRNAs by kernel methods including support vector machines, we propose stem kernels as an extension of string kernels for measuring the similarities between two RNA sequences from the viewpoint of secondary structures. However, applying stem kernels directly to large data sets of ncRNAs is impractical due to their computational complexity. RESULTS: We have developed a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences that significantly increases the computation speed of stem kernels. Furthermore, we propose profile-profile stem kernels for multiple alignments of RNA sequences which utilize base-pairing probability matrices for multiple alignments instead of those for individual sequences. Our kernels outperformed the existing methods with respect to the detection of known ncRNAs and kernel hierarchical clustering. CONCLUSION: Stem kernels can be utilized as a reliable similarity measure of structural RNAs, and can be used in various kernel-based applications.
Abstract: We have examined the expression profile of selected non-coding RNAs (ncRNAs) in 11 human tissues. Among 5489 full-length cDNA clones annotated as non-protein-coding transcripts in the H-Invitational database, we chose 150 clones for further analysis based on their gene structure and EST information. Expression profiling using quantitative RT-PCR and Northern blot hybridization revealed that the majority of the selected ncRNAs exhibited tissue specificity: 67% are predominantly expressed in a restricted subset of tissues. The absolute quantification of representative ncRNAs revealed that the majority of ncRNAs are expressed as low abundance transcripts. A comparative genomic analysis revealed that only 27% of the selected ncRNAs have mouse counterparts. Since the expression patterns of the human ncRNAs having no mouse counterparts remain to be similar to those of the mouse ncRNAs, the expression patterns of the selected ncRNAs may be conserved between human and mouse.
Abstract: MOTIVATION: Structural RNA genes exhibit unique evolutionary patterns that are designed to conserve their secondary structures; these patterns should be taken into account while constructing accurate multiple alignments of RNA genes. The Sankoff algorithm is a natural alignment algorithm that includes the effect of base-pair covariation in the alignment model. However, the extremely high computational cost of the Sankoff algorithm precludes its application to most RNA sequences. RESULTS: We propose an efficient algorithm for the multiple alignment of structural RNA sequences. Our algorithm is a variant of the Sankoff algorithm, and it uses an efficient scoring system that reduces the time and space requirements considerably without compromising on the alignment quality. First, our algorithm computes the match probability matrix that measures the alignability of each position pair between sequences as well as the base pairing probability matrix for each sequence. These probabilities are then combined to score the alignment using the Sankoff algorithm. By itself, our algorithm does not predict the consensus secondary structure of the alignment but uses external programs for the prediction. We demonstrate that both the alignment quality and the accuracy of the consensus secondary structure prediction from our alignment are the highest among the other programs examined. We also demonstrate that our algorithm can align relatively long RNA sequences such as the eukaryotic-type signal recognition particle RNA that is approximately 300 nt in length; multiple alignment of such sequences has not been possible by using other Sankoff-based algorithms. The algorithm is implemented in the software named 'Murlet'. AVAILABILITY: The C++ source code of the Murlet software and the test dataset used in this study are available at http://www.ncrna.org/papers/Murlet/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Abstract: Several computational methods based on stochastic context-free grammars have been developed for modeling and analyzing functional RNA sequences. These grammatical methods have succeeded in modeling typical secondary structures of RNA, and are used for structural alignment of RNA sequences. However, such stochastic models cannot sufficiently discriminate member sequences of an RNA family from nonmembers and hence detect noncoding RNA regions from genome sequences. A novel kernel function, stem kernel, for the discrimination and detection of functional RNA sequences using support vector machines (SVMs) is proposed. The stem kernel is a natural extension of the string kernel, specifically the all-subsequences kernel, and is tailored to measure the similarity of two RNA sequences from the viewpoint of secondary structures. The stem kernel examines all possible common base pairs and stem structures of arbitrary lengths, including pseudoknots between two RNA sequences, and calculates the inner product of common stem structure counts. An efficient algorithm is developed to calculate the stem kernels based on dynamic programming. The stem kernels are then applied to discriminate members of an RNA family from nonmembers using SVMs. The study indicates that the discrimination ability of the stem kernel is strong compared with conventional methods. Furthermore, the potential application of the stem kernel is demonstrated by the detection of remotely homologous RNA families in terms of secondary structures. This is because the string kernel is proven to work for the remote homology detection of protein sequences. These experimental results have convinced us to apply the stem kernel in order to find novel RNA families from genome sequences.
Abstract: The identification of novel miRNAs has significant biological and clinical importance. However, none of the known miRNA features alone is sufficient for accurately detecting novel miRNAs. The aim of this paper is to integrate these features in a straightforward manner for detecting miRNAs with better accuracy. Since most miRNA regions are highly conserved among vertebrates for the ability to form stable hairpin structures, we implemented a hidden Markov model that outputs multidimensional feature vectors composed of both evolutionary features and secondary structural ones. The proposed method, called miRRim, outperformed existing ones in terms of detection/prediction performance: The total number of predictions was smaller than with existing methods when the number of miRNAs detected was adjusted to be the same. Moreover, there were several candidates predicted only by our method that are clustered with the known miRNAs, suggesting that our method is able to detect novel miRNAs. Genomic coordinates of predicted miRNA can be obtained from http://mirrim.ncrna.org/.
Abstract: In the human HOXA locus a number of ncRNAs are transcribed from the intergenic regions in the opposite direction to HOXA mRNAs. We observed that the genomic organization of genes for the ncRNAs and HOXA proteins is highly conserved between human and mouse. We examined the expression profiles of these ncRNAs and HOXA mRNAs in various human tissues. The expression patterns of ncRNAs in human tissues coincide with those of the adjacent HOXA mRNAs that are collinearly expressed along the anteroposterior axis. This coordinated expression was observed even in transformed tumors and cancer cell lines, suggesting that the expression of ncRNAs is prerequisite for the regulated expression of HOXA genes. HIT18844 ncRNA transcribed from the most upstream position of the HOXA cluster possesses an ultra-conserved short stretch which potentially forms an evolutionarily conserved secondary structure. Our data suggest a critical role for ncRNAs in the regulation of HOXA gene expression.
Abstract: The genome sequence of Aspergillus oryzae, a fungus used in the production of the traditional Japanese fermentation foods sake (rice wine), shoyu (soy sauce), and miso (soybean paste), has revealed prominent features in its gene composition as compared to those of Saccharomyces cerevisiae and Neurospora crassa. The A. oryzae genome is extremely enriched with genes involved in biomass degradation, primary and secondary metabolism, transcriptional regulation, and cell signaling. Even compared to the related species A. nidulans and A. fumigatus, an abundance of metabolic genes is apparent, with acquisition of more than 6 Mb of sequence in the A. oryzae lineage, interspersed throughout the A. oryzae genome. Besides the various already established merits of A. oryzae for industrial uses, the genome sequence and the abundance of metabolic genes should significantly accelerate the biotechnological use of A. oryzae in industry.
Abstract: MOTIVATION: Recent transcriptomic studies have revealed the existence of a considerable number of non-protein-coding RNA transcripts in higher eukaryotic cells. To investigate the functional roles of these transcripts, it is of great interest to find conserved secondary structures from multiple alignments on a genomic scale. Since multiple alignments are often created using alignment programs that neglect the special conservation patterns of RNA secondary structures for computational efficiency, alignment failures can cause potential risks of overlooking conserved stem structures. RESULTS: We investigated the dependence of the accuracy of secondary structure prediction on the quality of alignments. We compared three algorithms that maximize the expected accuracy of secondary structures as well as other frequently used algorithms. We found that one of our algorithms, called McCaskill-MEA, was more robust against alignment failures than others. The McCaskill-MEA method first computes the base pairing probability matrices for all the sequences in the alignment and then obtains the base pairing probability matrix of the alignment by averaging over these matrices. The consensus secondary structure is predicted from this matrix such that the expected accuracy of the prediction is maximized. We show that the McCaskill-MEA method performs better than other methods, particularly when the alignment quality is low and when the alignment consists of many sequences. Our model has a parameter that controls the sensitivity and specificity of predictions. We discussed the uses of that parameter for multi-step screening procedures to search for conserved secondary structures and for assigning confidence values to the predicted base pairs. AVAILABILITY: The C++ source code that implements the McCaskill-MEA algorithm and the test dataset used in this paper are available at http://www.ncrna.org/papers/McCaskillMEA/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Abstract: There are abundance of transcripts that code for no particular protein and that remain functionally uncharacterized. Some of these transcripts may have novel functions while others might be junk transcripts. Unfortunately, the experimental validation of such transcripts to find functional non-coding RNA candidates is very costly. Therefore, our primary interest is to computationally mine candidate functional transcripts from a pool of uncharacterized transcripts. We introduce fRNAdb: a novel database service that hosts a large collection of non-coding transcripts including annotated/non-annotated sequences from the H-inv database, NONCODE and RNAdb. A set of computational analyses have been performed on the included sequences. These analyses include RNA secondary structure motif discovery, EST support evaluation, cis-regulatory element search, protein homology search, etc. fRNAdb provides an efficient interface to help users filter out particular transcripts under their own criteria to sort out functional RNA candidates. fRNAdb is available at http://www.ncrna.org/
Abstract: MOTIVATION: The functions of non-coding RNAs are strongly related to their secondary structures, but it is known that a secondary structure prediction of a single sequence is not reliable. Therefore, we have to collect similar RNA sequences with a common secondary structure for the analyses of a new non-coding RNA without knowing the exact secondary structure itself. Therefore, the sequence comparison in searching similar RNAs should consider not only their sequence similarities but also their potential secondary structures. Sankoff's algorithm predicts the common secondary structures of the sequences, but it is computationally too expensive to apply to large-scale analyses. Because we often want to compare a large number of cDNA sequences or to search similar RNAs in the whole genome sequences, much faster algorithms are required. RESULTS: We propose a new method of comparing RNA sequences based on the structural alignments of the fixed-length fragments of the stem candidates. The implemented software, SCARNA (Stem Candidate Aligner for RNAs), is fast enough to apply to the long sequences in the large-scale analyses. The accuracy of the alignments is better or comparable with the much slower existing algorithms. AVAILABILITY: The web server of SCARNA with graphical structural alignment viewer is available at http://www.scarna.org/.
Abstract: BACKGROUND: Prediction of human cell response to anti-cancer drugs (compounds) from microarray data is a challenging problem, due to the noise properties of microarrays as well as the high variance of living cell responses to drugs. Hence there is a strong need for more practical and robust methods than standard methods for real-value prediction. RESULTS: We devised an extended version of the off-subspace noise-reduction (de-noising) method to incorporate heterogeneous network data such as sequence similarity or protein-protein interactions into a single framework. Using that method, we first de-noise the gene expression data for training and test data and also the drug-response data for training data. Then we predict the unknown responses of each drug from the de-noised input data. For ascertaining whether de-noising improves prediction or not, we carry out 12-fold cross-validation for assessment of the prediction performance. We use the Pearson's correlation coefficient between the true and predicted response values as the prediction performance. De-noising improves the prediction performance for 65% of drugs. Furthermore, we found that this noise reduction method is robust and effective even when a large amount of artificial noise is added to the input data. CONCLUSION: We found that our extended off-subspace noise-reduction method combining heterogeneous biological data is successful and quite useful to improve prediction of human cell cancer drug responses from microarray data.
Abstract: MOTIVATION: In detection of non-coding RNAs, it is often necessary to identify the secondary structure motifs from a set of putative RNA sequences. Most of the existing algorithms aim to provide the best motif or few good motifs, but biologists often need to inspect all the possible motifs thoroughly. RESULTS: Our method RNAmine employs a graph theoretic representation of RNA sequences and detects all the possible motifs exhaustively using a graph mining algorithm. The motif detection problem boils down to finding frequently appearing patterns in a set of directed and labeled graphs. In the tasks of common secondary structure prediction and local motif detection from long sequences, our method performed favorably both in accuracy and in efficiency with the state-of-the-art methods such as CMFinder. AVAILABILITY: The software is available upon request.
Abstract: Aspergillus fumigatus is exceptional among microorganisms in being both a primary and opportunistic pathogen as well as a major allergen. Its conidia production is prolific, and so human respiratory tract exposure is almost constant. A. fumigatus is isolated from human habitats and vegetable compost heaps. In immunocompromised individuals, the incidence of invasive infection can be as high as 50% and the mortality rate is often about 50% (ref. 2). The interaction of A. fumigatus and other airborne fungi with the immune system is increasingly linked to severe asthma and sinusitis. Although the burden of invasive disease caused by A. fumigatus is substantial, the basic biology of the organism is mostly obscure. Here we show the complete 29.4-megabase genome sequence of the clinical isolate Af293, which consists of eight chromosomes containing 9,926 predicted genes. Microarray analysis revealed temperature-dependent expression of distinct sets of genes, as well as 700 A. fumigatus genes not present or significantly diverged in the closely related sexual species Neosartorya fischeri, many of which may have roles in the pathogenicity phenotype. The Af293 genome sequence provides an unparalleled resource for the future understanding of this remarkable fungus.
Abstract: The genome of Aspergillus oryzae, a fungus important for the production of traditional fermented foods and beverages in Japan, has been sequenced. The ability to secrete large amounts of proteins and the development of a transformation system have facilitated the use of A. oryzae in modern biotechnology. Although both A. oryzae and Aspergillus flavus belong to the section Flavi of the subgenus Circumdati of Aspergillus, A. oryzae, unlike A. flavus, does not produce aflatoxin, and its long history of use in the food industry has proved its safety. Here we show that the 37-megabase (Mb) genome of A. oryzae contains 12,074 genes and is expanded by 7-9 Mb in comparison with the genomes of Aspergillus nidulans and Aspergillus fumigatus. Comparison of the three aspergilli species revealed the presence of syntenic blocks and A. oryzae-specific blocks (lacking synteny with A. nidulans and A. fumigatus) in a mosaic manner throughout the genome of A. oryzae. The blocks of A. oryzae-specific sequence are enriched for genes involved in metabolism, particularly those for the synthesis of secondary metabolites. Specific expansion of genes for secretory hydrolytic enzymes, amino acid metabolism and amino acid/sugar uptake transporters supports the idea that A. oryzae is an ideal microorganism for fermentation.
Abstract: We examined the statistical performance of clustering single particle molecular images by bottom-up clustering, a hierarchical algorithm, using simulated protein images with a low signal-to-noise ratio. Using covariance for the measure of similarity together with the iterative alignment, our method was found to be fairly robust against noise. Clustering tests of four known protein structures were performed at three levels of noise and with three levels of smoothing. A significant effect of smoothing was confirmed in our results for images with noise suggesting an effective degree of smoothing depending on the noise and structural features of the target molecule. The consistency of clustering results was evaluated by the average solid angle of projection, and the precision of our clustering results was checked by the average image correlation between the obtained cluster image and the true projection. Once image features are extracted appropriately, the average solid angle also represents the degree of clustering precision.
Abstract: The aspergilli comprise a diverse group of filamentous fungi spanning over 200 million years of evolution. Here we report the genome sequence of the model organism Aspergillus nidulans, and a comparative study with Aspergillus fumigatus, a serious human pathogen, and Aspergillus oryzae, used in the production of sake, miso and soy sauce. Our analysis of genome structure provided a quantitative evaluation of forces driving long-term eukaryotic genome evolution. It also led to an experimentally validated model of mating-type locus evolution, suggesting the potential for sexual reproduction in A. fumigatus and A. oryzae. Our analysis of sequence conservation revealed over 5,000 non-coding regions actively conserved across all three species. Within these regions, we identified potential functional elements including a previously uncharacterized TPP riboswitch and motifs suggesting regulation in filamentous fungi by Puf family genes. We further obtained comparative and experimental evidence indicating widespread translational regulation by upstream open reading frames. These results enhance our understanding of these widely studied fungi as well as provide new insight into eukaryotic genome evolution and gene regulation.
Abstract: MOTIVATION: Inferring networks of proteins from biological data is a central issue of computational biology. Most network inference methods, including Bayesian networks, take unsupervised approaches in which the network is totally unknown in the beginning, and all the edges have to be predicted. A more realistic supervised framework, proposed recently, assumes that a substantial part of the network is known. We propose a new kernel-based method for supervised graph inference based on multiple types of biological datasets such as gene expression, phylogenetic profiles and amino acid sequences. Notably, our method assigns a weight to each type of dataset and thereby selects informative ones. Data selection is useful for reducing data collection costs. For example, when a similar network inference problem must be solved for other organisms, the dataset excluded by our algorithm need not be collected. RESULTS: First, we formulate supervised network inference as a kernel matrix completion problem, where the inference of edges boils down to estimation of missing entries of a kernel matrix. Then, an expectation-maximization algorithm is proposed to simultaneously infer the missing entries of the kernel matrix and the weights of multiple datasets. By introducing the weights, we can integrate multiple datasets selectively and thereby exclude irrelevant and noisy datasets. Our approach is favorably tested in two biological networks: a metabolic network and a protein interaction network. AVAILABILITY: Software is available on request.
Abstract: MOTIVATION: Genomic and proteomic approaches have accumulated a huge amount of data which provide clues to protein function. However, interpreting single omic data for predicting uncharacterized protein functions has been a challenging task, because the data contain a lot of false positives. To overcome this problem, methods for integrating data from various omic approaches are needed for more accurate function prediction. RESULT: In this paper, we have developed a method which extracts functionally similar proteins with high confidence by integrating protein-protein interaction data and domain information. We used this method to analyze publicly available data from Saccharomyces cerevisiae. We identified 1042 functional associations, involving 765 proteins of which 98 (12.8%) had no previously ascribed function. Our method extracts functionally similar protein pairs more accurately than conventional methods, and predicting function for previously uncharacterized proteins can be achieved. Our method can of course be applied to protein-protein interaction data for any species.
Abstract: MOTIVATION: The relations between the promoter sequences and their strengths were extensively studied in the 1980s. Although these studies uncovered strong sequence-strength correlations, the cost of their elaborate experimental methods have been too high to be applied to a large number of promoters. On the contrary, a recent increase in the microarray data allows us to compare thousands of gene expressions with their DNA sequences. RESULTS: We studied the relations between the promoter sequences and their strengths using the Escherichia coli microarray data. We modeled those relations using a simple weight matrix, which was optimized with a novel support vector regression method. It was observed that several non-consensus bases in the '-35' and '-10' regions of promoter sequences act positively on the promoter strength and that certain consensus bases have a minor effect on the strength. We analyzed outliers for which the observed gene expressions deviate from the promoter strength predictions, and identified several genes with enhanced expressions due to multiple promoters and genes under strong regulation by transcription factors. Our method is applicable to other procaryotes for which both the promoter sequences and the microarray data are available.
Abstract: Single-particle analysis is one of the methods for structural studies of protein and macromolecules; it requires advanced image analysis of electron micrographics. Reconstructing three-dimensional (3D) structure from microscope images is not an easy analysis because of the low image resolution of images and lack of the directional information of images in 3D structure. To improve the resolution, different projections are aligned, classified, and averaged. Inferring the orientations of these images is so difficult that the task of reconstructing 3D structures depends upon the experience of researchers. But recently, a method to reconstruct 3D structures was automatically devised. In this paper, we propose a new method for determining Euler angles of projections by applying genetic algorithms. We empirically show that the proposed approach has improved the previous one in terms of computational time and acquired precision.
Abstract: The data processing language in a graphical software tool that manages sequence annotation data from genome databases should provide flexible functions for the tasks in molecular biology research. Among currently available languages we adopted the Lua programming language. It fulfills our requirements to perform computational tasks for sequence map layouts, i.e. the handling of data containers, symbolic reference to data, and a simple programming syntax. Upon importing a foreign file, the original data are first decomposed in the Lua language while maintaining the original data schema. The converted data are parsed by the Lua interpreter and the contents are stored in our data warehouse. Then, portions of annotations are selected and arranged into our catalog format to be depicted on the sequence map. Our sequence visualization program was successfully implemented, embedding the Lua language for processing of annotation data and layout script. The program is available at http://staff.aist.go.jp/yutaka.ueno/guppy/.
Abstract: Single particle analysis is one of the methods for structural studies of protein and macromolecules developed in image analysis on electron microscopy. Reconstructing 3D structure from microscope images is not an easy analysis because of the low resolution of images and lack of the directional information of images in 3D structure. To improve the resolution, different projections are aligned, classified and averaged. Inferring the orientations of these images is so difficult that the task of reconstructing 3D structures depends upon the experience of researchers. But recently, a method to reconstruct 3D structures is automatically devised. In this paper, we propose a new method for determining Euler angles of projections by applying Genetic Algorithms (i.e., GAs).We empirically show that the proposed approach has improved the previous one in terms of computational time and acquired precision.
Abstract: A molecular structure viewer program, MOSBY has been developed for studies that use atomic coordinates to understand the structures of protein molecules. The program is designed to be portable with a comprehensive user interface by our high-throughput graphics library. In addition, it cooperates with extension modules customized for individual research topics and analysis. For example, an electron density module loads and displays electron density maps derived in X-ray crystallographic analysis superimposed to an atomic model. A molecular dynamics module reads a trajectory file of the results of molecular dynamics calculations and animates the structure. These plug-in modules are devised to function without modification to the MOSBY program. For variations of analysis and calculations with atomic coordinates, the portability and extensibility illustrated by MOSBY play an important rule in scientific computational tools with active software development.
Abstract: MOTIVATION: A new method for finding subtle patterns in sequences is introduced. It approximates the multiple correlations among residuals with pair-wise correlations, with the learning cost O(m(2)n) where n is the number of training sequences, each of length m. The method suits to model splicing sites in human DNA, which are reported to have higher-order dependencies. RESULTS: By computational experiments, the prediction accuracy of our model was shown to surpass that of previously reported Markov models for the prediction of acceptor sites in human. AVAILABILITY: The C++ source code is available on request from the authors.
Abstract: MOTIVATION: Kernel methods such as support vector machines require a kernel function between objects to be defined a priori. Several works have been done to derive kernels from probability distributions, e.g., the Fisher kernel. However, a general methodology to design a kernel is not fully developed. RESULTS: We propose a reasonable way of designing a kernel when objects are generated from latent variable models (e.g., HMM). First of all, a joint kernel is designed for complete data which include both visible and hidden variables. Then a marginalized kernel for visible data is obtained by taking the expectation with respect to hidden variables. We will show that the Fisher kernel is a special case of marginalized kernels, which gives another viewpoint to the Fisher kernel theory. Although our approach can be applied to any object, we particularly derive several marginalized kernels useful for biological sequences (e.g., DNA and proteins). The effectiveness of marginalized kernels is illustrated in the task of classifying bacterial gyrase subunit B (gyrB) amino acid sequences.
Abstract: We present novel kernels that measure similarity of two RNA sequences, taking account of their secondary structures. Two types of kernels are presented. One is for RNA sequences with known secondary structures, the other for those without known secondary structures. The latter employs stochastic context-free grammar (SCFG) for estimating the secondary structure. We call the latter the marginalized count kernel (MCK). We show computational experiments for MCK using 74 sets of human tRNA sequence data: (i) kernel principal component analysis (PCA) for visualizing tRNA similarities, (ii) supervised classification with support vector machines (SVMs). Both types of experiment show promising results for MCKs.
Abstract: Voltage-sensitive membrane channels, the sodium channel, the potassium channel and the calcium channel operate together to amplify, transmit and generate electric pulses in higher forms of life. Sodium and calcium channels are involved in cell excitation, neuronal transmission, muscle contraction and many functions that relate directly to human diseases. Sodium channels--glycosylated proteins with a relative molecular mass of about 300,000 (ref. 5)--are responsible for signal transduction and amplification, and are chief targets of anaesthetic drugs and neurotoxins. Here we present the three-dimensional structure of the voltage-sensitive sodium channel from the eel Electrophorus electricus. The 19 A structure was determined by helium-cooled cryo-electron microscopy and single-particle image analysis of the solubilized sodium channel. The channel has a bell-shaped outer surface of 135 A in height and 100 A in side length at the square-shaped bottom, and a spherical top with a diameter of 65 A. Several inner cavities are connected to four small holes and eight orifices close to the extracellular and cytoplasmic membrane surfaces. Homologous voltage-sensitive calcium and tetrameric potassium channels, which regulate secretory processes and the membrane potential, may possess a related structure.
Abstract: Single particle analysis is a straightforward method for studying the structures of macromolecules that cannot be crystallized. It builds three-dimensional structures of particles by estimating the projection angles of their randomly oriented electron-microscopic images. The existing methods divide the images into clusters, build class averages for the clusters, and estimate the projection angle of each cluster. However, the clustering and the averaged images are highly sensitive to the choice of reference images and mask patterns for each cluster. Thus, the analyses are neither robust nor automatic, and their results depend heavily on the intuition and experience of researchers who set references. We have been developing a software system for single-particle analysis with new clustering and averaging algorithms for building the three-dimensional structures of target molecules. In this paper, we focus on the algorithms for the robust image-processing of the electron microscopic images in our system.
Abstract: In this paper, we evaluated the complexity and accuracy of dicodon model for gene finding using Hidden Markov Model with Self-Identification Learning. We used five different models as competitors with smaller parametric space than the dicodon model. Our evaluation result shows that the dicodon model outperforms other competitors in terms of sensitivity as well as specificity. This result indicates that the dicodon model can not be represented by a combination of the pair amino-acid, the codon usage, and the G+C content.
Abstract: We have constructed a general framework for integrating application programs with control through a local Web browser. This method is based on a simple inter-process message function from an external process to application programs. Commands to a target program are prepared in a script file, which is parsed by a message dispatcher program. When it is used as a helper application to a Web browser, these messages will be sent from the browser by clicking a hyper-link in a Web document. Our framework also supports pluggable extension-modules for application programs by means of dynamic linking. A prototype system is implemented on our molecular structure-viewer program, MOSBY. It successfully featured a function to load an extension-module required for the docking study of molecular fragments from a Web page. Our simple framework facilitates the concise configuration of Web softwares without complicated knowledge on network computation and security issues. It is also applicable for a wide range of network computations processing private data using a Web browser.
Abstract: A precursor is a compound which is transformed to a class of functional molecules within short steps. It is an important process in the production of natural drugs to decide whether a given compound is a precursor or not. We present two strategies to select precursor compounds in the secondary metabolism of terpenoids: one is to find the packing of basic molecules in the given cyclic structure, and the other is to find the synthetic map of the given set of compounds. Both strategies play important roles in reproducing tracer experiments on a computer.
Abstract: A gene finding system, GeneDecoder, based on a parsing technique using a stochastic grammar and dictionary of genetic words is introduced. The structure of human genes are expressed by a stochastic grammar and a dictionary, whose components are the genetic words consisting of genetic phonemes, built as hidden Markov models (HMMs). The HMMs represent the nucleotide acid bases, the codons, and the amino acids. The genetic words in the dictionary are described by the sequence of these HMMs and represent exons, introns, intergenic regions, tRNA regions and signals in DNA sequences. The statistics between these regions are expressed by the grammar, which is a stochastic network of the genetic words. Using the same kind of technique of speech recognition by HMMs with a word dictionary and a grammar, the stochastic network of genetic words enables the motif dictionary to be used during the parsing of the DNA sequences. At the same time, stochastic features of donor/acceptor sites, information of the di-codon statistics, and other important features are integrated into stochastic scores during the parsing. As a result, while the system parses DNA sequences and finds the exon/intron structures, the protein motifs are automatically annotated in the regions. It helps to identify the functions of the genes and reduces the cost of homology search for each hypothetical coding regions. This method is different from simply using the information of homology search. This method uses the information of the motif patterns during the parsing process, but searching the motif patterns after/before finding the coding regions cannot directly affect the parsing process itself. Experimental results have shown that this method reasonably finds and annotates the motifs in the exons in the DNA sequence of human.
Abstract: We have equipped our graphics library with efficient functions so that a molecular structure viewer program can provide both portability and high-throughput rendering without hardware acceleration. The library renders graphics primitives into off-screen image memory with novel functions such as a point list for the vertices of three-dimensional graphical primitives, scan conversions of sphere and cylinder primitives, and z-buffered bit-block transfer. A molecular structure viewer program was implemented with the graphics library, giving reasonable rendering performance on conventional UNIX workstations with the X-Window system. The use of dynamic linking also lends a flexible extension facility to this software system. An advanced polygon renderer can be provided as an plug-in style extension module as well as other functional modules that are specific to the application program. The design of our graphics library is not only effective in molecular graphics but is also applicable for general three-dimensional graphics software systems.
Abstract: MOTIVATION: Automatic extraction of motifs that occur frequently on a set of unaligned DNA sequences is useful for predicting the binding sites of unknown transcription factors. Several programs for this purpose have been released. However, in our opinion, they are not practical enough to be applied to a large number of upstream sequences. RESULTS: We propose a new program called YEBIS (Yet another Environment for the analysis of BIopolymer Sequences) which is capable of extracting a set of motifs, without any a priori knowledge, from a number of functionally related DNA sequences. Using the hidden Markov model, these motifs are represented in a more general form than other conventional methods, such as the weight matrix method. When applied to several sets of benchmark data, it was found that YEBIS had comparable capability to the existing methods, but was much faster. Moreover, it could extract all known motifs from the LTR sequences (long terminal repeat sequences) in a single run. Finally, it could be successfully applied to approximately 400 human promoter sequences and some of the extracted motifs turned out to be known cis-elements. Therefore, YEBIS could be a practical tool for exploring the upstream sequences of genomic ORFs, some of which are regulated in a similar fashion. AVAILABILITY: YEBIS will be distributed to academic users free of charge. All requests should be sent to the address below. CONTACT: E-MAIL: yada@tokyo.jst.go.jp
Abstract: In this paper, we propose a new approach for gene recognition, which uses no training data for the recognizer. In this approach, we start from a simple model, which only uses the knowledge of start codons and the stop codons, then the recognition of the DNA sequences by the recognizer and the training of the parameters of the recognizer by the result of the recognition are repeated. We applied this parse and train approach to the complete genome sequence of cyanobacterium, and achieved the almost same recognition rate with the case of using the whole sequence as training data. This results open the possibility to use automatic gene annotation system in the early stage of sequencing projects.
Abstract: A new software configuration method using plug-in style components was established for the tool with the incremental development of software used in protein structural study. Our memory database provides the interface of data and functions among plug-in modules and its host program. A molecular structure browser program was developed together with several plug-in modules on our programming library that maintains graphics portability and user interfaces. This plug-in software architecture is generally useful for large-scale software development and for prototyping parts of the system.
Abstract: We have developed a method to extract the signal patterns in DNA sequences. In this method, the Genetic Algorithm (GA) and Baum-Welch algorithm are used to obtain the best Hidden Markov Model (HMM) representations of the signal patterns in DNA sequences. The GA is used to search the best network shapes and the initial parameters of the HMMs. Baum-Welch algorithm is used to optimize the HMM parameters for the given network shapes. Akaike Information Criterion (AIC), which gives a criterion for the balance of adaptation and complexity of a model, is applied in the HMM evaluation. We have applied the method to the extraction of the signal patterns in human promoters and 5' ends of yeast introns. As a result, we obtained HMM representations of characteristic features in these sequences. To validate the efficiency of the method, we have performed promoter recognition using obtained HMMs. Two entries including nine promoters are selected from GenBank 76.0, and it is observed that the HMM can predicts eight promoters correctly. These results imply that the method is efficient to design preferable HMM networks, and provides reliable models for the recognition of the signal patterns.
Abstract: The applicability of the Multi-Scale Structure Description (MSSD) scheme to the inverse-folding problems was investigated. An MSSD represents a 3D protein structure with multiple symbolic sequences, where fine structures are represented with the sequence at low levels, the middle scale structural motifs at middle levels, and global topology at high levels. Each symbol in the symbolic sequence denotes a type of local structure of the level scale. The structure fragments are classified at each scale level respectively according to the shape and the environment around the fragments: how the structure is exposed to the solvent or buried in the molecule. I modeled the propensity of an amino-acid sequence to the structure fragment type (i.e., primary constraint) at each scale level. The local propensity is, therefore, modeled at small scale (low) levels, while the global propensity modeled at large scale (high) levels. Thus, superposing all the primary constraint, a 3D protein structure yields an amino-acid sequence profile. Evaluating the fit of an amino acid sequence to the profile derived from the known 3D protein structure, we can identify which 3D structure the given amino-acid sequence would fold into. I checked whether a sequence identifies its own structure over two hundred protein sequences. In many cases, an amino acid sequence identified its own 3D protein structure.
Abstract: There are many shared attributes between existing iterative aligners and Hidden Markov Model (HMM). A learning algorithm of HMM called Viterbi is the same as the iteration of DP-matching of iterative aligners. HMM aligners can use the result of an iterative aligner initially, incorporate the similarity score of amino acids, and apply the detailed gap cost systems to improve the matching accuracy. On the other hand, the iterative aligner can inherit the modeling capability of HMM, and provide the better representation of the proteins than motifs. In this paper, we present an overview of several iterative aligners which include the parallel iterative aligner of ICOT and the HMM aligner of Haussler's group. We compare the merits and shortcomings of these aligners. This comparison enables us to formulate a better, more advanced aligner through proper integration of the iterative technique and HMM technique.
Abstract: We propose a novel description scheme of protein backbone conformation that can model the important factors of protein structure formation, such as global interaction and geometric constraints. This description scheme represents a protein conformation with several symbolic sequences of multiple levels of abstraction. Each symbol in the sequence denotes the class of abstracted topology of subconformation with the size specific to the level. Low level sequences of this description represent fine structures of high resolution, and high level sequences represent the abstracted topologies of large scale. The classification of protein backbone subconformations of various sizes is the most important base for this description scheme. This has never been tried so far due to the complexity in dealing with the number of degrees of freedom in subconformations. However, the proposed technique solved this problem by abstracting the topology of middle and large scale subconformations. This linear expansion technique extracts a fixed number of parameters as the expansion coefficients from the co-ordinate representation of subconformations. In this case, the simple reverse-transformation from the expansion coefficients reconstructs the three-dimensional topology of a subconformation. The analysis of the relation between primary structure of a region and the subconformation of that region at each level in this description helps to model both local and global interactions of protein structure formation. Further, the statistic analysis of overlapping patterns of two subconformations models the geometric constraints important for a structure prediction system in generating a conformation which is geometrically sound.
Abstract: The purpose of this paper is to introduce a new method for analyzing the amino acid sequences of proteins using the hidden Markov model (HMM), which is a type of stochastic model. Secondary structures such as helix, sheet and turn are learned by HMMs, and these HMMs are applied to new sequences whose structures are unknown. The output probabilities from the HMMs are used to predict the secondary structures of the sequences. The authors tested this prediction system on approximately 100 sequences from a public database (Brookhaven PDB). Although the implementation is 'without grammar' (no rule for the appearance patterns of secondary structure) the result was reasonable.