Kengo Sato - Publications List

Journal articles

2010

DOI
PMID

Michiaki Hamada, Kengo Sato, Kiyoshi Asai (2010) Improving the accuracy of predicting secondary structure for aligned RNA sequences. Nucleic Acids Res Sep

Abstract: Considerable attention has been focused on predicting the secondary structure for aligned RNA sequences since it is useful not only for improving the limiting accuracy of conventional secondary structure prediction but also for finding non-coding RNAs in genomic sequences. Although there exist many algorithms of predicting secondary structure for aligned RNA sequences, further improvement of the accuracy is still awaited. In this article, toward improving the accuracy, a theoretical classification of state-of-the-art algorithms of predicting secondary structure for aligned RNA sequences is presented. The classification is based on the viewpoint of maximum expected accuracy (MEA), which has been successfully applied in various problems in bioinformatics. The classification reveals several disadvantages of the current algorithms but we propose an improvement of a previously introduced algorithm (CentroidAlifold). Finally, computational experiments strongly support the theoretical classification and indicate that the improved CentroidAlifold substantially outperforms other algorithms.

Notes:

DOI
PMID

Yuki Kato, Kengo Sato, Michiaki Hamada, Yoshihide Watanabe, Kiyoshi Asai, Tatsuya Akutsu (2010) RactIP: fast and accurate prediction of RNA-RNA interaction using integer programming. Bioinformatics 26: 18. i460-i466 Sep

Abstract: MOTIVATION: Considerable attention has been focused on predicting RNA-RNA interaction since it is a key to identifying possible targets of non-coding small RNAs that regulate gene expression post-transcriptionally. A number of computational studies have so far been devoted to predicting joint secondary structures or binding sites under a specific class of interactions. In general, there is a trade-off between range of interaction type and efficiency of a prediction algorithm, and thus efficient computational methods for predicting comprehensive type of interaction are still awaited. RESULTS: We present RactIP, a fast and accurate prediction method for RNA-RNA interaction of general type using integer programming. RactIP can integrate approximate information on an ensemble of equilibrium joint structures into the objective function of integer programming using posterior internal and external base-paring probabilities. Experimental results on real interaction data show that prediction accuracy of RactIP is at least comparable to that of several state-of-the-art methods for RNA-RNA interaction prediction. Moreover, we demonstrate that RactIP can run incomparably faster than competitive methods for predicting joint secondary structures. AVAILABILITY: RactIP is implemented in C++, and the source code is available at http://www.ncrna.org/software/ractip/ CONTACT: ykato@kuicr.kyoto-u.ac.jp; satoken@k.u-tokyo.ac.jp SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Notes:

PMID

Yohei Okada, Kengo Sato, Yasubumi Sakakibara (2010) Improvement of structure conservation index with centroid estimators. Pac Symp Biocomput 88-97

Abstract: RNAz, a support vector machine (SVM) approach for identifying functional non-coding RNAs (ncRNAs), has been proven to be one of the most accurate tools for this goal. Among the measurements used in RNAz, the Structure Conservation Index (SCI) which evaluates the evolutionary conservation of RNA secondary structures in terms of folding energies, has been reported to have an extremely high discrimination capability. However, for practical use of RNAz on the genome-wide search, a relatively high false discovery rate has unfortunately been estimated. It is conceivable that multiple alignments produced by a standard aligner that does not consider any secondary structures are not suitable for identifying ncRNAs in some cases and incur high false discovery rate. In this study, we propose C-SCI, an improved measurement based on the SCI applying gamma-centroid estimators to incorporate the robustness against low quality multiple alignments. Our experiments show that the C-SCI achieves higher accuracy than the original SCI for not only human-curated structural alignments but also low quality alignments produced by CLUSTAL W. Furthermore, the accuracy of the C-SCI on CLUSTAL W alignments is comparable with that of the original SCI on structural alignments generated with RAF for which 4.7-fold expensive computational time is required on average.

Notes:

2009

PMID

Kengo Sato, Yutaka Saito, Yasubumi Sakakibara (2009) Gradient-based optimization of hyperparameters for base-pairing profile local alignment kernels. Genome Inform 23: 1. 128-138 Oct

Abstract: We have recently proposed novel kernel functions, called base-pairing profile local alignment (BPLA) kernels for discrimination and detection of functional RNA sequences using SVMs. We employ STRAL's scoring function which takes into account sequence similarities as well as upstream and downstream base-pairing probabilities, which enables us to model secondary structures of RNA sequences. In this paper, we develop a method for optimizing hyperparameters of BPLA kernels with respect to discrimination accuracy using a gradient-based optimization technique. Our experiments show that the proposed method can find a nearly optimal set of parameters much faster than the grid search on all parameter combinations.

Notes:

DOI
PMID

Kensuke Morita, Yutaka Saito, Kengo Sato, Kotaro Oka, Kohji Hotta, Yasubumi Sakakibara (2009) Genome-wide searching with base-pairing kernel functions for noncoding RNAs: computational and expression analysis of snoRNA families in Caenorhabditis elegans. Nucleic Acids Res 37: 3. 999-1009 Feb

Abstract: Despite the accumulating research on noncoding RNAs (ncRNAs), it is likely that we are seeing only the tip of the iceberg regarding our understanding of the functions and the regulatory roles served by ncRNAs in cellular metabolism, pathogenesis and host-pathogen interactions. Therefore, more powerful computational and experimental tools for analyzing ncRNAs need to be developed. To this end, we propose novel kernel functions, called base-pairing profile local alignment (BPLA) kernels, for analyzing functional ncRNA sequences using support vector machines (SVMs). We extend the local alignment kernels for amino acid sequences in order to handle RNA sequences by using STRAL's; scoring function, which takes into account sequence similarities as well as upstream and downstream base-pairing probabilities, thus enabling us to model secondary structures of RNA sequences. As a test of the performance of BPLA kernels, we applied our kernels to the problem of discriminating members of an RNA family from nonmembers using SVMs. The results indicated that the discrimination ability of our kernels is stronger than that of other existing methods. Furthermore, we demonstrated the applicability of our kernels to the problem of genome-wide search of snoRNA families in the Caenorhabditis elegans genome, and confirmed that the expression is valid in 14 out of 48 of our predicted candidates by using qRT-PCR. Finally, highly expressed six candidates were identified as the original target regions by DNA sequencing.

Notes:

DOI
PMID

Michiaki Hamada, Kengo Sato, Hisanori Kiryu, Toutai Mituyama, Kiyoshi Asai (2009) Predictions of RNA secondary structure by combining homologous sequence information. Bioinformatics 25: 12. i330-i338 Jun

Abstract: MOTIVATION: Secondary structure prediction of RNA sequences is an important problem. There have been progresses in this area, but the accuracy of prediction from an RNA sequence is still limited. In many cases, however, homologous RNA sequences are available with the target RNA sequence whose secondary structure is to be predicted. RESULTS: In this article, we propose a new method for secondary structure predictions of individual RNA sequences by taking the information of their homologous sequences into account without assuming the common secondary structure of the entire sequences. The proposed method is based on posterior decoding techniques, which consider all the suboptimal secondary structures of the target and homologous sequences and all the suboptimal alignments between the target sequence and each of the homologous sequences. In our computational experiments, the proposed method provides better predictions than those performed only on the basis of the formation of individual RNA sequences and those performed by using methods for predicting the common secondary structure of the homologous sequences. Remarkably, we found that the common secondary predictions sometimes give worse predictions for the secondary structure of a target sequence than the predictions from the individual target sequence, while the proposed method always gives good predictions for the secondary structure of target sequences in all tested cases. AVAILABILITY: Supporting information and software are available online at: http://www.ncrna.org/software/centroidfold/ismb2009/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Notes:

DOI
PMID

Kengo Sato, Michiaki Hamada, Kiyoshi Asai, Toutai Mituyama (2009) CENTROIDFOLD: a web server for RNA secondary structure prediction. Nucleic Acids Res 37: Web Server issue. W277-W280 Jul

Abstract: The CENTROIDFOLD web server (http://www.ncrna.org/centroidfold/) is a web application for RNA secondary structure prediction powered by one of the most accurate prediction engine. The server accepts two kinds of sequence data: a single RNA sequence and a multiple alignment of RNA sequences. It responses with a prediction result shown as a popular base-pair notation and a graph representation. PDF version of the graph representation is also available. For a multiple alignment sequence, the server predicts a common secondary structure. Usage of the server is quite simple. You can paste a single RNA sequence (FASTA or plain sequence text) or a multiple alignment (CLUSTAL-W format) into the textarea then click on the 'execute CentroidFold' button. The server quickly responses with a prediction result. The major advantage of this server is that it employs our original CentroidFold software as its prediction engine which scores the best accuracy in our benchmark results. Our web server is freely available with no login requirement.

Notes:

DOI
PMID

Michiaki Hamada, Hisanori Kiryu, Kengo Sato, Toutai Mituyama, Kiyoshi Asai (2009) Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics 25: 4. 465-473 Feb

Abstract: MOTIVATION: Recent studies have shown that the methods for predicting secondary structures of RNAs on the basis of posterior decoding of the base-pairing probabilities has an advantage with respect to prediction accuracy over the conventionally utilized minimum free energy methods. However, there is room for improvement in the objective functions presented in previous studies, which are maximized in the posterior decoding with respect to the accuracy measures for secondary structures. RESULTS: We propose novel estimators which improve the accuracy of secondary structure prediction of RNAs. The proposed estimators maximize an objective function which is the weighted sum of the expected number of the true positives and that of the true negatives of the base pairs. The proposed estimators are also improved versions of the ones used in previous works, namely CONTRAfold for secondary structure prediction from a single RNA sequence and McCaskill-MEA for common secondary structure prediction from multiple alignments of RNA sequences. We clarify the relations between the proposed estimators and the estimators presented in previous works, and theoretically show that the previous estimators include additional unnecessary terms in the evaluation measures with respect to the accuracy. Furthermore, computational experiments confirm the theoretical analysis by indicating improvement in the empirical accuracy. The proposed estimators represent extensions of the centroid estimators proposed in Ding et al. and Carvalho and Lawrence, and are applicable to a wide variety of problems in bioinformatics. AVAILABILITY: Supporting information and the CentroidFold software are available online at: http://www.ncrna.org/software/centroidfold/.

Notes:

DOI
PMID

Michiaki Hamada, Kengo Sato, Hisanori Kiryu, Toutai Mituyama, Kiyoshi Asai (2009) CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score. Bioinformatics 25: 24. 3236-3243 Dec

Abstract: MOTIVATION: The importance of accurate and fast predictions of multiple alignments for RNA sequences has increased due to recent findings about functional non-coding RNAs. Recent studies suggest that maximizing the expected accuracy of predictions will be useful for many problems in bioinformatics. RESULTS: We designed a novel estimator for multiple alignments of structured RNAs, based on maximizing the expected accuracy of predictions. First, we define the maximum expected accuracy (MEA) estimator for pairwise alignment of RNA sequences. This maximizes the expected sum-of-pairs score (SPS) of a predicted alignment under a probability distribution of alignments given by marginalizing the Sankoff model. Then, by approximating the MEA estimator, we obtain an estimator whose time complexity is O(L(3)+c(2)dL(2)) where L is the length of input sequences and both c and d are constants independent of L. The proposed estimator can handle uncertainty of secondary structures and alignments that are obstacles in Bioinformatics because it considers all the secondary structures and all the pairwise alignments as input sequences. Moreover, we integrate the probabilistic consistency transformation (PCT) on alignments into the proposed estimator. Computational experiments using six benchmark datasets indicate that the proposed method achieved a favorable SPS and was the fastest of many state-of-the-art tools for multiple alignments of structured RNAs. AVAILABILITY: The software called CentroidAlign, which is an implementation of the algorithm in this article, is freely available on our website: http://www.ncrna.org/software/centroidalign/. CONTACT: hamada-michiaki@aist.go.jp SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Notes:

2008

DOI
PMID

Kengo Sato, Kensuke Morita, Yasubumi Sakakibara (2008) PSSMTS: position specific scoring matrices on tree structures. J Math Biol 56: 1-2. 201-214 Jan

Abstract: Identifying non-coding RNA regions on the genome using computational methods is currently receiving a lot of attention. In general, it is essentially more difficult than the problem of detecting protein-coding genes because non-coding RNA regions have only weak statistical signals. On the other hand, most functional RNA families have conserved sequences and secondary structures which are characteristic of their molecular function in a cell. These are known as sequence motifs and consensus structures, respectively. In this paper, we propose an improved method which extends a pairwise structural alignment method for RNA sequences to handle position specific scoring matrices and hence to incorporate motifs into structural alignment of RNA sequences. To model sequence motifs, we employ position specific scoring matrices (PSSMs). Experimental results show that PSSMs enable us to find individual RNA families efficiently, especially if we have biological knowledge such as sequence motifs.

Notes:

DOI
PMID

Kengo Sato, Toutai Mituyama, Kiyoshi Asai, Yasubumi Sakakibara (2008) Directed acyclic graph kernels for structural RNA analysis. BMC Bioinformatics 9: 07

Abstract: BACKGROUND: Recent discoveries of a large variety of important roles for non-coding RNAs (ncRNAs) have been reported by numerous researchers. In order to analyze ncRNAs by kernel methods including support vector machines, we propose stem kernels as an extension of string kernels for measuring the similarities between two RNA sequences from the viewpoint of secondary structures. However, applying stem kernels directly to large data sets of ncRNAs is impractical due to their computational complexity. RESULTS: We have developed a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences that significantly increases the computation speed of stem kernels. Furthermore, we propose profile-profile stem kernels for multiple alignments of RNA sequences which utilize base-pairing probability matrices for multiple alignments instead of those for individual sequences. Our kernels outperformed the existing methods with respect to the detection of known ncRNAs and kernel hierarchical clustering. CONCLUSION: Stem kernels can be utilized as a reliable similarity measure of structural RNAs, and can be used in various kernel-based applications.

Notes:

DOI
PMID

Kiyoshi Asai, Hisanori Kiryu, Michiaki Hamada, Yasuo Tabei, Kengo Sato, Hiroshi Matsui, Yasubumi Sakakibara, Goro Terai, Toutai Mituyama (2008) Software.ncrna.org: web servers for analyses of RNA sequences. Nucleic Acids Res 36: Web Server issue. W75-W78 Jul

Abstract: We present web servers for analysis of non-coding RNA sequences on the basis of their secondary structures. Software tools for structural multiple sequence alignments, structural pairwise sequence alignments and structural motif findings are available from the integrated web server and the individual stand-alone web servers. The servers are located at http://software.ncrna.org, along with the information for the evaluation and downloading. This website is freely available to all users and there is no login requirement.

Notes:

2007

PMID

Yasubumi Sakakibara, Kris Popendorf, Nana Ogawa, Kiyoshi Asai, Kengo Sato (2007) Stem kernels for RNA sequence analyses. J Bioinform Comput Biol 5: 5. 1103-1122 Oct

Abstract: Several computational methods based on stochastic context-free grammars have been developed for modeling and analyzing functional RNA sequences. These grammatical methods have succeeded in modeling typical secondary structures of RNA, and are used for structural alignment of RNA sequences. However, such stochastic models cannot sufficiently discriminate member sequences of an RNA family from nonmembers and hence detect noncoding RNA regions from genome sequences. A novel kernel function, stem kernel, for the discrimination and detection of functional RNA sequences using support vector machines (SVMs) is proposed. The stem kernel is a natural extension of the string kernel, specifically the all-subsequences kernel, and is tailored to measure the similarity of two RNA sequences from the viewpoint of secondary structures. The stem kernel examines all possible common base pairs and stem structures of arbitrary lengths, including pseudoknots between two RNA sequences, and calculates the inner product of common stem structure counts. An efficient algorithm is developed to calculate the stem kernels based on dynamic programming. The stem kernels are then applied to discriminate members of an RNA family from nonmembers using SVMs. The study indicates that the discrimination ability of the stem kernel is strong compared with conventional methods. Furthermore, the potential application of the stem kernel is demonstrated by the detection of remotely homologous RNA families in terms of secondary structures. This is because the string kernel is proven to work for the remote homology detection of protein sequences. These experimental results have convinced us to apply the stem kernel in order to find novel RNA families from genome sequences.

Notes:

2005

DOI
PMID

Kengo Sato, Yasubumi Sakakibara (2005) RNA secondary structural alignment with conditional random fields. Bioinformatics 21 Suppl 2: ii237-ii242 Sep

Abstract: MOTIVATION: The computational identification of non-coding RNA regions on the genome is currently receiving much attention. However, it is essentially harder than gene-finding problems for protein-coding regions because non-coding RNA sequences do not have strong statistical signals. Since comparative sequence analysis is effective for non-coding RNA detection, efficient computational methods are expected for structural alignment of RNA sequences. Several methods have been proposed to accomplish the structural alignment tasks for RNA sequences, and we found that one of the most important points is to estimate an accurate score matrix for calculating structural alignments. RESULTS: We propose a novel approach for RNA structural alignment based on conditional random fields (CRFs). Our approach has some specific features compared with previous methods in the sense that the parameters for structural alignment are estimated such that the model can most probably discriminate between correct alignments and incorrect alignments, and has the generalization ability so that a satisfiable score matrix can be obtained even with a small number of sample data without overfitting. Experimental results clearly show that the parameter estimation with CRFs can outperform all the other existing methods for structural alignments of RNA sequences. Furthermore, structural alignment search based on CRFs is more accurate for predicting non-coding RNA regions than the other scoring methods. These experimental results strongly support our discriminative method employing CRFs to estimate the score matrix parameters. AVAILABILITY: The program which is implemented in C++ is available at http://phmmts.dna.bio.keio.ac.jp/ under the GNU public license.

Notes:

DOI
PMID

Hiroshi Matsui, Kengo Sato, Yasubumi Sakakibara (2005) Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics 21: 11. 2611-2617 Jun

Abstract: MOTIVATION: Since the whole genome sequences of many species have been determined, computational prediction of RNA secondary structures and computational identification of those non-coding RNA regions by comparative genomics become important. Therefore, more advanced alignment methods are required. Recently, an approach of structural alignment for RNA sequences has been introduced to solve these problems. Pair hidden Markov models on tree structures (PHMMTSs) proposed by Sakakibara are efficient automata-theoretic models for structural alignment of RNA secondary structures, although PHMMTSs are incapable of handling pseudoknots. On the other hand, tree adjoining grammars (TAGs), a subclass of context-sensitive grammars, are suitable for modeling pseudoknots. Our goal is to extend PHMMTSs by incorporating TAGs to be able to handle pseudoknots. RESULTS: We propose pair stochastic TAGs (PSTAGs) for aligning and predicting RNA secondary structures including a simple type of pseudoknot which can represent most known pseudoknot structures. First, we extend PHMMTSs defined on alignment of 'trees' to PSTAGs defined on alignment of 'TAG trees' which represent derivation processes of TAGs and are functionally equivalent to derived trees of TAGs. Then, we develop an efficient dynamic programming algorithm of PSTAGs for obtaining an optimal structural alignment including pseudoknots. We implement the PSTAG algorithm and demonstrate the properties of the algorithm by using it to align and predict several small pseudoknot structures. We believe that our implemented program based on PSTAGs is the first grammar-based and practically executable software for comparative analyses of RNA pseudoknot structures, and, further, non-coding RNAs.

Notes:

2004

PMID

Hiroshi Matsui, Kengo Sato, Yasubumi Sakakibara (2004) Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Proc IEEE Comput Syst Bioinform Conf 290-299

Abstract: MOTIVATION: Since the whole genome sequences for many species are currently available, computational predictions of RNA secondary structures and computational identifications of those non-coding RNA regions by comparative genomics become important, and require more advanced alignment methods. Recently, an approach of structural alignments for RNA sequences has been introduced to solve these problems. By structural alignments, we mean a pairwise alignment to align an unfolded RNA sequence into a folded RNA sequence of known secondary structure. Pair HMMs on tree structures (PHMMTSs) proposed by Sakakibara are efficient automata-theoretic models for structural alignments of RNA secondary structures, but are incapable of handling pseudoknots. On the other hand, tree adjoining grammars (TAGs) is a subclass of context-sensitive grammar, which is suitable for modeling pseudoknots. Our goal is to extend PHMMTSs by incorporating TAGs to be able to handle pseudoknots. RESULTS: We propose the pair stochastic tree adjoining grammars (PSTAGs) for modeling RNA secondary structures including pseudoknots and show the strong experimental evidences that modeling pseudoknot structures significantly improves the prediction accuracies of RNA secondary structures. First, we extend the notion of PHMMTSs defined on alignments of 'trees' to PSTAGs defined on alignments of "TAG (derivation) trees", which represent a top-down parsing process of TAGs and are functionally equivalent to derived trees of TAGs. Second, we modify PSTAGs so that it takes as input a pair of a linear sequence and a TAG tree representing a pseudoknot structure of RNA to produce a structural alignment. Then, we develop a polynomial-time algorithm for obtaining an optimal structural alignment by PSTAGs, based on dynamic programming parser. We have done several computational experiments for predicting pseudoknots by PSTAGs, and our computational experiments suggests that prediction of RNA pseudoknot structures by our method are more efficient and biologically plausible than by other conventional methods. The binary code for PSTAG method is freely available from our website at http://www.dna.bio.keio.ac.jp/pstag/.

Notes: