Abstract: <sec><title>Background</title><p>As a marker of <italic>Helicobacter pylori</italic>, Cytotoxin-associated gene A (cagA) has been revealed to be the major virulence factor causing gastroduodenal diseases. However, the molecular mechanisms that underlie the development of different gastroduodenal diseases caused by cagA-positive <italic>H. pylori</italic> infection remain unknown. Current studies are limited to the evaluation of the correlation between diseases and the number of Glu-Pro-Ile-Tyr-Ala (EPIYA) motifs in the CagA strain. To further understand the relationship between CagA sequence and its virulence to gastric cancer, we proposed a systematic entropy-based approach to identify the cancer-related residues in the intervening regions of CagA and employed a supervised machine learning method for cancer and non-cancer cases classification.</p></sec><sec><title>Methodology</title><p>An entropy-based calculation was used to detect key residues of CagA intervening sequences as the gastric cancer biomarker. For each residue, both combinatorial entropy and background entropy were calculated, and the entropy difference was used as the criterion for feature residue selection. The feature values were then fed into Support Vector Machines (SVM) with the Radial Basis Function (RBF) kernel, and two parameters were tuned to obtain the optimal F value by using grid search. Two other popular sequence classification methods, the BLAST and HMMER, were also applied to the same data for comparison.</p></sec><sec><title>Conclusion</title><p>Our method achieved 76% and 71% classification accuracy for Western and East Asian subtypes, respectively, which performed significantly better than BLAST and HMMER. This research indicates that small variations of amino acids in those important residues might lead to the virulence variance of CagA strains resulting in different gastroduodenal diseases. This study provides not only a useful tool to predict the correlation between the novel CagA strain and diseases, but also a general new framework for detecting biological sequence biomarkers in population studies.</p></sec>
Abstract: We present a Cytoscape plugin called Mosaic to support interactive network annotation, partitioning, layout and coloring based on Gene Ontology or other relevant annotations.
Abstract: With the rapid development of next-generation sequencing technologies, bacterial identi¯cation becomes a very
important and essential step in processing genomic data, especially for metagenomic data. Many computational methods
have been developed and some of them are widely used to address the problems in bacterial identi¯cation. In this article
we review the algorithms of these methods, discuss their drawbacks, and propose future computational methods that use
genomic data to characterize bacteria. In addition, we tackle two speci¯c computational problems in bacterial identi¯cation,
namely, the detection of host-speci¯c bacteria and the detection of disease-associated bacteria, by o®ering potential solutions
as a starting point for those who are interested in the area.
Abstract: BACKGROUND:A phylogenetic tree, showing ancestral relations among organisms, is commonly represented as a rooted tree with sets of bifurcating branches (dichotomies) for simplicity, although polytomies (multifurcating branches) may reflect more accurate evolutionary relationships. To represent the true evolutionary relationships, it is important to systematically identify the polytomies from a bifurcating tree and generate a taxonomy-compatible multifurcating tree. For this purpose we propose a novel approach, "PolyPhy", which would classify a set of bifurcating branches of a phylogenetic tree into a set of branches with dichotomies and polytomies by considering genome distances among genomes and tree topological properties.RESULTS:PolyPhy employs a machine learning technique, BLR (Bayesian logistic regression) classifier, to identify possible bifurcating subtrees as polytomies from the trees resulted from ComPhy. Other than considering genome-scale distances between all pairs of species, PolyPhy also takes into account different properties of tree topology between dichotomy and polytomy, such as long-branch retraction and short-branch contraction, and quantifies these properties into comparable rates among different sub-branches. We extract three tree topological features, ’LR’ (Leaf rate), ’IntraR’ (Intra-subset branch rate) and ’InterR’ (Inter-subset branch rate), all of which are calculated from bifurcating tree branch sets for classification. We have achieved F-measure (balanced measure between precision and recall) of 81% with about 0.9 area under the curve (AUC) of ROC.CONCLUSIONS:PolyPhy is a fast and robust method to identify polytomies from phylogenetic trees based on genome-wide inference of evolutionary relationships among genomes. The software package and test data can be downloaded from http://digbio.missouri.edu/ComPhy/phyloTreeBiNonBi-1.0.zip webcite.
Abstract: Effector secretion is a common strategy of pathogen in mediating host-pathogen interaction. Eight EPIYA-motif containing effectors have recently been discovered in six pathogens. Once these effectors enter host cells through type III/IV secretion systems (T3SS/T4SS), tyrosine in the EPIYA motif is phosphorylated, which triggers effectors binding other proteins to manipulate host-cell functions. The objectives of this study are to evaluate the distribution pattern of EPIYA motif in broad biological species, to predict potential effectors with EPIYA motif, and to suggest roles and biological functions of potential effectors in host-pathogen interactions.
Abstract: Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.
Abstract: Characterizing gene function is one of the major challenging tasks in the postgenomic era. To address this challenge, we developed GeneFAS (gene function annotation system), a computer system with a graphical user interface for cellular function prediction by integrating information from protein-protein interactions, protein complexes, microarray gene expression profiles, and annotations of known proteins. GeneFAS can provide biologists a workspace for their organism of interest, to integrate different types of experimental data and annotation information, and facilitate biological discovery and hypothesis generation using all the information. It also provides testing and training capabilities for users to utilize and integrate their data more efficiently. GeneFAS is freely available for download at http://digbio.missouri.edu/genefas .
Abstract: Characterising gene function is one of the major challenging tasks in the post-genomic era. Various approaches have been developed to integrate multiple sources of high-throughput data to predict gene function. Most of those approaches are just used for research purpose and have not been implemented as publicly available tools. Even for those implemented applications, almost all of them are still web-based 'prediction servers' that have to be managed by specialists. This paper introduces a systematic method for integrating various sources of high-throughput data to predict gene function and analyse our prediction results and evaluates its performances based on the competition for mouse gene function prediction (MouseFunc). A stand-alone Java-based software package 'GeneFAS' is freely available at http://digbio. missouri.eduigenefas.
Abstract: Protein identification through high-throughput mass spectrum data is an important domain in proteomics. Peptide mass fingerprinting (PMF) is one of the major methods for protein identification using the mass-spec technology. We developed a software package called "ProteinDecision" for PMF protein identification, together with a user-friendly graphical interface. "ProteinDecision" can handle the issues of selecting peaks from mass spectrum, transforming database format, displaying the top ranks of identification result, and detailed information for each ranking. We used a novel scoring function by considering the distribution of matching a mass-to-charge and peak intensity in a database based on the MOWSE table. Our new scoring function is assessed better than existing ones by comparing the computational results using experimental PMF data. A standalone version of "ProteinDecision" is freely available upon request.