G. N. Ramachandran Knowledge Center for Genome Informatics, Institute of Genomics and Integrative Biology (Council of Scientific and Industrial Research), Delhi, India
Abstract: The genome sequencing of H37Rv strain of Mycobacterium tuberculosis was completed in 1998 followed by the whole genome sequencing of a clinical isolate, CDC1551 in 2002. Since then, the genomic sequences of a number of other strains have become available making it one of the better studied pathogenic bacterial species at the genomic level. However, annotation of its genome remains challenging because of high GC content and dissimilarity to other model prokaryotes. To this end, we carried out an in-depth proteogenomic analysis of the M. tuberculosis H37Rv strain using Fourier transform mass spectrometry with high resolution at both MS and MS/MS levels. In all, we identified 3,176 proteins from Mycobacterium tuberculosis representing ~80% of its total predicted gene count. In addition to protein database search, we carried out genome database search, which led to identification of ~250 novel peptides. Based on these novel genome search specific peptides (GSSPs), we discovered 41 novel protein coding genes in the H37Rv genome. Using peptide evidence and alternative gene prediction tools, we also corrected 79 gene models. Finally, mass spectrometric data from N-terminus-derived peptides confirmed 745 existing annotations for translational start sites while correcting those for 33 proteins. We report creation of a high confidence set of protein coding regions in Mycobacterium tuberculosis genome obtained by high resolution tandem mass-spectrometry at both precursor and fragment detection steps for the first time. This proteogenomic approach should be generally applicable to other organisms whose genomes have already been sequenced for obtaining a more accurate catalog of protein-coding genes.
Abstract: Plasma is the most easily accessible source for biomarker discovery in clinical proteomics. However, identifying potential biomarkers from plasma is a challenge given the large dynamic range of proteins. The potential biomarkers in plasma are generally present at very low abundance levels and hence identification of these low abundance proteins necessitates the depletion of highly abundant proteins. Sample pre-fractionation using immuno-depletion of high abundance proteins using multi-affinity removal system (MARS) has been a popular method to deplete multiple high abundance proteins. However, depletion of these abundant proteins can result in concomitant removal of low abundant proteins. Although there are some reports suggesting the removal of non-targeted proteins, the predominant view is that number of such proteins is small. In this study, we identified proteins that are removed along with the targeted high abundant proteins. Three plasma samples were depleted using each of the three MARS (Hu-6, Hu-14 and Proteoprep 20) cartridges. The affinity bound fractions were subjected to gelC-MS using an LTQ-Orbitrap instrument. Using four database search algorithms including MassWiz (developed in house), we selected the peptides identified at <1% FDR. Peptides identified by at least two algorithms were selected for protein identification. After this rigorous bioinformatics analysis, we identified 101 proteins with high confidence. Thus, we believe that for biomarker discovery and proper quantitation of proteins, it might be better to study both bound and depleted fractions from any MARS depleted plasma sample.
Abstract: Mass spectrometry has made rapid advances in the recent past and has become the preferred method for proteomics. Although many open source algorithms for peptide identification exist, such as X!Tandem and OMSSA, it has majorly been a domain of proprietary software. There is a need for better, freely available, and configurable algorithms that can help in identifying the correct peptides while keeping the false positives to a minimum. We have developed MassWiz, a novel empirical scoring function that gives appropriate weights to major ions, continuity of b-y ions, intensities, and the supporting neutral losses based on the instrument type. We tested MassWiz accuracy on 486,882 spectra from a standard mixture of 18 proteins generated on 6 different instruments downloaded from the Seattle Proteome Center public repository. We compared the MassWiz algorithm with Mascot, Sequest, OMSSA, and X!Tandem at 1% FDR. MassWiz outperformed all in the largest data set (AGILENT XCT) and was second only to Mascot in the other data sets. MassWiz showed good performance in the analysis of high confidence peptides, i.e., those identified by at least three algorithms. We also analyzed a yeast data set containing 106,133 spectra downloaded from the NCBI Peptidome repository and got similar results. The results demonstrate that MassWiz is an effective algorithm for high-confidence peptide identification without compromising on the number of assignments. MassWiz is open-source, versatile, and easily configurable.