An Automated Proteogenomic Method Utilizes Mass Spectrometry to Reveal Novel Genes in Zea mays.
|Title||An Automated Proteogenomic Method Utilizes Mass Spectrometry to Reveal Novel Genes in Zea mays.|
|Publication Type||Journal Article|
|Year of Publication||2013|
|Authors||Castellana NE, Shen Z, He Y, Walley JW, Cassidy C J, Briggs SP, Bafna V|
|Journal||Mol Cell Proteomics|
|Date Published||2013 Oct 18|
The pace of genome sequencing is accelerating, revealing the genetic background of a growing number of organisms. However, assigning function to each nucleotide in a completed genome remains the rate-limiting step. New technologies in transcriptomics and proteomics have influenced the emergence of proteogenomics, a field at the confluence of genomics, transcriptomics, and proteomics. First generation proteogenomic toolkits employ peptide mass spectrometry to identify novel protein coding. These efforts rely heavily on existing algorithms designed for standard proteome analysis and fail to address the challenges specific to genome annotation. In this work we extend first generation proteogenomic tools to achieve greater accuracy and enable the analysis of large, complex genomes. We apply our pipeline to Zea mays, which has a genome comparable in size to human. Our pipeline begins with the comparison of mass spectra to a putative translation of the genome. Our translation includes the six-frame translation as well as a splice graph for capturing protein splice variants. We distribute the identification of mass spectra across 45 compute nodes for increased efficiency and employ a database-independent scoring method to improve sensitivity. We select novel peptides, those that match to a region of the genome that was not previously known to be protein coding, for grouping into events. Each of our eight event types describes a refinement needed to the genome annotation. We present a novel, Bayesian framework for evaluating the accuracy of each event. Our calculated event probability, or eventProb, considers the number of supporting peptides and spectra, and the quality of each supporting peptide-spectrum match. More than 80% of the maize genome is comprised of repetitive elements. To address this, our eventProb handles uniquely located peptides and shared location peptides separately. Our pipeline predicts 165 novel protein-coding genes and proposes updated models for 741 additional genes.
|Alternate Journal||Mol. Cell Proteomics|