Correcting base-assignment errors in repeat regions of shotgun assembly.

TitleCorrecting base-assignment errors in repeat regions of shotgun assembly.
Publication TypeJournal Article
Year of Publication2007
AuthorsZhi D, Keich U, Pevzner P, Heber S, Tang H
JournalIEEE/ACM Trans Comput Biol Bioinform
Volume4
Issue1
Pagination54-64
Date Published2007 Jan-Mar
ISSN1545-5963
KeywordsAlgorithms, Campylobacter jejuni, Cluster Analysis, Computational Biology, Genome, Bacterial, Lactococcus lactis, Models, Statistical, Repetitive Sequences, Nucleic Acid, Sequence Alignment, Sequence Analysis, DNA, Software, Staphylococcus epidermidis, Wolbachia
Abstract

Accurate base-assignment in repeat regions of a whole genome shotgun assembly is an unsolved problem. Since reads in repeat regions cannot be easily attributed to a unique location in the genome, current assemblers may place these reads arbitrarily. As a result, the base-assignment error rate in repeats is likely to be much higher than that in the rest of the genome. We developed an iterative algorithm, EULER-AIR, that is able to correct base-assignment errors in finished genome sequences in public databases. The Wolbachia genome is among the best finished genomes. Using this genome project as an example, we demonstrated that EULER-AIR can 1) discover and correct base-assignment errors, 2) provide accurate read assignments, 3) utilize finishing reads for accurate base-assignment, and 4) provide guidance for designing finishing experiments. In the genome of Wolbachia, EULER-AIR found 16 positions with ambiguous base-assignment and two positions with erroneous bases. Besides Wolbachia, many other genome sequencing projects have significantly fewer finishing reads and, hence, are likely to contain more base-assignment errors in repeats. We demonstrate that EULER-AIR is a software tool that can be used to find and correct base-assignment errors in a genome assembly project.

DOI10.1109/TCBB.2007.1005
PubMed URLhttp://www.ncbi.nlm.nih.gov/pubmed/17277413?dopt=Abstract
Alternate JournalIEEE/ACM Trans Comput Biol Bioinform
PubMed ID17277413