+ Site Statistics
+ Search Articles
+ PDF Full Text Service
How our service works
Request PDF Full Text
+ Follow Us
Follow on Facebook
Follow on Twitter
Follow on LinkedIn
+ Subscribe to Site Feeds
Most Shared
PDF Full Text
+ Translate
+ Recently Requested

A compression-based approach for coding sequences identification. I. Application to prokaryotic genomes



A compression-based approach for coding sequences identification. I. Application to prokaryotic genomes



Journal of Computational Biology 13(8): 1477-1488



Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach.

Please choose payment method:






(PDF emailed within 0-6 h: $19.90)

Accession: 011681458

Download citation: RISBibTeXText

PMID: 17061923

DOI: 10.1089/cmb.2006.13.1477


Related references

ICDS database: interrupted CoDing sequences in prokaryotic genomes. Nucleic Acids Research 34(Database Issue): D338, 2006

ISsaga is an ensemble of web-based methods for high throughput identification and semi-automatic annotation of insertion sequences in prokaryotic genomes. Genome Biology 12(3): R30, 2011

Analysis of the multi-copied genes and the impact of the redundant protein coding sequences on gene annotation in prokaryotic genomes. Journal of Theoretical Biology 376: 8-14, 2015

Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Current Protocols in Bioinformatics Chapter 4: Unit 4.5.1-17, 2011

Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Current Protocols in Microbiology 32: Unit 1e.7., 2014

PSP: rapid identification of orthologous coding genes under positive selection across multiple closely related prokaryotic genomes. Bmc Genomics 14: 924, 2013

P2RP: a Web-based framework for the identification and analysis of regulatory proteins in prokaryotic genomes. Bmc Genomics 14: 269, 2013

Mining prokaryotic genomes for unknown amino acids: a stop-codon-based approach. Bmc Bioinformatics 8: 225, 2007

Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. Nucleic Acids Research 30(19): 4264-4271, 2002

Correlations between coding and contiguous non-coding sequences in isochore families from vertebrate genomes. Gene 410(2): 241-248, 2008

De novo computational prediction of non-coding RNA genes in prokaryotic genomes. Bioinformatics 25(22): 2897-2905, 2009

Asymmetry of coding versus noncoding strand in coding sequences of different genomes. Microbial and Comparative Genomics 2(4): 259-268, 1997

Insertion sequences in prokaryotic genomes. Current Opinion in Microbiology 9(5): 526-531, 2006

Prokaryotic non-coding small RNA prediction: Oral pathogen genomes as examples. 2007

Causes of insertion sequences abundance in prokaryotic genomes. Molecular Biology and Evolution 24(4): 969-981, 2007