NCBI Prokaryotic Genome Annotation Standards

Go back to NCBI Prokaryotic Genome Annotation Pipeline

Certain metrics can be used to assess the quality of the annotation of the prokaryotic genomes. NCBI has established a relationship with other major archive databases and major sequencing centers in an effort to develop standards for the prokaryotic genome annotation.

This fruitful collaboration has resulted in a set of annotation standards approved and accepted by major annotation pipelines.

Minimum standards for complete genomes

Structural RNA: 5S, 16S, 23S – at least one copy of each with appropriate length
tRNA – at least one copy for each amino acid
Protein-coding genes count divided by genome length close to 1
No gene completely contained in another gene on the same or opposite strand
No partial feature

Every exception has to be explained and documented. See more details in the meeting report

Structural and functional annotation should follow INSDC feature table definitions. For each feature there is a set of mandatory and optional qualifiers that provide detailed information in a structured format for each particular feature. The flatfile format is reviewed every year by the member databases and proposed changes are discussed before acceptance. Complete description of the format and content of the flatfile is documented in GenBank release notes

Locus-tag registry

Locus-tags are systematic identifiers used for the enumeration of annotated genes even for cases when the genes have no known function. Prefixes consisting of alphanumeric characters that met the standards could be registered along with a genome project submission. The assignment of a unique locus-tag prefix to each genome assures that each gene feature in the dataset of all genomes records can be correctly identified.

Gene

A gene is defined as a region of biological interest for which a name has been assigned. Gene features are always a single interval, and their location should cover the intervals of all the relevant features such as promoters and operator binding sites. Gene names must follow the standard bacterial nomenclature rules of three lower case letters. Different loci are distinguished by a suffix of uppercase letters. Unique locus-tag is required for every gene feature.

Coding region

Coding region should have a valid start and stop; in draft genomes partial coding regions are allowed at the end of the contig. Conceptual translation should match the protein sequence; translational exceptions can be used to indicate exceptions to the normal genetic code, such as insertion of selenocysteine, suppression of terminator codons by a suppressor tRNA, or completion of a stop codon by poly-adenylation of an mRNA.

Protein naming guidelines

Naming of proteins follow the International Protein Nomenclature Guidelines, agreed upon by the European Bioinformatics Institute (EMBL-EBI), the National Center for Biotechnology Information (NCBI), the Protein Information Resource (PIR) and the Swiss Institute for Bioinformatics (SIB).

Capturing Annotation Methods and Information Sources

The results of genome annotation processes are deposited along with sequence records in the archival databases. The combination of methods and information sources that were used in the creation of a particular genome annotation are usually detailed in a publication. With increasing numbers of genomes being deposited that do not have an associated scientific publication, it is of paramount importance that there is a process to capture the methods, versions of the software and databases used in creating a set of annotated features, and the date the annotation was produced.

Annotation assessment tools

NCBI committed to produce annotation assessment tools to help submitters find problems with genome annotations. These tools are used during the submission process to GenBank, in the Prokaryotic Genome Annotation Pipeline, and are available separately and include: 1) the Discrepancy Report which includes internal consistency checks without the use of external databases, and is available in Sequin, as part of the tbl2asn tool or as a stand-alone command-line tool, 2) the subcheck/frameshift tool which incorporates sequence searches in external databases during annotation assessment in order to find potentially frameshifted genes and other annotation issues and is available via the web or as a command line tool.

RefSeq annotation

The Prokaryotic Genome Annotation Pipeline is also used to annotate the vast majority of RefSeq assemblies that meet standards of quality.

RefSeq

Integrated reference sequences