NCBI Prokaryotic Genome Annotation Standards
Go back to NCBI Prokaryotic Genome Annotation Pipeline
Certain metrics can be used to assess the quality of the annotation of the prokaryotic genomes. NCBI has established a relationship with other major archive databases and major sequencing centers in an effort to develop standards for the prokaryotic genome annotation.
This fruitful collaboration has resulted in a set of annotation standards approved and accepted by major annotation pipelines.
Minimum standards for complete genomes
- Structural RNA: 5S, 16S, 23S – at least one copy of each with appropriate length
- tRNA – at least one copy for each amino acid
- Protein-coding genes count divided by genome length close to 1
- No gene completely contained in another gene on the same or opposite strand
- No partial feature
Every exception has to be explained and documented. See more details in the meeting report
Structural and functional annotation should follow INSDC feature table definitions. For each feature there is a set of mandatory and optional qualifiers that provide detailed information in a structured format for each particular feature. The flatfile format is reviewed every year by the member databases and proposed changes are discussed before acceptance. Complete description of the format and content of the flatfile is documented in GenBank release notes
Locus-tag registry
Locus-tags are systematic identifiers used for the enumeration of annotated genes even for cases when the genes have no known function. Prefixes consisting of alphanumeric characters that met the standards could be registered along with a genome project submission. The assignment of a unique locus-tag prefix to each genome assures that each gene feature in the dataset of all genomes records can be correctly identified.
Gene
A gene is defined as a region of biological interest for which a name has been assigned. Gene features are always a single interval, and their location should cover the intervals of all the relevant features such as promoters and operator binding sites. Gene names must follow the standard bacterial nomenclature rules of three lower case letters. Different loci are distinguished by a suffix of uppercase letters. Unique locus-tag is required for every gene feature.
Coding region
Coding region should have a valid start and stop; in draft genomes partial coding regions are allowed at the end of the contig. Conceptual translation should match the protein sequence; translational exceptions can be used to indicate exceptions to the normal genetic code, such as insertion of selenocysteine, suppression of terminator codons by a suppressor tRNA, or completion of a stop codon by poly-adenylation of an mRNA.
Protein naming guidelines
Naming of proteins follow the International Protein Nomenclature Guidelines, agreed upon by the European Bioinformatics Institute (EMBL-EBI), the National Center for Biotechnology Information (NCBI), the Protein Information Resource (PIR) and the Swiss Institute for Bioinformatics (SIB).
Capturing Annotation Methods and Information Sources
The results of genome annotation processes are deposited along with sequence records in the archival databases. The combination of methods and information sources that were used in the creation of a particular genome annotation are usually detailed in a publication. With increasing numbers of genomes being deposited that do not have an associated scientific publication, it is of paramount importance that there is a process to capture the methods, versions of the software and databases used in creating a set of annotated features, and the date the annotation was produced.
Annotation assessment tools
NCBI committed to produce annotation assessment tools to help submitters find problems with genome annotations. These tools are used during the submission process to GenBank, in the Prokaryotic Genome Annotation Pipeline, and are available separately and include: 1) the Discrepancy Report which includes internal consistency checks without the use of external databases, and is available in Sequin, as part of the tbl2asn tool or as a stand-alone command-line tool, 2) the subcheck/frameshift tool which incorporates sequence searches in external databases during annotation assessment in order to find potentially frameshifted genes and other annotation issues and is available via the web or as a command line tool.
RefSeq annotation
The Prokaryotic Genome Annotation Pipeline is also used to annotate the vast majority of RefSeq assemblies that meet standards of quality.