PGAP is now available as a stand-alone software package. You can annotate your genomes on your own machine, local cluster or the Cloud! Get started by watching a short video!
NCBI Prokaryotic Genome Annotation Pipeline
The NCBI Prokaryotic Genome Annotation Pipeline (PGAP) is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).
Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements.
NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Pipeline was developed in 2001 and is regularly upgraded to improve structural and functional annotation quality (Li W, O'Neill KR et al 2021, Haft DH et al 2018, Tatusova T et al 2016). Structural and functional annotation uses Protein Family Models, a hierarchical collection of evidence composed of Hidden Markov Model-based and BLAST-based protein families (HMMs and BlastRules) and Conserved Domain Database architectures(CDDs). HMMs, BlastRules and CDDs are used to assign names, gene symbols, publications and EC numbers to the prokaryotic RefSeq proteins that meet the criteria for inclusion in a family. HMMs and BlastRules contribute to structural annotation.
Related documentation:
GenBank
The NCBI prokaryotic annotation pipeline is available as a stand-alone software package that you can run yourself to produce annotated genomes ready for submission to GenBank. It is also a service for GenBank submitters that can be requested at submission. The pipeline is capable of annotating both complete genomes and draft WGS genomes consisting of multiple contigs.
Both WGS and non-WGS genomes, including gapless complete bacterial chromosomes, can be submitted via the Submission Portal. You will be asked to choose whether the genome being submitted is considered WGS or not. The differences for GenBank purposes are:
non-WGS:
- Each chromosome is in a single sequence and there are no extra sequences
- Each sequence in the genome must be assigned to a chromosome or plasmid or organelle
- Plasmids and organelles can still be in multiple pieces.
WGS:
- One or more chromosomes are in multiple pieces and/or some sequences are not assembled into chromosomes
In both cases:
- There can still be gaps within the sequences; you will supply that information in the submission.
- Plasmids and organelles can still be in multiple pieces.
- Internal sequences must be arranged in the correct order and orientation.
- Sequences concatenated in unknown order are not allowed.
Submission is through the Genome Submission Portal. See the genome submission instructions page for details.
Refseq
All RefSeq bacterial and archaeal genomes, with the exception of RefSeq Prokaryotic Reference Genomes, are annotated using NCBI's prokaryotic genome annotation pipeline. Additional information on this policy is available here:
- RefSeq Prokaryotic Genomes
- Assemblies excluded from RefSeq
- RefSeq Prokaryotic Genomes Re-annotation project
For information about RefSeq Eukaryotic genomes, please see: Eukaryotic Genome Annotation
Questions about RefSeq prokaryotic genomes: [email protected]
References
Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028. doi: 10.1093/nar/gkaa1105. PMID: 33270901
Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O'Neill K, Li W, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu F, Marchler GH, Song JS, Thanki N, Yamashita RA, Zheng C, Thibaud-Nissen F, Geer LY, Marchler-Bauer A, Pruitt KD. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860. doi: 10.1093/nar/gkx1068. PubMed PMID: 29112715
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016 Aug 19;44(14):6614-24. doi: 10.1093/nar/gkw569. PMID: 27342282