NCBI Prokaryotic Genome Annotation Pipeline Release Notes
Go back to NCBI Prokaryotic Genome Annotation Pipeline
References for Third Party Software
- tRNAscan-SE PMID:34417604
- hmmer hmmer
- CRISPRCasFinder PMID:29790974
- AntiFam PMID:22434837
- Rfam PMID:29927072
- GeneMarkS2 PMID:29773659
- infernal PMID:24008419
- Miniprot PMID:36648328
*Version 6.9 November 18, 2024*
- Software updates
- CRISPR Identification: CRISPRCasFinder has replaced PILER-CR-based CRISPR identification.
- Miniprot: Minor runtime and annotation improvements due to miniprot parameter tuning.
Third Party Software versions used:
- tRNAscan-SE 2.0.12
- hmmer v.3.4
- CRISPRCasFinder 4.3.2
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal 1.1.5
- Miniprot 0.13
*Version 6.8 August 12, 2024*
-
In order to improve pipeline scalability and maintainability, NCBI PGAP now uses Miniprot for protein to genome alignments; PMID:36648328.
- NCBI has worked hard to minimize the adverse effects of a switch in algorithms and do not expect any disruption in the quality of our annotation calls. After extensive testing on a broad range of taxa, we conclude that PGAP 6.8 perfectly reproduces 98.6% of the protein models produced by PGAP 6.7, with the vast majority of the remaining differences confined to small changes in start site selection. On average, we expect such changes to approximately 40 models per assembly.
-
Update to CDD 3.21 in Protein Family Models
Third Party Software versions used:
- tRNAscan-SE 2.0.12
- hmmer v.3.4
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal 1.1.5
*Version 6.7 March 2024*
- Third party software updates
- hmmer v.3.4
- infernal 1.1.5
- Pfam release 36 is being used for help in structural and functional annotation
- Incorporating GeneOntology 2024-01-17 changes to update GO terms
Third Party Software versions used:
- tRNAscan-SE 2.0.12
- hmmer v.3.4
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal 1.1.5
*Version 6.6 August 2023*
-
No gene or other feature annotated on spans identified as foreign contaminant by FCS-GX https://github.com/ncbi/fcs/wiki/FCS-GX
-
Lowered pseudogene false positive rate by improving protein alignment handling during structural annotation
- Designed new hidden Markov models (HMMs) for validated small proteins, for improving structural annotation
- Adopted PFAM release 35, for help in structural and functional annotation
- Added CheckM completeness cut-offs to validate annotation. An annotated assembly will only be added to the RefSeq collection if it meets the following criteria:
- For species with more than 1000 assemblies in RefSeq, the completeness is higher than the species Average Completeness - 3 times the standard deviation
- For species with 10-1000 assemblies in RefSeq, the completeness is higher than the smaller of 90% or the species Average Completeness - 3 times the standard deviation
- No CheckM cutoff is applied if there are less than 10 assemblies in the species
Third Party Software versions used:
- tRNAscan-SE 2.0.12
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 6.5 March 2023*
- Adding CDD attributes to genomes and proteins
Third Party Software versions used:
- tRNAscan-SE 2.0.12
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 6.4 December 2022*
- More stringent filtering of alignments of trusted proteins, resulting in improvements in the structural annotation of long proteins
- Upgrade to tRNAscan-SE 2.0.12
- Changes in data used for functional annotation:
- Incorporation of GeneOntology 2022-11-03 changes
- Switch to CDD 3.20 architectures
Third Party Software versions used:
- tRNAscan-SE 2.0.12
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 6.3 September 2022*
- More stringent filtering of low quality alignments, resulting in better annotation of long proteins
Third Party Software versions used:
- tRNAScan-SE v.2.0.9
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 6.2 July 2022*
- Update to the structural annotation algorithm: increased trust in HMM alignments resulting in better choice of start sites
- Lowering of the length threshold for accepted ab initio hypothetical models from 45 to 40 a
- Update tRNAScan-SE from v.2.0.7 to 2.0.9
Third Party Software versions used:
- tRNAScan-SE v.2.0.9
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 6.1 March 2022*
Maintenance update only
Third Party Software versions used:
- tRNAScan-SE v.2.0.7
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 6.0 February 2022*
New Features:
- Addition of GO terms to annotation
- New Rfam models added
Third Party Software versions used:
- tRNAScan-SE v.2.0.7
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 5.3 September 2021*
New Features:
- Updates to the structural annotation algorithm to allow future extensibility
Third Party Software versions used:
- tRNAScan-SE v.2.0.7
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 5.2 May 2021*
New Features:
- Using gene orthology to map gene symbols for a limited set of species
- Map gene symbols from the Escherichia coli, Mycobacterium tuberculosis, Acinetobacter pittii, Bacillus subtilis and Campylobacter jejuni reference genomes to genomes in the same species for genes where PGAP does not provide gene symbols
- Parameters: gene coverage of both the reference genome gene and the target gene is >0.9, similarity is >0.8 and PGAP product name is not hypothetical protein
Third Party Software versions used:
- tRNAScan-SE v.2.0.7
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 5.1 February 2021*
New Features:
- Upgrades to third party software. tRNAScan-SE v.2.0.4 to v.2.0.7, Rfam v.12.0 to v.14.4 and GeneMarkS2-v.1.10_1.17 to v.1.14_1.25
Third Party Software versions used:
- tRNAScan-SE v.2.0.7
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.14.4
- GeneMarkS2-v.1.14_1.25
- infernal v.1.1.1
*Version 5.0 December 2020*
New Features:
- Improved performance by restricting Blast searches of non-plasmid candidate ORFs and final models to taxonomic-order-specific protein cluster representatives. Models on plasmid sequences continue to be searched against the unrestricted database of protein cluster representatives. As with previous versions, all non-plasmid and plasmid candidate ORFs and final models are searched against the entire collection of BlastRules. No loss in sensitivity or specificity of annotation output was observed with this change.
Third Party Software versions used:
- tRNAScan-SE v.2.0.4
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-2-1.10
- infernal v.1.1.1
*Version 4.13 September 2020*
New Features:
- Identification of 16S and 23S rRNA by infernal cmsearch against Rfam SSU and LSU models for bacteria and archaea. This replaces a BLAST-based search against a manually curated NCBI database of ribosomal RNAs.
- Improved annotation of circular sequences with cross-origin CDSs.
- Fixed runtime performance regression.
- Program changes to correct minor issues and improve performance.
Third Party Software versions used:
- tRNAScan-SE v.2.0.4
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-2-1.10
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.12 July 2020*
New Features:
- Upgrade to sc24.
- Removal of PMIDs that have been retracted from evidence used for annotation. WP accessions will be updated to reflect the changes in evidence.
- Program changes to correct minor issues and improve performance.
Third Party Software versions used:
- tRNAScan-SE v.2.0.4
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-2-1.10
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.11 Jan 2020*
New Features:
- Removal of soon-to-be retired Reference Genomes from PGAP structural analysis.
- Use reference proteins in stuctural annotation at level of genus. In structural annotation, higher weight is given to the alignments of proteins on the reference genome(s) available at the genus level, if any, than to other proteins alignments. This is a change compared to prior PGAP software where alignments of proteins on the reference genome(s) in the same clade as the annotated organism were given higher weight.
Third Party Software versions used:
- tRNAScan-SE v.2.0.4
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-2-1.10
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.10 Oct 2019*
New Features:
- Improvement in accuracy of gene calls for small proteins due to increased weight for GeneMark annotation
Third Party Software versions used:
- tRNAScan-SE v.2.0.4
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-2-1.10
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.9 July 2019*
New Features:
- Updated tRNAscan
- Added 17 Rfam models
- Increased allowed overlaps with CDSs for riboswitches and misc_binding features
Third Party Software versions used:
- tRNAScan-SE v.2.0.4
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-2-1.10
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.8 March 2019*
New Features:
- Evidence attributes and structured comments are added to RefSeq protein records
Third Party Software versions used:
- tRNAScan-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-2-1.10
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.7 November 2018*
New Features:
- GeneMarkS2+ now being used for ab initio gene prediction
- The naming set for intergenic proteins has been improved
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-2-1.10
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.6 July 2018*
New Features:
- Addition of SPARCLE architecture for protein naming
- Addition of family, subfamily and domain level curated HMM for protein naming
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-4.25
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.5 March 2018*
New Features:
- Use gathering threshold for Pfam Hmms
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-4.25
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.4 December 2017*
New Features:
- Turn on BLAST Rules in PGAP
- Fix split seq-loc in pseudo from programmed frameshift
- Reproduce all short reference proteins
- Use all equivalog HMMs for naming in PGAP
- Fix CDSs spanning gaps
- Use AMR HMMs for naming
- Use Blast Rules for structural annotation
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- GeneMarkS-4.25
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.3 October 2017*
New Features:
- Implementation of Blast Rules for functional annotation
- Implementation of programmatic frameshift for transposases
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.2 May 2017*
New Features:
- PGAP 4.2 is a point release with minimal change from PGAP 4.1
- Groundwork to support BLAST rules annotation has been implemented. Blast Rules is a system in which define precise criteria (coverage, % identity) for saying that a query protein matches a target by protein according to parameters found in the output of a BLAST search. Blast Rules may be created with extremely restrictive match criteria, and used to distinguish among very closely related proteins.
- Performance improvements have been made for protein matching in evidence searches
- Markup for programmatic frameshifts for transposases has been instituted
Bugs Fixed:
- Fixed issues arising ifrom the RefSeq reannotation execution
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.1 February 2017*
New Features:
- Improvements to the ORF+HMM approach:
- Tuned ability to create protein based on HMM evidence alone, picking appropriate frame
- Tuned cutoffs for extensions of ORFs given evidence
- Increased our dependence on protein evidence from reference genomes. After extensive review of the annotation on existing reference genomes, a small number were removed from our evidence set based on lower-quality annotation products.
- Added prediction of selenoproteins based on homology to trusted selenoprotein families (with thanks to Dr. Yan Zhang, PMID:26800233)
- Extensive cleanup of specific protein families in our evidence set, including large revisions to how we handle transposases. In our review, a large number of fragmentary proteins produced by PGAP resulted from poorly defined transposase evidence; our data cleanup focused on preserving high-quality full-length evidence to support better quality predictions.
Bugs Fixed:
- Significant cleanup of our naming set proteins. We anticipate continued improvements in pgap-4.1 over the course of the next several months
- Eliminated many short partial transposase fragments
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 4.0 November 2016*
New Features:
- Introduce new gene finding algorithm dependent on identifying ORFs with support from any HMM, including domain-specific Pfam HMMs
- Elimination of dependence on core protein evidence
- Introduce new riboswitch markup with extended properties based on Rfam predictions
- Expanded and updated antimicrobial resistance identification HMMs
- Improved cutoffs for acceptance of PRK evidence HMMs
Bugs Fixed:
- Significant cleanup of our naming set proteins.
- Eliminated many short partial transposase fragments
- Eliminated several conflicting erroneous frame translations
- Adjusted gene selection algorithm to allow preservation of duplicate evidence on genomes. Previously, the best placement algorithm would have favored a single best placement for any protein; as of PGAP-4.0, multiple placements are permitted, improving annotation for known redundant and repetitive proteins.
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 3.3 May 2016*
New Features:
- Software upgraded to lastest NCBI C++ toolkit production code; this does not impact annotation results
Bugs Fixed:
- Fixed a small number of process flow bugs; this does not impact annotation results
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
*Version 3.2 April 2016*
This software update was released for RefSeq production use in late April, and released for GenBank production use in early May.
New Features:
- Naming changes: we are now using trusted and curated equivalog HMMs for naming in preference to protein clusters
- Addition of a curated protein sequence list to the initial protein search
Bugs Fixed:
- Bugs fixed in ANI reporting and taxonomic checks
- Fixed bugs in handling of Seq-ids for GenBank submissions, particularly affecting updates to existing records
- Updates to filtering of partial alignments, improving calls to partial features
- Removed inappropriate use of transl_except for the start codon; reporting partial feature instead.
- Partial features near gaps or contig ends are extended to the boundary if possible.
Third Party Software versions used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- infernal v.1.1.1
- TIGRfam 15.0 (for naming)
Version 3.1 January 2016
New Features:
- Expanded list of RFAM models used to 33 total models:
- 5S_rRNA
- 6S
- 6S-Flavo
- Archaea_SRP
- Bacteria_large_SRP
- Bacteria_small_SRP
- Cobalamin
- FMN
- Glycine
- Hammerhead_II
- MOCO_RNA_motif
- PreQ1
- Purine
- RNaseP_arch
- RNaseP_bact_a
- RNaseP_bact_b
- RprA
- RtT
- SAH_riboswitch
- SAM
- SAM-IV
- SAM_V
- SAM_alpha
- TPP
- alpha_tmRNA
- beta_tmRNA
- c-di-GMP-I
- c-di-GMP-II
- cyano_tmRNA
- preQ1-II
- sX9
- snoPyro_CD
- tmRNA
- Added taxonomic checks to annotation outputs:
- Added check of assembly to type strain assemblies (k-mer check and ANI check)
- Added check of identified proteins to taxa of equivalent WP proteins
- Added check of identified 16S rRNA to reference 16S rRNA data set
- Added check of identified universal markers to markers assigned to prokaryotic clades
Third party software used:
- tRNASca-SE v.1.21
- hmmer v.3.1b2
- CRISPR v.1.02
- AntiFam v.3.0
- Rfam v.12.0
- infernal v.1.1.1
Version 3.0 July 2015
Changes:
-
Fixed numerous bugs in handling of protein alignments, resulting in better prediction of coding genes based on evidence
-
Removed many partial proteins, replacing the proteins with pseudo-CDSs. This change primarily affects partial proteins produced in the middle of contigs.
-
Significant clean-up of functional protein evidence, based on review of functional elements within our existing protein clusters
Third party software versions used:
-
tRNASca-SE v.1.21
-
hmmer v.3.1b2
-
CRISPR v.1.02
-
AntiFam v.3.0
-
Rfam v.11.0
Version 2.10
Version 2.9 November 2014
Several new features added including: ORF finder used as last resort for long unannotated regions; cross origina CDS annotated as two partial CDS features
Version 2.8 October 2014
Several new features added including:
New naming snapshot, correcting most of the previous issues with names; Many improvements to start site detection, producing more consistent models; Partials are now padded to gap boundaries within 1-2 nucleotides; Proteins with internal partial segments are now converted to pseudogenes (previously these were dropped entirely) Software bug fixed
Version 2.7 August 2014
Several new features were added including:
New protein naming snapshot, incorporating many new clusters and refinements; Significant improvements to protein start site selection. The new algorithm places much more weight on evidence by count of representatives, and produces a better consensus view of the start site; Inclusion of all clusters from well-annotated reference genomes; Modification to feature acceptance criteria to permit core and reference cluster-based proteins to be accepted whenever possible. The net effect significantly improves annotation of very short proteins such as leader peptides.
Version 2.6 June 2014
Several improvements were added including: evidence selection algorithm evaluates all protein evidence to select teh start site that maximizes correspondence across teh evidence; in case of an absent start or long unaligned tail, the new algorithn favors a partial model instead of a complete model that disagrees with exisitng evidence; GeneMArkS+ is now used only for pure ab inition predictions
Version 2.5 May 2014
IMproved protein name selection: always us ecluster names instead of seed protein name
Several software bugs were fixed
Version 2.4 Febraury 2014
New set of Protein Clusters installed in production
Several bugs fixed including: preserve exisitng locus-tag prefix unconditionally on reannotation; assign locus-tag prefix by BioSample ID if known; annotate scaffolds not contigs when provided a WGS genome
Version 2.3 November 2013
Annotate contigs not scaffolds for GenBank assemblies.
Several new features were added including: new evidence selection algorithm designed to provide greater fidelity with higher quality proteins
Version 2.2 October 2013
New features were added including: added support for ncRNA features; introduce tw-pass annotation, supporting better frameshift detection.
Fixed software stability issues
Version 2.1 July September 2013
Permit annotation of plasmid only submissions. Fix placement of DBLink descriptors. Trim features at gap boundaries.
May 2013 Version 2.0
Version 2.0 uses protein homology and GeneMarkS+ prediction program.
Features annotated: Gene; CDS; rRNA; tRNA; repeats in CRISPR region
This version does not include: small non-coding RNA (ncRNA)