Ribosomal RNA Sequence Processing at NCBI

Ribosomal RNA sequences are checked for a number of issues before they are accepted for GenBank. You will be notified during submission processing if your sequences have any of the issues listed below. If you have questions, please write to: [email protected] and include your submission number.

Error List

Trimmed Vector
Removed Vector
Trimmed Ends and Ambiguous Sequences
Removed Short Sequences
Removed Long Sequences
Sequences with Low or No Similarity to 16S rRNA
Misassembled Sequences
Chimeric Sequences
Unusual Sequences
Taxonomy Mismatch

Trimmed Vector

Sequences with terminal vector (or adaptor, linker, etc.) contamination are trimmed to remove the contaminating sequence. Sequences are checked for vector via BLAST search of your sequences against our vector and UniVec databases. While these similarities may be due to a variety of reasons, there is the possibility that contamination is the cause. To perform a BLAST search against the vector database, go to VecScreen .

Removed Vector

Sequences with internal vector matches or sequences that match vector across the length of the sequence are removed. Sequences are checked for vector via BLAST search of your sequences against our vector and UniVec databases. To perform a BLAST search against the vector database, go to VecScreen .

Trimmed Ends and Ambiguous Sequences

Terminal NNNs and sequences with a high percentage of ambiguities near the ends of the sequences are trimmed. Sequences with more 50% ambiguities are removed. Please be sure to trim or remove low quality sequence before submitting sequences to GenBank.

Removed Short Sequences

Short sequences are automatically removed from your submission. Unassembled sequences from next-generation sequencing platforms should be submitted to the NCBI Sequence Read Archive SRA .

Removed Long Sequences

Sequences longer than the expected rRNA length are automatically removed. Make sure you have selected the appropriate submission type for your sequences so the sequences are appropriately screened. If the Submission Type form does not list the appropriate sequence type, please use a different submission tool and annotate the appropriate features when you submit.

Sequences with Low or No Similarity to 16S rRNA

Submitters will be contacted regarding sequences with BLAST query coverage less than 90%, sequences with >5% BLAST alignment gaps, or sequences with less than 80% identity to other prokaryotic 16S rRNA sequences. Ribosensor, a program in the Ribovore software package, will report sequences in large scale 16S rRNA submissions that have a low score when compared to the bacterial HMM profile. Prokaryotic 16S Ribosomal RNAs are generally highly conserved and thus, we would expect to see similarity to other rRNA sequences over the entire length. A lack of similarity over the entire length of the sequence may be due to one of the following:

contaminant sequence
low quality sequencing
chimera formation
misassembly of the sequence reads
vector contamination
PCR artifact

These issues may be resolved by simply trimming or removing the sequences listed in the report.

Only prokaryotic 16S ribosomal RNA sequences should be submitted using the 16S rRNA Submission Tool . If you are submitting other types of sequences, you need to use a different submission tool for submitting to GenBank and annotate the appropriate features when you submit.

Misassembled Sequences

Submitters will be contacted regarding sequences identified as misassembled by BLAST or Ribosensor. Misassembled sequences are often due to incorrectly ordering the sequence fragments, mixing plus and minus strand fragments and/or incorrectly joining non-overlapping sequence reads.

Chimeric Sequences

Submitters will be contacted regarding sequences identified as chimeric. Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. This often occurs during PCR reactions using mixed templates (i.e., uncultured environmental samples). Incomplete extensions during PCR allow subsequent PCR cycles to use a partially extended strand to bind to the template of a different, but similar, sequence. This partially extended strand then acts as a primer to extend and form a chimeric sequence. Once created, the chimeric sequence is then further amplified in subsequent cycles. The end result is a PCR artifact that does not represent a sequence that exists in nature.

Studies have estimated that as many as 30% of the sequences from mixed template environmental samples may be chimeric. While chimera formation is most common in mixed template amplifications, in practice it is also seen at lower frequency in supposedly pure cultures.

A number of tools are available to detect chimeric sequences. NCBI uses Uchime in reference database mode to scan for chimeras. NCBI has optimized the Uchime parameters to find chimeras that are >3% diverged from the closest parent and therefore tend to produce spurious OTUs (Operational Taxonomic Units) and degrade diversity estimates and taxonomic predictions.

Accurate representations of biological diversity are not possible with data containing chimeras and other artifacts. The entire community must work together to prevent these artifact sequences from polluting the public databases.

Unusual Sequences

A sequence is lacking at least one type of nucleotide (A, T, G, or C). It is highly unusual for a sequence to not contain at least all four types of nucleotides (A, T, G, and C). If your sequence is missing one of these four nucleotides, it is likely an artifact or is a low quality sequence and the sequence should be removed from the submission.

Taxonomy Mismatch

Submitters will be contacted regarding possible source organism identification errors. If you receive this error, the source organism(s) for one or more of your submissions may be misidentified based on comparison with 16S ribosomal RNA from prokaryotic type strains. Check and correct the organism name for each sequence in your submission with this error. There are several possible reasons for this error:

[1] You have misidentified the source organism. To perform a BLAST search against the type strain database similar to the search we performed, go to:

https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch

Under 'Choose Search Set', first select the rRNA/ITS databases radio button, then change the Database from 'Nucleotide collection (nr/nt)' to '16S ribosomal RNA sequences (Bacteria and Archaea)' The following short video shows how to use the type strain database to help you identify the source organism of a 16S rRNA sequence: https://www.youtube.com/watch?v=1q8MlSheJPc&feature=youtu.be

[2] You have misspelled the organism name.
Please check the correct spelling in our taxonomy database:

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

or check spelling in the BLAST results using the link in [1]

[3] The organism is from a recently published genus (not yet added to the 16S rRNA type strain database). Please write to [email protected] with your SUB# and a link to the gen. nov. publication.

[4] The sequence is from a misplaced genus (i.e., a [Clostridium] or [Pseudomonas] species).
Square brackets ([ ]) indicate that the name awaits appropriate action by the research community to be transferred to another genus.

Use the information above to fix this error and provide the correct organism names for your submission. If you still have questions, write to

[email protected]

with your SUB# in the email subject line and a brief explanation.

GenBank

Public nucleic acid sequence repository