Targeted Locus Study (TLS) Submission Guide
Prerequisites
- Submission of sequence reads to SRA is highly recommended.
-
BioProject and BioSample IDs.
- The NCBI Submission Portal includes steps for BioProject and BioSample registration during the submission of large scale prokaryotic 16S ribosomal rRNA TLS projects. Previously assigned BioProject and BioSample IDs can also be provided within the submission wizard.
- Study and sample metadata must be submitted to BioProject and BioSample prior to submission of other types of TLS projects. If sequence reads were already submitted, provide the assigned BioProject and BioSample IDs during submission.
Creating the TLS submission file
[1] Sequences
- Each individual fasta file should include sequences derived from a single large scale study. The total number of sequences included within all fasta files that are part of a single project should be >2,500.
- A unique ID should be included for each sequence within the fasta file. These IDs will be included within the Definition line on the GenBank flatfile for each individual sequence. This sequence ID can represent the OTU, phylotype, or other unique sequence identifier.
- Remove vector, chimeras, low quality sequence and questionable data from your sequences before submitting.
- More information about the format for the fasta file and sequence requirements can be found here.
[2] Project Information
Include a description of the large scale/TLS study within a BioProject. The BioProject ID will be included on the TLS master flatfile and functions to provide a single link to all the data types that are part of the project. The same BioProject ID should be included with all submissions that are part of the same study.
[3] Source Information
Source metadata should be included within BioSample using the appropriate package. Details regarding the requirements for each BioSample package can be found here. The GenBank submission wizard for prokaryotic 16S rRNA sequences within the NCBI Submission Portal allows the creation of BioSamples using these package types:
- Pathogen affecting public health
- Metagenome or environmental sample
- Genome, metagenome or marker sequences (MIxS compliant): MIMS and MIMARKS (survey)
All submissions should include rich contextual information about where the specimen was obtained, including but not limited to: isolation-source or host, collection date, geographic location name, and latitude/longitude. Uncultured samples require a metagenome organism name (eg., marine metagenome) that will be applied to the entire TLS submission. If more descriptive organism names are needed, please send a request to [email protected] prior to submitting your files.
If the sequences in the submission were obtained from multiple BioSamples, a tab-delimited mapping file is required that lists the BioSample that should be included for each sequence in your submission. This mapping file should include the BioSample Accessions that were assigned if the samples were registered prior to sequence submission. If BioSamples are created within the 16S ribosomal RNA Submission Wizard, the sample names should be used. Currently, only a single BioSample ID can be included per sequence. If multiple BioSample IDs should be included per sequence, please contact [email protected].
[4] Features
- The ribosomal RNA submission wizard will incorporate the appropriate feature annotation.
- Other sequence types, such as a single locus or conserved element should include the appropriate feature type (eg., gene, misc_feature). More information regarding the options for submitting data to GenBank can be found here.
Submitting TLS Files
- The NCBI Submission Portal GenBank wizard should be used to submit ribosomal RNA sequences.
- Other targeted locus study submissions should be submitted to GenBank using table2asn to create submission files that can be emailed to [email protected].
TLS 16S rRNA Sequence Analysis
Ribosomal RNA sequences are checked for a number of issues before they are accepted for GenBank. A summary of these checks can be found here. These include chimera analysis, vector screens, and sequence length.
Large scale submissions of 16S rRNA from uncultured prokaryotes that are processed as TLS projects are verified using an additional analysis program Ribosensor (version 0.27). In this analysis, each sequence is BLASTed against a rRNA reference dataset as well as compared to sets of profile HMMs built from representative alignments of SSU rRNA sequences. Each profile HMM model is built from a multiple alignment of 50-100 representative sequences from the family. The source of several of the alignments, including the bacterial model, is the Rfam database (rfam.xfam.org). Each sequence is aligned to each profile and a score is computed based on how well the sequence matches the profile. Each sequence is classified by the model that gives it the highest score. The BLAST and profile HMM results are combined and sequences with unexpected features are reported. Some examples of unexpected features that are detected are low scores, low coverage, and duplicated regions suggestive of misassembly.
You will be notified during submission processing if your sequences have any of these issues. If you have questions, please write to: [email protected] and include your submission number.
Updating TLS Submissions
- If you are updating a publication, send the TLS accession prefix and complete publication information in the text portion of an email to [email protected].
-
If you are updating any other information, do not create a new submission. Please contact [email protected] for directions and include the following information with your request:
- Description of your update
- TLS Accession prefix
We will send instructions on how to proceed with the requested update.