Adding a Structured Comment to GenBank Submissions
Introduction
GenBank records consist primarily of nucleotide sequence data, source organism information, and sequence features. The organism and feature description are based on a controlled list of organism modifiers (such as isolate, strain, clone, and specimen voucher) and features (such as CDS, rRNA, and gene).
However, many sequence submitters also have additional organism metadata that cannot easily fit into the controlled list but that is significant for the complete description of a sequences source and allows for comparisons of sequences isolated from similar locations.
To collect and display such additional metadata in sequence records, GenBank has developed a Structured Comment. The comment consists of tag-value pairs that are contained within START and END tags that function as delimiters for easy parsing. These comments can be incorporated from a tab-delimited table into submission files using table2asn (the replacement of the older tbl2asn). An example of a GenBank record that includes a structured comment is GU949562.
This guide explains how to include structured comments with your sequence submission. However, note that several GenBank submission tools prompt submitters to provide the metadata required to create certain structured comments for particular types of data, as explained below.
If you do not understand any of the instructions presented here or you have questions, please contact GenBank User Services at [email protected] prior to creating your submission.
Table of Contents
Including Structured Comments Within GenBank Submissions
In order to include unique metadata within the structured comment, you need to create a tab-delimited table in one of two ways depending on how the data should be applied to the sequences in your submission. Any scientific unit of measurement (e.g., deg C or km) should be included with the value.
[1] Adding the same structured comment to all sequences in your submission
This requires a single tab-delimited table that includes the tag-value pairs that are to be applied to all of the sequences in your submission, for example:
oxygen_content | 32 ppm |
habitat | Black Lake |
temperature | 27 deg C |
sample size | 150 mL |
depth | 10 m |
Once the metadata table is created and saved as plain text, the structured comment can be included using table2asn.
- table2asn: The tab-delimited table needs to be saved as a .cmt file and included in the same directory as your fasta file. If the .cmt file name has the same basename as your fasta file (for example, fasta1.fsa and fasta1.cmt), it will be automatically recognized and the structured comment will be included for all the sequences in your fasta file. Alternatively, you can use any file name for the structured comment file and call it with the argument -w within the table2asn commandline.
[2] Adding a unique structured comment to each sequence in your submission
The format for this type of table is a tab-delimited, multi-column table, where the first column must be the Sequence Identifier used in the .fsa files. The first row in each column is the metadata tag that appears in the left side of the structured comment, for example:
SeqID | investigation_type | project_name | collection_date | depth |
---|---|---|---|---|
A | metagenome | aquatic study | 2007-03-04 | 10 m |
B | metagenome | aquatic study | 5 m | |
C | eukaryote | Analysis of fish | 2008-08-09 | 25 m |
Each sequence in this submission will include a structured comment with unique tag-value pairs. Once the metadata table is created and saved as plain text, the tag-value pairs can be included using table2asn.
See the HIV example below for instructions on the .cmt file format to include a specific prefix for the structured comment.
- table2asn: The tab-delimited table needs to be saved as a .cmt file and included in the same directory as your fasta (and optional .tbl) file. If the .cmt file name has the same basename as your fasta file (for example, fasta1.fsa and fasta1.cmt), the .cmt file will be automatically included so that each sequence in column 1 has the tag-value pairs of that row of the file.
Specialized Structured Comments
[1] MIGS/MIMS/MIMARKS
Minimum information checklists have been developed by the Genomic Standards Consortium (GSC) as a means of reporting core descriptive information about the environment from which an organism(s) was collected. Core descriptors include information about the origins of the nucleic acid sequence (genome), its environment (e.g., latitude and longitude, date and time of sampling, habitat) and sequence processing (sequencing and assembly methods).
Different lists have been developed to describe genomic, metagenomic, and marker sequence metadata:
- MIGS - Minimum Information About a Genome Sequence
- MIMS - Minimum Information About a Metagenome Sequence
- MIMARKS - Minimum Information About a Marker Sequence
- MIMAG - Minimum Information About a Metagenome-Assembled Genome
- MISAG - Minimum Information About a Single Amplified Genome
- MIUVIG - Minimum Information About an Uncultivated Virus Genome
The tag-value pairs that are included for each submission type can be validated for compliance with the GSC recommended list. The recommended lists of core descriptors that should be included for each of these sequence types can be found here.
Validation tools within will report if structured comments include all of the GSC recommended compliant core descriptors. Submissions that include of all the compliant tags will have a Keyword included within the GenBank flatfile, for example:
KEYWORD GSC:MIMARKS:5.0
Structured comments that are not compliant based on the GSC guidelines can still be included within GenBank submissions - they just will not include the keyword.
In order for this validation to occur, you will need to include within the first column in your table a tag that defines the prefix and suffix for the start and end tags within the structured comment, for example:
StructuredCommentPrefix | [one of the following - MIGS:3.0-Data / MIMS:3.0-Data / MIMARKS:3.0-Data] |
investigation_type | [value determined by organism type as defined within GSC spreadsheet] |
project_name | Analysis of soil bacteria |
collection_date | 2008-08-09 |
lat_lon | 35.64N 56E |
geo_loc_name | France |
biome |
grassland |
feature |
field |
material |
soil |
env_package |
[env_package types are listed within the GSC spreadsheet] - can include the term "missing" |
num_replicons | 14 |
ref_biomaterial | PMID |
biotic_relationship | free living |
trophic_level | autotroph |
rel_to_oxygen | aerobe |
isol_growth_condt | PMID |
seq_meth | pyrosequencing |
assembly | Velvet; error rate 1/45 |
finishing_strategy | complete; 4X coverage; 2500 contigs |
An example of a sequence that includes a structured comment that meets GSC compliance is CP051461.
[2] Genome Submissions
Prokaryotic and eukaryotic genome submissions require assembly information in a Genome Assembly-Data structured comment. This structured comment includes the following required fields:
- Assembly Method (with version or date the program was run): e.g., Newbler v. 2.3 OR Celera Assembly v. May 2010
- Genome Coverage : e.g., 121x
- Sequencing Technology : e.g., ABI 3730; Illumina GAIIx; Nanopore
Assembly Name may be added for eukaryotic assemblies, but is optional.
- Assembly Name : a short name suitable for display e.g., LoxAfr_3.0 for a Loxodonta africana assembly, version 3.0
Note that Assembly Method requires 'v. ' between the algorithm name and its version (or the month and year it was run). If more than one sequencing technology was used, they are separated with a semi-colon, e.g. "PacBio; Illumina GAIIx".
You will be prompted for this information when you submit your prokarotic or eukaryotic genome via the Genome Submission Portal, which is the easiest way to provide the information.
If you are creating a .sqn file with table2asn, you can create a Genome-Assembly-Data file and include it as described above, if you wish. However, this is not necessary because you will be prompted for the information when you submit the genome in the Submission Portal.
The prefix and suffix for the start and end tags are:
- StructuredCommentPrefix Genome-Assembly-Data
- StructuredCommentSuffix Genome-Assembly-Data
An example of a genome with the required structured comment is AMVS01000000.
[3] Transcriptome Shotgun Assembly Submissions
An Assembly-Data structured comment is required for Transcriptome Shotgun Assembly (TSA) sequences. Users will be prompted for this information when using the TSA Submission Wizard. If submitting using table2asn, this file can be made using the Structured Comment template (non-genomes) page or as described above. However, this is not necessary because you will be prompted for the information when you submit the genome in the Submission Portal.
The TSA structured comment includes the following required values:
- Assembly Method (with version or date the program was run): e.g., Velvet v.1.1.05, Oases v.0.1.22, Trinity r2012-01-25
- Sequencing Technology : e.g., ABI 3730; 454 GS-FLX Titanium; Illumina GAIIx
Coverage and Assembly Name may be added but these are optional.
- Assembly Name : a short name suitable for display e.g., LoxAfr_3.0 for a Loxodonta africana assembly, version 3.0
- Coverage : e.g., 12x
The prefix and suffix for the start and end tags to include within this structured comment are:
- StructuredCommentPrefix Assembly-Data
- StructuredCommentSuffix Assembly-Data
An example of a TSA submission with the required structured comment is JU497302.
[4] GenBank Assembly-Data
Submission to GenBank can include an Assembly-Data structured comment that is displayed within the GenBank flatfile and provides users with information regarding the sequencing and assembly details.
This structured comment includes the following values:
- Assembly Method (with version or date the program was run): e.g., Newbler v. 2.3 OR Celera Assembly v. May 2010
- Coverage : e.g., 12x
- Sequencing Technology : e.g., ABI 3730; 454 GS-FLX Titanium; Illumina GAIIx (required)
The prefix and suffix for the start and end tags to include within this structured comment are:
- StructuredCommentPrefix Assembly-Data
- StructuredCommentSuffix Assembly-Data
An example of a GenBank record with an Assembly-Data structured comment is JQ307843.
[5] HIV
A specialized structured comment can be included with HIV sequence submissions to describe additional metadata that cannot be easily included within the source descriptor. This includes specific tags that provide more information regarding the source of the virus.
For HIV-specific structured comments, you need to include two additional columns in your table that define the prefix and suffix for the start and end tags on either side of the structured comment:
- StructuredCommentPrefix HIVDataBaseData
- StructuredCommentSuffix HIVDataBaseData
Example Table
SeqID | sequence name | Patient cohort | Sample tissue | viral load | StructuredCommentPrefix | StructuredCommentSuffix |
---|---|---|---|---|---|---|
SeqA | mysample_1 | CHAVI001 | plasma | 3565728 | HIV-DataBaseData | HIV-DataBaseData |
SeqB | mysample_2 | CHAVI002 | plasma | 3565730 | HIV-DataBaseData | HIV-DataBaseData |
SeqC | mysample_3 | CHAVI003 | plasma | 3565755 | HIV-DataBaseData | HIV-DataBaseData |
An example record that includes a properly formatted HIV structured comment is EU579019.
Retrieval in Entrez
Sequences with structured comments can be retrieved in Entrez by specifying the tag-value pair in double quotes, e.g. "investigation_type bacteria_archaea". This search in Entrez retrieves GenBank records with this tag-value pair in the structured comment. You can also search for each tag as a property in Entrez (e.g., depth[prop]) in order to retrieve all records that have this indexed within the structured comment.