How to Submit to dbSNP
II. Introduction and Submission Overview
Version 4.2; May 1, 2017
An Introduction to dbSNP Submissions
Although SNP is an abbreviation for “single nucleotide polymorphism,” dbSNP is a general public archive of all short sequence variation, not just single nucleotide substitutions that occur frequently enough in a population to be termed polymorphic, but also rare variants, including those with clinical assertions in ClinVar. dbSNP includes a broad collection of simple genetic variations such as single-base nucleotide substitutions, small-scale multi-base deletions or insertions, and microsatellite repeats. Data submitted to dbSNP can be from any organism (exceptions noted in the limitations section), from any part of a genome, and can include genotype and allele frequency data if those data are available. dbSNP accepts submissions for all classes of simple sequence variation, and provides access to variations of germline or somatic origin that are clinically significant through our integrated relationship with ClinVar. Since dbSNP and ClinVar work closely together, all variants with clinical assertions should be submitted to ClinVar, which will process and accession the variants and then forward the submission to dbSNP for mapping and accessioning.
dbSNP maps variations to their corresponding reference genome assemblies whenever available and provides links to details of the medical impact of clinically asserted alleles. Large-scale insertion/deletion, inversion and translocation data that are over 50bp long should be submitted to dbVar, the NCBI database of genomic structural variation. dbSNP accepts data submissions from individual researchers as well as from large studies (e.g. 1000 Genomes). Submissions can be from any organism, from any part of a genome, and can include genotype and allele frequency data if those data are available. dbSNP accepts submissions for all classes of simple sequence variation and since dbSNP submissions are not limited by allele frequency, our database includes polymorphisms as well as rare, medically important alleles.
Variation Types and Sizes Accepted by dbSNP
- dbSNP accepts simple variations < 50 bp; structural variations >50bp should be submitted to dbVar
- dbSNP accepts variations from any part of the genome
- dbSNP accepts single-base nucleotide substitutions
- dbSNP accepts small-scale multi-base deletions or insertions
- dbSNP accepts microsatellite repeats
- dbSNP accepts variation genotype data
- dbSNP accepts variation allele frequency data
Variation Types and Sizes dbSNP does NOT Accept
- dbSNP does not accept submissions of non-human organisms data. You can submit your data to European Variation Archive (http://www.ebi.ac.uk/eva/).
- dbSNP does not accept structural variations >50bp; they should be submitted to dbVar
- dbSNP does not accept synthetic mutations
- dbSNP does not accept variations ascertained from cross-species alignments and analysis
- dbSNP does not accept personal human data for use in research due to current NIH policy unless the participant is enrolled in a study with institutional oversight. However, we are aware of research organizations that do accept such data. For more information, please see, for example, the Genomes Unzipped project (http://www.genomesunzipped.org/project) and Personal Genome Project (https://pgp.med.harvard.edu/participate/).
- dbSNP does not accept human variations with an asserted relationship to disease or other phenotypes. Submit these data to ClinVar
- dbSNP does not accept bacterial variations. Bacterial variant sequences can be submitted to SRA or as alignments to GenBank PopSet. Please contact Pathogen Group if you have pathogen data (Campylobacter, Escherichia coli and Shigella, Listeria, and Salmonella).
Submitting Variations Related to a Disease or other Phenotype
- Submit Clinically Related Data toClinVar. Your submissions to ClinVar will be processed, assigned ClinVar accessions (SCV), and will be accessioned with novel variant locations in dbSNP or dbVar as appropriate.
- Submit Sensitive Clinical Data to dbGaP. If your submission contains identifying information or other clinically sensitive material, or if the individuals where you observed the variants did not sign a consent form allowing the display of their genetic information on a free public website, submit your study to NCBI's Database of Genotypes and Phenotypes (dbGaP). Sensitive information will be stored behind controlled access at dbGaP while aggregate data, stripped of personally identifying information, will be forwarded to dbSNP.
- Variants identified in individuals with a phenotype that has not yet been interpreted for functional or clinical significance should be submitted to dbSNP or dbVar as appropriate. The dbSNP or dbVar staff can broker submissions of phenotype information about the sample to the BioSample database.
- Submit clinical assertion updates for existing variations to ClinVar.
dbSNP Submission Overview
- Familiarize yourself with the dbSNP submission introductory material and the 10 Major elements of a dbSNP submission in this document.
- Complete the dbSNP Pre-Submission Process.
- Format the dbSNP Metatdata file that will accompany your data submission.
- Format your submission in VCF or a tabular format, and send it to dbSNP.
- If you need help with your submission, contact dbSNP at [email protected]
- If you are not ready for your data to be publically available, see dbSNP's HUP (Hold Until Published) policies; please see previous section.
- See our documentation for updating or withdrawing your submitted data at a later point in time.
- Learn about dbSNP policies regarding accessioning, turn-around time, and how dbSNP will commuicate your Processing Status to you.
The 10 Major Data Elements of a dbSNP Submission
1. Sequence Context (Required)
An essential component of a submission to dbSNP is an unambiguous location for the variation being submitted. dbSNP now minimally requires that you submit variant location as an asserted position on RefSeq or INSDC sequences.
a. Asserted Position
Asserted positions are statements based on experimental evidence that a variant is located at a particular position on a sequence that has been accessioned in an INSDC database or RefSeq, and on a sequence that is part of an assembly housed in the NCBI Assembly Resource. If you have a de novo assembly, the sequences can be submitted to GenBank and the assigned assembly accession can be used for dbSNP submission.
IMPORTANT: Those variant positions reported on a sequence that is part of an assembly housed in the NCBI Assembly Resource will receive a submitted SNP (ss) number, and a Reference SNP (rs or RefSNP) number. Variations that are assigned a refSNP number are distributed as part of dbSNP, which allows the reported variation to appear on maps or graphic representations of the assembly, and be integrated with NCBI's other resources like Gene, ClinVar, dbGAP or PubMed.
dbSNP will accept data on a RefSeq or INSDC sequence for an asserted position that is not associated with an assembly housed in the NCBI Assembly Resource when:
* there is not yet an assembly to which the sequence aligns, or
* the submitted sequence aligns to a gap in an existing assembly
In such cases, your submitted variant will be assigned only an ss number that you can access by using the dbSNP “ID search” tool or through an FTP download. Because the submitted variant in these cases only has an ss number, it will NOT appear on maps or graphic representations of the assembly, and will NOT be integrated with NCBI's other resources. The ss will, however, be reported on the 'Submitted SNP' web report. If, however, at some future date, a new assembly is created or an old assembly is updated such that the reported variant sequence aligns to an assembly in the NCBI Assembly Resource, the reported variant will be assigned an rs number at that time, which will allow it to be distributed as part of dbSNP, appear on maps or graphic representations of the assembly, and be integrated with other NCBI resources.
b. Flanking Sequence
Please note that dbSNP no longer accepts variants submitted with flanking sequence. dbSNP requires that submitters report variant positions as asserted positions on a sequence that is part of an assembly housed in the NCBI Assembly Resource using VCF format.
*If you do not know if your sequence is part of an assembly housed in the dbSNP Assembly Resource, contact dbSNP at [email protected]*
2. Alleles (Required)
Alleles define each variation class. dbSNP defines single nucleotide variants in its submission scheme as G, A, T, or C, and does not permit ambiguous IUPAC codes, such as N, in the allele definition of a variation. Note: dbSNP has an allele length limitation of <=50bp. Submit alleles >50 nucleotides in length to the Database of Genomic Structural Variation (dbVAR).
3. Method (Required)
Each submitter defines the methods in their submission as either the techniques used to assay variation or the techniques used to estimate allele frequencies. dbSNP groups methods by method class to facilitate queries using general experimental technique as a query field. The submitter provides all other details of the techniques in a free-text description of the method. Submitters can also use the METHOD_EXCEPTION field to describe changes to a general protocol for particular sets of data (batch-specific details). Submitters generally define methods only once in a submission.
4. Asserted Allele Origin (Required)
A submitter can provide a statement (assertion) with supporting experimental evidence that a variant has a particular allelic origin. Assertions for a single refSNP are summarized and given an attribute value of germline or unknown. Variants of somatic origin should be submitted to ClinVar. Additional attributes (e.g., paternal) will be added in the future.
5. Population (Required)
Each submitter defines population samples either as the group used to initially identify variations or as the group used to identify population-specific measures of allele frequencies. These populations may be one and the same in some experimental designs. Although dbSNP has assigned populations to a population class based on the geographic origin of the sample, we will phase out this practice in the near future since most population descriptions are now submitted to BioSample. We encourage you to register your samples with BioSample to obtain an assigned accession that you can use in their dbSNP submission.
6. Sample Size (Optional)
There are two sample-size fields in dbSNP. One field, SNPASSAY SAMPLE SIZE, reports the number of chromosomes in the sample used to initially ascertain or discover the variation. The other sample size field, SNPPOPUSE SAMPLE SIZE, reports the number of chromosomes used as the denominator in computing estimates of allele frequencies. These two measures need not be the same.
7. Population-specific Allele Frequencies (Optional)
Alleles typically exist at different frequencies in different populations; a very common allele in one population may be quite rare in another population. Also, allelic variants can emerge as private polymorphisms when particular populations have been reproductively isolated from neighboring groups, as is the case with isolated or remote populations. Frequency data are submitted to dbSNP as allele counts or binned frequency intervals, depending on the precision of the experimental method used to make the measurement. dbSNP contains records of allele frequencies for specific population samples that are defined by each submitter and used in validating submitted variations.
8. Population-specific Genotype Frequencies (Optional)
Similar to alleles, genotypes have frequencies in populations that can be submitted to dbSNP, and are used in validating submitted variations.
9. Individual Genotypes (Optional)
dbSNP accepts individual genotypes from samples provided by donors that have consented to having their DNA sequence housed in a public database (e.g. HapMap or the 1000 Genomes project). Genotypes reported in dbSNP contain links to population and method descriptions. General genotype data provide the foundation for individual haplotype definitions and are useful for selecting positive and negative control reagents in new experiments.
10. Validation Information (Optional)
dbSNP accepts individual assay records (ss numbers) without validation evidence. When possible, however, dbSNP tries to distinguish high-quality validated data from unconfirmed (usually computational) variation reports. Assays validated directly by the submitter through the VALIDATION section show the type of evidence used to confirm the variation. Additionally, dbSNP will flag an assay variation as validated if:
- There are multiple independent submissions to the refSNP cluster with at least one non-computational method,
OR
- The variation was genotyped by the HapMap project, sequenced by the 1000 Genomes project, or other large sequencing projects.
For general information regarding dbSNP, or detailed information about refSNP clustering, annotation and mapping, see the dbSNP Handbook.
Contact dbSNP
If you do not find the answer to your submission questions in the How to Submit to dbSNP document series, contact dbSNP submissions at [email protected], and we will do our best to answer your submission question or help you solve a difficult submission problem.
- Send submissions and submission questions to: [email protected]
- Send Submission updates to: [email protected]
- Send general inquiries, etc. to: [email protected]
Titles in the How to Submit to dbSNP Series: