Submitting Multiple Haplotype Assemblies
Background
Advanced sequencing and assembly technologies now allow the sequencing of a diploid individual and separation of its genome into two assemblies, often called haplotypes. In some cases there is enough information to assemble the maternal and paternal haplotypes from the sequenced genome of the child (ie, the F1 individual). At GenBank we had used the term 'pseudohaplotype' for all of these but we use 'haplotype' now. FYI, the NCBI assembly types are defined in the Assembly Model.
GenBank accepts these haplotypes as separate assemblies that are linked to each other in the Assembly resource. They can be retrieved by searching, eg with "alternate pseudohaplotype"[Filter], or by using the facets to filter search results.
Special details for haplotype assemblies
Haplotype assembly submissions are created like regular genome assemblies with a few exceptions. The special characteristics of the haplotype assemblies of a sequenced polyploid genome are:
-
They share the same BioSample, which corresponds to the individual that was sequenced, not to a parental taxid
-
They have separate BioProjects
-
They have an umbrella BioProject that links those two data-level BioProjects. This will be created by GenBank staff if one does not already exist.
-
The relationship of the haplotypes must be asserted by the submitter. The options are:
-
Principal haplotype / Alternate haplotype, if one is much better than the other
-
Haplotype 1 / Haplotype 2, if they are of similar quality
- When more than 2 haplotypes are present, use Haplotype 3 / Haplotype 4 for the additional assemblies
-
Maternal haplotype / Paternal haplotype, when that information is known
-
-
Can be submitted via the "Pseudohaplotypes of a diploid/polyploid assembly" option in the genome submission portal, where the submitter is prompted to provide the required information for the two assemblies AND to create the BioProjects and the BioSample if they do not already exist.
- NOTE: The submission wizard was updated in April so all the options are now available.
Recommendations for these assemblies:
-
The BioSample should include an isolate name or number to distinguish the sequenced individual from others of that species. (this is a general recommendation for all eukaryotic genome assemblies)
-
The Assembly Name should include information about Principal/Alternate or Maternal/Paternal or other identifiers to distinguish these two assemblies of an individual from each other. For example, bSteHir1.pri & bSteHir1.alt or mCalJac1.pat & mCalJac1.mat or rPleGil1.0.hap1 & rPleGil1.0.hap2
-
If the mitochondrial genome was assembled and is present, it should be included in the Principal or Maternal assembly.
An example of the two pseudohaplotypes of a diploid is the Sterna hirundo genome in Umbrella BioProject PRJNA560234 with BioSample SAMN12369541, Principal Assembly (PRJNA558062; WNMW00000000; GCA_009819605.1) and Alternate Assembly (PRJNA558063; WNMX00000000; GCA_009819645.1)
How to submit
[1] In the simple case fasta files can be uploaded because there is no annotation and having the same linkage_evidence for all the gaps is acceptable. However, if annotation or different kinds of gaps are included, then you will need to use the command line program table2asn to create an ASN (.sqn) file, as explained in the Genome Submission Guide.
[2] If any of the sequences belong to chromosomes or organelles and the "Batch" or "Pseudohaplotypes of a diploid/polyploid assembly" submission option is used, then that assignment information must be included in the definition lines of the fasta sequences, as described at 'IMPORTANT: Additional requirements for batch submissions'. Specifically:
-
Unlocalized organelle sequence, use [location=xxx], eg:
-
[location=mitochondrion]
-
[location=chloroplast]
-
-
The complete circular organelle sequence, then add the topology and completeness, eg:
- [location=mitochondrion] [completeness=complete] [topology=circular]
-
Unlocalized sequence that belongs to a chromosome, eg chromosome 2:
- [chromosome=2]
-
The sequence represents the chromosome, even if gaps may be present, then add the location:
- [location=chromosome] [chromosome=2]
[3] If the files are very big, you may want to upload them before you begin your genome submission, as described at https://www.ncbi.nlm.nih.gov/genbank/preloadfiles/. FYI, this is the same process that exists for preloading files for SRA submissions or any genome submission.
[4] To submit the haplotypes of an individual, start a new submission in the Genomes Submission Portal.
A. If no AGP file is included, then choose the "Pseudohaplotypes of a diploid/polyploid assembly" option. During the submission process you will be prompted to provide the required assembly information.
If there are multiple assembly methods, then you must use the embedded table option on the GENOME INFO tab to provide this information for each haplotype in the submission.
Here is the information that is collected:
-
BioSample accession, or sample_name if you create the BioSample during this submission
-
Type of haplotype
-
BioProject accessions, or BioProject descriptions if you create them during this submission
-
Umbrella BioProject accession, if it has already been created
-
filename : exact name of the file that will be uploaded (all files in this submission must be the same format, either fasta or ASN)
-
Assembly date (Optional): approximate date the assembly was created, format is YYYY-MM-DD; YYYY-MM; or YYYY
-
Assembly name (Optional but strongly recommended for these)
- Include information about Principal/Alternate or Maternal/Paternal or other identifiers to distinguish these two assemblies of an individual from each other.
-
For example,
-
bSteHir1.pri & bSteHir1.alt
-
mCalJac1.pat & mCalJac1.mat
-
rPleGil1.0.hap1 & rPleGil1.0.hap2
-
-
Assembly method and version: name(s) and version(s) of the assembly algorithm(s)
-
Genome coverage
-
Sequencing technology or technologies
-
Reference genome if it is not a de novo assembly
-
Update: accession of the genome being updated, when appropriate
B. If an AGP file is included, then you will need to submit each assembly individually with the "Single" option. In this case, include a statement in the comment box to tell us:
-
that this is one haplotype of a diploid genome
-
whether this is the Principal/Alternate or Haplotype 1/Haplotype 2 or Maternal/Paternal haplotype
-
what the umbrella BioProject is, if one has already been created
NOTE it is frequently preferred to submit the chromosome and unplaced and unlocalized scaffolds as gapped sequences instead of submitting contigs plus an AGP file to make scaffolds and chromosomes from those contigs, so keep that option in mind.
Other situations
Artificially adding a sex chromosome to the other haplotype assembly
Sometimes the assembly methodology creates separate sequences for the two haplotypes of a genome but the submitter wants both sex chromosomes in a single assembly. If the assembly includes sequences from multiple haplotypes, it would not be submitted using the diploid option in the submission portal. Instead, create a regular genome submission using either the Single or Batch option. If the two separate haplotype assemblies will also be submitted, use a single BioSample for all of the assemblies but a different BioProject for each assembly (one for the combined assembly, another for the first haplotype and a third for the other haplotype). When submitting such an assembly, include a note in the comment box to inform the GenBank staff what assemblies are being submitted. Note that the SRA reads should use the same BioSample and could have the same BioProject of one of those assemblies, if desired.
Unresolved diploid
Sometimes the assembly methodology creates separate sequences for the two haplotypes of a genome but the submitter is not able to distinguish them into two haplotypes. This type of genome assembly is an Unresolved diploid assembly, and is submitted with the Single or Batch submission option, whichever is the most appropriate. When submitting such an assembly, include a note in the comment box to inform the GenBank staff that it is an Unresolved diploid.
Ancestral genome duplication
When an ancestral merge or duplication event has caused a species to have multiple copies of its chromosomes, eg Triticum aestivum which has A, B and D versions of 7 chromosomes, then the two haplotypes of that genome would each have the full complement of chromosomes, eg 21 in the case of T.aestivum (plus any unplaced and unlocalized sequences).
If the sequences of the ancestral genomes were resolved and submitted as separate assemblies (eg separate assemblies for the A, B, and D chromosomes of T.aestivum, plus unplaced/unlocalized sequences), then those assemblies would be Partial genome representations rather than haplotypes of the genome because each includes only a subset of the organism's chromosomes. Therefore, those assemblies should be submitted in the normal Single or Batch genome submission and with NO as the answer to the "Full Representation" question in the submission form.
Genome Resources
- About WGS
- WGS Browser
- Genome Submission Guide
- Genome Submission Portal
- Update Genome Records
- FAQ
- table2asn
- Submitting Multiple Haplotype Assemblies
- Create Submission Template
- Eukaryotic Annotation Guide
- Prokaryotic Annotation Guide
- Annotation Example Files
- Annotating Genomes with GFF3 or GTF files
- Validation Error Explanations for Genomes
- Discrepancy Report
- NCBI Prokaryotic Genome Annotation Pipeline
- AGP Format
- Metagenome Submission Guide
- Structured Comment
- BioProject
- BioSample