Glossary
Glossary of terms used in Datasets
Glossary
Alternate locus
A specific sequence that represents a different version or variation of a particular locus present in the haploid assembly, also known as the primary assembly. It serves as an alternative representation of the genetic information found at that specific location in the genome.
Alternate locus group
A set of alternate loci grouped together for annotation purposes. This may be because they are from the same haplotype or strain or for annotation convenience.
Assembly
An assembly or assembled genome is the set of chromosomes, unlocalized and unplaced (sometimes called “random”) and alternate sequences used to represent an organism’s genome. The NCBI Assembly Data Model defines assemblies as comprising one or more assembly units.
Assembly anomaly
See atypical genomes (below).
Assembly level
The highest assembly level for any object in the assembly. The values are as follows:
Complete genome
All chromosomes are gapless and contain runs of nine or less ambiguous bases (Ns), there are no unplaced or unlocalized scaffolds, and all the expected chromosomes are present (i.e., the assembly is not noted as having partial genome representation). Plasmids and organelles may or may not be included in the assembly, but if they are present, the sequences are gapless.
Chromosome
There is sequence for one or more chromosomes. This may be a completely sequenced chromosome without gaps or a chromosome containing spans with gaps between them. There may also be unplaced or unlocalized sequences.
Scaffold
One or more contigs are connected across gaps of 10 or more bases to create scaffolds. All sequences are unplaced or unlocalized.
Contig
All sequences do not contain gaps. All sequences are unplaced or unlocalized. Includes ultra-contig assemblies based on long read sequencing with no gaps.
Assembly name
The submitter’s name for the assembly when one is provided, otherwise, a default name is provided by NCBI.
Assembly type
The possible assembly types are as follows:
Diploid assembly
A genome assembly for which a chromosome assembly is available for both sets of an individual’s chromosomes. A diploid genome assembly is expected to represent the genome of an individual. Therefore, alternate loci are not expected to be defined for this assembly, though it is possible that unlocalized or unplaced sequences may be part of the assembly.
Haploid assembly (default assembly type)
The collection of chromosome assemblies, unlocalized and unplaced sequences representing an organism’s genome. Any locus may be represented zero or one time, and entire chromosomes are only represented zero or one time.
Haploid-with-alt-loci
The collection of chromosome assemblies, unlocalized and unplaced sequences, and alternate loci representing an organism’s genome. Any locus may be represented zero, one, or greater than one time, but entire chromosomes are only represented zero or one time.
Linked pseudohaplotype assemblies
A genome assembly from a diploid in which many of the haplotypic sequences have been resolved and phased, and the two haplotypes have been separated. A pair of pseudohaplotype assemblies derived from the same diploid individual can be linked with a cross-reference.
Unresolved diploid
A genome assembly from a diploid in which many haplotypic sequences have been resolved, but the two haplotypes have not been separated. Consequently, the assembly will be much larger than the expected haploid genome size, and two copies of many genes will be present.
Atypical genomes
Atypical genomes are genomes with one or more problems identified by NCBI relating to quality, unusual size, or other flaws in the genome assembly. For a full list of problems and definitions that result in this designation, see Genome Notes .
Average Nucleotide Identity (ANI)
The Average Nucleotide Identity (ANI) is a measure of genomic similarity at the nucleotide level between two different genomes. The NCBI utilizes ANI to evaluate the taxonomic identity of genome assemblies, primarily for prokaryotes and some eukaryotes such as fungi, that are submitted to GenBank (see NCBI Datasets Average Nucleotide Identity documentation).
Chromosome context
A term used to describe alternate loci and patches that have been aligned to the chromosome sequences defined in the Primary Assembly. While these sequences cannot strictly be expressed as chromosome coordinates, they can be related to the chromosome sequence via their alignment to the chromosome.
Full Genome
The data used to generate the assembly was obtained from the whole genome, e.g., Whole Genome Shotgun (WGS) assemblies. The assembly may still contain gaps.
GenBank assembly accession
The accession and version for the GenBank assembly (“accession.version”).
Genome patches
Sequence updates released outside the major assembly cycle. These are instantiated as independent scaffolds aligned to the primary assembly to provide chromosome context. There are two types of patches:
Fix patches
These patches correct assembly errors, and the scaffolds are withdrawn in the next major assembly update. Their accessions will be made secondary to the chromosome and their sequences incorporated into the primary assembly.
Novel patches
These represent new alternate loci. These sequences will be moved to the appropriate assembly unit at the next major assembly update, and the accession will remain stable.
Linked assembly
The “accession.version” and designation (principal or alternate pseudohaplotype) of a paired genome assembly derived from the same diploid individual (see “Assembly type” definitions above).
Modifier
Infraspecific or subspecies name or description.
PAR
Pseudo-autosomal region. A region found on the X and Y chromosomes of mammals that allows recombination between the sex chromosomes.
Partial Genome
The data used to generate the assembly came from only part of the genome. The reasons genome representation is set to partial include:
The assembly description indicates that the assembly was targeted to a single chromosome or a subset of the genome.
The chromosome set in the assembly is less than the expected chromosome complement for the organism, ignoring any plasmids, organelle chromosomes, and the small sex chromosome (Y for mammals, W for birds).
The genome coverage in a WGS assembly is less than one.
The ungapped sequence length of the assembly is less than half the average for other assemblies from the same species.
Placed sequence
Sequence that has been ordered and oriented on the chromosome. The locations of these sequences can usually be expressed in chromosome coordinates.
RefSeq assembly accession
The accession and version for the RefSeq version of the assembly (“accession.version”).
Reference genome
A genome assembly that NCBI has identified as the “best” for the species. Criteria for choosing and maintaining reference genomes are described in Selecting Reference Genomes.
Regions
Locations on the primary assembly (typically on the chromosome sequences) for which alternate representations or genome patches exist.
Relation to type material
Shown if the sequences in the genome assembly were derived from type material, synonym type material, or other type material (for more information, see What is type material? and Federhen 2015 ):
Assembly designated as clade exemplar
The genome assembly was designated as an additional representative for species with high intra-species genome diversity.
Assembly designated as neotype
The sequences in the genome assembly were derived from neotype material.
Assembly designated as reftype
The sequences in the genome assembly were derived from reftype material.
Assembly from pathotype material
The sequences in the genome assembly were derived from pathovar type material.
Assembly from synonym type material
The sequences in the genome assembly were derived from synonym type material.
Assembly from type material
The sequences in the genome assembly were derived from type material.
ICTV additional isolate
The International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly as an additional isolate for the virus species.
ICTV species exemplar
The International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly as the exemplar for the virus species.
Release type
Indicates whether this version of the genome assembly is a major, minor, or patch release:
Major
Changes from the previous assembly version result in a significant change to the coordinate system. The first version of an assembly is always a major release. Most subsequent genome assembly updates are also major releases.
Minor
Changes from the previous assembly version are limited to the following changes, none of which result in a significant change to the coordinate system of the primary assembly unit:
- Adding, removing, or changing a non-nuclear assembly unit.
- Dropping unplaced or unlocalized scaffolds.
- Adding up to 50 unplaced or unlocalized scaffolds shorter than the current scaffold-N50 value.
- Replacing a component with a gap of the same length.
Patch
The only change from the previous assembly version is the addition or modification of a patch assembly unit (relevant for assemblies maintained by the Genome Reference Consortium ).
Status
The current status for the GenBank and/or RefSeq assembly accession.version is shown. The possible values are “latest,” “replaced,” or “suppressed.”
Unlocalized sequence
A sequence found in an assembly associated with a specific chromosome that cannot be ordered or oriented on that chromosome. The location of these sequences cannot be expressed in chromosome coordinates.
Unplaced sequence
A sequence found in an assembly not associated with any chromosome. These sequences cannot be expressed in chromosome coordinates.
Generated November 25, 2024