Glossary

Glossary of terms used in Datasets

Glossary

Glossary of terms used in Datasets

Alternate locus

A specific sequence that represents a different version or variation of a particular locus present in the haploid assembly, also known as the primary assembly. It serves as an alternative representation of the genetic information found at that specific location in the genome.

Alternate locus group

A set of alternate loci grouped together for annotation purposes. This may be because they are from the same haplotype or strain or for annotation convenience.

Assembly

An assembly or assembled genome is the set of chromosomes, unlocalized and unplaced (sometimes called “random”) and alternate sequences used to represent an organism’s genome. The NCBI Assembly Data Model defines assemblies as comprising one or more assembly units.

Assembly anomaly

See atypical genomes (below).

Assembly level

The highest assembly level for any object in the assembly. The values are as follows:

  • Complete genome

    All chromosomes are gapless and contain runs of nine or less ambiguous bases (Ns), there are no unplaced or unlocalized scaffolds, and all the expected chromosomes are present (i.e., the assembly is not noted as having partial genome representation). Plasmids and organelles may or may not be included in the assembly, but if they are present, the sequences are gapless.

  • Chromosome

    There is sequence for one or more chromosomes. This may be a completely sequenced chromosome without gaps or a chromosome containing spans with gaps between them. There may also be unplaced or unlocalized sequences.

  • Scaffold

    One or more contigs are connected across gaps of 10 or more bases to create scaffolds. All sequences are unplaced or unlocalized.

  • Contig

    All sequences do not contain gaps. All sequences are unplaced or unlocalized. Includes ultra-contig assemblies based on long read sequencing with no gaps.

Assembly name

The submitter’s name for the assembly when one is provided, otherwise, a default name is provided by NCBI.

Assembly type

The possible assembly types are as follows:

  • Diploid assembly

    A genome assembly for which a chromosome assembly is available for both sets of an individual’s chromosomes. A diploid genome assembly is expected to represent the genome of an individual. Therefore, alternate loci are not expected to be defined for this assembly, though it is possible that unlocalized or unplaced sequences may be part of the assembly.

  • Haploid assembly (default assembly type)

    The collection of chromosome assemblies, unlocalized and unplaced sequences representing an organism’s genome. Any locus may be represented zero or one time, and entire chromosomes are only represented zero or one time.

  • Haploid-with-alt-loci

    The collection of chromosome assemblies, unlocalized and unplaced sequences, and alternate loci representing an organism’s genome. Any locus may be represented zero, one, or greater than one time, but entire chromosomes are only represented zero or one time.

  • Linked pseudohaplotype assemblies

    A genome assembly from a diploid in which many of the haplotypic sequences have been resolved and phased, and the two haplotypes have been separated. A pair of pseudohaplotype assemblies derived from the same diploid individual can be linked with a cross-reference.

  • Unresolved diploid

    A genome assembly from a diploid in which many haplotypic sequences have been resolved, but the two haplotypes have not been separated. Consequently, the assembly will be much larger than the expected haploid genome size, and two copies of many genes will be present.

Atypical genomes

Atypical genomes are genomes with one or more problems identified by NCBI relating to quality, unusual size, or other flaws in the genome assembly. For a full list of problems and definitions that result in this designation, see Genome Notes .

Average Nucleotide Identity (ANI)

The Average Nucleotide Identity (ANI) is a measure of genomic similarity at the nucleotide level between two different genomes. The NCBI utilizes ANI to evaluate the taxonomic identity of genome assemblies, primarily for prokaryotes and some eukaryotes such as fungi, that are submitted to GenBank (see NCBI Datasets Average Nucleotide Identity documentation).

Chromosome context

A term used to describe alternate loci and patches that have been aligned to the chromosome sequences defined in the Primary Assembly. While these sequences cannot strictly be expressed as chromosome coordinates, they can be related to the chromosome sequence via their alignment to the chromosome.

Full Genome

The data used to generate the assembly was obtained from the whole genome, e.g., Whole Genome Shotgun (WGS) assemblies. The assembly may still contain gaps.

GenBank assembly accession

The accession and version for the GenBank assembly (“accession.version”).

Genome patches

Sequence updates released outside the major assembly cycle. These are instantiated as independent scaffolds aligned to the primary assembly to provide chromosome context. There are two types of patches:

  • Fix patches

    These patches correct assembly errors, and the scaffolds are withdrawn in the next major assembly update. Their accessions will be made secondary to the chromosome and their sequences incorporated into the primary assembly.

  • Novel patches

    These represent new alternate loci. These sequences will be moved to the appropriate assembly unit at the next major assembly update, and the accession will remain stable.

Linked assembly

The “accession.version” and designation (principal or alternate pseudohaplotype) of a paired genome assembly derived from the same diploid individual (see “Assembly type” definitions above).

Modifier

Infraspecific or subspecies name or description.

PAR

Pseudo-autosomal region. A region found on the X and Y chromosomes of mammals that allows recombination between the sex chromosomes.

Partial Genome

The data used to generate the assembly came from only part of the genome. The reasons genome representation is set to partial include:

  • The assembly description indicates that the assembly was targeted to a single chromosome or a subset of the genome.

  • The chromosome set in the assembly is less than the expected chromosome complement for the organism, ignoring any plasmids, organelle chromosomes, and the small sex chromosome (Y for mammals, W for birds).

  • The genome coverage in a WGS assembly is less than one.

  • The ungapped sequence length of the assembly is less than half the average for other assemblies from the same species.

Placed sequence

Sequence that has been ordered and oriented on the chromosome. The locations of these sequences can usually be expressed in chromosome coordinates.

RefSeq assembly accession

The accession and version for the RefSeq version of the assembly (“accession.version”).

Reference genome

A genome assembly that NCBI has identified as the “best” for the species. Criteria for choosing and maintaining reference genomes are described in Selecting Reference Genomes.

Regions

Locations on the primary assembly (typically on the chromosome sequences) for which alternate representations or genome patches exist.

Relation to type material

Shown if the sequences in the genome assembly were derived from type material, synonym type material, or other type material (for more information, see What is type material? and Federhen 2015 ):

  • Assembly designated as clade exemplar

    The genome assembly was designated as an additional representative for species with high intra-species genome diversity.

  • Assembly designated as neotype

    The sequences in the genome assembly were derived from neotype material.

  • Assembly designated as reftype

    The sequences in the genome assembly were derived from reftype material.

  • Assembly from pathotype material

    The sequences in the genome assembly were derived from pathovar type material.

  • Assembly from synonym type material

    The sequences in the genome assembly were derived from synonym type material.

  • Assembly from type material

    The sequences in the genome assembly were derived from type material.

  • ICTV additional isolate

    The International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly as an additional isolate for the virus species.

  • ICTV species exemplar

    The International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly as the exemplar for the virus species.

Release type

Indicates whether this version of the genome assembly is a major, minor, or patch release:

  • Major

    Changes from the previous assembly version result in a significant change to the coordinate system. The first version of an assembly is always a major release. Most subsequent genome assembly updates are also major releases.

  • Minor

    Changes from the previous assembly version are limited to the following changes, none of which result in a significant change to the coordinate system of the primary assembly unit:

    • Adding, removing, or changing a non-nuclear assembly unit.
    • Dropping unplaced or unlocalized scaffolds.
    • Adding up to 50 unplaced or unlocalized scaffolds shorter than the current scaffold-N50 value.
    • Replacing a component with a gap of the same length.
  • Patch

    The only change from the previous assembly version is the addition or modification of a patch assembly unit (relevant for assemblies maintained by the Genome Reference Consortium ).

Status

The current status for the GenBank and/or RefSeq assembly accession.version is shown. The possible values are “latest,” “replaced,” or “suppressed.”

Unlocalized sequence

A sequence found in an assembly associated with a specific chromosome that cannot be ordered or oriented on that chromosome. The location of these sequences cannot be expressed in chromosome coordinates.

Unplaced sequence

A sequence found in an assembly not associated with any chromosome. These sequences cannot be expressed in chromosome coordinates.

Generated November 25, 2024