Genome assembly report
Genome record accession, organism, assembly statistics, and annotation info
Genome assembly report
The downloaded genome package contains a genome assembly
data report in JSON Lines
format in the file:
ncbi_dataset/data/data_report.jsonl
Each line of the genome assembly data report file is a hierarchical JSON
object that represents a single genome assembly record. The schema of the genome assembly record is defined in the tables
below where each row describes a single field in the report or a sub-structure, which is a collection of fields.
The outermost structure of the report is AssemblyDataReport.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's --fields
Sample report
{
"annotationInfo": {
"busco": {
"buscoLineage": "primates_odb10",
"buscoVer": "5.7.1",
"complete": 0.9887518,
"duplicated": 0.009433962,
"fragmented": 0.0045718434,
"missing": 0.0066763423,
"singleCopy": 0.97931784,
"totalCount": "13780"
},
"name": "GCF_000001405.40-RS_2024_08",
"releaseDate": "2024-08-23",
"reportUrl": "https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/GCF_000001405.40-RS_2024_08.html",
"source": "NCBI RefSeq",
"stats": {
"geneCounts": {
"nonCoding": 22163,
"other": 411,
"proteinCoding": 20078,
"pseudogene": 17063,
"total": 59715
}
}
},
"assemblyInfo": {
"assemblyAccession": "GCF_000001405.40",
"assemblyLevel": "Chromosome",
"assemblyName": "GRCh38.p14",
"assemblyStatus": "current",
"assemblyType": "haploid-with-alt-loci",
"bioprojectLineage": [
{
"bioprojects": [
{
"accession": "PRJNA31257",
"title": "The Human Genome Project, currently maintained by the Genome Reference Consortium (GRC)"
}
]
}
],
"blastUrl": "https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_SPEC=GDH_GCF_000001405.40",
"currentAssemblyAccession": "GCF_000001405.40",
"description": "Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14)",
"genbankAssmAccession": "GCA_000001405.29",
"pairedAssembly": {
"accession": "GCA_000001405.29",
"status": "current"
},
"pairedAssemblyAccession": "GCA_000001405.29",
"refseqAssmAccession": "GCF_000001405.40",
"refseqCategory": "reference genome",
"submissionDate": "2022-02-03",
"submitter": "Genome Reference Consortium",
"ucscAssmName": "hg38"
},
"assemblyStats": {
"contigL50": 18,
"contigN50": 57879411,
"gapsBetweenScaffoldsCount": 349,
"gcCount": "1374283647",
"numberOfComponentSequences": 35611,
"numberOfContigs": 996,
"numberOfScaffolds": 470,
"scaffoldL50": 16,
"scaffoldN50": 67794873,
"totalNumberOfChromosomes": 24,
"totalSequenceLength": "3099441038",
"totalUngappedLength": "2948318359"
},
"commonName": "human",
"organelleInfo": [
{
"assemblyName": "GRCh38.p14",
"description": "Mitochondrion",
"submitter": "Genome Reference Consortium"
}
],
"organismName": "Homo sapiens",
"taxId": 9606
}
AssemblyDataReport Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
commonName | common-name | Common name | string | Vernacular name associated with a particular taxon | Human zebrafish pacific white shrimp |
organismName | organism-name | Organism name | string | Scientific name of the species or subspecies | Homo sapiens Arabidopsis thaliana Canis lupus familiaris |
breed | breed | Breed | string | A homogenous group of animals within a domesticated species | Hereford boxer |
cultivar | cultivar | Cultivar | string | A variety of plant within a species produced and maintained by cultivation | B73 |
ecotype | ecotype | Ecotype | string | A population or subspecies occupying a distinct habitat | Alpine |
isolate | isolate | Isolate | string | The individual isolate from which the sequences in the genome assembly were derived | L1 Dominette 01449 registration number 42190680 Pmale09 |
sex | sex | Sex | string | Male or female | female |
strain | strain | Strain | string | A genetic variant, subtype or culture within a species | |
taxId | tax-id | Taxonomic ID | uint32 | The NCBI Taxonomy identifier for the organism from which thegenome assembly was derived. | |
assemblyInfo | assminfo- | Assembly | AssemblyInfo | Metadata for the genome assembly submission | |
assemblyStats | assmstats- | Assembly Stats | AssemblyStats | Global statistics for the genome assembly | |
organelleInfo repeated | organelle- | Organelle | OrganelleInfo | Metadata for all associated organelle genomes | |
annotationInfo | annotinfo- | Annotation Info | AnnotationInfo | Metadata and statistics for the genome assembly annotation, when available | |
wgsInfo | wgs- | WGS | WGSInfo | Metadata pertaining to the Whole Genome Shotgun (WGS) record for the genome assembliesthat are complete genomes. Those that are clone-based do not haveWGS-master records. |
AnnotationInfo Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
name | name | Name | string | ||
source | source | Source | string | ||
releaseDate | release-date | Release Date | string | ||
reportUrl | report-url | Report URL | string | ||
stats | featcount- | Count | FeatureCounts | ||
busco | busco- | BUSCO | BuscoStat |
AssemblyInfo Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
assemblyAccession | accession | Accession | string | The GenColl assembly accession | GCF_000001405.40 |
currentAssemblyAccession | current-accession | Current Accession | string | The latest GenColl assembly accession for this revision chain | GCF_000001405.40 |
assemblyStatus | status | Status | AssemblyStatus | The GenColl assembly status | current |
pairedAssemblyAccession | paired_accession | Paired Accession | string | The GenBank or RefSeq assembly accession paired with this assembly | GCA_000001405.28 |
pairedAssembly | paired-assm | Paired Assembly | PairedAssembly | Metadata from the GenBank or RefSeq assembly paired with this one | |
assemblyLevel | level | Level | string | The level at which a genome has been assembled | chromosome scaffold contig |
assemblyName | name | Name | string | The assembly submitter’s name for the genome assembly, when provided. Otherwise, a default name in theform ASM#####v# is assigned | GRCh38.p14 ASM985889v3 |
assemblyType | type | Type | string | Chromosome content of the submitted genome assembly | haploid-with-alt-loci haploid |
bioprojectLineage repeated | bioproject- | BioProject | BioProjectLineage | The lineage of BioProject accessions. The specific BioProject which produced the sequences in thegenome assembly is listed first, followed in order by its antecendents. | |
submissionDate | submission-date | Submission Date | string | Date the assembly was submitted to NCBI | |
description | description | Description | string | Long description for this genome | |
genbankAssmAccession | genbank-assm-accession | GenBank Accession | string | Accession for the GenBank assembly is the unique identifier for the set of sequences in this particular version ofthe genome assembly. | GCA_000001405.28 |
submitter | submitter | Submitter | string | The submitting consortium or organization. Full submitter information is available in the BioProject | |
refseqCategory | refseq-category | Refseq Category | string | The RefSeq Category is either reference or representative genome and indicates the RefSeq project classification | reference genome representative genome |
refseqAssmAccession | refseq-assm-accession | RefSeq Accession | string | RefSeq assembly accession is the unique identifier for the set of sequences in this particular version ofthe genome assembly. | GCF_000001405.40 |
ucscAssmName | ucsc-assm-name | UCSC Assembly Name | string | Genome name ascribed to this assembly by the UC Santa Cruz genome browser | hg38 |
linkedAssemblies repeated | linked-assm | Linked Assembly | LinkedAssembly | Genome assemblies derived from the same diploid individual | |
atypical | atypical | Atypical | AtypicalInfo | Information on atypical genomes - genomes that have assembly issues or are otherwise atypical | |
sequencingTech | sequencing-tech | Sequencing Tech | string | Sequencing technology used to sequence this genome | |
biosampleAccession | biosample-accession | BioSample Accession | string | NCBI BioSample Accession for the BioSample from which the sequences in the genomeassembly were obtained. | SAMN03145444 |
biosample | biosample- | BioSample | BioSampleDescriptor | NCBI BioSample from which the sequences in the genome assembly were obtained. | |
blastUrl | blast-url | Blast URL | string | URL to blast page for this assembly | |
comments | coming soon | coming soon | string | Freeform comments |
AssemblyStats Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
totalNumberOfChromosomes | total-number-of-chromosomes | Total Number of Chromosomes | uint32 | Count of nuclear chromosomes, organelles and plasmids in a submitted genome assembly | |
totalSequenceLength | total-sequence-len | Total Sequence Length | uint64 | Total sequence length of the nuclear genome including unplaced and unlocalized sequences | |
totalUngappedLength | total-ungapped-len | Total Ungapped Length | uint64 | Total length of all top-level sequences ignoring gaps. Any stretch of 10 or more Ns in a sequence is treated like a gap | |
numberOfContigs | number-of-contigs | Number of Contigs | uint32 | Total number of sequence contigs in the assembly. Any stretch of 10 or more Ns in a sequence is treated as a gap between twocontigs in a scaffold when counting contigs and calculating contig N50 & L50 values | |
contigN50 | contig-n50 | Contig N50 | uint32 | Length such that sequence contigs of this length or longer include half the bases of the assembly | |
contigL50 | contig-l50 | Contig L50 | uint32 | Number of sequence contigs that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly | |
numberOfScaffolds | number-of-scaffolds | Number of Scaffolds | uint32 | Number of scaffolds including placed, unlocalized, unplaced, alternate loci and patch scaffolds | |
scaffoldN50 | scaffold-n50 | Scaffold N50 | uint32 | Length such that scaffolds of this length or longer include half the bases of the assembly | |
scaffoldL50 | scaffold-l50 | Scaffold L50 | uint32 | Number of scaffolds that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly | |
gapsBetweenScaffoldsCount | gaps-between-scaffolds-count | Gaps Between Scaffolds Count | uint32 | Number of unspanned gaps between scaffolds | |
numberOfComponentSequences | number-of-component-sequences | Number of Component Sequences | uint32 | Total number of component WGS or clone sequences in the assembly | |
gcCount | gc-count | GC Count | uint64 | The number of GC base-pairs in the assembly |
AtypicalInfo Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
isAtypical | is-atypical | Is Atypical | bool | If true there are assembly issues or the assembly is in some way non-standard | |
warnings repeated | warnings | Warnings | string | The reasons that the assembly is considered atypical |
BioProject Structure
A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. The record can be retrieved from NCBI BioProject
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accession | accession | Accession | string | BioProject accession | PRJEB35387 |
title | title | Title | string | Title of the BioProject provided by the submitter | Sciurus carolinensis (grey squirrel) genome assembly, mSciCar1 |
parentAccessions repeated | parent-accessions | Parent Accessions | string | BioProject accession containing multiple children BioProjects | ["PRJNA489243","PRJEB33226","PRJEB40665"] |
BioProjectLineage Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
bioprojects repeated | lineage- | Lineage | BioProject | A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium |
BioSampleAttribute Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
name | name | Name | string | ||
value | value | Value | string |
BioSampleContact Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
lab | lab | Lab | string |
BioSampleDescription Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
title | title | Title | string | ||
organism | organism- | Organism | Organism | ||
comment | comment | Comment | string |
BioSampleDescriptor Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accession | accession | Accession | string | SAMN20055006 | |
lastUpdated | last-updated | Last updated | string | ||
publicationDate | publication-date | Publication date | string | ||
submissionDate | submission-date | Submission date | string | ||
sampleIds repeated | ids- | Sample Identifiers | BioSampleId | ||
description | description- | Description | BioSampleDescription | ||
owner | owner- | Owner | BioSampleOwner | ||
models repeated | models | Models | string | ||
bioprojects repeated | bioproject- | BioProject | BioProject | ||
package | package | Package | string | MIGS.ba.air.4.0 | |
attributes repeated | attribute- | Attribute | BioSampleAttribute | ||
status | status- | Status | BioSampleStatus |
BioSampleId Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
db | db | Database | string | Wellcome Sanger Institute | |
label | label | Label | string | Sample name | |
value | value | Value | string | COG-UK/ALDP-17A6A8C |
BioSampleOwner Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
name | name | Name | string | ||
contacts repeated | contact- | Contact | BioSampleContact |
BioSampleStatus Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
status | status | Status | string | live | |
when | when | When | string |
BuscoStat Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
buscoLineage | lineage | Lineage | string | BUSCO Lineage | |
buscoVer | ver | Version | string | BUSCO Version | |
complete | complete | Complete | float | BUSCO score: Complete | |
singleCopy | singlecopy | Single Copy | float | BUSCO score: Single Copy | |
duplicated | duplicated | Duplicated | float | BUSCO score: Duplicated | |
fragmented | fragmented | Fragmented | float | BUSCO score: Fragmented | |
missing | missing | Missing | float | BUSCO score: Missing | |
totalCount | totalcount | Total Count | uint64 | BUSCO score: Total Count |
FeatureCounts Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
geneCounts | gene- | Gene | GeneCounts | Counts of gene types |
GeneCounts Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
total | total | Total | uint32 | Total number of annotated genes | |
proteinCoding | protein-coding | Protein-coding | uint32 | Count of annotated genes that encode a protein | |
nonCoding | non-coding | Non-coding | uint32 | Count of transcribed non-coding genes (e.g. lncRNAs, miRNAs, rRNAs, etc…) excludes transcribed pseudogenes | |
pseudogene | pseudogene | Pseudogene | uint32 | Count of transcribed and non-transcribed pseudogenes | |
other | other | Other | uint32 | Count of genic region GeneIDs and non-genic regulatory GeneIDs |
LinkedAssembly Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
linkedAssembly | accession | Accession | string | The linked assembly accession | GCA_000212995.1 |
assemblyType | type | Type | LinkedAssembly.LinkedAssemblyType | The linked assembly type |
OrganelleInfo Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
assemblyName | assembly-name | Assembly Name | string | Name of associated nuclear assembly | |
infraspecificName | infraspecific-name | Infraspecific Name | string | The strain, breed, cultivar or ecotype of the organism from which the sequences in the assembly were derived | |
bioproject repeated | bioproject-accessions | BioProject Accessions | string | The associated BioProject accession, when available | |
description | description | Description | string | Long description of the organelle genome | |
totalSeqLength | total-seq-length | Total Seq Length | uint64 | Sequence length of the organelle genome | |
submitter | submitter | Submitter | string | Name of submitter |
PairedAssembly Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accession | accession | Accession | string | The GenColl assembly accession of the GenBank or RefSeq assembly paired with this one | GCF_000001405.40 |
status | status | Status | AssemblyStatus | GenColl Assembly status from paired record | current |
annotationName | name | Name | string | Annotation name from paired record |
WGSInfo Structure
Whole Genome Shotgun (WGS) projects are genome assemblies of incomplete genomes or incomplete chromosomes of prokaryotes or eukaryotes that are generally being sequenced by a whole genome shotgun strategy.
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
wgsProjectAccession | project-accession | project accession | string | AAEX03 CABHLF01 | |
masterWgsUrl | url | URL | string | https://www.ncbi.nlm.nih.gov/nuccore/AAEX00000000.3 | |
wgsContigsUrl | contigs-url | contigs URL | string | https://www.ncbi.nlm.nih.gov/Traces/wgs/AAEX03 |
AssemblyStatus Enumeration
Name | Number | Description |
---|---|---|
ASSEMBLY_STATUS_UNKNOWN | 0 | |
current | 1 | |
previous | 2 | |
suppressed | 3 | |
retired | 4 | This is deprecated - should no longer be seen in the data. |
LinkedAssembly.LinkedAssemblyType Enumeration
Name | Number | Description |
---|---|---|
LINKED_ASSEMBLY_TYPE_UNKNOWN | 0 | |
alternate_pseudohaplotype_of_diploid | 1 | SEQUI-5245 |
principal_pseudohaplotype_of_diploid | 2 | |
maternal_haplotype_of_diploid | 3 | |
paternal_haplotype_of_diploid | 4 | |
haplotype_1 | 6 | |
haplotype_2 | 7 | |
haplotype_3 | 8 | |
haplotype_4 | 9 | |
haploid | 10 | Catch all for any value that is not explicitly listed above |
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|---|---|---|---|---|
double | double | float | double | float64 | |
float | float | float | float | float32 | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | bool | boolean | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |