Genome assembly report

Genome record accession, organism, assembly statistics, and annotation info

Genome assembly report

Genome record accession, organism, assembly statistics, and annotation info

The downloaded genome package contains a genome assembly data report in JSON Lines format in the file:

ncbi_dataset/data/data_report.jsonl

Each line of the genome assembly data report file is a hierarchical JSON object that represents a single genome assembly record. The schema of the genome assembly record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is AssemblyDataReport.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option Refer to the dataformat CLI tool reference to see how you can use this tool to transform assembly data reports from JSON Lines to tabular formats.

Sample report

{
  "annotationInfo": {
    "busco": {
      "buscoLineage": "primates_odb10",
      "buscoVer": "5.7.1",
      "complete": 0.9887518,
      "duplicated": 0.009433962,
      "fragmented": 0.0045718434,
      "missing": 0.0066763423,
      "singleCopy": 0.97931784,
      "totalCount": "13780"
    },
    "name": "GCF_000001405.40-RS_2024_08",
    "releaseDate": "2024-08-23",
    "reportUrl": "https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/GCF_000001405.40-RS_2024_08.html",
    "source": "NCBI RefSeq",
    "stats": {
      "geneCounts": {
        "nonCoding": 22163,
        "other": 411,
        "proteinCoding": 20078,
        "pseudogene": 17063,
        "total": 59715
      }
    }
  },
  "assemblyInfo": {
    "assemblyAccession": "GCF_000001405.40",
    "assemblyLevel": "Chromosome",
    "assemblyName": "GRCh38.p14",
    "assemblyStatus": "current",
    "assemblyType": "haploid-with-alt-loci",
    "bioprojectLineage": [
      {
        "bioprojects": [
          {
            "accession": "PRJNA31257",
            "title": "The Human Genome Project, currently maintained by the Genome Reference Consortium (GRC)"
          }
        ]
      }
    ],
    "blastUrl": "https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_SPEC=GDH_GCF_000001405.40",
    "currentAssemblyAccession": "GCF_000001405.40",
    "description": "Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14)",
    "genbankAssmAccession": "GCA_000001405.29",
    "pairedAssembly": {
      "accession": "GCA_000001405.29",
      "status": "current"
    },
    "pairedAssemblyAccession": "GCA_000001405.29",
    "refseqAssmAccession": "GCF_000001405.40",
    "refseqCategory": "reference genome",
    "submissionDate": "2022-02-03",
    "submitter": "Genome Reference Consortium",
    "ucscAssmName": "hg38"
  },
  "assemblyStats": {
    "contigL50": 18,
    "contigN50": 57879411,
    "gapsBetweenScaffoldsCount": 349,
    "gcCount": "1374283647",
    "numberOfComponentSequences": 35611,
    "numberOfContigs": 996,
    "numberOfScaffolds": 470,
    "scaffoldL50": 16,
    "scaffoldN50": 67794873,
    "totalNumberOfChromosomes": 24,
    "totalSequenceLength": "3099441038",
    "totalUngappedLength": "2948318359"
  },
  "commonName": "human",
  "organelleInfo": [
    {
      "assemblyName": "GRCh38.p14",
      "description": "Mitochondrion",
      "submitter": "Genome Reference Consortium"
    }
  ],
  "organismName": "Homo sapiens",
  "taxId": 9606
}

AssemblyDataReport Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`commonName`	`common-name`	Common name	`string`	Vernacular name associated with a particular taxon	`Human` `zebrafish` `pacific white shrimp`
`organismName`	`organism-name`	Organism name	`string`	Scientific name of the species or subspecies	`Homo sapiens` `Arabidopsis thaliana` `Canis lupus familiaris`
`breed`	`breed`	Breed	`string`	A homogenous group of animals within a domesticated species	`Hereford` `boxer`
`cultivar`	`cultivar`	Cultivar	`string`	A variety of plant within a species produced and maintained by cultivation	`B73`
`ecotype`	`ecotype`	Ecotype	`string`	A population or subspecies occupying a distinct habitat	`Alpine`
`isolate`	`isolate`	Isolate	`string`	The individual isolate from which the sequences in the genome assembly were derived	`L1 Dominette 01449 registration number 42190680` `Pmale09`
`sex`	`sex`	Sex	`string`	Male or female	`female`
`strain`	`strain`	Strain	`string`	A genetic variant, subtype or culture within a species
`taxId`	`tax-id`	Taxonomic ID	`uint32`	The NCBI Taxonomy identifier for the organism from which thegenome assembly was derived.
`assemblyInfo`	`assminfo-`	Assembly	`AssemblyInfo`	Metadata for the genome assembly submission
`assemblyStats`	`assmstats-`	Assembly Stats	`AssemblyStats`	Global statistics for the genome assembly
`organelleInfo repeated`	`organelle-`	Organelle	`OrganelleInfo`	Metadata for all associated organelle genomes
`annotationInfo`	`annotinfo-`	Annotation Info	`AnnotationInfo`	Metadata and statistics for the genome assembly annotation, when available
`wgsInfo`	`wgs-`	WGS	`WGSInfo`	Metadata pertaining to the Whole Genome Shotgun (WGS) record for the genome assembliesthat are complete genomes. Those that are clone-based do not haveWGS-master records.

AnnotationInfo Structure

Field	Table Field Mnemonic	Table Column Name	Type
`name`	`name`	Name	`string`
`source`	`source`	Source	`string`
`releaseDate`	`release-date`	Release Date	`string`
`reportUrl`	`report-url`	Report URL	`string`
`stats`	`featcount-`	Count	`FeatureCounts`
`busco`	`busco-`	BUSCO	`BuscoStat`

AssemblyInfo Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`assemblyAccession`	`accession`	Accession	`string`	The GenColl assembly accession	`GCF_000001405.40`
`currentAssemblyAccession`	`current-accession`	Current Accession	`string`	The latest GenColl assembly accession for this revision chain	`GCF_000001405.40`
`assemblyStatus`	`status`	Status	`AssemblyStatus`	The GenColl assembly status	`current`
`pairedAssemblyAccession`	`paired_accession`	Paired Accession	`string`	The GenBank or RefSeq assembly accession paired with this assembly	`GCA_000001405.28`
`pairedAssembly`	`paired-assm`	Paired Assembly	`PairedAssembly`	Metadata from the GenBank or RefSeq assembly paired with this one
`assemblyLevel`	`level`	Level	`string`	The level at which a genome has been assembled	`chromosome` `scaffold` `contig`
`assemblyName`	`name`	Name	`string`	The assembly submitter’s name for the genome assembly, when provided. Otherwise, a default name in theform ASM#####v# is assigned	`GRCh38.p14` `ASM985889v3`
`assemblyType`	`type`	Type	`string`	Chromosome content of the submitted genome assembly	`haploid-with-alt-loci` `haploid`
`bioprojectLineage repeated`	`bioproject-`	BioProject	`BioProjectLineage`	The lineage of BioProject accessions. The specific BioProject which produced the sequences in thegenome assembly is listed first, followed in order by its antecendents.
`submissionDate`	`submission-date`	Submission Date	`string`	Date the assembly was submitted to NCBI
`description`	`description`	Description	`string`	Long description for this genome
`genbankAssmAccession`	`genbank-assm-accession`	GenBank Accession	`string`	Accession for the GenBank assembly is the unique identifier for the set of sequences in this particular version ofthe genome assembly.	`GCA_000001405.28`
`submitter`	`submitter`	Submitter	`string`	The submitting consortium or organization. Full submitter information is available in the BioProject
`refseqCategory`	`refseq-category`	Refseq Category	`string`	The RefSeq Category is either reference or representative genome and indicates the RefSeq project classification	`reference genome` `representative genome`
`refseqAssmAccession`	`refseq-assm-accession`	RefSeq Accession	`string`	RefSeq assembly accession is the unique identifier for the set of sequences in this particular version ofthe genome assembly.	`GCF_000001405.40`
`ucscAssmName`	`ucsc-assm-name`	UCSC Assembly Name	`string`	Genome name ascribed to this assembly by the UC Santa Cruz genome browser	`hg38`
`linkedAssemblies repeated`	`linked-assm`	Linked Assembly	`LinkedAssembly`	Genome assemblies derived from the same diploid individual
`atypical`	`atypical`	Atypical	`AtypicalInfo`	Information on atypical genomes - genomes that have assembly issues or are otherwise atypical
`sequencingTech`	`sequencing-tech`	Sequencing Tech	`string`	Sequencing technology used to sequence this genome
`biosampleAccession`	`biosample-accession`	BioSample Accession	`string`	NCBI BioSample Accession for the BioSample from which the sequences in the genomeassembly were obtained.	`SAMN03145444`
`biosample`	`biosample-`	BioSample	`BioSampleDescriptor`	NCBI BioSample from which the sequences in the genome assembly were obtained.
`blastUrl`	`blast-url`	Blast URL	`string`	URL to blast page for this assembly
`comments`	coming soon	coming soon	`string`	Freeform comments

AssemblyStats Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description
`totalNumberOfChromosomes`	`total-number-of-chromosomes`	Total Number of Chromosomes	`uint32`	Count of nuclear chromosomes, organelles and plasmids in a submitted genome assembly
`totalSequenceLength`	`total-sequence-len`	Total Sequence Length	`uint64`	Total sequence length of the nuclear genome including unplaced and unlocalized sequences
`totalUngappedLength`	`total-ungapped-len`	Total Ungapped Length	`uint64`	Total length of all top-level sequences ignoring gaps. Any stretch of 10 or more Ns in a sequence is treated like a gap
`numberOfContigs`	`number-of-contigs`	Number of Contigs	`uint32`	Total number of sequence contigs in the assembly. Any stretch of 10 or more Ns in a sequence is treated as a gap between twocontigs in a scaffold when counting contigs and calculating contig N50 & L50 values
`contigN50`	`contig-n50`	Contig N50	`uint32`	Length such that sequence contigs of this length or longer include half the bases of the assembly
`contigL50`	`contig-l50`	Contig L50	`uint32`	Number of sequence contigs that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
`numberOfScaffolds`	`number-of-scaffolds`	Number of Scaffolds	`uint32`	Number of scaffolds including placed, unlocalized, unplaced, alternate loci and patch scaffolds
`scaffoldN50`	`scaffold-n50`	Scaffold N50	`uint32`	Length such that scaffolds of this length or longer include half the bases of the assembly
`scaffoldL50`	`scaffold-l50`	Scaffold L50	`uint32`	Number of scaffolds that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
`gapsBetweenScaffoldsCount`	`gaps-between-scaffolds-count`	Gaps Between Scaffolds Count	`uint32`	Number of unspanned gaps between scaffolds
`numberOfComponentSequences`	`number-of-component-sequences`	Number of Component Sequences	`uint32`	Total number of component WGS or clone sequences in the assembly
`gcCount`	`gc-count`	GC Count	`uint64`	The number of GC base-pairs in the assembly

AtypicalInfo Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`isAtypical`	`is-atypical`	Is Atypical	`bool`	If true there are assembly issues or the assembly is in some way non-standard
`warnings repeated`	`warnings`	Warnings	`string`	The reasons that the assembly is considered atypical

BioProject Structure

A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. The record can be retrieved from NCBI BioProject

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`accession`	`accession`	Accession	`string`	BioProject accession	`PRJEB35387`
`title`	`title`	Title	`string`	Title of the BioProject provided by the submitter	`Sciurus carolinensis (grey squirrel) genome assembly, mSciCar1`
`parentAccessions repeated`	`parent-accessions`	Parent Accessions	`string`	BioProject accession containing multiple children BioProjects	`["PRJNA489243","PRJEB33226","PRJEB40665"]`

BioProjectLineage Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`bioprojects repeated`	`lineage-`	Lineage	`BioProject`	A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium

BioSampleAttribute Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`name`	`name`	Name	`string`
`value`	`value`	Value	`string`

BioSampleContact Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`lab`	`lab`	Lab	`string`

BioSampleDescription Structure

Field	Table Field Mnemonic	Table Column Name	Type
`title`	`title`	Title	`string`
`organism`	`organism-`	Organism	`Organism`
`comment`	`comment`	Comment	`string`

BioSampleDescriptor Structure

Field	Table Field Mnemonic	Table Column Name	Type	Examples
`accession`	`accession`	Accession	`string`	`SAMN20055006`
`lastUpdated`	`last-updated`	Last updated	`string`
`publicationDate`	`publication-date`	Publication date	`string`
`submissionDate`	`submission-date`	Submission date	`string`
`sampleIds repeated`	`ids-`	Sample Identifiers	`BioSampleId`
`description`	`description-`	Description	`BioSampleDescription`
`owner`	`owner-`	Owner	`BioSampleOwner`
`models repeated`	`models`	Models	`string`
`bioprojects repeated`	`bioproject-`	BioProject	`BioProject`
`package`	`package`	Package	`string`	`MIGS.ba.air.4.0`
`attributes repeated`	`attribute-`	Attribute	`BioSampleAttribute`
`status`	`status-`	Status	`BioSampleStatus`

BioSampleId Structure

Field	Table Field Mnemonic	Table Column Name	Type	Examples
`db`	`db`	Database	`string`	`Wellcome Sanger Institute`
`label`	`label`	Label	`string`	`Sample name`
`value`	`value`	Value	`string`	`COG-UK/ALDP-17A6A8C`

BioSampleOwner Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`name`	`name`	Name	`string`
`contacts repeated`	`contact-`	Contact	`BioSampleContact`

BioSampleStatus Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`status`	`status`	Status	`string`		`live`
`when`	`when`	When	`string`

BuscoStat Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description
`buscoLineage`	`lineage`	Lineage	`string`	BUSCO Lineage
`buscoVer`	`ver`	Version	`string`	BUSCO Version
`complete`	`complete`	Complete	`float`	BUSCO score: Complete
`singleCopy`	`singlecopy`	Single Copy	`float`	BUSCO score: Single Copy
`duplicated`	`duplicated`	Duplicated	`float`	BUSCO score: Duplicated
`fragmented`	`fragmented`	Fragmented	`float`	BUSCO score: Fragmented
`missing`	`missing`	Missing	`float`	BUSCO score: Missing
`totalCount`	`totalcount`	Total Count	`uint64`	BUSCO score: Total Count

FeatureCounts Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`geneCounts`	`gene-`	Gene	`GeneCounts`	Counts of gene types

GeneCounts Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description
`total`	`total`	Total	`uint32`	Total number of annotated genes
`proteinCoding`	`protein-coding`	Protein-coding	`uint32`	Count of annotated genes that encode a protein
`nonCoding`	`non-coding`	Non-coding	`uint32`	Count of transcribed non-coding genes (e.g. lncRNAs, miRNAs, rRNAs, etc…) excludes transcribed pseudogenes
`pseudogene`	`pseudogene`	Pseudogene	`uint32`	Count of transcribed and non-transcribed pseudogenes
`other`	`other`	Other	`uint32`	Count of genic region GeneIDs and non-genic regulatory GeneIDs

LinkedAssembly Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`linkedAssembly`	`accession`	Accession	`string`	The linked assembly accession	`GCA_000212995.1`
`assemblyType`	`type`	Type	`LinkedAssembly.LinkedAssemblyType`	The linked assembly type

OrganelleInfo Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description
`assemblyName`	`assembly-name`	Assembly Name	`string`	Name of associated nuclear assembly
`infraspecificName`	`infraspecific-name`	Infraspecific Name	`string`	The strain, breed, cultivar or ecotype of the organism from which the sequences in the assembly were derived
`bioproject repeated`	`bioproject-accessions`	BioProject Accessions	`string`	The associated BioProject accession, when available
`description`	`description`	Description	`string`	Long description of the organelle genome
`totalSeqLength`	`total-seq-length`	Total Seq Length	`uint64`	Sequence length of the organelle genome
`submitter`	`submitter`	Submitter	`string`	Name of submitter

PairedAssembly Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`accession`	`accession`	Accession	`string`	The GenColl assembly accession of the GenBank or RefSeq assembly paired with this one	`GCF_000001405.40`
`status`	`status`	Status	`AssemblyStatus`	GenColl Assembly status from paired record	`current`
`annotationName`	`name`	Name	`string`	Annotation name from paired record

WGSInfo Structure

Whole Genome Shotgun (WGS) projects are genome assemblies of incomplete genomes or incomplete chromosomes of prokaryotes or eukaryotes that are generally being sequenced by a whole genome shotgun strategy.

Field	Table Field Mnemonic	Table Column Name	Type	Examples
`wgsProjectAccession`	`project-accession`	project accession	`string`	`AAEX03` `CABHLF01`
`masterWgsUrl`	`url`	URL	`string`	`https://www.ncbi.nlm.nih.gov/nuccore/AAEX00000000.3`
`wgsContigsUrl`	`contigs-url`	contigs URL	`string`	`https://www.ncbi.nlm.nih.gov/Traces/wgs/AAEX03`

AssemblyStatus Enumeration

Name	Number	Description
`ASSEMBLY_STATUS_UNKNOWN`	`0`
`current`	`1`
`previous`	`2`
`suppressed`	`3`
`retired`	`4`	This is deprecated - should no longer be seen in the data.

LinkedAssembly.LinkedAssemblyType Enumeration

Name	Number	Description
`LINKED_ASSEMBLY_TYPE_UNKNOWN`	`0`
`alternate_pseudohaplotype_of_diploid`	`1`	SEQUI-5245
`principal_pseudohaplotype_of_diploid`	`2`
`maternal_haplotype_of_diploid`	`3`
`paternal_haplotype_of_diploid`	`4`
`haplotype_1`	`6`
`haplotype_2`	`7`
`haplotype_3`	`8`
`haplotype_4`	`9`
`haploid`	`10`	Catch all for any value that is not explicitly listed above

Scalar Value Types

Protocol buffers type	Notes	C++	Python	Java	Go
`double`		`double`	`float`	`double`	`float64`
`float`		`float`	`float`	`float`	`float32`
`int32`	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.	`int32`	`int`	`int`	`int32`
`int64`	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.	`int64`	`int/long`	`long`	`int64`
`uint32`	Uses variable-length encoding.	`uint32`	`int/long`	`int`	`uint32`
`uint64`	Uses variable-length encoding.	`uint64`	`int/long`	`long`	`uint64`
`sint32`	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.	`int32`	`int`	`int`	`int32`
`sint64`	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.	`int64`	`int/long`	`long`	`int64`
`fixed32`	Always four bytes. More efficient than uint32 if values are often greater than 2^28.	`uint32`	`int`	`int`	`uint32`
`fixed64`	Always eight bytes. More efficient than uint64 if values are often greater than 2^56.	`uint64`	`int/long`	`long`	`uint64`
`sfixed32`	Always four bytes.	`int32`	`int`	`int`	`int32`
`sfixed64`	Always eight bytes.	`int64`	`int/long`	`long`	`int64`
`bool`		`bool`	`boolean`	`boolean`	`bool`
`string`	A string must always contain UTF-8 encoded or 7-bit ASCII text.	`string`	`str/unicode`	`String`	`string`
`bytes`	May contain any arbitrary sequence of bytes.	`string`	`str`	`ByteString`	`[]byte`

Generated November 25, 2024