Genome assembly report

Genome record accession, organism, assembly statistics, and annotation info

Genome assembly report

Genome record accession, organism, assembly statistics, and annotation info

The downloaded genome package contains a genome assembly data report in JSON Lines format in the file:

ncbi_dataset/data/data_report.jsonl

Each line of the genome assembly data report file is a hierarchical JSON object that represents a single genome assembly record. The schema of the genome assembly record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is AssemblyDataReport.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option Refer to the dataformat CLI tool reference to see how you can use this tool to transform assembly data reports from JSON Lines to tabular formats.

Sample report

{
  "annotationInfo": {
    "busco": {
      "buscoLineage": "primates_odb10",
      "buscoVer": "5.7.1",
      "complete": 0.9887518,
      "duplicated": 0.009433962,
      "fragmented": 0.0045718434,
      "missing": 0.0066763423,
      "singleCopy": 0.97931784,
      "totalCount": "13780"
    },
    "name": "GCF_000001405.40-RS_2024_08",
    "releaseDate": "2024-08-23",
    "reportUrl": "https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/GCF_000001405.40-RS_2024_08.html",
    "source": "NCBI RefSeq",
    "stats": {
      "geneCounts": {
        "nonCoding": 22163,
        "other": 411,
        "proteinCoding": 20078,
        "pseudogene": 17063,
        "total": 59715
      }
    }
  },
  "assemblyInfo": {
    "assemblyAccession": "GCF_000001405.40",
    "assemblyLevel": "Chromosome",
    "assemblyName": "GRCh38.p14",
    "assemblyStatus": "current",
    "assemblyType": "haploid-with-alt-loci",
    "bioprojectLineage": [
      {
        "bioprojects": [
          {
            "accession": "PRJNA31257",
            "title": "The Human Genome Project, currently maintained by the Genome Reference Consortium (GRC)"
          }
        ]
      }
    ],
    "blastUrl": "https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_SPEC=GDH_GCF_000001405.40",
    "currentAssemblyAccession": "GCF_000001405.40",
    "description": "Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14)",
    "genbankAssmAccession": "GCA_000001405.29",
    "pairedAssembly": {
      "accession": "GCA_000001405.29",
      "status": "current"
    },
    "pairedAssemblyAccession": "GCA_000001405.29",
    "refseqAssmAccession": "GCF_000001405.40",
    "refseqCategory": "reference genome",
    "submissionDate": "2022-02-03",
    "submitter": "Genome Reference Consortium",
    "ucscAssmName": "hg38"
  },
  "assemblyStats": {
    "contigL50": 18,
    "contigN50": 57879411,
    "gapsBetweenScaffoldsCount": 349,
    "gcCount": "1374283647",
    "numberOfComponentSequences": 35611,
    "numberOfContigs": 996,
    "numberOfScaffolds": 470,
    "scaffoldL50": 16,
    "scaffoldN50": 67794873,
    "totalNumberOfChromosomes": 24,
    "totalSequenceLength": "3099441038",
    "totalUngappedLength": "2948318359"
  },
  "commonName": "human",
  "organelleInfo": [
    {
      "assemblyName": "GRCh38.p14",
      "description": "Mitochondrion",
      "submitter": "Genome Reference Consortium"
    }
  ],
  "organismName": "Homo sapiens",
  "taxId": 9606
}

AssemblyDataReport Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
commonNamecommon-nameCommon namestringVernacular name associated with a particular taxonHuman
zebrafish
pacific white shrimp
organismNameorganism-nameOrganism namestringScientific name of the species or subspeciesHomo sapiens
Arabidopsis thaliana
Canis lupus familiaris
breedbreedBreedstringA homogenous group of animals within a domesticated speciesHereford
boxer
cultivarcultivarCultivarstringA variety of plant within a species produced and maintained by cultivationB73
ecotypeecotypeEcotypestringA population or subspecies occupying a distinct habitatAlpine
isolateisolateIsolatestringThe individual isolate from which the sequences in the genome assembly were derivedL1 Dominette 01449 registration number 42190680
Pmale09
sexsexSexstringMale or femalefemale
strainstrainStrainstringA genetic variant, subtype or culture within a species
taxIdtax-idTaxonomic IDuint32The NCBI Taxonomy identifier for the organism from which thegenome assembly was derived.
assemblyInfoassminfo-AssemblyAssemblyInfoMetadata for the genome assembly submission
assemblyStatsassmstats-Assembly StatsAssemblyStatsGlobal statistics for the genome assembly
organelleInfo repeatedorganelle-OrganelleOrganelleInfoMetadata for all associated organelle genomes
annotationInfoannotinfo-Annotation InfoAnnotationInfoMetadata and statistics for the genome assembly annotation, when available
wgsInfowgs-WGSWGSInfoMetadata pertaining to the Whole Genome Shotgun (WGS) record for the genome assembliesthat are complete genomes. Those that are clone-based do not haveWGS-master records.

AnnotationInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
namenameNamestring
sourcesourceSourcestring
releaseDaterelease-dateRelease Datestring
reportUrlreport-urlReport URLstring
statsfeatcount-CountFeatureCounts
buscobusco-BUSCOBuscoStat

AssemblyInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyAccessionaccessionAccessionstringThe GenColl assembly accessionGCF_000001405.40
currentAssemblyAccessioncurrent-accessionCurrent AccessionstringThe latest GenColl assembly accession for this revision chainGCF_000001405.40
assemblyStatusstatusStatusAssemblyStatusThe GenColl assembly statuscurrent
pairedAssemblyAccessionpaired_accessionPaired AccessionstringThe GenBank or RefSeq assembly accession paired with this assemblyGCA_000001405.28
pairedAssemblypaired-assmPaired AssemblyPairedAssemblyMetadata from the GenBank or RefSeq assembly paired with this one
assemblyLevellevelLevelstringThe level at which a genome has been assembledchromosome
scaffold
contig
assemblyNamenameNamestringThe assembly submitter’s name for the genome assembly, when provided. Otherwise, a default name in theform ASM#####v# is assignedGRCh38.p14
ASM985889v3
assemblyTypetypeTypestringChromosome content of the submitted genome assemblyhaploid-with-alt-loci
haploid
bioprojectLineage repeatedbioproject-BioProjectBioProjectLineageThe lineage of BioProject accessions. The specific BioProject which produced the sequences in thegenome assembly is listed first, followed in order by its antecendents.
submissionDatesubmission-dateSubmission DatestringDate the assembly was submitted to NCBI
descriptiondescriptionDescriptionstringLong description for this genome
genbankAssmAccessiongenbank-assm-accessionGenBank AccessionstringAccession for the GenBank assembly is the unique identifier for the set of sequences in this particular version ofthe genome assembly.GCA_000001405.28
submittersubmitterSubmitterstringThe submitting consortium or organization. Full submitter information is available in the BioProject
refseqCategoryrefseq-categoryRefseq CategorystringThe RefSeq Category is either reference or representative genome and indicates the RefSeq project classificationreference genome
representative genome
refseqAssmAccessionrefseq-assm-accessionRefSeq AccessionstringRefSeq assembly accession is the unique identifier for the set of sequences in this particular version ofthe genome assembly.GCF_000001405.40
ucscAssmNameucsc-assm-nameUCSC Assembly NamestringGenome name ascribed to this assembly by the UC Santa Cruz genome browserhg38
linkedAssemblies repeatedlinked-assmLinked AssemblyLinkedAssemblyGenome assemblies derived from the same diploid individual
atypicalatypicalAtypicalAtypicalInfoInformation on atypical genomes - genomes that have assembly issues or are otherwise atypical
sequencingTechsequencing-techSequencing TechstringSequencing technology used to sequence this genome
biosampleAccessionbiosample-accessionBioSample AccessionstringNCBI BioSample Accession for the BioSample from which the sequences in the genomeassembly were obtained.SAMN03145444
biosamplebiosample-BioSampleBioSampleDescriptorNCBI BioSample from which the sequences in the genome assembly were obtained.
blastUrlblast-urlBlast URLstringURL to blast page for this assembly
commentscoming sooncoming soonstringFreeform comments

AssemblyStats Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
totalNumberOfChromosomestotal-number-of-chromosomesTotal Number of Chromosomesuint32Count of nuclear chromosomes, organelles and plasmids in a submitted genome assembly
totalSequenceLengthtotal-sequence-lenTotal Sequence Lengthuint64Total sequence length of the nuclear genome including unplaced and unlocalized sequences
totalUngappedLengthtotal-ungapped-lenTotal Ungapped Lengthuint64Total length of all top-level sequences ignoring gaps. Any stretch of 10 or more Ns in a sequence is treated like a gap
numberOfContigsnumber-of-contigsNumber of Contigsuint32Total number of sequence contigs in the assembly. Any stretch of 10 or more Ns in a sequence is treated as a gap between twocontigs in a scaffold when counting contigs and calculating contig N50 & L50 values
contigN50contig-n50Contig N50uint32Length such that sequence contigs of this length or longer include half the bases of the assembly
contigL50contig-l50Contig L50uint32Number of sequence contigs that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
numberOfScaffoldsnumber-of-scaffoldsNumber of Scaffoldsuint32Number of scaffolds including placed, unlocalized, unplaced, alternate loci and patch scaffolds
scaffoldN50scaffold-n50Scaffold N50uint32Length such that scaffolds of this length or longer include half the bases of the assembly
scaffoldL50scaffold-l50Scaffold L50uint32Number of scaffolds that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
gapsBetweenScaffoldsCountgaps-between-scaffolds-countGaps Between Scaffolds Countuint32Number of unspanned gaps between scaffolds
numberOfComponentSequencesnumber-of-component-sequencesNumber of Component Sequencesuint32Total number of component WGS or clone sequences in the assembly
gcCountgc-countGC Countuint64The number of GC base-pairs in the assembly

AtypicalInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
isAtypicalis-atypicalIs AtypicalboolIf true there are assembly issues or the assembly is in some way non-standard
warnings repeatedwarningsWarningsstringThe reasons that the assembly is considered atypical

BioProject Structure

A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. The record can be retrieved from NCBI BioProject

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringBioProject accessionPRJEB35387
titletitleTitlestringTitle of the BioProject provided by the submitterSciurus carolinensis (grey squirrel) genome assembly, mSciCar1
parentAccessions repeatedparent-accessionsParent AccessionsstringBioProject accession containing multiple children BioProjects["PRJNA489243","PRJEB33226","PRJEB40665"]

BioProjectLineage Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
bioprojects repeatedlineage-LineageBioProjectA BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium

BioSampleAttribute Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
namenameNamestring
valuevalueValuestring

BioSampleContact Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
lablabLabstring

BioSampleDescription Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
titletitleTitlestring
organismorganism-OrganismOrganism
commentcommentCommentstring

BioSampleDescriptor Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringSAMN20055006
lastUpdatedlast-updatedLast updatedstring
publicationDatepublication-datePublication datestring
submissionDatesubmission-dateSubmission datestring
sampleIds repeatedids-Sample IdentifiersBioSampleId
descriptiondescription-DescriptionBioSampleDescription
ownerowner-OwnerBioSampleOwner
models repeatedmodelsModelsstring
bioprojects repeatedbioproject-BioProjectBioProject
packagepackagePackagestringMIGS.ba.air.4.0
attributes repeatedattribute-AttributeBioSampleAttribute
statusstatus-StatusBioSampleStatus

BioSampleId Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
dbdbDatabasestringWellcome Sanger Institute
labellabelLabelstringSample name
valuevalueValuestringCOG-UK/ALDP-17A6A8C

BioSampleOwner Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
namenameNamestring
contacts repeatedcontact-ContactBioSampleContact

BioSampleStatus Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
statusstatusStatusstringlive
whenwhenWhenstring

BuscoStat Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
buscoLineagelineageLineagestringBUSCO Lineage
buscoVerverVersionstringBUSCO Version
completecompleteCompletefloatBUSCO score: Complete
singleCopysinglecopySingle CopyfloatBUSCO score: Single Copy
duplicatedduplicatedDuplicatedfloatBUSCO score: Duplicated
fragmentedfragmentedFragmentedfloatBUSCO score: Fragmented
missingmissingMissingfloatBUSCO score: Missing
totalCounttotalcountTotal Countuint64BUSCO score: Total Count

FeatureCounts Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneCountsgene-GeneGeneCountsCounts of gene types

GeneCounts Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
totaltotalTotaluint32Total number of annotated genes
proteinCodingprotein-codingProtein-codinguint32Count of annotated genes that encode a protein
nonCodingnon-codingNon-codinguint32Count of transcribed non-coding genes (e.g. lncRNAs, miRNAs, rRNAs, etc…) excludes transcribed pseudogenes
pseudogenepseudogenePseudogeneuint32Count of transcribed and non-transcribed pseudogenes
otherotherOtheruint32Count of genic region GeneIDs and non-genic regulatory GeneIDs

LinkedAssembly Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
linkedAssemblyaccessionAccessionstringThe linked assembly accessionGCA_000212995.1
assemblyTypetypeTypeLinkedAssembly.LinkedAssemblyTypeThe linked assembly type

OrganelleInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyNameassembly-nameAssembly NamestringName of associated nuclear assembly
infraspecificNameinfraspecific-nameInfraspecific NamestringThe strain, breed, cultivar or ecotype of the organism from which the sequences in the assembly were derived
bioproject repeatedbioproject-accessionsBioProject AccessionsstringThe associated BioProject accession, when available
descriptiondescriptionDescriptionstringLong description of the organelle genome
totalSeqLengthtotal-seq-lengthTotal Seq Lengthuint64Sequence length of the organelle genome
submittersubmitterSubmitterstringName of submitter

PairedAssembly Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringThe GenColl assembly accession of the GenBank or RefSeq assembly paired with this oneGCF_000001405.40
statusstatusStatusAssemblyStatusGenColl Assembly status from paired recordcurrent
annotationNamenameNamestringAnnotation name from paired record

WGSInfo Structure

Whole Genome Shotgun (WGS) projects are genome assemblies of incomplete genomes or incomplete chromosomes of prokaryotes or eukaryotes that are generally being sequenced by a whole genome shotgun strategy.

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
wgsProjectAccessionproject-accessionproject accessionstringAAEX03
CABHLF01
masterWgsUrlurlURLstringhttps://www.ncbi.nlm.nih.gov/nuccore/AAEX00000000.3
wgsContigsUrlcontigs-urlcontigs URLstringhttps://www.ncbi.nlm.nih.gov/Traces/wgs/AAEX03

AssemblyStatus Enumeration

NameNumberDescription
ASSEMBLY_STATUS_UNKNOWN0
current1
previous2
suppressed3
retired4This is deprecated - should no longer be seen in the data.

LinkedAssembly.LinkedAssemblyType Enumeration

NameNumberDescription
LINKED_ASSEMBLY_TYPE_UNKNOWN0
alternate_pseudohaplotype_of_diploid1SEQUI-5245
principal_pseudohaplotype_of_diploid2
maternal_haplotype_of_diploid3
paternal_haplotype_of_diploid4
haplotype_16
haplotype_27
haplotype_38
haplotype_49
haploid10Catch all for any value that is not explicitly listed above

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated November 25, 2024