Genome sequence report

Genome assembly sequence accessions, chromosome, and length

Genome sequence report

Genome assembly sequence accessions, chromosome, and length

The downloaded genome package contains a genome sequence data report in JSON Lines format in the file:

ncbi_dataset/data/<assembly>/sequence_report.jsonl

Each line of the genome assembly sequence data report file is a JSON object that represents a single genome assembly sequence record. The schema of the genome assembly sequence record is defined in the table below where each row in SequenceInfo describes a single field in the report.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform assembly sequence data reports from JSON Lines to tabular formats.

Sample report

{
  "assemblyAccession": "GCF_000001405.40",
  "assemblyUnit": "Primary Assembly",
  "assignedMoleculeLocationType": "Chromosome",
  "chrName": "1",
  "gcCount": "103674491",
  "genbankAccession": "CM000663.2",
  "length": 248956422,
  "refseqAccession": "NC_000001.11",
  "role": "assembled-molecule",
  "ucscStyleName": "chr1"
}

SequenceInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyAccessionaccessionAssembly AccessionstringThe GenColl assembly accessionGCF_000001405.40
chrNamechr-nameChromosome namestringThe name of the associated chromosome. The name “Un” indicates that the chromosome is unknown.21
MT
Un
ucscStyleNameucsc-style-nameUCSC style namestringName ascribed to this sequence by the UC Santa Cruz genome browserchr21
chrM
Un
sortOrderorderingOrderinguint32A sort order value assigned to the sequence1
25
assignedMoleculeLocationTypemol-typeMolecule typestringThe type of molecule represented by the sequenceChromosome
Mitochondrion
refseqAccessionrefseq-seq-accRefSeq seq accessionstringThe RefSeq accession of the sequenceNC_000021.9
assemblyUnitassm-unit-nameAssembly-unit namestringThe name of the assembly unitPrimary Assembly
lengthseq-lengthSeq lengthuint32The length of the sequence in nucleotides46709983
genbankAccessiongenbank-seq-accGenBank seq accessionstringThe GenBank accession of the sequenceCM000683.2
gcCountgc-countGC Countuint64The number of GC base-pairs in the chromosome
roleroleRolestring

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated November 25, 2024