Genome sequence report
Genome assembly sequence accessions, chromosome, and length
Genome sequence report
The downloaded genome package contains a genome sequence data report in
JSON Lines
format in the file:
ncbi_dataset/data/<assembly>/sequence_report.jsonl
Each line of the genome assembly sequence data report file is a
JSON
object that represents a single genome assembly sequence record. The schema of the genome assembly sequence record
is defined in the table below where each row in SequenceInfo describes a single field in the report.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's
--fields
Sample report
{
"assemblyAccession": "GCF_000001405.40",
"assemblyUnit": "Primary Assembly",
"assignedMoleculeLocationType": "Chromosome",
"chrName": "1",
"gcCount": "103674491",
"genbankAccession": "CM000663.2",
"length": 248956422,
"refseqAccession": "NC_000001.11",
"role": "assembled-molecule",
"ucscStyleName": "chr1"
}
SequenceInfo Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
assemblyAccession | accession | Assembly Accession | string | The GenColl assembly accession | GCF_000001405.40 |
chrName | chr-name | Chromosome name | string | The name of the associated chromosome. The name “Un” indicates that the chromosome is unknown. | 21 MT Un |
ucscStyleName | ucsc-style-name | UCSC style name | string | Name ascribed to this sequence by the UC Santa Cruz genome browser | chr21 chrM Un |
sortOrder | ordering | Ordering | uint32 | A sort order value assigned to the sequence | 1 25 |
assignedMoleculeLocationType | mol-type | Molecule type | string | The type of molecule represented by the sequence | Chromosome Mitochondrion |
refseqAccession | refseq-seq-acc | RefSeq seq accession | string | The RefSeq accession of the sequence | NC_000021.9 |
assemblyUnit | assm-unit-name | Assembly-unit name | string | The name of the assembly unit | Primary Assembly |
length | seq-length | Seq length | uint32 | The length of the sequence in nucleotides | 46709983 |
genbankAccession | genbank-seq-acc | GenBank seq accession | string | The GenBank accession of the sequence | CM000683.2 |
gcCount | gc-count | GC Count | uint64 | The number of GC base-pairs in the chromosome | |
role | role | Role | string |
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|---|---|---|---|---|
double | double | float | double | float64 | |
float | float | float | float | float32 | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | bool | boolean | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |