Gene report
Gene record identifiers, genomic locations, transcripts, and products
Gene report
The downloaded gene package contains a gene data report in
JSON Lines
format in the file:
ncbi_dataset/data/data_report.jsonl
Each line of the gene data report file is a hierarchical JSON
object that represents a single gene record. The schema of the gene record is defined in the tables below
where each row describes a single field in the report or a sub-structure, which is a collection of fields.
The outermost structure of the report is GeneDescriptor.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's --fields
Sample report
{
"annotations": [
{
"assembliesInScope": [
{
"accession": "GCF_000001405.40",
"name": "GRCh38.p14"
}
],
"releaseDate": "2024-08-23",
"releaseName": "GCF_000001405.40-RS_2024_08"
},
{
"assembliesInScope": [
{
"accession": "GCF_009914755.1",
"name": "T2T-CHM13v2.0"
}
],
"releaseDate": "2024-08-23",
"releaseName": "GCF_009914755.1-RS_2024_08"
}
],
"chromosomes": [
"19"
],
"commonName": "human",
"description": "alpha-1-B glycoprotein",
"ensemblGeneIds": [
"ENSG00000121410"
],
"geneGroups": [
{
"id": "1",
"method": "NCBI Ortholog"
}
],
"geneId": "1",
"genomicRanges": [
{
"accessionVersion": "NC_000019.10",
"range": [
{
"begin": "58345183",
"end": "58353492",
"orientation": "minus"
}
]
},
{
"accessionVersion": "NC_060943.1",
"range": [
{
"begin": "61441599",
"end": "61449907",
"orientation": "minus"
}
]
}
],
"nomenclatureAuthority": {
"authority": "HGNC",
"identifier": "HGNC:5"
},
"omimIds": [
"138670"
],
"orientation": "minus",
"swissProtAccessions": [
"P04217"
],
"symbol": "A1BG",
"synonyms": [
"A1B",
"ABG",
"GAB",
"HYST2477"
],
"taxId": "9606",
"taxname": "Homo sapiens",
"transcripts": [
{
"accessionVersion": "NM_130786.4",
"cds": {
"accessionVersion": "NM_130786.4",
"range": [
{
"begin": "56",
"end": "1543"
}
]
},
"ensemblTranscript": "ENST00000263100.8",
"exons": {
"accessionVersion": "NC_000019.10",
"range": [
{
"begin": "58353404",
"end": "58353492",
"order": 1
},
{
"begin": "58353292",
"end": "58353327",
"order": 2
},
{
"begin": "58352928",
"end": "58353197",
"order": 3
},
{
"begin": "58352283",
"end": "58352555",
"order": 4
},
{
"begin": "58351391",
"end": "58351687",
"order": 5
},
{
"begin": "58350370",
"end": "58350651",
"order": 6
},
{
"begin": "58347353",
"end": "58347640",
"order": 7
},
{
"begin": "58345183",
"end": "58347029",
"order": 8
}
]
},
"genomicLocations": [
{
"exons": [
{
"begin": "58353404",
"end": "58353492",
"order": 1
},
{
"begin": "58353292",
"end": "58353327",
"order": 2
},
{
"begin": "58352928",
"end": "58353197",
"order": 3
},
{
"begin": "58352283",
"end": "58352555",
"order": 4
},
{
"begin": "58351391",
"end": "58351687",
"order": 5
},
{
"begin": "58350370",
"end": "58350651",
"order": 6
},
{
"begin": "58347353",
"end": "58347640",
"order": 7
},
{
"begin": "58345183",
"end": "58347029",
"order": 8
}
],
"genomicAccessionVersion": "NC_000019.10",
"genomicRange": {
"begin": "58345183",
"end": "58353492",
"orientation": "minus"
},
"sequenceName": "Chromosome 19 Reference GRCh38.p14 Primary Assembly"
},
{
"exons": [
{
"begin": "61449819",
"end": "61449907",
"order": 1
},
{
"begin": "61449707",
"end": "61449742",
"order": 2
},
{
"begin": "61449343",
"end": "61449612",
"order": 3
},
{
"begin": "61448698",
"end": "61448970",
"order": 4
},
{
"begin": "61447805",
"end": "61448101",
"order": 5
},
{
"begin": "61446784",
"end": "61447065",
"order": 6
},
{
"begin": "61443768",
"end": "61444055",
"order": 7
},
{
"begin": "61441599",
"end": "61443445",
"order": 8
}
],
"genomicAccessionVersion": "NC_060943.1",
"genomicRange": {
"begin": "61441599",
"end": "61449907",
"orientation": "minus"
},
"sequenceName": "Chromosome 19 Alternate T2T-CHM13v2.0"
}
],
"genomicRange": {
"accessionVersion": "NC_000019.10",
"range": [
{
"begin": "58345183",
"end": "58353492",
"orientation": "minus"
}
]
},
"length": 3382,
"protein": {
"accessionVersion": "NP_570602.2",
"ensemblProtein": "ENSP00000263100.2",
"length": 495,
"name": "alpha-1B-glycoprotein precursor"
},
"type": "PROTEIN_CODING"
}
],
"type": "PROTEIN_CODING"
}
GeneDescriptor Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
geneId | gene-id | NCBI GeneID | uint64 | NCBI Gene ID | 2778 |
symbol | symbol | Symbol | string | gene symbol | GNAS |
description | description | Description | string | gene name | GNAS complex locus |
taxId | tax-id | Taxonomic ID | uint64 | NCBI Taxonomy ID for the organism | 9606 |
taxname | tax-name | Taxonomic Name | string | Taxonomic name of the organism | Homo sapiens |
commonName | common-name | Common Name | string | Common name of the organism | human |
type | gene-type | Gene Type | GeneDescriptor.GeneType | ||
rnaType | rna-type | RNA Type | GeneDescriptor.RnaType | ||
orientation | orientation | Orientation | Orientation | ||
genomicRanges repeated | genomic-range- | Genomic Range | SeqRangeSet | ||
referenceStandards repeated | ref-standard- | Reference Standard | GenomicRegion | Clinical reference standard NG | |
genomicRegions repeated | genomic-region- | Genomic Region | GenomicRegion | Pseudogene, non-genic regulatory element and other genomic region NG | |
transcripts repeated | transcript- | Transcript | Transcript | RefSeq coding and non-coding transcript accessions | |
proteins repeated | protein- | Protein | Protein | Only for proteins directly annotated on the Gene, without any intermediary transcript | |
chromosomes repeated | chromosomes | Chromosomes | string | 1 X,Y | |
nomenclatureAuthority | name- | Nomenclature | NomenclatureAuthority | ||
swissProtAccessions repeated | swissprot-accessions | SwissProt Accessions | string | ||
ensemblGeneIds repeated | ensembl-geneids | Ensembl GeneIDs | string | ||
omimIds repeated | omim-ids | OMIM IDs | string | ||
synonyms repeated | synonyms | Synonyms | string | ||
replacedGeneId | replaced-gene-id | Replaced NCBI GeneID | uint64 | The NCBI Gene ID for the gene that was merged into the current gene record | |
annotations repeated | annotation- | Annotation | Annotation |
AnnotatedAssemblies Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accession | accession | Accession | string | ||
name | name | Name | string |
Annotation Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
releaseName | release-name | Release Name | string | ||
releaseDate | release-date | Release Date | string | ||
assembliesInScope repeated | assemblies-in-scope- | Assemblies in Scope | AnnotatedAssemblies |
GenomicLocation Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
genomicAccessionVersion | accession | Accession | string | ||
sequenceName | seq-name | Seq Name | string | ||
genomicRange | range- | Range | |||
exons repeated | exon- | Exons | Range |
GenomicRegion Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
geneRange | gene-range- | Gene Range | SeqRangeSet | The range of this Gene record on this genomic region. | |
type | genomic-region-type | Genomic Region Type | GenomicRegion.GenomicRegionType |
MaturePeptide Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Accession | string | ||
name | name | Name | string | ||
length | length | Length | uint32 |
NomenclatureAuthority Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
authority | authority | Authority | string | The nomenclature authority for this gene record | HGNC |
identifier | id | ID | string | The nomenclature authority identifier for this gene record | HGNC:4392 |
Protein Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Accession | string | RefSeq protein accession with version | NP_001296812.1 |
name | name | Name | string | Protein name | protein ALEX |
length | length | Length | uint32 | Protein length in amino acids | 626 |
isoformName | isoform | Isoform | string | Protein isoform name | isoform Alex |
ensemblProtein | ensembl-protein | Ensembl Protein | string | Ensembl protein accession with version | ENSP00000302237.3 |
maturePeptides repeated | mat-peptide- | Mature Peptide | MaturePeptide |
Range Structure
A 1-based range on a sequence record.
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
begin | start | Start | uint64 | ||
end | stop | Stop | uint64 | ||
orientation | orientation | Orientation | Orientation | ||
order | order | Order | uint32 |
SeqRangeSet Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Sequence Accession | string | NCBI Accession.version of the sequence | |
range repeated | range- | Range | Series of intervals on above accession_version |
Transcript Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Accession | string | RefSeq transcript accession with version | |
name | name | Transcript Name | string | RefSeq transcript name | transcript variant 12 |
length | length | Transcript Length | uint32 | RefSeq transcript length in nucleotides | 3180 |
cds | cds- | CDS | SeqRangeSet | ||
genomicLocations repeated | genomic-location- | Genomic | GenomicLocation | ||
ensemblTranscript | ensembl-transcript | Ensembl Transcript | string | Ensembl transcript accession with version | ENST00000306120.3 |
protein | protein- | Protein | Protein | ||
type | transcript-type | Type | Transcript.TranscriptType |
GeneDescriptor.GeneType Enumeration
NB: GeneType values match Entrez Gene
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
tRNA | 1 | |
rRNA | 2 | |
snRNA | 3 | |
scRNA | 4 | |
snoRNA | 5 | |
PROTEIN_CODING | 6 | |
PSEUDO | 7 | these will have NG or NR |
TRANSPOSON | 8 | |
miscRNA | 9 | |
ncRNA | 10 | |
BIOLOGICAL_REGION | 11 | these will have NG |
OTHER | 255 |
GeneDescriptor.RnaType Enumeration
Name | Number | Description |
---|---|---|
rna_UNKNOWN | 0 | |
premsg | 1 | |
tmRna | 2 |
GenomicRegion.GenomicRegionType Enumeration
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
REFSEQ_GENE | 1 | |
PSEUDOGENE | 2 | |
BIOLOGICAL_REGION | 3 | |
OTHER | 4 |
Orientation Enumeration
Name | Number | Description |
---|---|---|
none | 0 | |
plus | 1 | |
minus | 2 |
Transcript.TranscriptType Enumeration
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
PROTEIN_CODING | 1 | |
NON_CODING | 2 | |
PROTEIN_CODING_MODEL | 3 | |
NON_CODING_MODEL | 4 |
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|---|---|---|---|---|
double | double | float | double | float64 | |
float | float | float | float | float32 | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | bool | boolean | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |