Gene product report
Gene record identifiers, genomic locations, transcripts, and products
Gene product report
The downloaded gene package contains a gene product report in
JSON Lines
format in the file:
ncbi_dataset/data/product_report.jsonl
Each line of the gene product report file is a hierarchical JSON
object that represents a single gene record. The schema of the gene record is defined in the tables below
where each row describes a single field in the report or a sub-structure, which is a collection of fields.
The outermost structure of the report is GeneDescriptor.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's --fields
Sample report
{
"commonName": "human",
"description": "alpha-1-B glycoprotein",
"geneId": "1",
"proteinCount": 1,
"symbol": "A1BG",
"taxId": "9606",
"taxname": "Homo sapiens",
"transcriptCount": 1,
"transcriptTypeCounts": [
{
"count": 1,
"type": "PROTEIN_CODING"
}
],
"transcripts": [
{
"accessionVersion": "NM_130786.4",
"cds": {
"accessionVersion": "NM_130786.4",
"range": [
{
"begin": "56",
"end": "1543"
}
]
},
"ensemblTranscript": "ENST00000263100.8",
"genomicLocations": [
{
"exons": [
{
"begin": "58353404",
"end": "58353492",
"order": 1
},
{
"begin": "58353292",
"end": "58353327",
"order": 2
},
{
"begin": "58352928",
"end": "58353197",
"order": 3
},
{
"begin": "58352283",
"end": "58352555",
"order": 4
},
{
"begin": "58351391",
"end": "58351687",
"order": 5
},
{
"begin": "58350370",
"end": "58350651",
"order": 6
},
{
"begin": "58347353",
"end": "58347640",
"order": 7
},
{
"begin": "58345183",
"end": "58347029",
"order": 8
}
],
"genomicAccessionVersion": "NC_000019.10",
"genomicRange": {
"begin": "58345183",
"end": "58353492",
"orientation": "minus"
},
"sequenceName": "Chromosome 19 Reference GRCh38.p14 Primary Assembly"
},
{
"exons": [
{
"begin": "61449819",
"end": "61449907",
"order": 1
},
{
"begin": "61449707",
"end": "61449742",
"order": 2
},
{
"begin": "61449343",
"end": "61449612",
"order": 3
},
{
"begin": "61448698",
"end": "61448970",
"order": 4
},
{
"begin": "61447805",
"end": "61448101",
"order": 5
},
{
"begin": "61446784",
"end": "61447065",
"order": 6
},
{
"begin": "61443768",
"end": "61444055",
"order": 7
},
{
"begin": "61441599",
"end": "61443445",
"order": 8
}
],
"genomicAccessionVersion": "NC_060943.1",
"genomicRange": {
"begin": "61441599",
"end": "61449907",
"orientation": "minus"
},
"sequenceName": "Chromosome 19 Alternate T2T-CHM13v2.0"
}
],
"length": 3382,
"protein": {
"accessionVersion": "NP_570602.2",
"ensemblProtein": "ENSP00000263100.2",
"length": 495,
"name": "alpha-1B-glycoprotein precursor"
},
"type": "PROTEIN_CODING"
}
],
"type": "PROTEIN_CODING"
}
ProductDescriptor Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
geneId | gene-id | NCBI GeneID | uint64 | NCBI Gene ID | 2778 |
symbol | symbol | Symbol | string | gene symbol | GNAS |
description | description | Description | string | gene name | GNAS complex locus |
taxId | tax-id | Taxonomic ID | uint64 | NCBI Taxonomy ID for the organism | 9606 |
taxname | tax-name | Taxonomic Name | string | Taxonomic name of the organism | Homo sapiens |
commonName | common-name | Common Name | string | Common name of the organism | human |
type | gene-type | Gene Type | GeneType | ||
rnaType | rna-type | RNA Type | RnaType | ||
transcripts repeated | transcript- | Transcript | Transcript | RefSeq coding and non-coding transcript accessions | |
transcriptCount | transcript-count | Transcript Count | uint32 | ||
proteinCount | protein-count | Protein Count | uint32 | ||
transcriptTypeCounts repeated | TranscriptTypeCount |
GenomicLocation Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
genomicAccessionVersion | accession | Accession | string | ||
sequenceName | seq-name | Seq Name | string | ||
genomicRange | range- | Range | |||
exons repeated | exon- | Exons | Range |
MaturePeptide Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Accession | string | ||
name | name | Name | string | ||
length | length | Length | uint32 |
Protein Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Accession | string | RefSeq protein accession with version | NP_001296812.1 |
name | name | Name | string | Protein name | protein ALEX |
length | length | Length | uint32 | Protein length in amino acids | 626 |
isoformName | isoform | Isoform | string | Protein isoform name | isoform Alex |
ensemblProtein | ensembl-protein | Ensembl Protein | string | Ensembl protein accession with version | ENSP00000302237.3 |
maturePeptides repeated | mat-peptide- | Mature Peptide | MaturePeptide |
Range Structure
A 1-based range on a sequence record.
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
begin | start | Start | uint64 | ||
end | stop | Stop | uint64 | ||
orientation | orientation | Orientation | Orientation | ||
order | order | Order | uint32 | ||
ribosomalSlippage | coming soon | coming soon | int32 | When ribosomal slippage is desired, fill out slippage amount between this and previous range. |
SeqRangeSet Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Sequence Accession | string | NCBI Accession.version of the sequence | |
range repeated | range- | Range | Series of intervals on above accession_version |
Transcript Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Accession | string | RefSeq transcript accession with version | |
name | name | Transcript Name | string | RefSeq transcript name | transcript variant 12 |
length | length | Transcript Length | uint32 | RefSeq transcript length in nucleotides | 3180 |
cds | cds- | CDS | SeqRangeSet | ||
genomicLocations repeated | genomic-location- | Genomic | GenomicLocation | ||
ensemblTranscript | ensembl-transcript | Ensembl Transcript | string | Ensembl transcript accession with version | ENST00000306120.3 |
protein | protein- | Protein | Protein | ||
type | transcript-type | Type | Transcript.TranscriptType |
TranscriptTypeCount Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
type | Transcript.TranscriptType | ||||
count | coming soon | coming soon | uint32 |
GeneType Enumeration
NB: GeneType values match Entrez Gene
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
tRNA | 1 | |
rRNA | 2 | |
snRNA | 3 | |
scRNA | 4 | |
snoRNA | 5 | |
PROTEIN_CODING | 6 | |
PSEUDO | 7 | these will have NG or NR |
TRANSPOSON | 8 | |
miscRNA | 9 | |
ncRNA | 10 | |
BIOLOGICAL_REGION | 11 | these will have NG |
OTHER | 255 |
Orientation Enumeration
Name | Number | Description |
---|---|---|
none | 0 | |
plus | 1 | |
minus | 2 |
RnaType Enumeration
Name | Number | Description |
---|---|---|
rna_UNKNOWN | 0 | |
premsg | 1 | |
tmRna | 2 |
Transcript.TranscriptType Enumeration
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
PROTEIN_CODING | 1 | |
NON_CODING | 2 | |
PROTEIN_CODING_MODEL | 3 | |
NON_CODING_MODEL | 4 |
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|---|---|---|---|---|
double | double | float | double | float64 | |
float | float | float | float | float32 | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | bool | boolean | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |