Gene product report

Gene record identifiers, genomic locations, transcripts, and products

Gene product report

Gene record identifiers, genomic locations, transcripts, and products

The downloaded gene package contains a gene product report in JSON Lines format in the file:

ncbi_dataset/data/product_report.jsonl

Each line of the gene product report file is a hierarchical JSON object that represents a single gene record. The schema of the gene record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is GeneDescriptor.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform gene product reports from JSON Lines to tabular formats.

Sample report

{
  "commonName": "human",
  "description": "alpha-1-B glycoprotein",
  "geneId": "1",
  "proteinCount": 1,
  "symbol": "A1BG",
  "taxId": "9606",
  "taxname": "Homo sapiens",
  "transcriptCount": 1,
  "transcriptTypeCounts": [
    {
      "count": 1,
      "type": "PROTEIN_CODING"
    }
  ],
  "transcripts": [
    {
      "accessionVersion": "NM_130786.4",
      "cds": {
        "accessionVersion": "NM_130786.4",
        "range": [
          {
            "begin": "56",
            "end": "1543"
          }
        ]
      },
      "ensemblTranscript": "ENST00000263100.8",
      "genomicLocations": [
        {
          "exons": [
            {
              "begin": "58353404",
              "end": "58353492",
              "order": 1
            },
            {
              "begin": "58353292",
              "end": "58353327",
              "order": 2
            },
            {
              "begin": "58352928",
              "end": "58353197",
              "order": 3
            },
            {
              "begin": "58352283",
              "end": "58352555",
              "order": 4
            },
            {
              "begin": "58351391",
              "end": "58351687",
              "order": 5
            },
            {
              "begin": "58350370",
              "end": "58350651",
              "order": 6
            },
            {
              "begin": "58347353",
              "end": "58347640",
              "order": 7
            },
            {
              "begin": "58345183",
              "end": "58347029",
              "order": 8
            }
          ],
          "genomicAccessionVersion": "NC_000019.10",
          "genomicRange": {
            "begin": "58345183",
            "end": "58353492",
            "orientation": "minus"
          },
          "sequenceName": "Chromosome 19 Reference GRCh38.p14 Primary Assembly"
        },
        {
          "exons": [
            {
              "begin": "61449819",
              "end": "61449907",
              "order": 1
            },
            {
              "begin": "61449707",
              "end": "61449742",
              "order": 2
            },
            {
              "begin": "61449343",
              "end": "61449612",
              "order": 3
            },
            {
              "begin": "61448698",
              "end": "61448970",
              "order": 4
            },
            {
              "begin": "61447805",
              "end": "61448101",
              "order": 5
            },
            {
              "begin": "61446784",
              "end": "61447065",
              "order": 6
            },
            {
              "begin": "61443768",
              "end": "61444055",
              "order": 7
            },
            {
              "begin": "61441599",
              "end": "61443445",
              "order": 8
            }
          ],
          "genomicAccessionVersion": "NC_060943.1",
          "genomicRange": {
            "begin": "61441599",
            "end": "61449907",
            "orientation": "minus"
          },
          "sequenceName": "Chromosome 19 Alternate T2T-CHM13v2.0"
        }
      ],
      "length": 3382,
      "protein": {
        "accessionVersion": "NP_570602.2",
        "ensemblProtein": "ENSP00000263100.2",
        "length": 495,
        "name": "alpha-1B-glycoprotein precursor"
      },
      "type": "PROTEIN_CODING"
    }
  ],
  "type": "PROTEIN_CODING"
}

ProductDescriptor Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneIdgene-idNCBI GeneIDuint64NCBI Gene ID2778
symbolsymbolSymbolstringgene symbolGNAS
descriptiondescriptionDescriptionstringgene nameGNAS complex locus
taxIdtax-idTaxonomic IDuint64NCBI Taxonomy ID for the organism9606
taxnametax-nameTaxonomic NamestringTaxonomic name of the organismHomo sapiens
commonNamecommon-nameCommon NamestringCommon name of the organismhuman
typegene-typeGene TypeGeneType
rnaTyperna-typeRNA TypeRnaType
transcripts repeatedtranscript-TranscriptTranscriptRefSeq coding and non-coding transcript accessions
transcriptCounttranscript-countTranscript Countuint32
proteinCountprotein-countProtein Countuint32
transcriptTypeCounts repeatedTranscriptTypeCount

GenomicLocation Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
genomicAccessionVersionaccessionAccessionstring
sequenceNameseq-nameSeq Namestring
genomicRangerange-Range
exons repeatedexon-ExonsRange

MaturePeptide Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionAccessionstring
namenameNamestring
lengthlengthLengthuint32

Protein Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionAccessionstringRefSeq protein accession with versionNP_001296812.1
namenameNamestringProtein nameprotein ALEX
lengthlengthLengthuint32Protein length in amino acids626
isoformNameisoformIsoformstringProtein isoform nameisoform Alex
ensemblProteinensembl-proteinEnsembl ProteinstringEnsembl protein accession with versionENSP00000302237.3
maturePeptides repeatedmat-peptide-Mature PeptideMaturePeptide

Range Structure

A 1-based range on a sequence record.

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
beginstartStartuint64
endstopStopuint64
orientationorientationOrientationOrientation
orderorderOrderuint32
ribosomalSlippagecoming sooncoming soonint32When ribosomal slippage is desired, fill out slippage amount between this and previous range.

SeqRangeSet Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionSequence AccessionstringNCBI Accession.version of the sequence
range repeatedrange-RangeSeries of intervals on above accession_version

Transcript Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionAccessionstringRefSeq transcript accession with version
namenameTranscript NamestringRefSeq transcript nametranscript variant 12
lengthlengthTranscript Lengthuint32RefSeq transcript length in nucleotides3180
cdscds-CDSSeqRangeSet
genomicLocations repeatedgenomic-location-GenomicGenomicLocation
ensemblTranscriptensembl-transcriptEnsembl TranscriptstringEnsembl transcript accession with versionENST00000306120.3
proteinprotein-ProteinProtein
typetranscript-typeTypeTranscript.TranscriptType

TranscriptTypeCount Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
typeTranscript.TranscriptType
countcoming sooncoming soonuint32

GeneType Enumeration

NB: GeneType values match Entrez Gene

NameNumberDescription
UNKNOWN0
tRNA1
rRNA2
snRNA3
scRNA4
snoRNA5
PROTEIN_CODING6
PSEUDO7these will have NG or NR
TRANSPOSON8
miscRNA9
ncRNA10
BIOLOGICAL_REGION11these will have NG
OTHER255

Orientation Enumeration

NameNumberDescription
none0
plus1
minus2

RnaType Enumeration

NameNumberDescription
rna_UNKNOWN0
premsg1
tmRna2

Transcript.TranscriptType Enumeration

NameNumberDescription
UNKNOWN0
PROTEIN_CODING1
NON_CODING2
PROTEIN_CODING_MODEL3
NON_CODING_MODEL4

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated November 25, 2024