Gene report

Gene record identifiers, genomic locations, transcripts, and products

Gene report

Gene record identifiers, genomic locations, transcripts, and products

The downloaded gene package contains a gene data report in JSON Lines format in the file:

ncbi_dataset/data/data_report.jsonl

Each line of the gene data report file is a hierarchical JSON object that represents a single gene record. The schema of the gene record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is GeneDescriptor.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform gene data reports from JSON Lines to tabular formats.

Sample report

{
  "annotations": [
    {
      "assembliesInScope": [
        {
          "accession": "GCF_000001405.40",
          "name": "GRCh38.p14"
        }
      ],
      "releaseDate": "2024-08-23",
      "releaseName": "GCF_000001405.40-RS_2024_08"
    },
    {
      "assembliesInScope": [
        {
          "accession": "GCF_009914755.1",
          "name": "T2T-CHM13v2.0"
        }
      ],
      "releaseDate": "2024-08-23",
      "releaseName": "GCF_009914755.1-RS_2024_08"
    }
  ],
  "chromosomes": [
    "19"
  ],
  "commonName": "human",
  "description": "alpha-1-B glycoprotein",
  "ensemblGeneIds": [
    "ENSG00000121410"
  ],
  "geneGroups": [
    {
      "id": "1",
      "method": "NCBI Ortholog"
    }
  ],
  "geneId": "1",
  "genomicRanges": [
    {
      "accessionVersion": "NC_000019.10",
      "range": [
        {
          "begin": "58345183",
          "end": "58353492",
          "orientation": "minus"
        }
      ]
    },
    {
      "accessionVersion": "NC_060943.1",
      "range": [
        {
          "begin": "61441599",
          "end": "61449907",
          "orientation": "minus"
        }
      ]
    }
  ],
  "nomenclatureAuthority": {
    "authority": "HGNC",
    "identifier": "HGNC:5"
  },
  "omimIds": [
    "138670"
  ],
  "orientation": "minus",
  "swissProtAccessions": [
    "P04217"
  ],
  "symbol": "A1BG",
  "synonyms": [
    "A1B",
    "ABG",
    "GAB",
    "HYST2477"
  ],
  "taxId": "9606",
  "taxname": "Homo sapiens",
  "transcripts": [
    {
      "accessionVersion": "NM_130786.4",
      "cds": {
        "accessionVersion": "NM_130786.4",
        "range": [
          {
            "begin": "56",
            "end": "1543"
          }
        ]
      },
      "ensemblTranscript": "ENST00000263100.8",
      "exons": {
        "accessionVersion": "NC_000019.10",
        "range": [
          {
            "begin": "58353404",
            "end": "58353492",
            "order": 1
          },
          {
            "begin": "58353292",
            "end": "58353327",
            "order": 2
          },
          {
            "begin": "58352928",
            "end": "58353197",
            "order": 3
          },
          {
            "begin": "58352283",
            "end": "58352555",
            "order": 4
          },
          {
            "begin": "58351391",
            "end": "58351687",
            "order": 5
          },
          {
            "begin": "58350370",
            "end": "58350651",
            "order": 6
          },
          {
            "begin": "58347353",
            "end": "58347640",
            "order": 7
          },
          {
            "begin": "58345183",
            "end": "58347029",
            "order": 8
          }
        ]
      },
      "genomicLocations": [
        {
          "exons": [
            {
              "begin": "58353404",
              "end": "58353492",
              "order": 1
            },
            {
              "begin": "58353292",
              "end": "58353327",
              "order": 2
            },
            {
              "begin": "58352928",
              "end": "58353197",
              "order": 3
            },
            {
              "begin": "58352283",
              "end": "58352555",
              "order": 4
            },
            {
              "begin": "58351391",
              "end": "58351687",
              "order": 5
            },
            {
              "begin": "58350370",
              "end": "58350651",
              "order": 6
            },
            {
              "begin": "58347353",
              "end": "58347640",
              "order": 7
            },
            {
              "begin": "58345183",
              "end": "58347029",
              "order": 8
            }
          ],
          "genomicAccessionVersion": "NC_000019.10",
          "genomicRange": {
            "begin": "58345183",
            "end": "58353492",
            "orientation": "minus"
          },
          "sequenceName": "Chromosome 19 Reference GRCh38.p14 Primary Assembly"
        },
        {
          "exons": [
            {
              "begin": "61449819",
              "end": "61449907",
              "order": 1
            },
            {
              "begin": "61449707",
              "end": "61449742",
              "order": 2
            },
            {
              "begin": "61449343",
              "end": "61449612",
              "order": 3
            },
            {
              "begin": "61448698",
              "end": "61448970",
              "order": 4
            },
            {
              "begin": "61447805",
              "end": "61448101",
              "order": 5
            },
            {
              "begin": "61446784",
              "end": "61447065",
              "order": 6
            },
            {
              "begin": "61443768",
              "end": "61444055",
              "order": 7
            },
            {
              "begin": "61441599",
              "end": "61443445",
              "order": 8
            }
          ],
          "genomicAccessionVersion": "NC_060943.1",
          "genomicRange": {
            "begin": "61441599",
            "end": "61449907",
            "orientation": "minus"
          },
          "sequenceName": "Chromosome 19 Alternate T2T-CHM13v2.0"
        }
      ],
      "genomicRange": {
        "accessionVersion": "NC_000019.10",
        "range": [
          {
            "begin": "58345183",
            "end": "58353492",
            "orientation": "minus"
          }
        ]
      },
      "length": 3382,
      "protein": {
        "accessionVersion": "NP_570602.2",
        "ensemblProtein": "ENSP00000263100.2",
        "length": 495,
        "name": "alpha-1B-glycoprotein precursor"
      },
      "type": "PROTEIN_CODING"
    }
  ],
  "type": "PROTEIN_CODING"
}

GeneDescriptor Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneIdgene-idNCBI GeneIDuint64NCBI Gene ID2778
symbolsymbolSymbolstringgene symbolGNAS
descriptiondescriptionDescriptionstringgene nameGNAS complex locus
taxIdtax-idTaxonomic IDuint64NCBI Taxonomy ID for the organism9606
taxnametax-nameTaxonomic NamestringTaxonomic name of the organismHomo sapiens
commonNamecommon-nameCommon NamestringCommon name of the organismhuman
typegene-typeGene TypeGeneDescriptor.GeneType
rnaTyperna-typeRNA TypeGeneDescriptor.RnaType
orientationorientationOrientationOrientation
genomicRanges repeatedgenomic-range-Genomic RangeSeqRangeSet
referenceStandards repeatedref-standard-Reference StandardGenomicRegionClinical reference standard NG
genomicRegions repeatedgenomic-region-Genomic RegionGenomicRegionPseudogene, non-genic regulatory element and other genomic region NG
transcripts repeatedtranscript-TranscriptTranscriptRefSeq coding and non-coding transcript accessions
proteins repeatedprotein-ProteinProteinOnly for proteins directly annotated on the Gene, without any intermediary transcript
chromosomes repeatedchromosomesChromosomesstring1
X,Y
nomenclatureAuthorityname-NomenclatureNomenclatureAuthority
swissProtAccessions repeatedswissprot-accessionsSwissProt Accessionsstring
ensemblGeneIds repeatedensembl-geneidsEnsembl GeneIDsstring
omimIds repeatedomim-idsOMIM IDsstring
synonyms repeatedsynonymsSynonymsstring
replacedGeneIdreplaced-gene-idReplaced NCBI GeneIDuint64The NCBI Gene ID for the gene that was merged into the current gene record
annotations repeatedannotation-AnnotationAnnotation

AnnotatedAssemblies Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstring
namenameNamestring

Annotation Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
releaseNamerelease-nameRelease Namestring
releaseDaterelease-dateRelease Datestring
assembliesInScope repeatedassemblies-in-scope-Assemblies in ScopeAnnotatedAssemblies

GenomicLocation Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
genomicAccessionVersionaccessionAccessionstring
sequenceNameseq-nameSeq Namestring
genomicRangerange-Range
exons repeatedexon-ExonsRange

GenomicRegion Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneRangegene-range-Gene RangeSeqRangeSetThe range of this Gene record on this genomic region.
typegenomic-region-typeGenomic Region TypeGenomicRegion.GenomicRegionType

MaturePeptide Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionAccessionstring
namenameNamestring
lengthlengthLengthuint32

NomenclatureAuthority Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
authorityauthorityAuthoritystringThe nomenclature authority for this gene recordHGNC
identifieridIDstringThe nomenclature authority identifier for this gene recordHGNC:4392

Protein Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionAccessionstringRefSeq protein accession with versionNP_001296812.1
namenameNamestringProtein nameprotein ALEX
lengthlengthLengthuint32Protein length in amino acids626
isoformNameisoformIsoformstringProtein isoform nameisoform Alex
ensemblProteinensembl-proteinEnsembl ProteinstringEnsembl protein accession with versionENSP00000302237.3
maturePeptides repeatedmat-peptide-Mature PeptideMaturePeptide

Range Structure

A 1-based range on a sequence record.

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
beginstartStartuint64
endstopStopuint64
orientationorientationOrientationOrientation
orderorderOrderuint32

SeqRangeSet Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionSequence AccessionstringNCBI Accession.version of the sequence
range repeatedrange-RangeSeries of intervals on above accession_version

Transcript Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionAccessionstringRefSeq transcript accession with version
namenameTranscript NamestringRefSeq transcript nametranscript variant 12
lengthlengthTranscript Lengthuint32RefSeq transcript length in nucleotides3180
cdscds-CDSSeqRangeSet
genomicLocations repeatedgenomic-location-GenomicGenomicLocation
ensemblTranscriptensembl-transcriptEnsembl TranscriptstringEnsembl transcript accession with versionENST00000306120.3
proteinprotein-ProteinProtein
typetranscript-typeTypeTranscript.TranscriptType

GeneDescriptor.GeneType Enumeration

NB: GeneType values match Entrez Gene

NameNumberDescription
UNKNOWN0
tRNA1
rRNA2
snRNA3
scRNA4
snoRNA5
PROTEIN_CODING6
PSEUDO7these will have NG or NR
TRANSPOSON8
miscRNA9
ncRNA10
BIOLOGICAL_REGION11these will have NG
OTHER255

GeneDescriptor.RnaType Enumeration

NameNumberDescription
rna_UNKNOWN0
premsg1
tmRna2

GenomicRegion.GenomicRegionType Enumeration

NameNumberDescription
UNKNOWN0
REFSEQ_GENE1
PSEUDOGENE2
BIOLOGICAL_REGION3
OTHER4

Orientation Enumeration

NameNumberDescription
none0
plus1
minus2

Transcript.TranscriptType Enumeration

NameNumberDescription
UNKNOWN0
PROTEIN_CODING1
NON_CODING2
PROTEIN_CODING_MODEL3
NON_CODING_MODEL4

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated November 25, 2024