Prokaryote gene location report
Prokaryote gene location record identifiers, organism, and genomic locations
The downloaded prokaryote package contains a prokaryote gene location data report in
JSON Lines
format in the file:
ncbi_dataset/data/annotation_report.jsonl
Each line of the prokaryote gene location data report file is a hierarchical
JSON
object that represents a single prokaryote gene location record. The schema of the prokaryote
gene location record is defined in the tables below where each row describes a single field in
the report or a sub-structure, which is a collection of fields.
The outermost structure of the report is ProkaryoteGeneLocation.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's
--fields
option. Refer to the
dataformat CLI tool reference to see how you
can use this tool to transform prokaryote gene location data reports from JSON Lines to tabular formats.
Sample report
{
"genbankGenomicLocation": {
"assemblyAccession": "GCA_964021255.1",
"sequenceRange": {
"accessionVersion": "CAXHYQ010000002.1",
"range": [
{
"begin": "64120",
"end": "64533",
"orientation": "minus"
}
]
}
},
"organism": {
"organismName": "Escherichia coli",
"taxId": 562
},
"proteinAccession": "WP_001435165.1",
"refseqGenomicLocation": {
"assemblyAccession": "GCF_964021255.1",
"sequenceRange": {
"accessionVersion": "NZ_CAXHYQ010000002.1",
"range": [
{
"begin": "64120",
"end": "64533",
"orientation": "minus"
}
]
}
}
}
ProkaryoteGeneLocation Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
proteinAccession | protein-accession | Protein Accession | string | The RefSeq WP_ prefixed accession for the protein sequence. | WP_000443665.1
|
refseqGenomicLocation | refseq-genomic-location- | RefSeq Genomic Location | SeqRangeWithAssembly | The RefSeq nucleotide mapping for this protein | |
genbankGenomicLocation | genbank-genomic-location- | GenBank Genomic Location | SeqRangeWithAssembly | The equivalent GenBank nucleotide mapping for this protein | |
organism | organism- | Organism | Organism | The species level taxonomy information | |
completeness | completeness | Completeness | ProkaryoteGeneLocation.Completeness | Whether the assembly is complete or partial | |
chromosomeName | chromosome_name | Chromosome | string | The name of the chromosome, if there is one. | |
LineageOrganism Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
taxId | coming soon | coming soon | uint32 | NCBI Taxonomy identifier | 11118
|
name | coming soon | coming soon | string | Scientific name | Coronaviridae
|
Organism Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
taxId | tax-id | Taxonomic ID | uint32 | NCBI Taxonomy identifier | 9606
2697049
|
organismName | organism-name | Organism Name | string | Scientific name | Homo sapiens
Severe acute respiratory syndrome coronavirus 2
|
commonName | common-name | Common Name | string | Common name | human
pangolin
MERS
SARS2
|
lineage repeated | | | LineageOrganism | Lineage ordered from superkingdom level to increasingly more specific taxonomic entries | |
strain | strain | Strain | string | | SE11
|
pangolinClassification | pangolin | Pangolin Classification | string | | B.1.1.7
|
Range Structure
A 1-based range on a sequence record.
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
begin | start | Start | uint64 | | |
end | stop | Stop | uint64 | | |
orientation | orientation | Orientation | Orientation | | |
order | order | Order | uint32 | | |
SeqRangeSet Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
accessionVersion | accession | Sequence Accession | string | NCBI Accession.version of the sequence | |
range repeated | range- | | Range | Series of intervals on above accession_version | |
SeqRangeWithAssembly Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
assemblyAccession | assembly-accession | Assembly Accession | string | The genomic assembly associated with the sequence location of this protein | GCF_000010385.1
|
sequenceRange | seq-range- | | SeqRangeSet | The genomic sequence location of this protein | |
Orientation Enumeration
Name | Number | Description |
---|
none | 0 | |
plus | 1 | |
minus | 2 | |
ProkaryoteGeneLocation.Completeness Enumeration
Name | Number | Description |
---|
complete | 0 | |
partial | 1 | |
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|
double | | double | float | double | float64 |
float | | float | float | float | float32 |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | | bool | boolean | boolean | bool |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |
Generated November 25, 2024