Retrieve ortholog data and metadata
Retrieve ortholog data and metadata
Retrieve ortholog data and metadata
Quick overview
Gene orthologs can be retrieved by gene-id, accession or symbol using the --ortholog
flag.
- gene-id and accession are unique identifiers. As a consequence, the associated taxon is implied. For example: for the human BRCA1 DNA repair associated gene and its gene orthologs in cat and Florida manatee:
- symbol is not a unique identifier (human and cat have the same symbol), so it’s necessary to specify a taxon. datasets uses human as default species.
Species | gene-id | accession | symbol |
---|---|---|---|
Human | 672 | NM_007297.4 | BRCA1 |
Cat | 101081937 | XM_019817934.2 | BRCA1 |
Florida manatee | 101356605 | XM_023725233.1 | LOC101356605 |
In the examples below, we will use datasets datasets download
and datasets summary
commands.
In short, datasets summary
returns only metadata in JSON or JSON-Lines format, while datasets download
retrieves a gene data package
including both metadata and sequence files.
The --ortholog
flag
The --ortholog
flag serves two purposes:
- It explicitly requests an ortholog set for a gene-id, accession or symbol.
- It defines the taxonomic scope of the ortholog set.
The --ortholog
flag requires an argument after it. The options are:
--ortholog all
: this option returns the complete ortholog set available for the requested gene, with no filter.--ortholog <any taxon>
: here, the user can define the taxonomic range for the requested ortholog set. We have an example below showing how to filter an ortholog set by taxon .
Simplest example: retrieve one gene ortholog set
All of the following commands will download the same gene ortholog set:
datasets download gene gene-id 672 --ortholog all
datasets download gene symbol brca1 --ortholog all
datasets download gene accession NM_007297.4 --ortholog all
Retrieve multiple gene ortholog sets based on a gene list
NCBI datasets can retrive multiple ortholog sets based on a list of symbols, accessions or gene-ids. Currently, datasets does not separate each ortholog set into its own files. All sets will be saved in a single data package.
For example: if we provide a list of gene-ids (one per line or comma-separated) using the flag --inputfile
, datasets will iterate over those and save the results as a single data package.
$ cat genelist.txt
672
4157
3206
$ datasets download gene gene-id --inputfile genelist.txt --ortholog all --filename ort.zip
$ unzip ort.zip -d ort
$ tree ort
ort
|-- README.md
`-- ncbi_dataset
`-- data
|-- data_report.jsonl
|-- data_table.tsv
|-- dataset_catalog.json
|-- protein.faa
`-- rna.fna
If we want each ortholog data package to be saved separately, we can use a loop instead:
Command:
cat genelist.txt | while read GENE; do
datasets download gene gene-id "${GENE}" --ortholog all --filename "${GENE}".zip;
done
Result:
Collecting 319 records [===============================================>] 100% 318/319
Collecting 318 records [================================================] 100% 318/318
Downloading: 672.zip 4.11MB done
Collecting 271 records [================================================] 100% 271/271
Collecting 271 records [================================================] 100% 271/271
Downloading: 4157.zip 312kB done
Collecting 431 records [===============================================>] 100% 430/431
Collecting 430 records [================================================] 100% 430/430
Downloading: 3206.zip 587kB done
In this case, the list of genes must have one gene-id per line.
Filter an ortholog gene set by taxon
NCBI datasets offers an option to filter the ortholog set by taxon (any level) by specifying it after the flag --ortholog
. For example: you can filter the BRCA1 (gene-id 672) ortholog set to include only members of the otter family Mustelidae:
You can get a list of species in the otter family for which gene orthologs of human BRCA1 have been calculated using datasets summary
with dataformat:
datasets summary gene gene-id 672 --ortholog mustelidae --as-json-lines | dataformat tsv gene --fields tax-name
Output:
Taxonomic Name
Enhydra lutris kenyoni
Mustela erminea
Lontra canadensis
Neogale vison
Mustela putorius furo
Meles meles
Lutra lutra
Mustela lutreola
Mustela nigripes
The full BRCA1 ortholog set includes 306 species, while the Mustelidae set has only 9 species.
Alternatively, you can download a data package for these otter family gene orthologs:
datasets download gene gene-id 672 --ortholog mustelidae --filename mustelidae.zip
Collecting 9 gene records [================================================] 100% 9/9
Downloading: mustelidae.zip 171kB valid zip archive
Validating package files [================================================] 100% 5/5
Retrieve an ortholog set by symbol using the --taxon
flag
By default, datasets will assume the taxon to be human (Taxonomy ID: 9606) when requesting an ortholog set by symbol. If we request an ortholog set by symbol for which no human gene is included in the ortholog set, we get an error without the --taxon
flag.
For example, when we query by the mouse gene symbol syna, we get the following result:
$ datasets summary gene symbol syna --ortholog all
{"total_count": 0}
If we specify mouse (TaxId: 10090) with the flag --taxon
, then datasets will return the syna ortholog set:
$ datasets summary gene symbol syna --taxon 10090 --ortholog all
How to retrieve ortholog metadata
Using dataformat
In addition to datasets, we have the dataformat command-line tool that can be used to extract metadata from the gene data report included with the data packages or accessible through the datasets summary
command.
Create a tsv file from the datasets summary
JSON-Lines output using dataformat
datasets summary gene symbol brca1 --ortholog all --as-json-lines | \
dataformat tsv gene --fields tax-name,gene-id,symbol,group-id > brca1.tsv
head brca1.tsv
Result:
Taxonomic Name NCBI GeneID Symbol Gene Group Identifier
Sus scrofa 100049662 BRCA1 672
Equus caballus 100051990 BRCA1 672
Taeniopygia guttata 100224649 BRCA1 672
Oryctolagus cuniculus 100347269 BRCA1 672
Callithrix jacchus 100388186 BRCA1 672
Pongo abelii 100439533 BRCA1 672
Ailuropoda melanoleuca 100480891 BRCA1 672
Anolis carolinensis 100553919 brca1 672
Nomascus leucogenys 100580360 BRCA1 672
Generated November 25, 2024