Get representative protein sequences from Ortholog sets
Each ortholog gene may have many transcripts. Here we describe how you can filter the resulting metadata to only the longest transcript.
Get representative protein sequences from Ortholog sets
Each ortholog gene may have many transcripts. Here we describe how you can filter the resulting metadata to only the longest transcript.
Get one representative protein sequence from a set of orthologs
# [START ortholog_longest_transcript]
from typing import Iterator, List
from ncbi.datasets.openapi import ApiClient as DatasetsApiClient
from ncbi.datasets.openapi import ApiException as DatasetsApiException
from ncbi.datasets import GeneApi as DatasetsGeneApi
input_symbol = "ACE2"
input_taxon = "human"
def find_longest_transcript(gene):
if not gene.transcripts:
return None
num_trs = len(gene.transcripts)
if num_trs == 0:
return None
if num_trs == 1:
return gene.transcripts[0]
def get_tr_length(tr):
return tr.length
tr_by_len = sorted(gene.transcripts, key=get_tr_length, reverse=True)
maxlen = get_tr_length(tr_by_len[0])
longest = []
for tr in tr_by_len:
trlen = get_tr_length(tr)
if trlen < maxlen:
break
longest.append(tr)
if len(longest) == 1:
return longest[0]
# select ealiest accession.version (oldest sequence)
return sorted(longest, key=lambda tr: tr.accession_version)[0]
def _genes_for_ortholog(ortholog_set) -> Iterator:
for gene_match in ortholog_set.genes.genes:
if gene_match.messages:
for m in gene_match.messages:
if m.error:
print(f"error: [{m.error.reason}] {m.error.message}")
if m.warning:
print(f"warning: [{m.warning.reason}] {m.warning.message}")
if gene_match.warnings:
for m in gene_match.warnings:
print(f"warning: [{m.reason}] {m.message}")
if gene_match.gene:
yield gene_match.gene
def retrieve_orthologs_by_gene_id(gene_api: DatasetsGeneApi, gene_id: str):
ortholog_set = gene_api.gene_orthologs_by_id(int(gene_id))
for gene in _genes_for_ortholog(ortholog_set):
longest_tr = find_longest_transcript(gene)
print(
{
"gene_id": gene.gene_id,
"symbol": gene.symbol,
"tr-accession": longest_tr.accession_version,
"tr-name": longest_tr.name,
"tr-length": longest_tr.length,
"prot-accession": longest_tr.protein.accession_version,
"prot-name": longest_tr.protein.name,
"prot-length": longest_tr.protein.length,
}
)
def gene_ids_for_symbol_and_taxon(gene_api: DatasetsGeneApi, symbols: List[str], taxon: str) -> List[int]:
"""Return a list of gene-ids for a set of symbols and taxon"""
gene_ids_for_symbols: List[int] = []
try:
gene_reply = gene_api.gene_metadata_by_tax_and_symbol(
symbols=symbols,
taxon=taxon,
)
for gene in gene_reply.genes:
gene_ids_for_symbols.append(gene.gene.gene_id)
except DatasetsApiException as e:
print(f"Exception when calling GeneApi: {e}\n")
return gene_ids_for_symbols
def retrieve_orthologs_for_symbol(symbol: str, taxon: str):
with DatasetsApiClient() as api_client:
gene_api = DatasetsGeneApi(api_client)
try:
gene_ids = gene_ids_for_symbol_and_taxon(gene_api, [symbol], taxon)
if len(gene_ids) != 1:
raise Exception(f"Unexpected number of gene-ids: {gene_ids}")
retrieve_orthologs_by_gene_id(gene_api, gene_ids[0])
except DatasetsApiException as e:
print(f"Exception when calling GeneApi: {e}\n")
# [END ortholog_longest_transcript]
if __name__ == "__main__":
retrieve_orthologs_for_symbol(symbol=input_symbol, taxon=input_taxon)
- Download a gene ortholog data package. In this example, we are using the
--taxon-filter
flag to restrict the data to calculated orthologs from ferret (TaxId: 9669), mouse (TaxId: 10090) and human (TaxID: 9606).
datasets download ortholog gene-id 672 \
--taxon-filter 9669,10090,9606 \
--filename brca1_example.zip
unzip brca1_example.zip -d brca1_example
- Using
dataformat
, let’s create a tsv file with the following fields: taxonomic ID, taxonomic name, transcript accession and length, protein accession and length and isoform name. We’ll be working from thedata
directory:
cd brca1_example/ncbi_dataset/data
dataformat tsv gene \
--inputfile data_report.jsonl \
--fields tax-id,tax-name,transcript-accession,transcript-length,transcript-protein-accession,transcript-protein-length,transcript-protein-isoform > transcript_protein.tsv
cat transcript_protein.tsv
Result:
Taxonomic ID Taxonomic Name Transcript Accession Transcript Transcript Length Transcript Protein Accession Transcript Protein Length Transcript Protein Isoform
9669 Mustela putorius furo XM_004772611.3 6768 XP_004772668.1 1869 isoform X5
9669 Mustela putorius furo XM_004772612.3 6929 XP_004772669.1 1869 isoform X5
9669 Mustela putorius furo XM_045080745.1 6937 XP_044936680.1 1869 isoform X5
9669 Mustela putorius furo XM_013045927.2 6902 XP_012901381.1 1822 isoform X6
9669 Mustela putorius furo XM_013045926.2 6950 XP_012901380.1 1877 isoform X4
9669 Mustela putorius furo XM_004772609.3 6953 XP_004772666.1 1878 isoform X2
9669 Mustela putorius furo XM_004772610.3 6953 XP_004772667.1 1878 isoform X3
9669 Mustela putorius furo XM_004772608.3 6956 XP_004772665.1 1879 isoform X1
9669 Mustela putorius furo XM_013045929.2 3642 XP_012901383.1 775 isoform X9
9669 Mustela putorius furo XM_004772616.3 3642 XP_004772673.1 775 isoform X8
9669 Mustela putorius furo XM_013045928.2 3645 XP_012901382.1 776 isoform X7
10090 Mus musculus XR_004936704.1 6499
10090 Mus musculus XM_006532068.4 3363 XP_006532131.1 707 isoform X4
10090 Mus musculus XM_006532064.5 6543 XP_006532127.1 1767 isoform X2
10090 Mus musculus NM_009764.3 6648 NP_033894.3 1812
10090 Mus musculus XM_036156253.1 6542 XP_036012146.1 1765 isoform X3
10090 Mus musculus XM_030245495.2 6780 XP_030101355.1 1812 isoform X1
9606 Homo sapiens NR_027676.2 7152
9606 Homo sapiens NM_007297.4 7028 NP_009228.2 1816 isoform 3
9606 Homo sapiens NM_007299.4 3696 NP_009230.2 699 isoform 5
9606 Homo sapiens NM_007294.4 7088 NP_009225.1 1863 isoform 1
9606 Homo sapiens NM_007300.4 7151 NP_009231.2 1884 isoform 2
9606 Homo sapiens NM_007298.3 3699 NP_009229.2 759 isoform 4
- Now let’s create a list of TaxIds using dataformat
dataformat tsv gene \
--inputfile data_report.jsonl \
--fields tax-id \
--elide-header > taxids.txt
- Now let’s create two files: longest_transcript.txt and longest_protein.txt with the accessions of the longest sequences for each type of data.
cat taxids.txt | while read IDS; do
LONGEST=$(grep -w "^$IDS" transcript_protein.tsv | sed 's/\t/|/g' | sort -Vr -k6,6 -t'|' | head -n1);
TRANSCRIPT=$(echo $LONGEST | cut -f3 -d'|');
PROTEIN=$(echo $LONGEST | cut -f5 -d'|');
printf "$TRANSCRIPT\n" >> transcript.list;
printf "$PROTEIN\n" >> protein.list;
done
- Extract the sequences from the original files
rna.fna
andprotein.faa
using the lists we created and the programseqtk
- transcript:
seqtk subseq rna.fna transcript.list > rna_longest.fna
- protein:
seqtk subseq protein.faa protein.list > protein_longest.faa