Get genome metadata
Get genome metadata by accession, bioproject, or taxonomic name
Get genome metadata
Get genome metadata from NCBI Datasets through the command line tool, or programming languages.
Using a taxonomic name
Get genome metadata for all assemblies for an organism and its subspecies using the organism name or NCBI Taxonomy ID.
Run the following command to get metadata in JSON format:
datasets summary genome taxon human
Use quotes for taxon names that include spaces, such as mus musculus
:
datasets summary genome taxon 'mus musculus'
For more information, see the Datasets Python API reference documentation.
Use the get_assembly_metadata_by_taxon method from ncbi-datasets-pylib to get all genome metadata for a single taxon.
from ncbi.datasets.metadata.genome import print_assembly_metadata_by_fields
from ncbi.datasets.metadata.genome import get_assembly_metadata_by_taxon
taxon_name = "human"
# Retrieve and print genomic metadata for assemblies belonging to the specified taxon
for assembly in get_assembly_metadata_by_taxon(taxon_name):
print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
api.genome_instance <- GenomeApi$new()
result_genome <- api.genome_instance$AssemblyDescriptorsByBioproject('PRJEB33226')
prettify(result_genome$toJSONString())
Using BioProject accession
Get genome metadata for genome assemblies belonging to an NCBI BioProject, for example, the Sanger 25 Genomes Project, PRJEB33226.datasets summary genome accession PRJEB33226
For more information, see the Datasets Python API reference documentation
Use the get_assembly_metadata_by_bioproject_accessions method from ncbi-datasets-pylib to get genome metadata for all genomes associated with the provided bioproject accessions.
from typing import List
from ncbi.datasets.metadata.genome import print_assembly_metadata_by_fields
from ncbi.datasets.metadata.genome import get_assembly_metadata_by_bioproject_accessions
bioprojects: List[str] = ["PRJEB33226"]
# Retrieve and print genome metadata for a list of bioproject accessions
for assembly in get_assembly_metadata_by_bioproject_accessions(bioprojects):
print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
api.genome_instance <- GenomeApi$new()
result_genome <- api.genome_instance$AssemblyDescriptorsByBioproject('PRJEB33226')
prettify(result_genome$toJSONString())
Using an Assembly accession
Get metadata using an NCBI Assembly accession, for example for the human reference assembly, GRCh38.
Run the following command to get metadata in JSON format:
datasets summary genome accession GCF_000001405.40
For more information, see the Datasets Python API reference documentation.
Use the get_assembly_metadata_by_asm_accessions method from ncbi-datasets-pylib to get genome metadata for all genomes with the provided NCBI Assembly accessions.
from typing import List
from ncbi.datasets.metadata.genome import print_assembly_metadata_by_fields
from ncbi.datasets.metadata.genome import get_assembly_metadata_by_asm_accessions
genome_assembly_accessions: List[str] = ["GCF_000001405.40"]
# Retrieves and prints genome metadata for a list of assembly accessions
for assembly in get_assembly_metadata_by_asm_accessions(genome_assembly_accessions):
print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
api.genome_instance <- GenomeApi$new()
result_genome <- api.genome_instance$AssemblyDescriptorsByAccessions('GCF_000001405.40')
prettify(result_genome$toJSONString())
Filtering by genome assembly properties
When getting genome metadata by either taxon, Assembly or BioProject accession, you can filter the results by different genome assembly properties, including the following:
- reference status
- annotation status
- assembly level
- year released
- infraspecies name
- assembly name
- submitter name
Get metadata for the human reference genome:
datasets summary genome taxon human --reference
datasets summary genome taxon human --annotated
datasets summary genome taxon human --assembly-level complete_genome
datasets summary genome taxon human --released-since 01/01/2020
datasets summary genome taxon human --search 'T2T Consortium'
For more information, see the Datasets Python API reference documentation.
All of the genome metadata retrieval functions support filtering, but for our examples we use the the get_assembly_metadata_by_taxon method from ncbi-datasets-pylib to get genome metadata for all genomes that match the selected taxon and filter criteria.
from ncbi.datasets.metadata.genome import print_assembly_metadata_by_fields
from ncbi.datasets.metadata.genome import get_assembly_metadata_by_taxon
taxon_name = "human"
print(f"Reference assemblies for {taxon_name}:")
for assembly in get_assembly_metadata_by_taxon(taxon_name, filters_reference_only=True):
print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
print(f"\nAnnotated assemblies for {taxon_name}:")
for assembly in get_assembly_metadata_by_taxon(taxon_name, filters_has_annotation=True):
print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
# valid assembly levels are: ['chromosome', 'scaffold', 'contig', 'complete_genome']
print(f"\n{taxon_name} assemblies with complete (all chromosomes are gapless) genomes:")
for assembly in get_assembly_metadata_by_taxon(taxon_name, filters_assembly_level=["complete_genome"]):
print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
print(f"\nassemblies for {taxon_name} released in 2017:")
for assembly in get_assembly_metadata_by_taxon(
taxon_name,
filters_first_release_date="2017-01-01T00:00:00.000Z",
filters_last_release_date="2017-12-31T00:00:00.000Z",
):
print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
# filters_search_text includes the species and infraspecies, assembly name and submitter fields
print(f'\n{taxon_name} assemblies including text "T2T Consortium"')
for assembly in get_assembly_metadata_by_taxon(taxon_name, filters_search_text=["T2T Consortium"]):
print_assembly_metadata_by_fields(assembly, ["assembly_accession", "submitter", "assembly_level", "seq_length"])