Get genome metadata

Get genome metadata by accession, bioproject, or taxonomic name

Get genome metadata

Get genome metadata by accession, bioproject, or taxonomic name

Get assembled genome metadata from NCBI Datasets. Most genome metadata is included in the genome data report. Sequences that comprise an assembled genome are listed in a separate sequence report.

Using a taxonomic name

Get genome metadata for all the assembled genomes of an organism and its subspecies using the organism name or NCBI Taxonomy ID.

Run the following command to get genome metadata for all human genomes in JSON format:

datasets summary genome taxon human

Use quotes for taxon names that include spaces, such as mus musculus:

datasets summary genome taxon 'mus musculus'

Using BioProject accession

Get genome metadata for assembled genomes belonging to an NCBI BioProject, for example, the Sanger 25 Genomes Project, PRJEB33226:

datasets summary genome accession PRJEB33226

Using an Assembly accession

Get genome metadata for the human reference genome, GRCh38, using an NCBI Assembly accession.

Run the following command to get genome metadata for GRCh38 in JSON format:

datasets summary genome accession GCF_000001405.40

Filtering by genome assembly properties

When getting genome metadata by either taxon, Assembly or BioProject accession, you can filter the results by different genome assembly properties, including the following:

reference status
annotation status
assembly level
year released
infraspecies name
assembly name
submitter name

Get metadata for the human reference genome:

datasets summary genome taxon human --reference

Get metadata for annotated human genomes:

datasets summary genome taxon human --annotated

Get metadata for human genomes with the Assembly level of "complete genome" (all chromosomes are gapless):

datasets summary genome taxon human --assembly-level complete

Get metadata for human genomes released after January 1, 2020:

datasets summary genome taxon human --released-after 01/01/2020

Get metadata for human genomes submitted by the T2T Consortium:

datasets summary genome taxon human --search 'T2T Consortium'

Get metadata by taxonomic name and generate a table using dataformat

Get a table of selected metadata for shark genomes annotated by NCBI:

datasets summary genome taxon 'sharks' --assembly-source refseq --as-json-lines | dataformat tsv genome --fields accession,assminfo-name,annotinfo-name,annotinfo-release-date,organism-name

Output:

Assembly Accession	Assembly Name	Annotation Name	Annotation Release Date	Organism Name
GCF_017639515.1	sCarCar2.pri	NCBI Carcharodon carcharias Annotation Release 100	2021-04-27	Carcharodon carcharias
GCF_004010195.1	ASM401019v1	NCBI Chiloscyllium plagiosum Annotation Release 100	2021-09-13	Chiloscyllium plagiosum
GCF_020745735.1	sHemOce1.pat.X.cur.	GCF_020745735.1-RS_2023_11	2023-11-02	Hemiscyllium ocellatum
GCF_021869965.1	sRhiTyp1.1	NCBI Rhincodon typus Annotation Release 101	2022-05-27	Rhincodon typus
GCF_902713615.1	sScyCan1.1	NCBI Scyliorhinus canicula Annotation Release 100	2020-12-29	Scyliorhinus canicula
GCF_030684315.1	sSteTig4.hap1	GCF_030684315.1-RS_2023_09	2023-09-12	Stegostoma tigrinum

Note: Always use --as-json-lines when piping data from datasets to dataformat

Get the set of nucleotide accessions for one or more genome assemblies

Get the GenBank and RefSeq nucleotide accessions for a given genome assembly from the genome sequence report using dataformat:

datasets summary genome accession GCF_000006945.2 --report sequence --as-json-lines | dataformat tsv genome-seq --fields accession,genbank-seq-acc,refseq-seq-acc,chr-name

Output:

Assembly Accession	GenBank seq accession	RefSeq seq accession	Chromosome name
GCF_000006945.2	AE006468.2	NC_003197.2	chromosome
GCF_000006945.2	AE006471.2	NC_003277.2	pSLT

Generated November 25, 2024

Get genome metadata

Get genome metadata

Using a taxonomic name

Using BioProject accession

Using an Assembly accession

Filtering by genome assembly properties

Get metadata by taxonomic name and generate a table using dataformat

Get the set of nucleotide accessions for one or more genome assemblies

Related information