Get genome metadata
Get genome metadata by accession, bioproject, or taxonomic name
Get genome metadata
Get assembled genome metadata from NCBI Datasets. Most genome metadata is included in the genome data report. Sequences that comprise an assembled genome are listed in a separate sequence report.
Using a taxonomic name
Get genome metadata for all the assembled genomes of an organism and its subspecies using the organism name or NCBI Taxonomy ID.
Run the following command to get genome metadata for all human genomes in JSON format:
datasets summary genome taxon human
Use quotes for taxon names that include spaces, such as mus musculus
:
datasets summary genome taxon 'mus musculus'
Using BioProject accession
Get genome metadata for assembled genomes belonging to an NCBI BioProject, for example, the Sanger 25 Genomes Project, PRJEB33226:datasets summary genome accession PRJEB33226
Using an Assembly accession
Get genome metadata for the human reference genome, GRCh38, using an NCBI Assembly accession.
Run the following command to get genome metadata for GRCh38 in JSON format:
datasets summary genome accession GCF_000001405.40
Filtering by genome assembly properties
When getting genome metadata by either taxon, Assembly or BioProject accession, you can filter the results by different genome assembly properties, including the following:
- reference status
- annotation status
- assembly level
- year released
- infraspecies name
- assembly name
- submitter name
Get metadata for the human reference genome:
datasets summary genome taxon human --reference
datasets summary genome taxon human --annotated
datasets summary genome taxon human --assembly-level complete
datasets summary genome taxon human --released-after 01/01/2020
datasets summary genome taxon human --search 'T2T Consortium'
Get metadata by taxonomic name and generate a table using dataformat
Get a table of selected metadata for shark genomes annotated by NCBI:
datasets summary genome taxon 'sharks' --assembly-source refseq --as-json-lines | dataformat tsv genome --fields accession,assminfo-name,annotinfo-name,annotinfo-release-date,organism-name
Output:
Assembly Accession Assembly Name Annotation Name Annotation Release Date Organism Name
GCF_017639515.1 sCarCar2.pri NCBI Carcharodon carcharias Annotation Release 100 2021-04-27 Carcharodon carcharias
GCF_004010195.1 ASM401019v1 NCBI Chiloscyllium plagiosum Annotation Release 100 2021-09-13 Chiloscyllium plagiosum
GCF_020745735.1 sHemOce1.pat.X.cur. GCF_020745735.1-RS_2023_11 2023-11-02 Hemiscyllium ocellatum
GCF_021869965.1 sRhiTyp1.1 NCBI Rhincodon typus Annotation Release 101 2022-05-27 Rhincodon typus
GCF_902713615.1 sScyCan1.1 NCBI Scyliorhinus canicula Annotation Release 100 2020-12-29 Scyliorhinus canicula
GCF_030684315.1 sSteTig4.hap1 GCF_030684315.1-RS_2023_09 2023-09-12 Stegostoma tigrinum
Note: Always use --as-json-lines
when piping data from datasets to dataformat
Get the set of nucleotide accessions for one or more genome assemblies
Get the GenBank and RefSeq nucleotide accessions for a given genome assembly from the genome sequence report using dataformat:
datasets summary genome accession GCF_000006945.2 --report sequence --as-json-lines | dataformat tsv genome-seq --fields accession,genbank-seq-acc,refseq-seq-acc,chr-name
Output:
Assembly Accession GenBank seq accession RefSeq seq accession Chromosome name
GCF_000006945.2 AE006468.2 NC_003197.2 chromosome
GCF_000006945.2 AE006471.2 NC_003277.2 pSLT