NCBI Datasets Taxonomy Package

Taxonomic metadata for a set of requested taxa.

NCBI Datasets Taxonomy Package

Taxonomic metadata for a set of requested taxa.

The NCBI Datasets Taxonomy Data Package contains metadata for the requested taxonomic entities (NCBI TaxID, scientific or common name). In addition to the taxonomy report, the data package can be customized to include the names report in JSON Lines format, and a subset of metadata in tabular format.

Package Content

NCBI Datasets Taxonomy Data Package

This example shows the contents of the taxonomy data package for the genus Drosophila (taxid 7215)

datasets download taxonomy taxon 7215 --filename 7215.zip               
unzip 7215.zip -d 7215     
tree 7215

7215
|-- README.md
|-- md5sum.txt
`-- ncbi_dataset
    `-- data
        |-- dataset_catalog.json
        |-- taxonomy_report.jsonl
        `-- taxonomy_summary.tsv

Taxonomy report

The taxonomy report contains metadata describing the taxonomic classification, parent and children nodes (when applicable) and counts of assemblies, genes and other genomic features. The file is in JSON Lines format, where each line is the metadata for one taxonomic entity. Use the dataformat tool for easy conversion to a tabular format of selected fields.

  • Path: ncbi_dataset/data/taxonomy_report.jsonl

Taxonomy summary

The taxonomy summary table is a tabular representation of a subset of metadata in the taxonomy report. Each row of the data table represents one NCBI Taxonomic ID. The columns in the data table are listed below:

Taxid
Tax name
Authority
Rank
Basionym
Basionym authority
Curator common name
Has type material
Group name
Superkingdom name
Superkingdom taxid
Kingdom name
Kingdom taxid
Phylum name
Phylum taxid
Class name
Class taxid
Order name
Order taxid
Family name
Family taxid
Genus name
Genus taxid
Species name
Species taxid
  • Path: ncbi_dataset/data/taxonomy_summary.tsv

Names report

The names report describes current scientific name, type material, basionym and authority as well as rank and taxonomic ID. The file is in JSON Lines format, where each line describes one NCBI taxonomic ID.

  • Path: ncbi_dataset/data/names_report.jsonl

README.md

The README contains a general project description common to all data packages.

  • Path: README.md

Dataset catalog

The dataset catalog lists each data file contained within or referenced by the package. Each data file is associated with a content type and location.

  • Path: ncbi_dataset/dataset_catalog.json

MD5 checksum file

The MD5 checksum file contains MD5 hash values for each file contained in the data package after decompression. These hash values can be used as a checksum to verify that a file has not changed as the result of an error during download or decompression. Each line of the MD5 checksum file corresponds to a file in the package after decompression, where the first column contains the MD5 hash value and the second column contains the path to the file.

  • Path: md5sum.txt
Generated November 25, 2024