Download large genome data packages
Use the command line to get large NCBI Datasets Genome Data Packages
Download large genome data packages
If you want to download genome data for more than 1000 genomes or the genome data package exceeds 15 GB, you’ll need to use the datasets command-line tool .
Use the datasets command-line tool to download a large NCBI Datasets Genome Data Package as a dehydrated zip archive that contains only metadata and the location of the data on NCBI servers. You can get the data in three steps:
- Download the dehydrated zip archive.
- Unzip the downloaded zip archive.
- Rehydrate the extracted zip archive to retrieve the data.
1. Download
Download a dehydrated data package (< 5 KB) for the human GRCh38 RefSeq genome using the datasets command-line tool .
datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip
2. Unzip
Unzip the dehydrated zip archive to a directory, for example my_human_dataset:
unzip human_GRCh38_dataset.zip -d my_human_dataset
The output will look like this:
Archive: human_GRCh38_dataset.zip
inflating: my_human_dataset/README.md
inflating: my_human_dataset/ncbi_dataset/data/GCF_000001405.40/assembly_data_report.jsonl
inflating: my_human_dataset/ncbi_dataset/data/dataset_catalog.json
inflating: my_human_dataset/ncbi_dataset/fetch.txt
3. Rehydrate
Run the rehydrate command to get the full genome data package, including genome sequences and annotation:
datasets rehydrate --directory my_human_dataset/
A progress meter will indicate the number of files to be retrieved, showing the completion progress of a task. When complete, the output looks like this:
Found 43 files for rehydration
Completed 43 of 43 [================================================] 100%
Downloading: my_human_dataset/ncbi_dataset/data/GCF_000001405.40/chr6.fna 173MB done
Downloading: my_human_dataset/ncbi_dataset/data/GCF_000001405.40/chr5.fna 184MB done
Downloading: my_human_dataset/ncbi_dataset/data/GCF_000001405.40/chrX.fna 158MB done