File validation
NCBI Datasets offers two ways to validate downloaded files
File validation
NCBI Datasets offers two ways to validate files included in an NCBI Datasets data package. This documentation page will highlight these two options for file validation and provide instructions on how to ensure the integrity of the data you downloaded.
There are two options for validating a downloaded data package:
- Built-in client validation run by the datasets command-line tool immediately after the completion of a download
- User-initiated validation using the MD5 checksum file that is included in every data package
Built-in client validation
To use built-in client validation, no action is necessary. Since version 16.31.0, validation of the data package structure and all included files is turned on by default. This validation process automatically runs after the completion of a download.
For example, after downloading the human reference genome, validation runs automatically:
datasets download genome accession GCF_000001405.40 --filename GRCh38.zip
Collecting 1 genome record [================================================] 100% 1/1
Downloading: GRCh38.zip 973MB valid data package
Validating package files [================================================] 100% 5/5
When validation completes, “valid data package” is printed to the right of the size of the downloaded zip file.
Because file validation can add some additional time to the download process, you may want to skip file validation by using --fast-zip-validation
. When this flag is used, the zip archive structure will still be validated but the included files will not be checked.
For example:
datasets download genome accession GCF_000001405.40 --filename GRCh38.zip --fast-zip-validation
Collecting 1 genome record [================================================] 100% 1/1
Downloading: GRCh38.zip 973MB valid zip structure -- files not checked
Validating package [================================================] 100% 5/5
After validation of the zip archive structure completes, “valid zip structure – files not checked” is printed to the right of the size of the downloaded zip file.
User-initiated validation using the MD5 checksum file
You can also validate the integrity of the files in the data package using the md5sum
tool with the MD5 checksum file, md5sum.txt
, that is now included in all NCBI Datasets data packages. The md5sum
tool is included in most Linux distributions.
To validate the files yourself using the MD5 checksum file, try the following.
Following download of a data package, as shown above, first unzip the archive:
unzip GRCh38.zip -d GRCh38
Archive: GRCh38.zip
inflating: GRCh38/README.md
inflating: GRCh38/ncbi_dataset/data/assembly_data_report.jsonl
inflating: GRCh38/ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna
inflating: GRCh38/ncbi_dataset/data/dataset_catalog.json
inflating: GRCh38/md5sum.txt
Then, change your working directory to the directory containing the extracted archive:
cd GRCh38
Next, run md5sum
to calculate the checksums for each file and compare them to the MD5 hash values in md5sum.txt
:
md5sum -c md5sum.txt
ncbi_dataset/data/assembly_data_report.jsonl: OK
ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna: OK
ncbi_dataset/data/dataset_catalog.json: OK
The text “OK” is shown after each file to indicate that the MD5 hash values calculated for each file match the hash values included in md5sum.txt
.