File validation

NCBI Datasets offers two ways to validate downloaded files

File validation

NCBI Datasets offers two ways to validate downloaded files

NCBI Datasets offers two ways to validate files included in an NCBI Datasets data package. This documentation page will highlight these two options for file validation and provide instructions on how to ensure the integrity of the data you downloaded.

There are two options for validating a downloaded data package:

  1. Built-in client validation run by the datasets command-line tool immediately after the completion of a download
  2. User-initiated validation using the MD5 checksum file that is included in every data package

Built-in client validation

To use built-in client validation, no action is necessary. Since version 16.31.0, validation of the data package structure and all included files is turned on by default. This validation process automatically runs after the completion of a download.

For example, after downloading the human reference genome, validation runs automatically:

datasets download genome accession GCF_000001405.40 --filename GRCh38.zip
Collecting 1 genome record [================================================] 100% 1/1
Downloading: GRCh38.zip    973MB valid data package
Validating package files [================================================] 100% 5/5

When validation completes, “valid data package” is printed to the right of the size of the downloaded zip file.

Because file validation can add some additional time to the download process, you may want to skip file validation by using --fast-zip-validation. When this flag is used, the zip archive structure will still be validated but the included files will not be checked. For example:

datasets download genome accession GCF_000001405.40 --filename GRCh38.zip --fast-zip-validation
Collecting 1 genome record [================================================] 100% 1/1
Downloading: GRCh38.zip    973MB valid zip structure -- files not checked
Validating package [================================================] 100% 5/5

After validation of the zip archive structure completes, “valid zip structure – files not checked” is printed to the right of the size of the downloaded zip file.

User-initiated validation using the MD5 checksum file

You can also validate the integrity of the files in the data package using the md5sum tool with the MD5 checksum file, md5sum.txt, that is now included in all NCBI Datasets data packages. The md5sum tool is included in most Linux distributions.

To validate the files yourself using the MD5 checksum file, try the following.

Following download of a data package, as shown above, first unzip the archive:

unzip GRCh38.zip -d GRCh38
Archive:  GRCh38.zip
  inflating: GRCh38/README.md
  inflating: GRCh38/ncbi_dataset/data/assembly_data_report.jsonl
  inflating: GRCh38/ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna
  inflating: GRCh38/ncbi_dataset/data/dataset_catalog.json
  inflating: GRCh38/md5sum.txt

Then, change your working directory to the directory containing the extracted archive:

cd GRCh38

Next, run md5sum to calculate the checksums for each file and compare them to the MD5 hash values in md5sum.txt:

md5sum -c md5sum.txt
ncbi_dataset/data/assembly_data_report.jsonl: OK
ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna: OK
ncbi_dataset/data/dataset_catalog.json: OK

The text “OK” is shown after each file to indicate that the MD5 hash values calculated for each file match the hash values included in md5sum.txt.

Generated November 25, 2024