Health
Pathogen Detection
Help
GCP
Isolates Browser data at Google Cloud Platform

Isolates Browser data at Google Cloud Platform

ALPHA RELEASE -- This is under active development and while we strive to maintain correctness, it is possible results may be unstable, unavailable, or incorrect at times. Please contact us by email at [email protected] before relying on this data for production analyses.

What data is available on the Google Cloud?
- Pathogen Detection Resources available on the Google Cloud
- Update Frequency
Getting started with BigQuery
ncbi-pathogen-detect.pdbrowser.isolates
ncbi-pathogen-detect.pdbrowser.isolate_exceptions
Complex fields
Linking to MicroBIGG-E data
Example searches
Example searches that link between MicroBIGG-E and Isolates Browser data in BigQuery
- Join with the MicroBIGG-E table to find isolates that are carbapenem resistant but don't have a known carbapenem resistance gene or allele

What data is available on the Google Cloud?

For a list of all resources see Pathogen Detection Resources at Google Cloud Platform

Data from the Pathogen Detection Isolates Browser is now available at Google Cloud Platform (GCP) in the ncbi-pathogen-detect.pdbrowser.isolates and ncbi-pathogen-detect.pdbrowser.isolate_exceptions tables at Google BigQuery. This data includes all fields available in the web browser and can be searched using Google Standard SQL instead of the SOLR Query Language. This also permits programmatic access and more complex queries. BigQuery will also allow you to download tables exceeding the 100,000 row limit for the Isolates Browser web download. NCBI is piloting this in BigQuery to help users leverage the benefits of elastic scaling and parallel execution of queries. BigQuery has a large collection of client libraries that can be used within your workflow. You can also interact with it on a web browser as described below.

Pathogen Detection Resources available on the Google Cloud

Update Frequency

The isolates and isolate_exceptions tables at Google Cloud BigQuery are updated daily. For this reason the contents may not agree exactly with those shown in the Isolates Browser. If you see unexpected discrepancies please let us know by emailing us at [email protected].

Getting started with BigQuery

Our Getting started with BigQuery page has instructions on how to run queries with BigQuery.

`ncbi-pathogen-detect.pdbrowser.isolates`

This data corresponds to the main table of the Pathogen Detection Isolates Browser. Detailed information on the contents of the fields in this table is included in the documentation for the Isolates Browser web interface. Some fields (AMR_genotypes, virulence_genotypes, stress_genotypes, and AST_phenotypes are complex and contain AST_phenotypes and contain arrays of structs. The computed_types field contains a single struct. See the examples below for ideas of how to search those fields using Google Standard SQL.

`ncbi-pathogen-detect.pdbrowser.isolate_exceptions`

The isolate_exceptions table contains information on isolates that failed to process in our system becuase of quality control (QC) failures. See the Exceptions table documentation for the Isolates Browser web interface for more details.

Complex fields

The AMR_genotypes, stress_genotypes, and virulence_genotypes fields are arrays of structs, which require syntax that you might not be familiar with coming from other SQL-based databases. The AST_phenoypes field is a struct which has fields that are accessed by appending the subfield value to the field value like AST_phenotypes.serotype See the examples below for how to search based on those fields.

Linking to MicroBIGG-E data

NCBI Pathogen Detection also has MicroBIGG-E data in Google Cloud BigQuery and links can be made between the ncbi-pathogen-detect.pdbrowser.microbigge table and the ncbi-pathogen-detect.pdbrowser.isolates tables. Several fields are in common, but generally we recommend joining the two tables using the target_acc field. See below for examples.

Example searches

Find the AMR genes from an isolate of interest

SELECT biosample_acc, asm_acc, target_acc, mindiff, AMR_genotypes,
    stress_genotypes, virulence_genotypes, computed_types
FROM `ncbi-pathogen-detect.pdbrowser.isolates`
WHERE biosample_acc = 'SAMN08848639'

Find an isolate in isolate_exceptions

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.isolate_exceptions`
WHERE biosample_acc = 'SAMN16279178'

Find all Salmonella Newport isolates

SELECT target_acc, biosample_acc, erd_group, computed_types
FROM `ncbi-pathogen-detect.pdbrowser.isolates`
WHERE computed_types.serotype = 'Newport'
AND taxgroup_name LIKE 'Salmonella%'

Find the most common SNP clusters for Salmonella Newport isolates

SELECT erd_group, count(*) num_newport_isolates
FROM `ncbi-pathogen-detect.pdbrowser.isolates`
WHERE computed_types.serotype = 'Newport'
    AND taxgroup_name LIKE 'Salmonella%'
GROUP BY erd_group
ORDER BY num_newport_isolates DESC
LIMIT 5

Find isolates that are carbapenem resistant but don't have a KPC or NDM beta-lactamase

SELECT target_acc 
FROM `ncbi-pathogen-detect.pdbrowser.isolates` isolates
WHERE
(SELECT COUNT(1)
  FROM UNNEST(isolates.AST_phenotypes) 
  WHERE antibiotic LIKE '%penem' AND phenotype = 'resistant'
) >= 1
AND
(SELECT COUNT(1)
  FROM UNNEST(isolates.AMR_genotypes)
  WHERE element LIKE 'blaKPC%' OR element LIKE 'blaNDM%'
) = 0

Find all clusters where >75% have a blaKPC

SELECT kpc_count.erd_group, round(kpc_count.num / erd_size.num * 100) as pct_with_kpc, erd_size.num as snp_cluster_size
FROM
  (SELECT i1.erd_group, count(*) num
      FROM `ncbi-pathogen-detect.pdbrowser.isolates` i1
      WHERE (
        (SELECT COUNT(1)
        FROM UNNEST(AMR_genotypes) AS AMR_genotypes
        WHERE element LIKE 'blaKPC%'
        ) >= 1
      )
      GROUP BY erd_group
  ) kpc_count
  LEFT JOIN
    (SELECT erd_group, count(*) num
      FROM `ncbi-pathogen-detect.pdbrowser.isolates`
      GROUP BY erd_group
    ) erd_size
  ON kpc_count.erd_group = erd_size.erd_group
WHERE
  kpc_count.num / erd_size.num > 0.75
ORDER BY
  pct_with_kpc DESC, snp_cluster_size DESC

Example searches that link between MicroBIGG-E and Isolates Browser data in BigQuery

Join with the MicroBIGG-E table to find isolates that are carbapenem resistant but don't have a known carbapenem resistance gene or allele

SELECT isolates.target_acc,
  ARRAY(select AS STRUCT antibiotic, phenotype from UNNEST(AST_phenotypes) WHERE  antibiotic LIKE "%penem") AST
FROM `ncbi-pathogen-detect.pdbrowser.isolates` isolates
LEFT JOIN `ncbi-pathogen-detect.pdbrowser.microbigge` microbigge
  ON isolates.target_acc = microbigge.target_acc
  AND microbigge.subclass = 'CARBAPENEM' -- Only carbapenem genes / point mutations
WHERE
  (SELECT count(1) FROM unnest(AST_phenotypes) AS ast
    WHERE antibiotic like "%penem" AND phenotype = 'resistant') >= 1
  AND isolates.amrfinderplus_version IS NOT NULL -- AMRFinderPlus was run on this target
  AND isolates.asm_acc IS NOT NULL -- AMRFinderPlus results should be in MicroBIGG-E because assembly is public
  AND microbigge.subclass IS NULL -- There are no rows in MicroBIGG-E with subclass = CARBAPENEM
ORDER BY isolates.target_acc