Isolates Browser data at Google Cloud Platform
ALPHA RELEASE -- This is under active development and while we strive to maintain correctness, it is possible results may be unstable, unavailable, or incorrect at times. Please contact us by email at [email protected] before relying on this data for production analyses.
- What data is available on the Google Cloud?
- Getting started with BigQuery
- ncbi-pathogen-detect.pdbrowser.isolates
- ncbi-pathogen-detect.pdbrowser.isolate_exceptions
- Complex fields
- Linking to MicroBIGG-E data
- Example searches
- Find the AMR genes from an isolate of interest
- Find an isolate in isolate_exceptions
- Find all Salmonella Newport isolates
- Find the most common SNP clusters for Salmonella Newport isolates
- Find isolates that are carbapenem resistant but don't have a KPC or NDM beta-lactamase
- Find all clusters where >75% have a blaKPC
- Example searches that link between MicroBIGG-E and Isolates Browser data in BigQuery
What data is available on the Google Cloud?
For a list of all resources see Pathogen Detection Resources at Google Cloud Platform
Data from the Pathogen Detection Isolates Browser is now available at Google Cloud Platform (GCP) in the ncbi-pathogen-detect.pdbrowser.isolates
and ncbi-pathogen-detect.pdbrowser.isolate_exceptions
tables at Google BigQuery. This data includes all fields available in the web browser and can be searched using Google Standard
SQL
instead of the SOLR Query Language. This
also permits programmatic access and more complex queries.
BigQuery will also allow you to download tables exceeding the 100,000 row limit
for the Isolates Browser web
download. NCBI is
piloting this in BigQuery to help users leverage the benefits of elastic
scaling and parallel execution of queries. BigQuery has a large collection of
client libraries that can be used within your workflow. You can also interact
with it on a web browser as described below.
Pathogen Detection Resources available on the Google Cloud
- Pathogen Detection Resources at Google Cloud Platform
- Getting started with BigQuery
- MicroBIGG-E table in BigQuery
- MicroBIGG-E contig sequences in Google Storage buckets
- MicroBIGG-E protein sequences in Google Storage buckets
- Isolates Browser table in BigQuery
- Isolate Exceptions table in BigQuery
- BioProject Hierarchy in BigQuery
Update Frequency
The isolates and isolate_exceptions tables at Google Cloud BigQuery are updated daily. For this reason the contents may not agree exactly with those shown in the Isolates Browser. If you see unexpected discrepancies please let us know by emailing us at [email protected].
Getting started with BigQuery
Our Getting started with BigQuery page has instructions on how to run queries with BigQuery.
ncbi-pathogen-detect.pdbrowser.isolates
This data corresponds to the main table of the Pathogen Detection Isolates
Browser. Detailed information on the contents of the
fields in this table is included in the documentation for the Isolates Browser
web interface. Some
fields (AMR_genotypes
, virulence_genotypes
, stress_genotypes
, and
AST_phenotypes
are complex and contain AST_phenotypes
and contain arrays of
structs. The computed_types field contains a single struct. See the examples
below for ideas of how to search those fields using Google Standard
SQL.
ncbi-pathogen-detect.pdbrowser.isolate_exceptions
The isolate_exceptions
table contains information on isolates that failed to process in our system becuase of quality control (QC) failures. See the Exceptions table documentation for the Isolates Browser web interface for more details.
Complex fields
The AMR_genotypes
, stress_genotypes
, and virulence_genotypes
fields are
arrays
of structs, which require syntax that you might not be familiar with coming
from other SQL-based databases. The AST_phenoypes
field is a struct which has
fields that are accessed by appending the subfield value to the field value
like AST_phenotypes.serotype
See the examples below for how to search based
on those fields.
Linking to MicroBIGG-E data
NCBI Pathogen Detection also has MicroBIGG-E data in Google Cloud BigQuery and links can be made between the ncbi-pathogen-detect.pdbrowser.microbigge
table and the ncbi-pathogen-detect.pdbrowser.isolates
tables. Several fields are in common, but generally we recommend joining the two tables using the target_acc
field. See below for examples.
Example searches
Find the AMR genes from an isolate of interest
SELECT biosample_acc, asm_acc, target_acc, mindiff, AMR_genotypes,
stress_genotypes, virulence_genotypes, computed_types
FROM `ncbi-pathogen-detect.pdbrowser.isolates`
WHERE biosample_acc = 'SAMN08848639'
Find an isolate in isolate_exceptions
SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.isolate_exceptions`
WHERE biosample_acc = 'SAMN16279178'
Find all Salmonella Newport isolates
SELECT target_acc, biosample_acc, erd_group, computed_types
FROM `ncbi-pathogen-detect.pdbrowser.isolates`
WHERE computed_types.serotype = 'Newport'
AND taxgroup_name LIKE 'Salmonella%'
Find the most common SNP clusters for Salmonella Newport isolates
SELECT erd_group, count(*) num_newport_isolates
FROM `ncbi-pathogen-detect.pdbrowser.isolates`
WHERE computed_types.serotype = 'Newport'
AND taxgroup_name LIKE 'Salmonella%'
GROUP BY erd_group
ORDER BY num_newport_isolates DESC
LIMIT 5
Find isolates that are carbapenem resistant but don't have a KPC or NDM beta-lactamase
SELECT target_acc
FROM `ncbi-pathogen-detect.pdbrowser.isolates` isolates
WHERE
(SELECT COUNT(1)
FROM UNNEST(isolates.AST_phenotypes)
WHERE antibiotic LIKE '%penem' AND phenotype = 'resistant'
) >= 1
AND
(SELECT COUNT(1)
FROM UNNEST(isolates.AMR_genotypes)
WHERE element LIKE 'blaKPC%' OR element LIKE 'blaNDM%'
) = 0
Find all clusters where >75% have a blaKPC
SELECT kpc_count.erd_group, round(kpc_count.num / erd_size.num * 100) as pct_with_kpc, erd_size.num as snp_cluster_size
FROM
(SELECT i1.erd_group, count(*) num
FROM `ncbi-pathogen-detect.pdbrowser.isolates` i1
WHERE (
(SELECT COUNT(1)
FROM UNNEST(AMR_genotypes) AS AMR_genotypes
WHERE element LIKE 'blaKPC%'
) >= 1
)
GROUP BY erd_group
) kpc_count
LEFT JOIN
(SELECT erd_group, count(*) num
FROM `ncbi-pathogen-detect.pdbrowser.isolates`
GROUP BY erd_group
) erd_size
ON kpc_count.erd_group = erd_size.erd_group
WHERE
kpc_count.num / erd_size.num > 0.75
ORDER BY
pct_with_kpc DESC, snp_cluster_size DESC
Example searches that link between MicroBIGG-E and Isolates Browser data in BigQuery
Join with the MicroBIGG-E table to find isolates that are carbapenem resistant but don't have a known carbapenem resistance gene or allele
SELECT isolates.target_acc,
ARRAY(select AS STRUCT antibiotic, phenotype from UNNEST(AST_phenotypes) WHERE antibiotic LIKE "%penem") AST
FROM `ncbi-pathogen-detect.pdbrowser.isolates` isolates
LEFT JOIN `ncbi-pathogen-detect.pdbrowser.microbigge` microbigge
ON isolates.target_acc = microbigge.target_acc
AND microbigge.subclass = 'CARBAPENEM' -- Only carbapenem genes / point mutations
WHERE
(SELECT count(1) FROM unnest(AST_phenotypes) AS ast
WHERE antibiotic like "%penem" AND phenotype = 'resistant') >= 1
AND isolates.amrfinderplus_version IS NOT NULL -- AMRFinderPlus was run on this target
AND isolates.asm_acc IS NOT NULL -- AMRFinderPlus results should be in MicroBIGG-E because assembly is public
AND microbigge.subclass IS NULL -- There are no rows in MicroBIGG-E with subclass = CARBAPENEM
ORDER BY isolates.target_acc