MicroBIGG-E data at Google Cloud Platform
BETA RELEASE -- This is under active development and while we strive to maintain correctness, it is possible results may be unstable, unavailable, or incorrect at times. Please contact us by email at [email protected] before relying on this data for production analyses.
- What data is available on the Google Cloud?
- Getting started with BigQuery
- Linking to Isolates Browser data in BigQuery
- Example searches
- Find all carbapenem resistance genes or point mutations in the database
- Find all carbapenem resistance genes in the database
- Find all AMRFinderPlus results from Salmonella genomes for further analysis
- Find elements on contigs that have both blaKPC-2 and blaTEM-1 genes
- Find the five most common known parC resistance mutations in Pathogen Detection analyzed isolates
- Find the five most common AMR genes associated with quinolone resistance
- Contig sequences
- Protein sequences
What data is available on the Google Cloud?
For a list of all resources see Pathogen Detection Resources at Google Cloud Platform
The Microbial Browser for Genomic and Genetic Elements
data is now publicly available in the
table at Google BigQuery. This
data includes all the fields available in the browser and can be searched using
Google Standard
instead of the SOLR Query Language. This
also permits programmatic access and more complex queries. MicroBIGG-E at
BigQuery will also allow you to download tables exceeding the 100,000 row limit
for the MicroBIGG-E web
download. NCBI is
piloting this in BigQuery to help users leverage the benefits of elastic
scaling and parallel execution of queries. BigQuery has a large collection of
client libraries that can be used within your workflow. You can also interact
with it on a web browser as described below.
We also are storing the contig sequences and protein sequences for MicroBIGG-E hits in Google Storage buckets. See Contig sequences and Protein sequences below for more information.
Pathogen Detection Resources available on the Google Cloud
- Pathogen Detection Resources at Google Cloud Platform
- Getting started with BigQuery
- MicroBIGG-E table in BigQuery
- MicroBIGG-E contig sequences in Google Storage buckets
- MicroBIGG-E protein sequences in Google Storage buckets
- Isolates Browser table in BigQuery
- Isolate Exceptions table in BigQuery
- BioProject Hierarchy in BigQuery
Update Frequency
The microbigge table at Google Cloud BigQuery is updated daily. For this reason the contents may not agree exactly with those shown in the MicroBIGG-E web browser. If you see unexpected discrepancies please let us know by emailing us at [email protected].
Getting started with BigQuery
Our Getting started with BigQuery page has instructions on how to run queries with BigQuery.
Linking to Isolates Browser data in BigQuery
NCBI Pathogen Detection also maintains Isolates Browser data in the BigQuery table ncbi-pathogen-detect.pdbrowser.isolates
. There are several fields in common between the two tables, but we generally recommend joining on the target_acc
field. See Isolates Browser Data at Google Cloud Platform for examples of joining the two tables.
Example searches
Find all carbapenem resistance genes or point mutations in the database
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%CARBAPENEM%'
ORDER BY element_symbol, closest_reference_acc, target_acc, protein_acc
Find all carbapenem resistance genes in the database
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%CARBAPENEM%'
AND subtype = 'AMR'
ORDER BY element_symbol, closest_reference_acc, target_acc, protein_acc
Find all AMRFinderPlus results from Salmonella genomes for further analysis
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE taxgroup_name = 'Salmonella enterica'
Find elements on contigs that have both blaKPC-2 and blaTEM-1 genes
`ncbi-pathogen-detect.pdbrowser.microbigge` mb
`ncbi-pathogen-detect.pdbrowser.microbigge` mb1
JOIN `ncbi-pathogen-detect.pdbrowser.microbigge` mb2
ON mb1.element_symbol = 'blaTEM-1'
AND mb1.contig_acc = mb2.contig_acc
AND mb2.element_symbol = 'blaKPC-2') contigs
ON contigs.contig_acc = mb.contig_acc
Find the five most common known parC resistance mutations in Pathogen Detection analyzed isolates
SELECT element_symbol, count(*) num_found
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol like 'parC_%'
GROUP BY element_symbol
ORDER BY num_found DESC
Find the five most common AMR genes associated with quinolone resistance
SELECT element_symbol, subclass, count(*) num_found
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%QUINOLONE%'
AND subtype = 'AMR'
GROUP BY element_symbol, subclass
ORDER BY num_found DESC
Contig sequences
Contig sequences in gzipped FASTA format are stored and accessible in the Google storage bucket ncbi-pathogen-assemblies
and the paths to those contigs are listed in the ncbi-pathogen-detect.pdbrowser.microbigge
field contig_url
These can be accessed using the gsutil
command-line program included with the Google Cloud CLI (Installation instructions). Or through the GCP BigQuery web interface. See Getting started with BigQuery for more information on how to use BigQuery.
Get the contig sequence for a contig with a point mutation in a specific assembly
First find the contig_url using BigQuery
SELECT contig_url
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol = 'ompK36_D135DGD'
AND biosample_acc = 'SAMN01057611';
The results should be:
contig_url |
gs://ncbi-pathogen-assemblies/Klebsiella/9/640/NZ_CP008827.1.fna.gz |
Copy the gzipped contig file using the gs
Enter the following at a unix shell command-line to copy the gzipped contig FASTA file to your computer. See the Google docs for more information on the gsutil program.
gsutil cp gs://ncbi-pathogen-assemblies/Klebsiella/9/640/NZ_CP008827.1.fna.gz .
Protein sequences
Protein sequences in gzipped FASTA format are stored and accessible in the Google Storage bucket ncbi-pathogen-assemblies
and the paths to those files are listed in the ncbi-pathogen-detect.pdbrowser.microbigge
field protein_url
These can be accessed using the gsutil
command-line program included with the Google Cloud CLI (Installation instructions). Or through the GCP BigQuery web interface. See Getting started with BigQuery for more information on how to use BigQuery.
Get the sequence of a single protein from MicroBIGG-E
Find the protein URL using BigQuery
SELECT protein_url
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol = 'ompK36_D135DGD'
AND biosample_acc = 'SAMN01057611';
The results should be:
protein_url |
gs://ncbi-pathogen-proteins/WP_/004/151/WP_004151112.1.faa.gz |
Copy the gzipped protein FASTA file using the gs
Enter the following at a unix shell command-line to copy the gzipped contig FASTA file to your computer. See the Google docs for more information on the gsutil program.
gsutil cp gs://ncbi-pathogen-proteins/WP_/004/151/WP_004151112.1.faa.gz .
Download all QUINOLONE resistance genes
This example uses a linux or MacOS command-line, the Google cloud CLI, and the bash shell. See Install the Google Cloud CLI documentation from Google for instructions of how to install the CLI.
Authenticate the CLI to give it permissions on your Google Cloud project
See Initializing the gcloud CLI for more information.
gcloud auth login
Follow instructions to authenticate to google cloud
Download a list of URLs using bq
bq query --use_legacy_sql=false --format=csv --max_rows 300000 '
select distinct protein_url
from `ncbi-pathogen-detect.pdbrowser.microbigge`
where class = "QUINOLONE"
' > all_quinolone_urls.csv
Split the list to smaller lists
We do this because unix directories tend to have problems when there are too many files in one directory
split -d -l 3500 all_quinolone_urls.csv batch.
Use a shell loop to download the protein files
for file in batch.*
mkdir $file.asm
cat $file | gcloud alpha storage cp --read-paths-from-stdin $file.asm/