Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation

MicroBIGG-E data at Google Cloud Platform MicroBIGG-E Documentation TOC Main documentation page

BETA RELEASE -- This is under active development and while we strive to maintain correctness, it is possible results may be unstable, unavailable, or incorrect at times. Please contact us by email at [email protected] before relying on this data for production analyses.

What data is available on the Google Cloud? MicroBIGG-E at Google Cloud Platform TOC Main documentation page

For a list of all resources see Pathogen Detection Resources at Google Cloud Platform

The Microbial Browser for Genomic and Genetic Elements data is now publicly available in the ncbi-pathogen-detect.pdbrowser.microbigge table at Google BigQuery. This data includes all the fields available in the browser and can be searched using Google Standard SQL instead of the SOLR Query Language. This also permits programmatic access and more complex queries. MicroBIGG-E at BigQuery will also allow you to download tables exceeding the 100,000 row limit for the MicroBIGG-E web download. NCBI is piloting this in BigQuery to help users leverage the benefits of elastic scaling and parallel execution of queries. BigQuery has a large collection of client libraries that can be used within your workflow. You can also interact with it on a web browser as described below.

We also are storing the contig sequences and protein sequences for MicroBIGG-E hits in Google Storage buckets. See Contig sequences and Protein sequences below for more information.

Pathogen Detection Resources available on the Google Cloud

Update Frequency MicroBIGG-E at Google Cloud Platform TOC Main documentation page

The microbigge table at Google Cloud BigQuery is updated daily. For this reason the contents may not agree exactly with those shown in the MicroBIGG-E web browser. If you see unexpected discrepancies please let us know by emailing us at [email protected].

Getting started with BigQuery

Our Getting started with BigQuery page has instructions on how to run queries with BigQuery.

Linking to Isolates Browser data in BigQuery

NCBI Pathogen Detection also maintains Isolates Browser data in the BigQuery table ncbi-pathogen-detect.pdbrowser.isolates. There are several fields in common between the two tables, but we generally recommend joining on the target_acc field. See Isolates Browser Data at Google Cloud Platform for examples of joining the two tables.

Example searches MicroBIGG-E at Google Cloud Platform TOC Main documentation page

Find all carbapenem resistance genes or point mutations in the database MicroBIGG-E at Google Cloud Platform TOC Main documentation page

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%CARBAPENEM%'
ORDER BY element_symbol, closest_reference_acc, target_acc, protein_acc

Find all carbapenem resistance genes in the database MicroBIGG-E at Google Cloud Platform TOC Main documentation page

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%CARBAPENEM%'
AND   subtype = 'AMR'
ORDER BY element_symbol, closest_reference_acc, target_acc, protein_acc

Find all AMRFinderPlus results from Salmonella genomes for further analysis MicroBIGG-E at Google Cloud Platform TOC Main documentation page

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE taxgroup_name = 'Salmonella enterica'

Find elements on contigs that have both blaKPC-2 and blaTEM-1 genes MicroBIGG-E at Google Cloud Platform TOC Main documentation page

SELECT
    mb.contig_acc,
    mb.element_symbol
FROM
    `ncbi-pathogen-detect.pdbrowser.microbigge` mb
    JOIN ( SELECT DISTINCT
            mb1.contig_acc
        FROM
            `ncbi-pathogen-detect.pdbrowser.microbigge` mb1
            JOIN `ncbi-pathogen-detect.pdbrowser.microbigge` mb2 
                ON mb1.element_symbol = 'blaTEM-1'
                    AND mb1.contig_acc = mb2.contig_acc
                    AND mb2.element_symbol = 'blaKPC-2') contigs 
        ON contigs.contig_acc = mb.contig_acc
ORDER BY
    mb.contig_acc,
    mb.start_on_contig

Find the five most common known parC resistance mutations in Pathogen Detection analyzed isolates MicroBIGG-E at Google Cloud Platform TOC Main documentation page

SELECT element_symbol, count(*) num_found
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol like 'parC_%'
GROUP BY element_symbol
ORDER BY num_found DESC
LIMIT 5

Find the five most common AMR genes associated with quinolone resistance MicroBIGG-E at Google Cloud Platform TOC Main documentation page

SELECT element_symbol, subclass, count(*) num_found
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%QUINOLONE%'
AND   subtype = 'AMR'
GROUP BY element_symbol, subclass
ORDER BY num_found DESC
LIMIT 5

Contig sequencesMicroBIGG-E at Google Cloud Platform TOC Main documentation page

Contig sequences in gzipped FASTA format are stored and accessible in the Google storage bucket ncbi-pathogen-assemblies and the paths to those contigs are listed in the ncbi-pathogen-detect.pdbrowser.microbigge field contig_url.

These can be accessed using the gsutil command-line program included with the Google Cloud CLI (Installation instructions). Or through the GCP BigQuery web interface. See Getting started with BigQuery for more information on how to use BigQuery.

Example: MicroBIGG-E at Google Cloud Platform TOC Main documentation page

Get the contig sequence for a contig with a point mutation in a specific assembly MicroBIGG-E at Google Cloud Platform TOC Main documentation page

First find the contig_url using BigQuery MicroBIGG-E at Google Cloud Platform TOC Main documentation page
SELECT contig_url
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol = 'ompK36_D135DGD'
AND biosample_acc = 'SAMN01057611';

The results should be:

contig_url
gs://ncbi-pathogen-assemblies/Klebsiella/9/640/NZ_CP008827.1.fna.gz
Copy the gzipped contig file using the gs utility MicroBIGG-E at Google Cloud Platform TOC Main documentation page

Enter the following at a unix shell command-line to copy the gzipped contig FASTA file to your computer. See the Google docs for more information on the gsutil program.

gsutil cp gs://ncbi-pathogen-assemblies/Klebsiella/9/640/NZ_CP008827.1.fna.gz .

Protein sequencesMicroBIGG-E at Google Cloud Platform TOC Main documentation page

Protein sequences in gzipped FASTA format are stored and accessible in the Google Storage bucket ncbi-pathogen-assemblies and the paths to those files are listed in the ncbi-pathogen-detect.pdbrowser.microbigge field protein_url.

These can be accessed using the gsutil command-line program included with the Google Cloud CLI (Installation instructions). Or through the GCP BigQuery web interface. See Getting started with BigQuery for more information on how to use BigQuery.

Example: MicroBIGG-E at Google Cloud Platform TOC Main documentation page

Get the sequence of a single protein from MicroBIGG-E MicroBIGG-E at Google Cloud Platform TOC Main documentation page

Find the protein URL using BigQuery MicroBIGG-E at Google Cloud Platform TOC Main documentation page
SELECT protein_url
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol = 'ompK36_D135DGD'
AND biosample_acc = 'SAMN01057611';

The results should be:

protein_url
gs://ncbi-pathogen-proteins/WP_/004/151/WP_004151112.1.faa.gz
Copy the gzipped protein FASTA file using the gs utility MicroBIGG-E at Google Cloud Platform TOC Main documentation page

Enter the following at a unix shell command-line to copy the gzipped contig FASTA file to your computer. See the Google docs for more information on the gsutil program.

gsutil cp gs://ncbi-pathogen-proteins/WP_/004/151/WP_004151112.1.faa.gz .

Download all QUINOLONE resistance genes MicroBIGG-E at Google Cloud Platform TOC Main documentation page

This example uses a linux or MacOS command-line, the Google cloud CLI, and the bash shell. See Install the Google Cloud CLI documentation from Google for instructions of how to install the CLI.

Authenticate the CLI to give it permissions on your Google Cloud project MicroBIGG-E at Google Cloud Platform TOC Main documentation page

See Initializing the gcloud CLI for more information.

gcloud auth login

Follow instructions to authenticate to google cloud

Download a list of URLs using bq MicroBIGG-E at Google Cloud Platform TOC Main documentation page
bq query --use_legacy_sql=false --format=csv --max_rows 300000 '
select distinct protein_url
from `ncbi-pathogen-detect.pdbrowser.microbigge`
where class = "QUINOLONE"
' > all_quinolone_urls.csv
Split the list to smaller lists MicroBIGG-E at Google Cloud Platform TOC Main documentation page

We do this because unix directories tend to have problems when there are too many files in one directory

split -d -l 3500 all_quinolone_urls.csv batch.
Use a shell loop to download the protein files MicroBIGG-E at Google Cloud Platform TOC Main documentation page
for file in batch.*
do
    mkdir $file.asm
    cat $file | gcloud alpha storage cp --read-paths-from-stdin $file.asm/
done