BioProject Hierarchy at Google Cloud Platform
ALPHA RELEASE This is under active development and while we strive to maintain correctness, it is possible results may be unstable, unavailable, or incorrect at times. Please contact us by email at [email protected] before relying on this data for production analyses.
What data is available on the Google Cloud?
For a list of all resources see Pathogen Detection Resources at Google Cloud Platform
A dump of the BioProject hierarchy is available at Google Cloud Platform (GCP) in the ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy
table at Google BigQuery. This data includes all data BioProjects for isolates in the Pathogen Detection browsers as well as any parent umbrella BioProjects they are linked to. This data allows you to identify all the isolates for a given parent bioproject. This data is also available on our FTP site (see the ReadMe.txt for details).
Pathogen Detection Resources available on the Google Cloud
- Pathogen Detection Resources at Google Cloud Platform
- Getting started with BigQuery
- MicroBIGG-E table in BigQuery
- MicroBIGG-E contig sequences in Google Storage buckets
- MicroBIGG-E protein sequences in Google Storage buckets
- Isolates Browser table in BigQuery
- Isolate Exceptions table in BigQuery
- BioProject Hierarchy in BigQuery
Update Frequency
The bioproject_hierarchy
table at Google Cloud BigQuery is updated daily. The information is also updated daily on our ftp site in https://ftp.ncbi.nlm.nih.gov/pathogen/Results/BioProject_Hierarchy/ with latest.bioproject_hierarchy.txt including the most recent dump.
Getting started with BigQuery
Our Getting started with BigQuery page has instructions on how to run queries with BigQuery.
What is ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy
The bioproject_hierarchy
table contains information about bioprojects and
their parents. A given bioproject may have multiple parents and each parent may
have multiple children, so it is not a strictly tree-like structure. The
organization of the bioprojects and their membership is determined by
submitters and is not generated by NCBI or Pathogen Detection; data and
bioproject labelling may be inconsistent.
Fields
Column | Description |
---|---|
bioproject_id | BioProject ID |
bioproject_acc | BioProject accession |
bioproject_name | BioProject name |
bioproject_title | BioProject title |
top_organization | "Top organization" or primary organization associated with this submission |
parent_bioproject_id | BioProject ID of the parent bioproject (if any) otherwise NULL |
parent_bioproject_acc | BioProject accession of the parent bioproject (if any) otherwise NULL |
parent_bioproject_name | BioProject name of the parent bioproject (if any) otherwise NULL |
parent_bioproject_title | BioProject title of the parent bioproject (if any) otherwise NULL |
parent_top_organization | Top organization of the parent bioproject (if any) otherwise NULL |
Examples
Search for all the isolates belonging to a given umbrella bioproject
The following Google BigQuery Standard SQL will identify all the isolates for umbrella BioProject PRJNA514048
WITH RECURSIVE child_bioprojects as (
SELECT parent_bioproject_acc, bioproject_acc FROM `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy`
UNION ALL
SELECT b.parent_bioproject_acc, a.bioproject_acc
FROM `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy` a JOIN child_bioprojects b
ON b.bioproject_acc = a.parent_bioproject_acc
)
SELECT cb.parent_bioproject_acc, parent_bp.top_organization parent_organization,
cb.bioproject_acc, child_bp.bioproject_name, child_bp.top_organization,
isolates.target_acc, isolates.taxgroup_name, isolates.biosample_acc
FROM child_bioprojects cb
JOIN `ncbi-pathogen-detect.pdbrowser.isolates` isolates ON isolates.bioproject_acc = cb.bioproject_acc
JOIN `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy` parent_bp ON parent_bp.bioproject_acc = cb.parent_bioproject_acc
JOIN `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy` child_bp ON child_bp.bioproject_acc = cb.bioproject_acc
WHERE cb.parent_bioproject_acc = 'PRJNA514048'