Expression profiling by array Third-party reanalysis
Summary
The purpose of our study was to define robust glioma subtypes by applying rigorous preprocessing and validation steps to 1,952 microarray samples aggregated from public data repositories for 16 prior studies. We evaluated each sample for quality-control issues, normalized high-quality samples using the Single-Channel Array Normalization (SCAN) algorithm (PMID: 22959562), corrected for probe-composition biases and inter-platform variability, and adjusted for intra- and inter-study batch effects. The deposited data in GEO include the 1,841 microarray samples that passed quality control tests, and underwent normalization and batch effect adjustment.
Where available, we retrieved treatment, histological and clinical data, such as tumor grade, histopathology, age-at-diagnosis, and survival time after diagnosis for these samples. Using a training/testing validation design, we identified six transcriptional subtypes in the training set, and evaluated clinically observable characteristics in the test set. Three of our clusters contained a heterogeneous mix of histopathological subtypes and tumor grades. We evaluated age, survival, and treatment patterns across our test samples and observed highly significant differences among the clusters. We also observed the potential to use gene expression patterns to further understanding of the biological mechanisms that drive gliomagenesis for each subtype. Our findings provide clinical and biological insights that may not be apparent with alternative approaches or smaller data sets, and our approach serves as an example for gene-expression meta-analyses that can be applied to other complex diseases.
Overall design
Total 1,841 microarray samples aggregated from public data repositories from 16 prior studies were used to define six robust glioma subtypes by applying rigorous preprocessing and validation steps.
We collected raw microarray data from publicly available repositories for histologically defined glioma patients. We downloaded 11 of the data sets from general-purpose databases—either NCBI GEO (http://ncbi.nlm.nih.gov/geo) or ArrayExpress (http://www.ebi.ac.uk/arrayexpress) —and 5 of the data sets from disease-focused databases. We focused on data sets that used the Human Genome U133A and U133 Plus 2.0 Affymetrix platforms because they constitute the majority of available microarray samples that have been used to profile glioma patients, and these two Affymetrix platforms have many overlapping probes.
Step 1: We performed quality control tests, SCAN normalization and batch effect adjustment. We excluded low-quality samples.
Step 2: We separated data sets into training and testing sets according to clinical data availability. Unsupervised clustering analysis and internal validation was performed on the training data to determine an optimal cluster size.
Step 3: Cluster Assignment for the test data set was performed and clinical characteristics across transcriptional clusters were examined.
Results are reported as normalized log2 signal intensity which was mapped to human 12,078 Entrez Gene IDs from the Human Genome U133A and U133 Plus 2.0 Affymetrix platforms probe-set IDs (File: GSE55918_Matrix_GliomaClusteringAnalysis.txt).