Contest datasets

As traditional in CAMDA contests, neither we nor the producers of the data can provide advice on the datasets to individuals as dealing with the files forms part of the analysis challenge. There is, however, an open forum for participants' free discussions relating to the contest data sets, and in which you are encouraged to participate.

We look forward to a lively contest!


This year's conference will focus on the promise of gaining better insight from an integration of heterogeneous large-scale data. As contest data set, we have identified the Glioblastoma multiforme subset of The Cancer Genome Atlas (TCGA) as a particularly interesting challenge.

This repository is unusual in that it provides publicly, for several hundred patients, profiles of

  • gene transcript expression (435 cancer patients versus 11 controls)
  • miRNA expression (426 tumour samples versus 10 controls)
  • genomic DNA methylation (256 tumour samples versus a control)
  • copy number variation (465 tumour samples versus 430 controls [402 matched normals])

complemented by a variety of clinical parameters and survival outcomes. Sometimes, additional results are available from alternative technologies / platforms. Note that the data can be downloaded at different abstraction levels, from raw (Level 1) via normalized (Level 2) to processed (Level 3), also facilitating integration by non-domain experts.

There are obviously a large number of interesting questions that can be addressed in the context of this data collection. Of course, analyses can focus on subsets of the data. The below outline is meant as inspiration and makes no claim to being comprehensive!

Practical challenges / insight
  • For what questions and what kind of data does the analysis outcome benefit from integrating an additional -omics data source?
  • For a particular type of investigation, what combination of -omics data and/or clinical parameters gives the best analysis outcome?
  • What kind of novel biological insights of tumor biology can be gained through such a large and heterogeneous profile collection?
  • What role can such large patient based repositories play in translating molecular findings into potential personalized treatment?
Statistical challenges
  • How can we best deal with different noise / power characteristics in different -omics data sets?
  • What meaningful approach are there to dealing with the further considerable increase of dimensionality in relation to the sample size?
  • How do we correctly take into account the expected correlations among features across multiple -omics data?
  • What are robust approaches for compiling consolidated data from multiple technical platforms; for example, mRNA expression from different types of arrays; methylation data from microarrays and next generation sequencing?
  • How can we ensure that batch effects or other confounding variable have been dealt with sufficiently as not to jeopardize subsequent biomarker identification or prediction steps?
Software engineering challenges
  • Improvements of existing interfaces of querying the TCGA data
  • Making primary and secondary information more accessible to clinicians
  • What can we learn about good user interface design for such large-scale data?
  • How do we deal with versioning considering data sizes and complex dependencies?

Data download

The data is avalailable online from the TCGA portal for this collection:

Feel free to browse the site, all publicly available data relating to this collection can be used as part of the contest. Some relevant pointers:


This selection of papers provides a first impression of typical analyses already performed on the contest data set.

(1) The Cancer Genome Atlas Research Network. (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways, Nature 455(7216):1061-1068.

(2) Freire, P., Vilela, M., Deus, H., Kim, Y.W., Koul, D., Colman, H., Aldape, K.D., Bogler, O., Yung, W.K.A., Coombes, K., et al. (2008) Exploratory analysis of the copy number alterations in glioblastoma multiforme. PLoS One 3(12):e4076.

(3) Cerami, E., Demir, E., Schultz, N., Taylor, B.S. and Sander, C. (2010) Automated network analysis identifies core pathways in glioblastoma. PLoS One 5(2):e8918.

(4) Bredel, M., Scholtens, D.M., Harsh, G.R., Bredel, C., Chandler, J.P., Renfrow, J.J., Yadav, A.K., Vogel, H., Scheck, A.C., Tibshirani, R., et al. (2009) A network model of a cooperative genetic landscape in brain tumors. JAMA 302(3):261-275.

Other relevant article references can be found in the TCGA list of publications.

contest_dataset.txt · Last modified: 2011/05/26 13:39 by dkreil