BMC Cancer (Jun 2009)
Intrinsic bias in breast cancer gene expression data sets
Abstract
Abstract Background While global breast cancer gene expression data sets have considerable commonality in terms of their data content, the populations that they represent and the data collection methods utilized can be quite disparate. We sought to assess the extent and consequence of these systematic differences with respect to identifying clinically significant prognostic groups. Methods We ascertained how effectively unsupervised clustering employing randomly generated sets of genes could segregate tumors into prognostic groups using four well-characterized breast cancer data sets. Results Using a common set of 5,000 randomly generated lists (70 genes/list), the percentages of clusters with significant differences in metastasis latencies (HR p-value Conclusion It is highly probable to identify a statistically significant association between a given gene list and prognosis in the NKI2 dataset due to its large sample size and the interrelationship between ER-α expression and markers of proliferation. In most respects, the TRANSBIG data set generated similar outcomes as the NKI2 data set, although its smaller sample size led to fewer statistically significant results.