Recently, I’ve been in need to retrieve some records regarding renal cell carcinoma referenced in papers by Zhao et al. and Higgins et al. The records of the former were hosted on NCBI’s Gene Expression Omnibus, while the latter records were uploaded to EBI’s ArrayExpress database. Getting data from others and using it for your own analysis is called meta-analysis, and it’s often used to validate methods and algorithms with different data sets.
The problem is, getting the **right **data is not always easy. I spent the whole afternoon yesterday trying to figure out how I could retrieve already analyzed data (usually you get the processed - i.e. normalized - data only). From GEO I could download individual sample data (something I didn’t need) or the whole data set (a whopping 1.6 Gb), in SOFTtext format. Biopython has a SOFT parser, but the set was so big I just crashed my own machine. Of course, data wasn’t available in tabular format.
ArrayExpress wasn’t better on that respect. Perhaps I don’t understand well the format used by two color arrays, but again, it was impossible to group the samples like I wanted, and the sample information file was missing (critical requirement, I needed to choose only clear cell histotypes), though with some fiddling I managed to get the right files. Of course, they included only a normalized mean of the log2ratio of the two channels, and I didn’t want to run an analysis (such as SAM) myself…
Science is all about being able to reproduce results. It’s a shame that sometimes doing so is so hard.
Luca Beltrame SCIENCE