Tag Archives: R

Performance and R

I’m often wondering why people only resort to R when working with microarrays. I can understand that Bioconductor offers a plethora of different packages and that R’s statistical functions come in handy for many applications, but still, I think people underestimate the impact of performance.

R is not a performing language at all, it doesn’t parallelize well when using HPC (at least from the talks I’ve had with people studying the matter), and in general is a memory and resource hog. For example, it takes much more to perform RMA via R that with RMAExpress (which is a C++ application): the latter works also better with regards to memory utilization. I can understand the complexity of some statistical procedures, but what about parsing GEO files?

The surprising aspect is that aside by a few exceptions (like the aforementioned RMAExpress) no one has tried to write more performing implementations of certain algorithms. I for one would welcome a non-R implementation of SAM (the original implementation works in Excel… ugh) or similar algorithms. Otherwise we would be stuck with programs that are interesting, but way too memory hungry (AMDA comes to mind).

Data clustering with Python

Following up my recent post, I’ve been looking for alternatives to TMeV. So far I’ve found the R package pvclust and the Pycluster library, part of BioPython. The first one also performs bootstrapping (I’m not sure if it’s similar to what support trees do, but it’s still better than no resampling at all). I’ve found another Python project but it is still too basic to perform what I need.

Read More »

SOFT file woes

Today I started working on a data set published on GEO. As the sample data were somehow inconsistent (they mentioned 23 controls when I found 28), I decided to parse the SOFT file from GEO in order to get the exact sample information.

I did a grave mistake. First of all, Biopython’s SOFT parser is horribly broken (doesn’t work at all) and quite undocumented: I could work around the lack of documentation (API docs) but not with the fact that it wouldn’t work. So I turned to R, which offers a GEO query module through Bioconductor.

Again that proved to be a terrible mistake. For a file containing 183 samples, the analysis is going on since four hours and with no sign of completing anytime soon (not to mention a  possible memory leak). After this, I gave up. I’m going to get the reduced data sheet and write a small parser in Python myself.

What is frustrating is the lack of quality: I could concentrate on my own work rather than reinventing the wheel for the nth time if the existing implementations worked. What’s the point in releasing non-working software? I could understand bugs, but this is one step further.