Category Archives: Science

Everything related to my work.

data.frames in Python - DataMatrix

  

For a long time I have tried to handle text files in Python in the same way that R’s data.frame does - that is, direct access to columns and rows of a loaded text file. As I don’t like R at all, I struggled to find a Pythonic equivalent, and since I found none, I decided to eat my own food and write an implementation, which is what you’ll find below.

Read More »

Commercial applications, public funding

  

I wanted to write this earier, but I couldn’t: I’m now in a hotel in Maastricht, Netherlands, and waiting to get back tomorrow. I’ve been attending the 4th NuGO hands-on advanced microarray data analysis course and I even wanted to blog about it… but the hotel’s connection did not resolve any non-European web page until late today.

Read More »

FOSS and research

  

I’ve been wondering about why FOSS is often compared to the academic world, but at least in my limited experience, I see little people that grasp its concept in the world of research. On a quick look, developing FOSS in a research environment would be very good: not only you’d get publicly available results when you publish, but at the same time you can make sure that in an extreme case your application will be carried on by someone else should you not be able to continue development.

At least in the life sciences, it’s hard to see such a mentality. I can understand the publish or perish frenzy, but at the same time, don’t we all remember about published and unmantained software? For me, such an idea would be optimal. Once the paper is out, you can release your software (GPL would be best) and make sure someone will improve or mantain in. Of course you won’t be able to publish for each upgrade you do, but I would generally think of that as a bad policy, one made just to increase the publication count.

Does something like that happen with FOSS in other research areas?

Performance and R

  

I’m often wondering why people only resort to R when working with microarrays. I can understand that Bioconductor offers a plethora of different packages and that R’s statistical functions come in handy for many applications, but still, I think people underestimate the impact of performance.

R is not a performing language at all, it doesn’t parallelize well when using HPC (at least from the talks I’ve had with people studying the matter), and in general is a memory and resource hog. For example, it takes much more to perform RMA via R that with RMAExpress (which is a C++ application): the latter works also better with regards to memory utilization. I can understand the complexity of some statistical procedures, but what about parsing GEO files?

The surprising aspect is that aside by a few exceptions (like the aforementioned RMAExpress) no one has tried to write more performing implementations of certain algorithms. I for one would welcome a non-R implementation of SAM (the original implementation works in Excel… ugh) or similar algorithms. Otherwise we would be stuck with programs that are interesting, but way too memory hungry (AMDA comes to mind).