Gene search applet: suggestions and code review needed

In the past months I’ve always wanted to write a small Plasma applet to aid me in some boring tasks as a bioinformatician. One example (for the non-scientific crowd out there) is when I find a specific gene out of my analysis work which I want to take a look at. I am often lazy, so instead of firing up the browser to look at the online resources, I wanted to write something which could access said resources programmatically.

Continue reading

FOSS and research

I’ve been wondering about why FOSS is often compared to the academic world, but at least in my limited experience, I see little people that grasp its concept in the world of research. On a quick look, developing FOSS in a research environment would be very good: not only you’d get publicly available results when you publish, but at the same time you can make sure that in an extreme case your application will be carried on by someone else should you not be able to continue development.

At least in the life sciences, it’s hard to see such a mentality. I can understand , but at the same time, ? For me, such an idea would be optimal. Once the paper is out, you can release your software (GPL would be best) and make sure someone will improve or mantain in. Of course you won’t be able to publish for each upgrade you do, but I would generally think of that as a bad policy, one made just to increase the publication count.

Does something like that happen with FOSS in other research areas?

Performance and R

I’m often wondering why people only resort to R when working with microarrays. I can understand that Bioconductor offers a plethora of different packages and that R’s statistical functions come in handy for many applications, but still, I think people underestimate the impact of performance.

R is not a performing language at all, it doesn’t parallelize well when using HPC (at least from the talks I’ve had with people studying the matter), and in general is a memory and resource hog. For example, it takes much more to perform RMA via R that with RMAExpress (which is a C++ application): the latter works also better with regards to memory utilization. I can understand the complexity of some statistical procedures, but what about ?

The surprising aspect is that aside by a few exceptions (like the aforementioned RMAExpress) no one has tried to write more performing implementations of certain algorithms. I for one would welcome a non-R implementation of SAM (the original implementation works in Excel… ugh) or similar algorithms. Otherwise we would be stuck with programs that are interesting, but way too memory hungry (AMDA comes to mind).

Gene identifiers

While working today on an annotation class in Python I stumbled on a problem. Normally I work with lists of genes that are consistent, i.e. all Entrez Gene IDs (or RefSeq IDs, or Genome Browser IDs…), but today I had a list of mixed identifiers.

The subsequent idea was “let’s implement auto-detection of common identifiers in the class”. The problem is… is there any actual documentation on how identifiers are made? So far, using regular expressions, I’ve tracked down a few:

  • RefSeq
  • GenBank
  • Entrez Gene
  • UCSC Genome Browser
  • Ensembl

However, I have no idea if I have implemented all types of these IDs. Does anyone know a place where to look these information up?

(On a related note: my thesis defense will be on January 14th, 2008, so I have to get the printing going)