Tag Archives: microarray

Performance and R

I’m often wondering why people only resort to R when working with microarrays. I can understand that Bioconductor offers a plethora of different packages and that R’s statistical functions come in handy for many applications, but still, I think people underestimate the impact of performance.

R is not a performing language at all, it doesn’t parallelize well when using HPC (at least from the talks I’ve had with people studying the matter), and in general is a memory and resource hog. For example, it takes much more to perform RMA via R that with RMAExpress (which is a C++ application): the latter works also better with regards to memory utilization. I can understand the complexity of some statistical procedures, but what about parsing GEO files?

The surprising aspect is that aside by a few exceptions (like the aforementioned RMAExpress) no one has tried to write more performing implementations of certain algorithms. I for one would welcome a non-R implementation of SAM (the original implementation works in Excel… ugh) or similar algorithms. Otherwise we would be stuck with programs that are interesting, but way too memory hungry (AMDA comes to mind).

Follow up on meta-analysis

Fourteen days since my last post. Quite a while, indeed. Mostly I’ve been stumbled with work and some health related issues. Anyway, I thought I’d follow up on the meta analysis matter I discussed in my last post.

It turns out that it’s a fault of both limma and the data sets, because apparently the raw data found in the Stanford Microarray Database have different length, gene-wise (a result of not all spots on the array being good?) and limma itself does need equal length tables to form a single object (I stumbled upon the same problem when doing my thesis, but I used a hack to work around it), and does not perform any checking.

According to the documentation, the “merge” command should be used to deal with these cases, but here’s what I get:

[sourcecode lang='c']

>> RG1 = read.maimages(file=”file1.txt”,source=”smd”)
Read file1.txt
>> RG2 = read.maimages(file=”file2.txt”,source=”smd”)
Read file2.txt
>> merge(RG1,RG2)
Error in merge(RG1,RG2): Need row names to align on
>> rownames(RG1)
NULL
>> rownames(RG2)
NULL [/sourcecode]

I’m going to ask the Bioconductor ML and see what they tell me.

Meta analysis difficulty increasing

Again in the past days I’ve been banging my head thanks to the fact that doing meta-analysis with microarray data is more difficult than what it seems.

The problem sometimes lies in the data, sometimes lies in the analysis software and sometimes in a combination of factors. When doing work on a public data set (Zhao et al., 2005), I had to start analysis from raw data. Now, I tried using both the limma and marray Bioconductor packages, but both of them bail out with cryptic error messages. From what I’ve learnt by googling around, it seems that R doesn’t like batch loading of tables of different length.

I have 177 samples and I have to normalize them all together. Apparently this is a quirk of marray and limma (or worse, R itself) which is preventing me to work properly. And this is not the first time it happens, either: in the past year I’ve lost a lot of time dealing with software issues rather than performing real analsis. The problem has been posted already on some R mailing lists (and on BioC, too), but judging from the responses I doubt I’ll see a solution.

I guess I’ll have to work around this somehow (and of course, this doesn’t improve the idea I have of R…).

Gene identifiers

While working today on an annotation class in Python I stumbled on a problem. Normally I work with lists of genes that are consistent, i.e. all Entrez Gene IDs (or RefSeq IDs, or Genome Browser IDs…), but today I had a list of mixed identifiers.

The subsequent idea was “let’s implement auto-detection of common identifiers in the class”. The problem is… is there any actual documentation on how identifiers are made? So far, using regular expressions, I’ve tracked down a few:

  • RefSeq
  • GenBank
  • Entrez Gene
  • UCSC Genome Browser
  • Ensembl

However, I have no idea if I have implemented all types of these IDs. Does anyone know a place where to look these information up?

(On a related note: my thesis defense will be on January 14th, 2008, so I have to get the printing going)