Recently I had to annotate a large (10,000+) number of genes identified by Entrez Gene IDs. My goal was to avoid “annotation files” (basically CSV files) that a part of wet lab group likes, because I wanted to stay up-to-date without having to remember to update them. So the obvious solution was to use a service available on the web, and in an automated way. For reference, I just tried to attach gene symbol, gene name, chromosome and cytoband.
I tried many services:
- UCSC Genome Browser: it has a MySQL server but it’s rather slow and I did not want to clog it up. Using their tables and .sql files I managed to get a first shot at annotation, but about 2,000 genes were without annotation!
- NCBI’s own Entrez Gene: This needs EUtils, and in Biopython there is not a parser for Entrez Gene XML entries. I had to scrap the idea because I did not have time.
- Ensembl: I decided to use the Biomart service, through Rpy. There were missing genes, and sometimes the IDs were “converted” in something else (I had no time to figure out what was happening). Also some perfectly valid genes (in Entrez Gene) were not present in Ensembl.
In the end I just grabbed Bioconductor’s “org.Hs.eg.db” package and used its sqlite gene database (from Entrez Gene) to annotate the list, with only 97 missing IDs (mostly genes that had changed identifiers). However, this effort revealed a problem:the annotations are not consistent between databases. This is a real pain when doing microarray-based analysis, because you often have large number of genes and perceived lack of annotation might get lead to a number of them getting discarded.
I thought the situation was better than this. If I annotate genes in different databases with the same ID, I expect to get identical results. I mean, it’s not like Gene or Ensembl have little resources… or am I wrong?
I’m often wondering why people only resort to R when working with microarrays. I can understand that Bioconductor offers a plethora of different packages and that R’s statistical functions come in handy for many applications, but still, I think people underestimate the impact of performance.
R is not a performing language at all, it doesn’t parallelize well when using HPC (at least from the talks I’ve had with people studying the matter), and in general is a memory and resource hog. For example, it takes much more to perform RMA via R that with RMAExpress (which is a C++ application): the latter works also better with regards to memory utilization. I can understand the complexity of some statistical procedures, but what about ?
The surprising aspect is that aside by a few exceptions (like the aforementioned RMAExpress) no one has tried to write more performing implementations of certain algorithms. I for one would welcome a non-R implementation of SAM (the original implementation works in Excel… ugh) or similar algorithms. Otherwise we would be stuck with programs that are interesting, but way too memory hungry (AMDA comes to mind).
Fourteen days since my last post. Quite a while, indeed. Mostly I’ve been stumbled with work and some health related issues. Anyway, I thought I’d follow up on the meta analysis matter I discussed in my last post.
It turns out that it’s a fault of both limma and the data sets, because apparently the raw data found in the Stanford Microarray Database have different length, gene-wise (a result of not all spots on the array being good?) and limma itself does need equal length tables to form a single object (I stumbled upon the same problem when doing my thesis, but I used a hack to work around it), and does not perform any checking.
According to the documentation, the “merge” command should be used to deal with these cases, but here’s what I get:
>> RG1 = read.maimages(file="file1.txt",source="smd")
>> RG2 = read.maimages(file="file2.txt",source="smd")
Error in merge(RG1,RG2): Need row names to align on
I’m going to ask the Bioconductor ML and see what they tell me.
Again in the past days I’ve been banging my head thanks to the fact that doing meta-analysis with microarray data is more difficult than what it seems.
The problem sometimes lies in the data, sometimes lies in the analysis software and sometimes in a combination of factors. When doing work on a public data set (Zhao et al., 2005), I had to start analysis from raw data. Now, I tried using both the limma and marray Bioconductor packages, but both of them bail out with cryptic error messages. From what I’ve learnt by googling around, it seems that R doesn’t like batch loading of tables of different length.
I have 177 samples and I have to normalize them all together. Apparently this is a quirk of marray and limma (or worse, R itself) which is preventing me to work properly. And this is not the first time it happens, either: in the past year I’ve lost a lot of time dealing with software issues rather than performing real analsis. The problem has been posted already on some R mailing lists (and on BioC, too), but judging from the responses I doubt I’ll see a solution.
I guess I’ll have to work around this somehow (and of course, this doesn’t improve the idea I have of R…).