The plague of cross-database annotations

Recently I had to annotate a large (10,000+) number of genes identified by Entrez Gene IDs. My goal was to avoid “annotation files” (basically CSV files) that a part of wet lab group likes, because I wanted to stay up-to-date without having to remember to update them. So the obvious solution was to use a service available on the web, and in an automated way. For reference, I just tried to attach gene symbol, gene name, chromosome and cytoband.
I tried many services:

  • UCSC Genome Browser: it has a MySQL server but it’s rather slow and I did not want to clog it up. Using their tables and .sql files I managed to get a first shot at annotation, but about 2,000 genes were without annotation!

  • NCBI’s own Entrez Gene: This needs EUtils, and in Biopython there is not a parser for Entrez Gene XML entries. I had to scrap the idea because I did not have time.

  • Ensembl: I decided to use the Biomart service, through Rpy. There were missing genes, and sometimes the IDs were “converted” in something else (I had no time to figure out what was happening). Also some perfectly valid genes (in Entrez Gene) were not present in Ensembl.

In the end I just grabbed Bioconductor’s “org.Hs.eg.db” package and used its sqlite gene database (from Entrez Gene) to annotate the list, with only 97 missing IDs (mostly genes that had changed identifiers). However, this effort revealed a problem:the annotations are not consistent between databases. This is a real pain when doing microarray-based analysis, because you often have large number of genes and perceived lack of annotation might get lead to a number of them getting discarded.

I thought the situation was better than this. If I annotate genes in different databases with the same ID, I expect to get identical results. I mean, it’s not like Gene or Ensembl have little resources… or am I wrong?

Dialogue & Discussion