The tower of Babel of bioinformatics

The title of this post tries to give some insight on a problem that I’ve stumbled upon a lot of problems when doing microarray data analysis: the plethora of different file formats. In “conventional” (as I call it) bioinformatics this is less problematic, as FASTA or PDB are quite standardized by now.

In microarray studies, I keep on seeing attempts to reinvent the wheel. Some are caused by technology: log2 ratio works with two-color arrays but it has absolutely no sense on one-color Affymetrix arrays (well, except on SNP arrays when used for copy number analysis, mostly to mimic arrayCGH). Others, however, are often created for the needs of specific software. The problem is that, like in the tower of Babel, these formats don’t play well together. I ran into this when working for my thesis, as I got three different data sets and they were rather different in format. The only solution was to write scripts that would handle the conversions. When you have dozens of different data sources, it quickly becomes annoying.

A related problem is the one of annotation. Affymetrix uses its own IDs, so does Illumina, others use UniGene clusters or Entrez Gene IDs… again meta-analysis becomes a daunting task. Luckily a few programs convert all the IDs to a single format (e.g., Ensembl) but not always perfectly (FatiGO Plus discards “ambiguous” - genes with more than one Ensembl ID - genes, for example). Even raw data suffers from this, considering that there are different normalization and quantification algorithms (MAS5, PLIER, RMA just to name a few…).

Brazma et al. proposed the MIAME specification a few years ago. It was a step in the right direction, though MAGE-ML in my opinion was somewhat overkill (therefore I’m not surprised to see MAGE-TAB, a tab-delimited format, being pushed now). However, it didn’t help that unlike Array Express, Gene Expression Omnibus did not adhere to MIAME at first and instead pursued its own format (SOFT).

I don’t see a clear solution to this problem soon, but I still believe that where possible, we, the bioinformatics people, should work hard to make the lives easier for our colleagues by re-using existing standards and formats if possible.
*[SNP]: Single Nucleotide Polymorphism
*[arrayCGH]: array Comparative Genomic Hybridization

Dialogue & Discussion