DataMatrix 0.7 has been released

Finally a new entry! I’ve been extremely busy with other things, that is why I did not have time to write more. One of the main reason is related to an important landmark in my professional career, but I’ll write more about it after January 1st (hint: those who follow my Twitter updates may have already understood).

As a nice way to break the hiatus, I’m releasing a new version of DataMatrix, my implementation of R’s data.frame in Python. Although the version bump is small, there are loads of improvements. First of all, there is proper support for file-like objects, as well as support for appending and inserting both rows and columns. writeMatrix has been substantially improved and now writes files correctly, and I have added (experimental) support for a DataMatrix object that does not require files – EmptyMatrix. Also, there is now proper documentation. Last but not least, unit tests have been added, a good way to watch out for regressions in the code.

Finally, this version marks the entrance of dalloliogm as contributor to the code. He gave quite a number of helpful hints, especially with regards to unit tests.

I’m quite satisfied on how DataMatrix behaves – as a matter of fact I use it extensively on a number of internal projects.

You can grab DataMatrix 0.7 as a source package or as a Windows installer. Comments are welcome.

The plague of cross-database annotations

Recently I had to annotate a large (10,000+) number of genes identified by Entrez Gene IDs. My goal was to avoid “annotation files” (basically CSV files) that a part of wet lab group likes, because I wanted to stay up-to-date without having to remember to update them. So the obvious solution was to use a service available on the web, and in an automated way. For reference, I just tried to attach gene symbol, gene name, chromosome and cytoband.
I tried many services:

  • UCSC Genome Browser: it has a MySQL server but it’s rather slow and I did not want to clog it up. Using their tables and .sql files I managed to get a first shot at annotation, but about 2,000 genes were without annotation!
  • NCBI’s own Entrez Gene: This needs EUtils, and in Biopython there is not a parser for Entrez Gene XML entries. I had to scrap the idea because I did not have time.
  • Ensembl: I decided to use the Biomart service, through Rpy. There were missing genes, and sometimes the IDs were “converted” in something else (I had no time to figure out what was happening). Also some perfectly valid genes (in Entrez Gene) were not present in Ensembl.

In the end I just grabbed Bioconductor’s “org.Hs.eg.db” package and used its sqlite gene database (from Entrez Gene) to annotate the list, with only 97 missing IDs (mostly genes that had changed identifiers). However, this effort revealed a problem:the annotations are not consistent between databases. This is a real pain when doing microarray-based analysis, because you often have large number of genes and perceived lack of annotation might get lead to a number of them getting discarded.

I thought the situation was better than this. If I annotate genes in different databases with the same ID, I expect to get identical results. I mean, it’s not like Gene or Ensembl have little resources… or am I wrong?

DataMatrix 0.5

At last, since it’s been like ages, I decided to put out a new version of DataMatrix. For those who haven’t seen my previous post, DataMatrix is a Pythonic implementation of R’s data.frame. It enables you to manipulate a text file by columns or rows, to your liking, using a dictionary-like syntax.

In this new version there have been a few improvements and correction to a couple bugs (for example saveMatrix did not really save) and the start (only a stub at the moment) of an append function to add more columns (I’ll also think about a function to add rows).

DataMatrix is licensed under the GNU GPL, version 2 only. You can download the installer (Windows) or the source distribution (Linux and other *nixes). The only requirement is Python 2.5 or later installed on your system.

The README currently is a stub, but you can browse the pydoc generated documentation, which details how to instantiate and use DataMatrix objects (or you can turn to my older post).

Also, since git is the new “cool feature of the day”, DataMatrix is is hosted on github’s repository, and you can grab the source with

git clone git://github.com/cswegger/datamatrix.git

Comments and suggestions are welcome. I’ll be putting a static page on DataMatrix tomorrow, if time permits.

data.frames in Python – DataMatrix

For a long time I have tried to handle text files in Python in the same way that R’s data.frame does – that is, direct access to columns and rows of a loaded text file. As I don’t like R at all, I struggled to find a Pythonic equivalent, and since I found none, I decided to eat my own food and write an implementation, which is what you’ll find below.

Continue reading