data.frames in Python – DataMatrix

For a long time I have tried to handle text files in Python in the same way that R’s data.frame does – that is, direct access to columns and rows of a loaded text file. As I don’t like R at all, I struggled to find a Pythonic equivalent, and since I found none, I decided to eat my own food and write an implementation, which is what you’ll find below.

The idea is to store the values of the text file as a dictionary of columns which includes then a list of (row name, row value) tuples. Like this, you can access the columns by their name (I need to see if it’s workable to also use numbers), or you can view specific rows, including all or a subset of the columns. It’s decently faster and it allows for non-sequential access, which you can’t do when reading a file (or a file-like structure).

Requirements

I have tested this on Python 2.5.1. Older versions may or may not work. All modules called by this one should be shipped with Python itself.

Download and installation

Download the py file directly. Currently there is no installation mechanism, so copy it wherever Python can find it.  There’s some API documentation generated with pydoc.

This module is licensed under the GNU General Public License, version 2.

Usage

First of all, import the module


import datamatrix

Then open a file and instantiate a DataMatrix object


fh = open("somefile.txt")
data = datamatrix.DataMatrix(fh)

By default no column with row names is specified, so if you have one, you have to specify it:

data = datamatrix.DataMatrix(fh, row_names=1)

More options are in the documentation.

Once the DataMatrix is initialized, you can view how many columns are there and also view rows with the getRow method:


>> data.columns
["GeneID","Great_Exp1","Great_Exp2"]

>> data["Great_Exp1"]
[("Gene1",56.34),
...
]

>> data.getRow(5)
["NOT_EXISTENT","56.545","4.56"]

Sometimes you’d want to get only the column without the row identifier, and that’s where getColumn comes in:


>> data.getColumn("Great_Exp1")
[56.34,2.55.....]

Should you want to save a DataMatrix instance, you can use the writeMatrix function:


datamatrix.writeMatrix(data,fname="/path/to/somewhere/file.txt")

That’s all. Questions and suggestions, especially on coding and improvements, are very welcome.

9 thoughts on “data.frames in Python – DataMatrix

  1. Tony

    Nice work.

    Have you played around with numpy? This is pretty similar to Record Arrays [http://www.scipy.org/RecordArrays] and recarrays (which, is similar to Record Arrays, but slightly different) in numpy.

  2. Einar

    Good catch. I found the scipy web page a little disorganized, that is why I probably did not find it. I will see if I can integrate the file management part (AFAIK numpy provides only simplistic ways of converting files into arrays) with either recarrays or Record Arrays.
    Thanks a lot for the suggestion!
    EDIT: I have briefly looked at numpy and looks like you can’t really add arrays in an iteration like my module does. The only alternative would be to use fromfile to load the file directly and add the record array on the spot, but I’m not sure if I should use that since numpy’s documentation says that the function is not very robust.

  3. Mike

    Hi,

    I found your module quite usefule! really like it.
    However, one thing I dont get is how I am able to modify some entries in a few columns and then write the whole matrix back to a file with the modification I made. Say, I have a file with 10 columns and each column has a few thousand entries. If I would like to rename entries, whose value <10, as “NA” for only some columns, I do not see a way to write the modification into a file. Can you give me some examples in this purpose?

    Thanks!

  4. Mike

    Hi,

    I found your module quite useful! really like it.
    However, one thing I dont get is how I am able to modify some entries in a few columns and then write the whole matrix back to a file with the modification I made. Say, I have a file with 10 columns and each column has a few thousand entries. If I would like to rename entries, whose value <10, as “NA” for only some columns, I do not see a way to write the modification into a file. Can you give me some examples in this purpose?

    Thanks!

  5. Einar

    DataMatrix entries are lists, hence mutable. The latest version (in git) supports __setitem__ so you could make a list of tuples in the same format as the entries you read (e.g. a list of (rowname, value) tuples) and reassign it with datamatrix[columnname] = list .
    Then you can use the writeMatrix function (use the one in the git repository, the one in the 0.5 version is broken as far as I remember) to write the modified DataMatrix to a file.
    I have made loads of enhancements with respect to this version, so I guess I should put a new version out soon.

  6. Mike

    Hi Einar,

    Thanks for the reply. I am new in python so I am not so sure how you said works.

    Here is part of the code I wrote:
    ===========================
    for num in column:
    (indent) col_data=data_obj.getColumn(col_names[int(num)-1])
    (indent)(indent) for i in range(1,len(col_data)):
    (indent)(indent)(indent) col_data[i]=col_data[i].replace(old,new)
    (indent)(indent) datamatrix(col_names[int(num)-1])=col_data
    ===========================
    NOTE:
    (indent)-> represents indentation since the blog does not support it
    column-> a list of column which I would like to apply the substitution
    col_names-> a list of column names
    data_obj-> datamatrix object
    old-> some pattern I do not want to keep
    new-> substitute old by the new pattern

    The last line of the code, which I mimic what you said about reassign it with datamatrix[columnname] = list, does not work since it assigns to function call.

    I think my question is related to how to make changes within column entries and then save and write the datamatrix object into a file.

    I really appreciate any comments and suggestions you gave.

    Mike

  7. Einar

    The last line should be
    datamatrix[col_names[int(num)-1] = col_data

    and to save, the writeMatrix function is not part of the class, so assuming you have imported “datamatrix”:

    datamatrix.writeMatrix(datamatrix_object,open(“somefile.txt”,”w”,header=true)

    Notice that this will work only with the git version of DataMatrix (since it has changed a bit internally)

    With the old version, I think you should do

    datamatrix.writeMatrix(datamatrix_object,”somefile.txt”)

    Where datamatrix_object is your DataMatrix instance.

  8. Jindra Vavruska

    Everything fine, except the package name which is pretty confusing. Datamatrix is a name (sometimes spelled DataMatrix) of a kind of 2D optical code (like a barcode, but in 2D, made of dots instead of bars).

    Would it be possible to rename the package? I was googling for “datamatrix” and got this, instead of what I wanted…

    Jindra

  9. Tyberius Prime

    There’s now also a pydataframe module over at pypi that closely mimics R’s dataframe.

Comments are closed.