Databases

By einar

April 25, 2007 - Comments

As I’ve been working to get some results done for my Ph.D. thesis, I’ve stumbled across the problem of having different data obtained through different software. Even if it’s just a matter of text files, the fields are all different and even if dealing with the same data, trying to infer relationships is a pain.

Therefore I decided to create a small database to host the data of my work and query it accordingly. I didn’t want to run a database server, so I settled for SQLite, a lightweight file-driven database, I don’t handle enormous amount of data so it should be ok. Up to now I’ve inserted parts of the Entrez Gene database. First of all I downloaded the gene_info.gz from NCBI’s FTP, which contains data such as gene name, gene symbol, and so on. Then it was a matter of filtering out non-human entries, and to do so I wrote a small script called taxon_filter.py:

[code lang=“python”] #!/usr/bin/env python

import gzip import sys import csv

“““Filters NCBI annotation files by human taxon (9606). Works directly from the source gzipped file and outputs a tab-delimited file.”””

class ncbi: delimiter = ‘\t’ quotechar = ‘"’ escapechar = None doublequote = True skipinitialspace = False lineterminator = ‘\n’ quoting = csv.QUOTE_NONE

def init(self):

Dialect registration

csv.register_dialect(“ncbi”, self)

ncbi()

if len(sys.argv) < 3: print “Not enough command line arguments” sys.exit(0)

try: compressed_file = gzip.open(sys.argv[1]) except IOError: print “Could not open file!” sys.exit(-1)

delim = csv.reader(compressed_file,dialect=“ncbi”)

try: destination_file = open(sys.argv[2],“wb”) except IOError: print “Can’t open destination file!” sys.exit(-1)

write_delim = csv.writer(destination_file,dialect=“ncbi”)

write_delim.writerow(delim.next())

for row in delim: if row[0] == “9606”: write_delim.writerow(row)

print “Complete!” compressed_file.close() destination_file.close() sys.exit(0)

[/code]

(apologies, tabs are messed up)

That filtered the file for entries belonging to taxon 9606 (_Homo sapiens) . _Then I had to keep only the interesting bits in the file, so I cut the leading comment and selected only the correct fields:

[code lang=“bash”] sed ‘1d’ gene_info_human | cut -f2,9,3,7-8 > entrez_gene.txt [/code]

SQLite has Python bindings (officially part of Python since 2.5) but those don’t allow the direct import of text files, so I fired up the command line sqlite3 command and created the relevant table, called entrez_gene, and imported the data:

[code lang=“bash”] .separator “\t” .import datafiles/File_NCBI/entrez_gene.txt entrez_gene [/code]

Done! This is the first step, then I’ll work on creating tables for my own data.

Comments