November 15, 2007 – 20:57
While working today on an annotation class in Python I stumbled on a problem. Normally I work with lists of genes that are consistent, i.e. all Entrez Gene IDs (or RefSeq IDs, or Genome Browser IDs…), but today I had a list of mixed identifiers.
The subsequent idea was “let’s implement auto-detection of common identifiers in the class”. The problem is… is there any actual documentation on how identifiers are made? So far, using regular expressions, I’ve tracked down a few:
- RefSeq
- GenBank
- Entrez Gene
- UCSC Genome Browser
- Ensembl
However, I have no idea if I have implemented all types of these IDs. Does anyone know a place where to look these information up?
(On a related note: my thesis defense will be on January 14th, 2008, so I have to get the printing going)
Following up my recent post, I’ve been looking for alternatives to TMeV. So far I’ve found the R package pvclust and the Pycluster library, part of BioPython. The first one also performs bootstrapping (I’m not sure if it’s similar to what support trees do, but it’s still better than no resampling at all). I’ve found another Python project but it is still too basic to perform what I need.
Read More »
The other day I was thinking about how to make screencaps for the anime I watch. Windows users often use Media Player Classic, which can create a video contact sheet (i.e., a series of captures) out of a movie file. I had two problems with this:
- The biggest is that it runs on Windows, and I don’t use Windows;
- The frames needed to be manually cropped every time, which was slow.
Therefore, inspired by a video contact sheet script for Linux, I decided to write a small piece of code to make captures. It works rather easy, by taking snapshots every X minutes, where X is an integer number.
The code is here: it requires bash, python (just for checks) and mplayer to work correctly. It should work with every format mplayer suppports. It’s hackish, but if you find it useful, let me know.
EDIT: I changed a line (thanks, greg) because this syntax highlighter messes up some formatting.
#!/bin/bash
# (C) 2007 Luca Beltrame - licensed under the terms of the GPL v2
# Simple script to output video frames with MPlayer. It takes the file
# and a step argument to indicate how many minutes a capture should
# be taken. The step must be an integer!
if [ $# -ne "2" ]
then
echo "Usage: $0 <file> <step>"
exit
fi
file=$1
step=$2
i=1
# Requires python 2.5
verification=`python -c "value = 0 if isinstance($step,float) else 1;print value"`
echo $verification
if [ $verification -eq "0" ]
then
echo "Step must be an integer!"
fi
length=`mplayer -benchmark -ao null -vo null -identify -frames 0 -quiet $file 2>/dev/null | grep ID_LENGTH | cut -f2 -d'='`
end=`echo $length/60| bc`
while [ $i -lt $end ]
do
minutes="00:"$i":00"
name="capture_"$i"min.png"
# We take two captures as the first will be always black - mplayer bug?
mplayer -sws 9 -ao null -quiet -benchmark -vo "png:z=0" -frames 2 -ss $minutes "$1" &> /dev/null
mv 00000002.png $name
rm -f 00000001.png
i=$[$i+$step]
done
echo "Screenshots saved."
Today I started working on a data set published on GEO. As the sample data were somehow inconsistent (they mentioned 23 controls when I found 28), I decided to parse the SOFT file from GEO in order to get the exact sample information.
I did a grave mistake. First of all, Biopython’s SOFT parser is horribly broken (doesn’t work at all) and quite undocumented: I could work around the lack of documentation (API docs) but not with the fact that it wouldn’t work. So I turned to R, which offers a GEO query module through Bioconductor.
Again that proved to be a terrible mistake. For a file containing 183 samples, the analysis is going on since four hours and with no sign of completing anytime soon (not to mention a possible memory leak). After this, I gave up. I’m going to get the reduced data sheet and write a small parser in Python myself.
What is frustrating is the lack of quality: I could concentrate on my own work rather than reinventing the wheel for the nth time if the existing implementations worked. What’s the point in releasing non-working software? I could understand bugs, but this is one step further.