While reading the statistics for my blog, I noticed that a number of searches looked for hierarchical clustering with Python, which I covered quite a while ago. Today I’d like to present an updated version which uses more robust techniques.
My Akademy talk proposal was not accepted, but the organizers were kind enough to offer me the chance to hold a BoF on the same subject. Now I bet you wonder on what I’m going to discuss, and I think the title already gives you an idea:
KDE and bioinformatics: the missing link
Although in the KDE community we have our fair share of scientists (hey there, Stuart!), my BoF will focus on the adoption of KDE in the field of bioinformatics (my day job, not-so-by-chance) on the "outsiders" front and how to improve the current situation. To elaborate further, bioinformatics is a rather broad field where biological data are treated with computational methods. The oldest and most famous branch of bioinformatics is sequence analysis and related field, where sequences of DNA are analyzed, for example, to find common ancestors among several species, or to reconstruct the genetic code of an organism by comparing it to a related species. Another recent example is related to high-throughput technologies, technologies which produce huge amounts of data from a very small number of experiments ("ultramassive sequencing" and DNA microarrays are examples of such a technology).
Either way, bioinformaticians have to deal with large amounts of data all the time, and usually there’s no "shrink-wrap" solution to the problems they have to face, software-wise. That’s because we do research, so we need to find something new. So the solution is often to write algorithms, or re-implement existing ones in a form that is suited for the tasks at hand. So, bioinformaticians also write software, although they’re by no means (usually) professional coders: some have a mathematical or statistical background, others (like me) come from an experience at the lab bench. What kind of programs bioinformaticians write? Normally scripts and small stuff, but in certain cases even full blown-algorithms and applications. Some become so famous that are even trend-setters.
Which brings us to the heart of the matter: how does KDE stand in all of this? Sadly, not too well. I’ve done some research in the published literature, but there’s just one hit returned that’s proper: a KDE application for neuroscience (based on the 3.5.x Development Platform) published in 2008. I know that big research places like CERN use KDE, but to my knowledge smaller realities such as research group code in the majority of the cases for Windows or for web-based solutions. Given that at least a signficant portion of bioinformaticians uses UNIX-like operating systems, the question we need to answer is: why?
The first and foremost problem is related to market share. Research groups don’t even know that KDE exists, so it’s unlikely they develop something using the Development Platform (even now that’s becoming more cross-platform). This is where some promo efforts could help. Secondly, the problem lies in the "difficulty" (notice the quotes!) of developing using the KDE Development platform: most bioinformaticians, as I wrote, are not professional coders, and few of them know C++. The most used languages in bioinformatics are Perl and Java (with some Python and Ruby thrown into the mix). Thus, the need for proper bindings. The bindings are there, thanks to the excellent work of the kde-bindings team, but documentation is still lacking (namely in the examples department, but also in tutorials and getting started guides that aren’t aimed at C++). Some documentation is auto-generated, and while the KDE API docs are usually not too hard to read, they can still scare off newcomers. Of course this is not the fault of the kde-bindings team: namely, more help is needed.
Promo efforts and better bindings are the keys to spread KDE more in the field of the bioinformatics. This is what my BoF is about, plus an informal discussion on the use of FOSS in academia and related matters.
Interested? If you are, you can come to the BoF which will be on Tuesday, 6th July at 15.00 in the Area 2 of the main room at Demola.
I’ll also be around later till the following morning (sadly, two days is the best I can do to attend) in case you’re interested for a chat.
At last, after months of inactivity, I pushed out a new release of DataMatrix. Although the version bump is small (0.8) there are a lot of changes since last releases. The most notable include:
- Ability to apply functions to elements of the matrix
- Ability to filter rows by column contents
- Ability to transpose rows with columns
- An option to load text files produced by R (which are, by design, broken)
- Removed the getter for columns, using dictionary-like syntax directly
- A lot of bug fixes
The download links on the project page have been updated, along with the documentation. Also, there is another change, because from now on the official Git repository is hosted on gitorious.org, and no longer on github, because gitorious (the software) is also free, while github.com’s is not. It’s mainly a philosophical issue (the same that prompted me to switch from twitter to identi.ca).
Also, from today DataMatrix is also officially hosted on the Python Package Index (with the name “datamatrix”), meaning that you can use easy_install to quickly install it.
If you use this module, let me know what you think (including bugs, if you find them).
In the past months I’ve always wanted to write a small Plasma applet to aid me in some boring tasks as a bioinformatician. One example (for the non-scientific crowd out there) is when I find a specific gene out of my analysis work which I want to take a look at. I am often lazy, so instead of firing up the browser to look at the online resources, I wanted to write something which could access said resources programmatically.