While reading the statistics for my blog, I noticed that a number of searches looked for hierarchical clustering with Python, which I covered quite a while ago. Today I’d like to present an updated version which uses more robust techniques.
Defining the problem
Since Eisen’s original paper on clustering, this form of analysis has been widely used by a lot of researchers. However, as it is known, such systems may be susceptible to an ordering bias: in other words, the order of the samples and/or genes might influence the final result. That’s why popular software such as TMeV offers alternative approaches, based on bootstrapping.
In this specific form of bootstrapping, the samples and/or genes are randomly shuffled a number of times (1000 or more iterations are a good starting point) and the resulting dendrograms checked for consistency and robustness of partitioning. In other words, a p-value is calculated, our null hypothesis being that the arrangement of samples/genes is merely due by chance. Depending on the software, this value might be expressed either in form of p-value or percentage (TMeV calls it support).
In the past years, I found an interesting method developed by Hidetoshi Shimodaira: the technique, called multiscale bootstrap resampling, aims at determining more accurate p-values out of the bootstrapping. Shimodaira calls the resulting p-value an AU value, where AU stands for “approximately unbiased”, a more precise p-value than the one obtained through bootstrapping alone.
In addition to this nice algorithm, a R package was also provided, named pvclust (it’s available on your favorite CRAN mirror). And that’s exactly what we’ll use for this exercise.
Some of the readers of this blog might remember my disdain of R: while I need to use it for Bioconductor, I’m often annoyed by its weird syntax, and difficult to understand error messages. Luckily, thanks to the hard work of Laurent Gautier and contributors, there’s rpy2, a nice R-to-Python bridge. All the examples here require this package, version 2.1 or newer (I’d recommend the release candidate of 2.2, it’s really nice). Unfortunately, this means that Windows users are out of luck as there’s no version of rpy2 2.1 or 2.2 available for that platform..
Also, don’t forget to have the pvclust package installed in R.
Loading and preparing the data
import rpy2.robjects as robjects from rpy2.robjects.packages import importr
base = importr("base") pvclust = importr("pvclust")
dataframe = robjects.DataFrame.from_csvfile("GSE4984.txt", sep="\t ", row_names=1)
rows = robjects.IntVector(range(1,501)) subset = dataframe.rx(rows, True)
subset <- dataframe[rows, ]
Once we have the data, it’s time to do some serious clustering on it:
result = pvclust.pvclust(subset, nboot=100, method_dist="correlation", method_hclust="average")
We’re using a small number of permutations (100) because the computation times are long. You can change the distance metric and the linkage types using method_dist and method_hclust. Internally the data.frame is converted to a matrix, so ensure you have valid data (i.e. numeric) prior to proceeding.
Notice that this will just cluster the columns by default. If we want to cluster genes, we have to transpose the data.frame. In this case we have to first convert it to a matrix, then transpose it:
matrix = base.as_matrix(subset) subset_transposed = matrix.transpose() result_rows = pvclust(subset_tranposed, nboot=100, method_dist="correlation", method_hclust="average")
Once the computation is done, we have a pvclust object which holds information on the results. What we’re most interested in is the hclust attribute, as it holds a dendrogram object we can use for plotting (either standalone or via a heat map). We can also manipulate the object with the pvpick function, for example to color the trees of the dendrogam basing on their AU values.
To get a fast representation, we can just dump the object as it is to a dendrogram which will show AU and BP values for each element of the cluster:
graphics = importr("graphics") graphics.plot(result)
graphics = importr("graphics") grdevices = importr("grDevices") grdevices.pdf("myresult.pdf", paper="a4") graphics.plot(result) grdevices.dev_off()
gplots = importr("gplots") row_dendrogram = result_rows.rx2("hclust") column_dendrogram = result.rx2("hclust") gplots.heatmap_2(subset, Rowv=row_dendrogram, Colv=column_dendrogram, col=gplots.greenred(255), density_info="none")
pvclust as-is it’s quite slow. There’s however a parallelized version, called parPvclust, which uses snow to parallelize the clustering, either through multiple machines or using multiple cores. Setting snow properly up is beyond the scope of this tutorial, but it may be worth investing if you cluster a lot of data.