Doing the clustering yourself

Hierarchical clustering with bootstrapping

The pvclust package allows to perform hierarchical clustering (based on a dendrogram) with bootstrapping. Bootstrapping means R is going to do the clustering 100 times, each time slightly changing the data. This generates bootstrap values that show the consistency of the branches (= how many times out of these 100 were these two samples clustered together?).

The function to do this is pvclust(), it’s main arguments are:

  • method.hclust: the linkage method (“average”, “complete”, “centroid”…)
  • method.dist: the distance measure used to calculate how (dis)similar two samples (“euclidean”, “manhattan”, “correlation”, …)

k-means clustering

There are other clustering methods than hierarchical clustering, e.g. k-means clustering. The function kmeans() allows to perform k-means clustering. It s main arguments are:

  • centers: how many clusters do you want
  • nstart: how many times do you want to restart the clustering
c <- kmeans(data,centers=4,nstart=5)
c$cluster

The result is a list that contains an object cluster, that stores the cluster results (for each sample the cluster number it belongs to).

The factoextra package contains a function fviz_cluster() that can read and visualize the results generated by kmeans().

Select best cluster shape and characteristics

The Mclust() function from the mclust package tries and compares multiple cluster numbers and shapes (spheres – ellipsoids – diagonals).  You can use summary() on the output of Mclust() to see which cluster number/shape is the best based on the BIC scores (Bayes information criterion). The best clustering is the one with the lowest BIC score.

library(mclust)
c <- Mclust(data)
c$classification

Mclust() generates a list, the classification object saves the cluster results.

summary(c)

This will display the best cluster number/shape (with the lowest BIC score).

?mclustModelNames

The documentation shows the different cluster shapes that are assessed.