Wednesday, May 12, 2021

An extremely brief introduction to coloring data points in scatter plots by cluster in R.



Title: An extremely brief introduction to coloring data points in scatter plots by cluster in R.
Topics: PCA, K-means clustering, hierarchical clustering, scatter plots, heat maps.
Created: 2021-05-12
By: John Urban
For: Anyone interested (Spradling Lab, Carnegie Emb, or otherwise)






You're here because you want to learn to how to color data points in scatterplots by the cluster to which they belong. Heck, you want to learn how to cluster columns or rows in a data frame. Maybe you even want to learn to plot data points by their first and second principle components. You might accidentally learn any of these things in this tutorial.

Maybe you're here because you have been making PCA plots, and circling or coloring the clusters yourself by eye for powerpoint presentations. The world is a crazy place. It happens.

Perhaps you want to make the world just a little less crazy by coloring the data points using the output of k-means or hierarchical clustering.

I'm here to give you the briefest quick start possible. The rest is up to you.


Let's pretend we are in R / RStudio.

Let's pretend we did RNA-seq on 3 samples, and only care about 4 genes.



#1. Make a toy dataframe that has 3 replicates each of 3 different sample types (A, B, and C) and 4 genes. Here we are just simulating technical replicates using the rnorm function.

> gdf <- data.frame(sample=c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"), g1=c(rnorm(3, 10, 1), rnorm(3, 20,2), rnorm(3, 30,2)), g2=c(rnorm(3,30,3), rnorm(3, 25,3), rnorm(3,20,3)), g3=c(rnorm(3, 15,1), rnorm(3,45,5), rnorm(3,60,5)), g4=c(rnorm(3,5,1), rnorm(3,10,2), rnorm(3,15,1)))





#2. Go ahead. Take a look at it. You know you want to. Heck, I want to watch you do it.
> gdf





#3. Also look at it this way!
> t(gdf[,2:5])





#4. Now run PCA on it, and take a look at the output.
> p <- prcomp(t(gdf[,2:5]))
> p





#5. Go ahead and plot that.
> plot(p$rotation[,1], p$rotation[,2], pch=19, cex=2)


#6. If you don't yet know how to color by clusters found with algorithms, use your biological neural network (BNN) to cluster by eye: 





#7. Your BNN may have been extremely powerful, but let's try some clustering algorithms. First is k-means clustering.  Remember to take a peak at the result.

## Note: K-means is not necessarily deterministic, and the end-result depends on a few things, such as the initial starting points, the number of iterations, and the number of times you restart the algorithm to see if you can do better. Nonetheless, it usually does quite well. Just provide it the number of clusters, k, that you are expecting. If you don't know "k", then a whole other topic is on ways to select "k". Here we know we have 3 samples, so we should have 3 clusters.


> k <- kmeans(x = gdf[,2:5], centers = 3)
> k





#8. Now use the "clustering vector" to color your data points in the PCA plot. One way to do it is this:

> plot(p$rotation[,1], p$rotation[,2], pch=19, cex=2, xlab="PC1", ylab="PC2", main="Colored by K-means clustering.", col=k$cluster)


#9. Another clustering approach is "hierarchical". There are 3 steps here instead of just 1: creating a distance matrix, clustering based on the distance matrix, and cutting the "tree" to give you k=3 clusters.

## Note: hierarchical clustering is deterministic.

## Step 1: creating a distance matrix
> d <- dist(gdf[,2:5])
> d




## Step 2: clustering based on the distance matrix
> h <- hclust(d)
> h




## Step 3: cutting the tree to give you k=3 clusters
> hc <- cutree(h, k=3)
> hc



#10. Now use the hc "cluster vector" to color your data points in the PCA plot. One way to do it is this:

> plot(p$rotation[,1], p$rotation[,2], pch=19, cex=2, xlab="PC1", ylab="PC2", main="Colored by hierarchical clustering.", col=hc)

#11. Note: it is also possible to do hierarchical clustering and visualization of the dataframe as a heatmap in a single step. One way to do that is shown below.


## First just plot the dataframe as a heatmap with no clustering
> heatmap(t(gdf[,2:5]), Rowv = NA, Colv = NA, labCol = gdf$sample)



# For making it easier on the eye to track the technical replicates from each sample, here I am also showing how to use the "known clusters" (the technical replicates of a given sample type) to make "color bars" at the top.
> heatmap(t(gdf[,2:5]), Rowv = NA, Colv = NA, labCol = gdf$sample, ColSideColors = c("red","green","blue")[c(1,1,1,2,2,2,3,3,3)])

# Next, use clustering on the columns to show how the samples relate to each other.
> heatmap(t(gdf[,2:5]), Rowv = NA, labCol = gdf$sample, ColSideColors = c("red","green","blue")[c(1,1,1,2,2,2,3,3,3)])
 

# Finally, you can optionally cluster on the rows (genes) too
heatmap(t(gdf[,2:5]), labCol = gdf$sample, ColSideColors = c("red","green","blue")[c(1,1,1,2,2,2,3,3,3)])




=================================================================


That concludes this very brief introduction.

This is only a quick start.

Have fun googling further for more advanced guides!




























1 comment: