Site Loader

Sunday, November 1, 2020 –
I’ve been marinating in a lot of data recently (hence the lack of posts!) I have been trying my best to obtain some theoretical background behind the analysis we do in microbiome research, so that I can adapt and implement new methods rather than just doing what’s been published without understanding why. I used to be very scared of quantitative biology because of big terms like “eigenvector.” Other people would explain their projects to me, and they would throw in unfamiliar terminology, and I would just walk away fixated on “eigenvector” bouncing around in my small brain, and have no idea what their project was about whatsoever. But! I have recently learned that quantitative biology is not that scary, and what we should really be scared about is nasopharyngeal swabs. I am here to tell you about how its not THAT scary and how very awesome qBio is!

Microbiome count data, like many other types of sequencing data, is challenging because there is so much of it (think about the counts for each of 400 taxa across each 50 samples – this would be considered a “small” sample). Let’s say Taxon 1 is present only in 50% of samples, and has an average relative abundance of 0.001%, whereas Taxon 2 is present in all samples, has an average relative abundance of 1%, and is at greater relative abundance in 40% of samples compared to the other 60% of samples, and so on… all the way to Taxon 400. How can we possibly draw conclusions when there’s so much data and information to preserve!?!? Cue brain explosion… and enter data reduction, bam!

A sub-figure from one of my boss’s papers (linked) in 2016, demonstrating that IBD status (no IBD, Ulcerative colitis and Crohn’s disease) can distinguish microbiome composition. (Jacobs JP et al., 2016)


One popular “data reduction” method in the field is called principal component analysis (PCA), in which we represent one sample as a dot, and we view dots that are clustered together as samples that are similar in terms of microbiome composition, and dots that are far apart as samples that are distinct in terms of microbiome composition. PCA is one type of ordination method, a set of operations “represent sample and species relationships as faithfully as possible in a low-dimensional space.” [1] Now, for microbial 16S rRNA data, we take it a step further and calculate distance metric data prior to performing PCA, and that’s called principal coordinates analysis (PCoA).

So, to organize the most common ways we like to “reduce” microbiome data:

  • Ordination method: PCoA
    • -PCoA of Euclidean distances (this is also called PCA)
    • -PCoA of Jaccard distances
    • -PCoA of Bray-Curtis dissimilarities
    • -PCoA of UniFrac distances
    • -PcoA of Jensen-Shannon distances
    • -PcoA of Aitchison distances

Starting with Euclidean distances, since that’s probably what we’re most familiar with from algebra, here’s how you would calculate the distance between two points in a two-dimensional Euclidean space:

Euclidean distance - Wikipedia
Image source: Wikipedia. Calculate the distance between 2 points in a 2D space.

Looks familiar, doesn’t it? Now let’s talk about how it applies to PCA. Here are all the steps of PCA, drawn out on my whiteboard late one night. We start with some simple dataset, with four samples and two taxa. By plotting T1 on X-axis and T2 on Y-axis, since our dataset is so minimal, we can already see that S3 is closer to S1, and S4 is closer to S2. But for the sake of learning purposes, we forge onwards with PCA. After calculating the center of the dataset, which is the (average x-coordinate, average y-coordinate), we center the data on the origin by subtracting the average x from every x-coordinate, and the average y from every y-coordinate. We then find, or we use the computer to find, the best fitting line for all of these points by minimizing residuals to the line (which is equivalent to maximizing variance, or I show it as “c^2” on the whiteboard.) Alex Williams’ blog post on PCA has this really great graphic showing why these two concepts are equivalent. I would say that my blog post is a good primer for the reading of his blog post. We then call this best-fitting line PC1, which has its own equation. For simplicity, on my whiteboard, PC1 = y= x. My slope is equal to 1, which means that if I draw a right-angle triangle of 1 T2 and 1 T1, I get a hypotenuse of root(2). So far, so good, right? I now calculate the unit vector of PC1, which I obtain by dividing (1/root(2)) for T1 and (1/root(2)) for T2. My unit vector (or the eigenvector) now is represented by two components: (1/root(2), 1/root(2)). To get PC2, I find the line which is orthogonal to my PC1, and calculate the eigenvector. Finally, for visualization, I rotate my PC1 and PC2 such that my PC1 and PC2 are horizontal and vertical.

Now, what happens when I’ve got a larger dataset? 4 samples and 6 taxa? Same thing: Find the best-fitting line, PC1, by maximizing the (distance from the projected point of the sample on the line to the origin)^2, then calculate the eigenvector (you’ll get 6 components for 6 taxa.) Conceptually, calculating the eigenvector is important because it scales all the components such that they are now comparable to one another. The largest component in the eigenvector corresponds to the taxon that most heavily influences PC1 (and therefore drives the distinction between samples). Calculate PC2, PC3, PC4, PC5, and PC6. PC2 is orthogonal to PC1. PC3 is perpendicular to the plane which PC1 and PC2 lie in and must go through the origin, and so on and so forth. Lastly, calculate the variation in the data accounted for by each principal component. For PC1, this is Var_PC1 = (C^2/(sample size-1)). Construct a scree plot, or the percent of explained variance vs. each principal component. If you are going to reduce data, you want to ensure that a majority of information is captured in your first two components. Congratulations! Now you’re done.

Picture explaining why maximizing the squared distance of the projected point to the origin of each sample point is equivalent to minimizing residuals in located the best-fitting line. From Alex Williams’ blog.
Image Source: Julianne’s whiteboard. I invested in my education and bought it from Amazon. Follow the arrows!
Percentage of Variance (Information) for each by PC
Scree plot. You want the majority of variation to be capture in your first two principal components.

One main disadvantage to this sort of PCA (PCoA on Euclidean distances) is that the PC1 is heavily affected by presence/ absence of taxa (such as when 5/ 50 samples have a bacterial species present, but the other 45 samples do not). This could be an issue if this particular species is present/absent due to parameters not of interest to your study, such as interindividual variation, yet still heavily influences your PC1. Now imagine if you had a lot of presence/ absence taxa, which is often the case with microbiome count data.

Thus, this is why we often do principal coordinates analysis on other distance metrics. A subsequent post to follow, hopefully before the end of 2020, regarding these other distance metrics! The other thing to keep in mind is that PCoA is just one of several ordination methods, and ordination methods are just one way to reduce high-dimensional data. I know, right?! I could spend eons learning data science.


To marinating in data together,
the microbepipettor

Resources/ References/Extra reading:
[1] http://ordination.okstate.edu/overview.htm
[2] https://towardsdatascience.com/visualizing-high-dimensional-microbiome-data-eacf02526c3a
[3] https://microbiome.github.io/tutorials/
[4] https://builtin.com/data-science/step-step-explanation-principal-component-analysis
[5] http://alexhwilliams.info/itsneuronalblog/2016/03/27/pca/

microbepipettor