first foray into bioinformatics with R

Published on 16 Mar 2018

Before we go straight into the content of this short write-up, here is a back story worth reading (I guess). When I was rotating back in fall semester 2017, I asked my then-PI (principal investigator, a.k.a research advisor) to suggest me a second rotation lab that was doing viral genomics. The response was “no, you should find yourself”.

Miraculously, perhaps unbeknownst to me, I got what I wanted. I enrolled into a class that I did not know would be a bit bioinformatics-heavy. At first all we did was trying not to be lost while navigating the UCSC Genome Browser, then the assignments metamorphed into trying our best to navigate the confusing R console / RStudio IDE (it is still confusing, but far less when I first started).

In hindsight, I am thankful. A brutal assignment on analyzing the transcription profile of 50k genes from 860 human samples (data from GTEx) by using R brought me to understand a number of things.

Here is a list of things to consider, as far as my experience is concerned.

List 01: DEA Analysis

  • The basic requirement for an RNA-seq differential expression analysis (DEA) is that you need to have at least 3 biological replicates per control/treatment group. Meaning that for you control group, you need at least 3. For your treatment group, you need at least 3. So, 6 biological samples in total. More is always better, so at least aim for a number higher than 10 per group.
  • The good thing about RNA-seq is that it is kind of a bandwagon, so everyone is hyped about it. Hence, there are quite a lot of online courses available within a few clicks away. This one from Weill Cornell Medical College is a good one.
  • Know your tools before doing analysis. I first started with DESeq2, then I played around with edgeR. I think I like edgeR but with DESeq2 you type less commands/codes in the console. Then I played with limma when analyzing a microarray dataset.
  • Know your source datasets. Is it a raw count table (which DESeq2 & edgeR can take as an input), is it a SummarizedExperiment (SE) object (for DESeq2). Is it a DGEList object (for edgeR). Is it Expression Set (for limma). If you are working with GEO dataset (e.g. GDS, GDL, GSE), you probably need to do some cleaning.
  • If you are afraid of R and typing in the R console, consider Galaxy. Free online training course is available here.
  • You got ~50k genes from human samples? Consider removing low-count reads. From the biological perspective, a gene is considered as being expressed if you have an N number of counts, usually & arbitrarily more than 10. From the statistical analysis, anything with low-count read is not significant.
  • Running the DESeq() method for N of samples more than 100, it could be an excruciating pain because it could take a very long time. A dataset that I worked with had 860 human samples (2 groups). The DESeq() method took around 7 to 8 hours (without dropping low-count reads). By using the edgeR, I got the data processed within an hour. But, are they different? Well, consider reading this & this.
  • Knowledge about PCA, MA, and volcano plots are very useful. They are the good plots for doing diagnostic check. A good grasp on how to interpret a p-value histogram is a good thing to have.
  • A good understanding on model.matrix() method & intercept is useful. I am still struggling with this.

List 02: R, RStudio, and Rmarkdown

  • Exporting to PDF with (La)TeX. When I first read on how to do it, the recommendation was to download MacTex (on macOS). The problem is: MacTex is very huge, sitting around 2GB. The bad thing is a typical R user might use around 1% of that thing. The alternatives are MikTeX (modular, installs stuff when you need it) and TinyTeX, same like MikTex but you can manage it within R because there is a wrapper.
  • On MacOS, we can manage our R installation with brew. On Windows, I tried using scoop to manage my R installation I bumped into a problem where RStudio was not able to find my R binary. I installed R by downloading the official installer from the R website. So, how do I manage my R? I am currently using installr R package. Run install.R() to install latest version of R. Must be done outside RStudio.
  • Deploy a website with Rmarkdown. I think this would be a great idea for me to start authoring documents in Rmarkdown, which I think could be useful for my career progression as a data-scientist.
  • Datasets to play around. A good package to those who aspire to start learning how make great plots in R.
  • For first-timers, when perhaps the console is doing you no good, maybe these could help: the GUI interface to R. There are Rattle and R Commander that are avaible as R packages. There is R Analytic Flow that you have to install separately (like installing RStudio). However, R Analytic Flow requires Java, which could be a turn off. In my opinion, I think RStudio is better for 2 reasons: the interface is well-polished as an IDE and the community looks strong and stable.

This list is updated from time-to-time