first foray into bioinformatics with R

Before we go straight into the content of this short write-up, here is a back story worth reading (I guess). When I was rotating back in fall semester 2017, I asked my then-PI (principal investigator, a.k.a research advisor) to suggest me a second rotation lab that was doing viral genomics. The response was “no, you should find yourself”.

Miraculously, perhaps unbeknownst to me, I got what I wanted. I enrolled into a class that I did not know would be a bit bioinformatics-heavy. At first all we did was trying not to be lost while navigating the UCSC Genome Browser, then the assignments metamorphed into trying our best to navigate the confusing R console / RStudio IDE (it is still confusing, but far less when I first started).

In hindsight, I am thankful. A brutal assignment on analyzing the transcription profile of 50k genes from 860 human samples (data from GTEx) by using R brought me to understand a number of things.

Here is a list of things to consider, as far as my experience is concerned.

List 01: DEA Analysis

The basic requirement for an RNA-seq differential expression analysis (DEA) is that you need to have at least 3 biological replicates per control/treatment group. Meaning that for you control group, you need at least 3. For your treatment group, you need at least 3. So, 6 biological samples in total. More is always better, so at least aim for a number higher than 10 per group.
The good thing about RNA-seq is that it is kind of a bandwagon, so everyone is hyped about it. Hence, there are quite a lot of online courses available within a few clicks away. This one from Weill Cornell Medical College is a good one.
Know your tools before doing analysis. I first started with DESeq2, then I played around with edgeR. I think I like edgeR but with DESeq2 you type less commands/codes in the console. Then I played with limma when analyzing a microarray dataset.
Know your source datasets. Is it a raw count table (which DESeq2 & edgeR can take as an input), is it a SummarizedExperiment (SE) object (for DESeq2). Is it a DGEList object (for edgeR). Is it Expression Set (for limma). If you are working with GEO dataset (e.g. GDS, GDL, GSE), you probably need to do some cleaning.
If you are afraid of R and typing in the R console, consider Galaxy. Free online training course is available here.
You got ~50k genes from human samples? Consider removing low-count reads. From the biological perspective, a gene is considered as being expressed if you have an N number of counts, usually & arbitrarily more than 10. From the statistical analysis, anything with low-count read is not significant.
Running the DESeq() method for N of samples more than 100, it could be an excruciating pain because it could take a very long time. A dataset that I worked with had 860 human samples (2 groups). The DESeq() method took around 7 to 8 hours (without dropping low-count reads). By using the edgeR, I got the data processed within an hour. But, are they different? Well, consider reading this & this.
Knowledge about PCA, MA, and volcano plots are very useful. They are the good plots for doing diagnostic check. A good grasp on how to interpret a p-value histogram is a good thing to have.
A good understanding on model.matrix() method & intercept is useful. I am still struggling with this.

List 02: R, RStudio, and Rmarkdown

Exporting to PDF with (La)TeX. When I first read on how to do it, the recommendation was to download MacTex (on macOS). The problem is: MacTex is very huge, sitting around 2GB. The bad thing is a typical R user might use around 1% of that thing. The alternatives are MikTeX (modular, installs stuff when you need it) and TinyTeX, same like MikTex but you can manage it within R because there is a wrapper.
On MacOS, we can manage our R installation with brew. On Windows, I tried using scoop to manage my R installation I bumped into a problem where RStudio was not able to find my R binary. I installed R by downloading the official installer from the R website. So, how do I manage my R? I am currently using installr R package. Run install.R() to install latest version of R. Must be done outside RStudio.
Deploy a website with Rmarkdown. I think this would be a great idea for me to start authoring documents in Rmarkdown, which I think could be useful for my career progression as a data-scientist.
Datasets to play around. A good package to those who aspire to start learning how make great plots in R.
For first-timers, when perhaps the console is doing you no good, maybe these could help: the GUI interface to R. There are Rattle and R Commander that are avaible as R packages. There is R Analytic Flow that you have to install separately (like installing RStudio). However, R Analytic Flow requires Java, which could be a turn off. In my opinion, I think RStudio is better for 2 reasons: the interface is well-polished as an IDE and the community looks strong and stable.

This list is updated from time-to-time