Before we go straight into the content of this short write-up, here is a back story worth reading (I guess). When I was rotating back in fall semester 2017, I asked my then-PI (principal investigator, a.k.a research advisor) to suggest me a second rotation lab that was doing viral genomics. The response was “no, you should find yourself”.
Miraculously, perhaps unbeknownst to me, I got what I wanted. I enrolled into a class that I did not know would be a bit bioinformatics-heavy. At first all we did was trying not to be lost while navigating the UCSC Genome Browser, then the assignments metamorphed into trying our best to navigate the confusing R console / RStudio IDE (it is still confusing, but far less when I first started).
In hindsight, I am thankful. A brutal assignment on analyzing the transcription profile of 50k genes from 860 human samples (data from GTEx) by using R brought me to understand a number of things.
Here is a list of things to consider, as far as my experience is concerned.
List 01: DEA Analysis
- The basic requirement for an RNA-seq differential expression analysis (DEA) is that you need to have at least 3 biological replicates per control/treatment group. Meaning that for you control group, you need at least 3. For your treatment group, you need at least 3. So, 6 biological samples in total. More is always better, so at least aim for a number higher than 10 per group.
- The good thing about RNA-seq is that it is kind of a bandwagon, so everyone is hyped about it. Hence, there are quite a lot of online courses available within a few clicks away. This one from Weill Cornell Medical College is a good one.
- Know your tools before doing analysis. I first started with DESeq2, then I played around with edgeR. I think I like edgeR but with DESeq2 you type less commands/codes in the console. Then I played with limma when analyzing a microarray dataset.
- Know your source datasets. Is it a raw count table (which DESeq2 & edgeR can take as an input), is it a SummarizedExperiment (SE) object (for DESeq2). Is it a DGEList object (for edgeR). Is it Expression Set (for limma). If you are working with GEO dataset (e.g. GDS, GDL, GSE), you probably need to do some cleaning.
- If you are afraid of R and typing in the R console, consider Galaxy. Free online training course is available here.
- You got ~50k genes from human samples? Consider removing low-count reads. From the biological perspective, a gene is considered as being expressed if you have an N number of counts, usually & arbitrarily more than 10. From the statistical analysis, anything with low-count read is not significant.
- Running the
DESeq()
method for N of samples more than 100, it could be an excruciating pain because it could take a very long time. A dataset that I worked with had 860 human samples (2 groups). TheDESeq()
method took around 7 to 8 hours (without dropping low-count reads). By using the edgeR, I got the data processed within an hour. But, are they different? Well, consider reading this & this. - Knowledge about PCA, MA, and volcano plots are very useful. They are the good plots for doing diagnostic check. A good grasp on how to interpret a p-value histogram is a good thing to have.
- A good understanding on
model.matrix()
method & intercept is useful. I am still struggling with this.
List 02: R, RStudio, and Rmarkdown
- Exporting to PDF with (La)TeX. When I first read on how to do it, the recommendation was to download MacTex (on macOS). The problem is: MacTex is very huge, sitting around 2GB. The bad thing is a typical R user might use around 1% of that thing. The alternatives are MikTeX (modular, installs stuff when you need it) and TinyTeX, same like MikTex but you can manage it within R because there is a wrapper.
- On MacOS, we can manage our R installation with
brew
. On Windows, I tried usingscoop
to manage my R installation I bumped into a problem where RStudio was not able to find my R binary. I installed R by downloading the official installer from the R website. So, how do I manage my R? I am currently usinginstallr
R package. Runinstall.R()
to install latest version of R. Must be done outside RStudio. - Deploy a website with Rmarkdown. I think this would be a great idea for me to start authoring documents in Rmarkdown, which I think could be useful for my career progression as a data-scientist.
- Datasets to play around. A good package to those who aspire to start learning how make great plots in R.
- For first-timers, when perhaps the console is doing you no good, maybe these could help: the GUI interface to R. There are Rattle and R Commander that are avaible as R packages. There is R Analytic Flow that you have to install separately (like installing RStudio). However, R Analytic Flow requires Java, which could be a turn off. In my opinion, I think RStudio is better for 2 reasons: the interface is well-polished as an IDE and the community looks strong and stable.
This list is updated from time-to-time