A short review of RNA sequencing and its applications

What are the omics sciences?

Omics sciences are targeting quantification of whole biomolecules such as RNA and proteins at organism, tissue, or a single-cell level. Omics sciences are separated into several branches such as genomics, transcriptomics, and proteomics¹.

What is transcriptomics?

Transcriptomics is one of the omics sciences dissecting the organism’s transcriptome which is the sum of all of its RNA molecules^2,3.

What is RNA sequencing?

RNA sequencing (RNA-seq) is a technique providing quantification of all RNAs in bulk tissues or each cell. The transcript amounts of each gene across samples are calculated by using this technique. It is utilizing next-generation sequencing (NGS) platforms deciphering the sequencing of biomolecules such as DNA and RNA^4,5.

What are the kinds of RNA-seq?

Bulk tissue RNA-seq

The whole transcriptome of target bulk tissues is sequenced to make transcriptomics analyses. Here, target bulk tissue can contain various cell types, and therefore, the whole transcriptome is mixed with RNAs of those cells. This approach is the most common usage of RNA-seq and is performed for some aims such as elucidating of diseases⁷.

Single-cell RNA-seq

In contrast to bulk tissue RNA-seq, single-cell RNA-seq (scRNA-seq) is performed in individual cells. The whole transcriptome of each cell in a tissue is sequenced to make transcriptomics analysis. The scRNA-seq has revealed that the transcriptome of each cell in a tissue is different from each other and individual cells can be separated into specific clusters according to its transcriptomic signature. The scRNA-seq has helped the discovery of some cells such as ionocyte cells, which could be relevant to the pathology of cystic fibrosis^7,8.

Spatial RNA-seq

The relationship between cells and their relative locations within a tissue sample can be critical to understanding disease pathology. Spatial transcriptomics is a technology that allows the measurement of all the gene activity in a tissue sample and map where the activity is occurring. This technique is utilized in the understanding of biological processes and disease. Spatial RNA-seq can be performed at intact tissue sections as well as a single-cell level. The general aim of this technique is a combination of gene expression and morphological information and providing information on tissue architecture and micro-environment for the generation of sub-cellular data. Current bulk and scRNA-seq methods provide users with highly detailed data regarding tissues or cell populations but do not capture spatial information^7,9,10.

RNA-seq analysis work-flow

1) Experimental design

There are many various library types in RNA-seq resulted in sequencing reads (sequenced transcripts) with different characteristics. For instance, reads can be single-end in which a transcript is read from its only an end (5’ or 3’), however, in the paired-end libraries, a transcript is read from both its 5’ and 3’ end. Paired-end sequencing can additionally help disambiguate read mappings and is preferred for alternative-exon quantification, fusion transcript detection, particularly when working with poorly annotated transcriptomes⁷. In addition to that, libraries can be stranded or unstranded. The strandedness for libraries is important to determine which DNA strand reads coming from and it is utilized to assign reads to relevant genes. If strandedness information of libraries is misused, then reads are not assigned to true genes, thus gene expression results gonna be wrong¹¹. Besides, technical replicates can be utilized in this process in which one sample is sequenced more than one by using the same high-throughput platform to increase the elimination of technical bias.

2) Laboratory performance

After RNA extraction from all samples, libraries are prepared for sequencing according to the selected library type. After detection of library type, libraries are sequenced to read depth of 10–30 million reads per sample on a high-throughput platform⁷.

3) Data analysis

After sequencing has been completed, the starting point for analysis is the data files, which contain base-called sequencing reads, usually in the form of FASTQ. The reads having poor quality in FASTQ files are eliminated before the alignment process in which raw sequences are aligned to a reference genome to find their relevant genes. Each sequence read is converted to one or more genomic coordinates and Sequence Alignment Map (SAM) files containing those coordinates are obtained after alignment process^7,12. This process has traditionally been accomplished using distinct alignment tools, such as TopHat¹³, STAR¹⁴, or HISAT¹⁵, which rely on a reference genome. The SAM files are converted to Binary Alignment Map (BAM) files for further analyses because of their large size and this process is carried out by using Samtools¹⁶. After alignment and file conversation steps, reads (transcripts) quantification across samples is performed by using some tools such as featureCounts¹⁷ to obtain expression matrix in which each row corresponds to individual genes, however, each column corresponds to individual samples⁷. Normalization of transcripts abundance across samples is made by using expression matrix to lessen range-based gene expression differences between samples^7,18,19. Normalization methods are shown in (Figure 1)²⁰.

**Figure 1.** Normalization methods that are used in RNA-seq analyses.

After normalization step, genes with low expression across samples are filtered to prevent statistical noise⁷, and then statistically meaningful genes (namely, differentially expressed genes) can be detected by using some tools such as edgeR²¹, DESeq2²². In the end, obtained genes can be used for enrichment analyses such as KEGG and Reactome to find out which pathways are affected. RNA-seq technology is utilized for distinct aims, some of which are shown in (Figure 2). The representations of RNA-seq results are shown in (Figure 3).

**Figure 3. Representation of differential expression, splicing, and co-expression results.** In differential expression figure, each row represents the expression amount of a gene, however, each column represents each sample. Red color shows higher expressions, but the yellow color shows lower expressions. In the co-expression figure, a network containing the interaction of each gene with other genes is depicted. In the differential alternative splicing figure, differential usage of E010 exon between control and knockdown groups is depicted.

A detailed RNA-seq work-flow is shown in (Figure 4)¹².

**Figure 4.** An example of differential expression work-flow.

The various tools that are used for RNA-seq and their tutorials were listed below as well as visualization tools that are used for high-throughput data.

Table 1. List of RNA-seq tool and their usage fields.

Tool names	Usage	Tutorial Link
DESeq2²²	Differential expression	https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
edgeR²¹	Differential expression	https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
DEXSeq²³	Differential splicing	https://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.html
WGCNA²⁴	Co-expression	https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/
GATK²⁵	Variant-calling	https://gatk.broadinstitute.org/hc/en-us

Table 2. List of high-throughput visualization and enrichment tools.

Tool names	Usage
pheatmap²⁶	Heatmap plot for differentially expressed genes
ggplot2²⁷	Most various visualizations ranging from bar charts to violin plots
igraph²⁸	Network visualization for co-expression networks and other network types
Enrichr²⁹	Enrichment analysis of genes
DAVID³⁰	Enrichment analysis of genes

Note/ Most of the listed tools are dependent on the R statistical computing environment.

Table 3. Examples of differential expression work-flows.

Examples	Links
Example 1	https://www.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html
Example 2	https://digibio.blogspot.com/2017/11/rna-seq-analysis-hisat2-featurecounts.html
Example 3	https://bioinformaticsworkbook.org/dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-using-a-genome.html
Example 4	https://uclouvain-cbio.github.io/BSS2019/rnaseq_gene_summerschool_belgium_2019.html

In addition to differential expression pipelines above, If you want to examine my pipeline containing differential expression analysis with DESeq2, you can visit this https://github.com/kaanokay/Differential-Expression-Analysis/blob/master/HISAT2-featureCounts-DESeq2-workflow.md website address in which I attached my Linux and R scripts.

Transcriptome researches in autism spectrum disorder

Autism Spectrum Disorder (ASD) is an early-onset neuropsychiatric disorder. ASD is clinically described with behavioural abnormalities such as restrictive interest and repetitive behaviour. ASD is genetically heterogeneous and heritable (~50%) and 80% of its genetic background is unclear. Aberrations in autistic brains take mostly place in cortex regions (Figure 5) rather than cerebellum. When ASD is compared with other neuropsychiatric disorders such as schizophrenia and bipolar disorder, it has a higher heritability-rate than them, which means that it appears with the more strong genetic background than schizophrenia and bipolar disorder. Studies have revealed that ASD-related genes are enriched in brain-development, neuronal activity, signalling, and transcription regulation. Wnt signalling, synaptic function, and translational regulation are pathways that are affected by mutations in ASD-related genes³¹.

**Figure 5.** Brain regions most affected in autism.

Transcriptome studies have shown that mRNA, microRNA (miRNA), small nucleolar RNA (snoRNA), and long non-coding RNA (lncRNAs) misexpression occurred in autistic brains. Genes with mRNA misregulation are especially enriched in immune and neuronal pathways, briefly neuronal development and immune system activation are both misregulated in the brains of individuals with ASD. Misregulated miRNAs in autistic brains target mostly genes with synaptic function. Additionally, alternative splicing is misregulated in splicing regulators and this causes mis-splicing patterns in autistic individuals³¹.

To summarize, RNA-seq is strong technology for understanding diseases and it can be used for various aims.

That’s all 🙂

If you have any questions about this short review and my differential expression pipeline in GitHub, you feel free to contact me via kaan.okay@msfr.ibg.edu.tr e-mail address.

Very thanks for your interest and time!

REFERENCES

1) https://en.wikipedia.org/wiki/Omics.

2) https://en.wikipedia.org/wiki/Transcriptomics_technologies.

3) https://en.wikipedia.org/wiki/Transcriptome.

4) Kadakkuzha, B. M., Liu, X. an, Swarnkar, S. & Chen, Y. Genomic and proteomic mechanisms and models in toxicity and safety evaluation of nutraceuticals. in Nutraceuticals: Efficacy, Safety and Toxicity 227–237 (Elsevier Inc., 2016). doi:10.1016/B978-0-12-802147-7.00018-8.

5) Behjati, S. & Tarpey, P. S. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed. 98, 236–238 (2013).

6) https://www.ebi.ac.uk/training/online/course/functional-genomics-ii-common-technologies-and-data-analysis-methods/performing-rna-seq.

7) Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. 20, 631–656 (2019).

8) https://en.wikipedia.org/wiki/Single_cell_sequencing.

9) https://www.10xgenomics.com/spatial-transcriptomics/.

10) https://www.diva-portal.org/smash/get/diva2:1068517/FULLTEXT01.pdf.

11) https://salmon.readthedocs.io/en/latest/library_type.html.

12) https://bioinformaticsworkbook.org/dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-using-a-genome.html.

13) Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: Discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).

14) Dobin, A. et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

15) Kim, D., Langmead, B. & Salzberg, S. L. HISAT: A fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

16) Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

17) Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

18) Evans, C., Hardin, J. & Stoebel, D. M. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief. Bioinform. 19, 776–792 (2018).

19) Liu, X. et al. Normalization Methods for the Analysis of Unbalanced Transcriptome Data: A Review. Front. Bioeng. Biotechnol. 7, 1–11 (2019).

20) https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html.

21) Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2009).

22) Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, (2014).

23) Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-Seq data. Nat. Preced. 1–30 (2012) doi:10.1038/npre.2012.6837.2.

24) Langfelder, P. & Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics 9, (2008).

25) McKenna, A. et al. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

26) https://cran.r-project.org/web/packages/pheatmap/pheatmap.pdf.

27) https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf.

28) https://cran.r-project.org/web/packages/igraph/igraph.pdf.

29) https://amp.pharm.mssm.edu/Enrichr/.

30) https://david.ncifcrf.gov/.

31) Quesnel-Vallières, M., Weatheritt, R. J., Cordes, S. P. & Blencowe, B. J. Autism spectrum disorder: insights into convergent mechanisms from transcriptomics. Nat. Rev. Genet. 20, 51–63 (2019).

Leave a Reply Cancel Reply