Blog

A short review of RNA sequencing and its applications

What are the omics sciences?

Omics sciences are targeting quantification of whole biomolecules such as RNA and proteins at organism, tissue, or a single-cell level. Omics sciences are separated into several branches such as genomics, transcriptomics, and proteomics1.

What is transcriptomics?

Transcriptomics is one of the omics sciences dissecting the organism’s transcriptome which is the sum of all of its RNA molecules2,3.

What is RNA sequencing?

RNA sequencing (RNA-seq) is a technique providing quantification of all RNAs in bulk tissues or each cell. The transcript amounts of each gene across samples are calculated by using this technique. It is utilizing next-generation sequencing (NGS) platforms deciphering the sequencing of biomolecules such as DNA and RNA4,5.

What are the kinds of RNA-seq?

Bulk tissue RNA-seq

The whole transcriptome of target bulk tissues is sequenced to make transcriptomics analyses. Here, target bulk tissue can contain various cell types, and therefore, the whole transcriptome is mixed with RNAs of those cells. This approach is the most common usage of RNA-seq and is performed for some aims such as elucidating of diseases7.

Single-cell RNA-seq

In contrast to bulk tissue RNA-seq, single-cell RNA-seq (scRNA-seq) is performed in individual cells. The whole transcriptome of each cell in a tissue is sequenced to make transcriptomics analysis. The scRNA-seq has revealed that the transcriptome of each cell in a tissue is different from each other and individual cells can be separated into specific clusters according to its transcriptomic signature. The scRNA-seq has helped the discovery of some cells such as ionocyte cells, which could be relevant to the pathology of cystic fibrosis7,8.

Spatial RNA-seq

The relationship between cells and their relative locations within a tissue sample can be critical to understanding disease pathology. Spatial transcriptomics is a technology that allows the measurement of all the gene activity in a tissue sample and map where the activity is occurring. This technique is utilized in the understanding of biological processes and disease. Spatial RNA-seq can be performed at intact tissue sections as well as a single-cell level. The general aim of this technique is a combination of gene expression and morphological information and providing information on tissue architecture and micro-environment for the generation of sub-cellular data. Current bulk and scRNA-seq methods provide users with highly detailed data regarding tissues or cell populations but do not capture spatial information7,9,10.

RNA-seq analysis work-flow

1) Experimental design

There are many various library types in RNA-seq resulted in sequencing reads (sequenced transcripts) with different characteristics. For instance, reads can be single-end in which a transcript is read from its only an end (5’ or 3’), however, in the paired-end libraries, a transcript is read from both its 5’ and 3’ end. Paired-end sequencing can additionally help disambiguate read mappings and is preferred for alternative-exon quantification, fusion transcript detection, particularly when working with poorly annotated transcriptomes7. In addition to that, libraries can be stranded or unstranded. The strandedness for libraries is important to determine which DNA strand reads coming from and it is utilized to assign reads to relevant genes. If strandedness information of libraries is misused, then reads are not assigned to true genes, thus gene expression results gonna be wrong11. Besides, technical replicates can be utilized in this process in which one sample is sequenced more than one by using the same high-throughput platform to increase the elimination of technical bias.

2) Laboratory performance

After RNA extraction from all samples, libraries are prepared for sequencing according to the selected library type. After detection of library type, libraries are sequenced to read depth of 10–30 million reads per sample on a high-throughput platform7.

3) Data analysis

After sequencing has been completed, the starting point for analysis is the data files, which contain base-called sequencing reads, usually in the form of FASTQ. The reads having poor quality in FASTQ files are eliminated before the alignment process in which raw sequences are aligned to a reference genome to find their relevant genes. Each sequence read is converted to one or more genomic coordinates and Sequence Alignment Map (SAM) files containing those coordinates are obtained after alignment process7,12. This process has traditionally been accomplished using distinct alignment tools, such as TopHat13, STAR14, or HISAT15, which rely on a reference genome. The SAM files are converted to Binary Alignment Map (BAM) files for further analyses because of their large size and this process is carried out by using Samtools16. After alignment and file conversation steps, reads (transcripts) quantification across samples is performed by using some tools such as featureCounts17 to obtain expression matrix in which each row corresponds to individual genes, however, each column corresponds to individual samples7. Normalization of transcripts abundance across samples is made by using expression matrix to lessen range-based gene expression differences between samples7,18,19. Normalization methods are shown in (Figure 1)20.


Figure 1. Normalization methods that are used in RNA-seq analyses.

After normalization step, genes with low expression across samples are filtered to prevent statistical noise7, and then statistically meaningful genes (namely, differentially expressed genes) can be detected by using some tools such as edgeR21, DESeq222. In the end, obtained genes can be used for enrichment analyses such as KEGG and Reactome to find out which pathways are affected. RNA-seq technology is utilized for distinct aims, some of which are shown in (Figure 2). The representations of RNA-seq results are shown in (Figure 3).


Figure 2. RNA-seq usage fields.



Figure 3. Representation of differential expression, splicing, and co-expression results. In differential expression figure, each row represents the expression amount of a gene, however, each column represents each sample. Red color shows higher expressions, but the yellow color shows lower expressions. In the co-expression figure, a network containing the interaction of each gene with other genes is depicted. In the differential alternative splicing figure, differential usage of E010 exon between control and knockdown groups is depicted.

A detailed RNA-seq work-flow is shown in (Figure 4)12.


Figure 4. An example of differential expression work-flow.

The various tools that are used for RNA-seq and their tutorials were listed below as well as visualization tools that are used for high-throughput data.

Table 1. List of RNA-seq tool and their usage fields.

Tool names Usage Tutorial Link
DESeq222 Differential expression https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
edgeR21 Differential expression https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
DEXSeq23 Differential splicing https://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.html
WGCNA24 Co-expression https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/
GATK25 Variant-calling https://gatk.broadinstitute.org/hc/en-us

Table 2. List of high-throughput visualization and enrichment tools.

Tool names Usage
pheatmap26 Heatmap plot for differentially expressed genes
ggplot227 Most various visualizations ranging from bar charts to violin plots
igraph28 Network visualization for co-expression networks and other network types
Enrichr29 Enrichment analysis of genes
DAVID30 Enrichment analysis of genes

Note/ Most of the listed tools are dependent on the R statistical computing environment.

Table 3. Examples of differential expression work-flows.

Examples Links
Example 1 https://www.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html
Example 2 https://digibio.blogspot.com/2017/11/rna-seq-analysis-hisat2-featurecounts.html
Example 3 https://bioinformaticsworkbook.org/dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-using-a-genome.html
Example 4 https://uclouvain-cbio.github.io/BSS2019/rnaseq_gene_summerschool_belgium_2019.html

In addition to differential expression pipelines above, If you want to examine my pipeline containing differential expression analysis with DESeq2, you can visit this https://github.com/kaanokay/Differential-Expression-Analysis/blob/master/HISAT2-featureCounts-DESeq2-workflow.md website address in which I attached my Linux and R scripts.

Transcriptome researches in autism spectrum disorder

Autism Spectrum Disorder (ASD) is an early-onset neuropsychiatric disorder. ASD is clinically described with behavioural abnormalities such as restrictive interest and repetitive behaviour. ASD is genetically heterogeneous and heritable (~50%) and 80% of its genetic background is unclear. Aberrations in autistic brains take mostly place in cortex regions (Figure 5) rather than cerebellum. When ASD is compared with other neuropsychiatric disorders such as schizophrenia and bipolar disorder, it has a higher heritability-rate than them, which means that it appears with the more strong genetic background than schizophrenia and bipolar disorder. Studies have revealed that ASD-related genes are enriched in brain-development, neuronal activity, signalling, and transcription regulation. Wnt signalling, synaptic function, and translational regulation are pathways that are affected by mutations in ASD-related genes31.


Figure 5. Brain regions most affected in autism.

Transcriptome studies have shown that mRNA, microRNA (miRNA), small nucleolar RNA (snoRNA), and long non-coding RNA (lncRNAs) misexpression occurred in autistic brains. Genes with mRNA misregulation are especially enriched in immune and neuronal pathways, briefly neuronal development and immune system activation are both misregulated in the brains of individuals with ASD. Misregulated miRNAs in autistic brains target mostly genes with synaptic function. Additionally, alternative splicing is misregulated in splicing regulators and this causes mis-splicing patterns in autistic individuals31.

To summarize, RNA-seq is strong technology for understanding diseases and it can be used for various aims.

That’s all 🙂

If you have any questions about this short review and my differential expression pipeline in GitHub, you feel free to contact me via kaan.okay@msfr.ibg.edu.tr e-mail address.

Very thanks for your interest and time!

REFERENCES

1) https://en.wikipedia.org/wiki/Omics.

2) https://en.wikipedia.org/wiki/Transcriptomics_technologies.

3) https://en.wikipedia.org/wiki/Transcriptome.

4) Kadakkuzha, B. M., Liu, X. an, Swarnkar, S. & Chen, Y. Genomic and proteomic mechanisms and models in toxicity and safety evaluation of nutraceuticals. in Nutraceuticals: Efficacy, Safety and Toxicity 227–237 (Elsevier Inc., 2016). doi:10.1016/B978-0-12-802147-7.00018-8.

5) Behjati, S. & Tarpey, P. S. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed. 98, 236–238 (2013).

6) https://www.ebi.ac.uk/training/online/course/functional-genomics-ii-common-technologies-and-data-analysis-methods/performing-rna-seq.

7) Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. 20, 631–656 (2019).

8) https://en.wikipedia.org/wiki/Single_cell_sequencing.

9) https://www.10xgenomics.com/spatial-transcriptomics/.

10) https://www.diva-portal.org/smash/get/diva2:1068517/FULLTEXT01.pdf.

11) https://salmon.readthedocs.io/en/latest/library_type.html.

12) https://bioinformaticsworkbook.org/dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-using-a-genome.html.

13) Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: Discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).

14) Dobin, A. et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

15) Kim, D., Langmead, B. & Salzberg, S. L. HISAT: A fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

16) Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

17) Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

18) Evans, C., Hardin, J. & Stoebel, D. M. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief. Bioinform. 19, 776–792 (2018).

19) Liu, X. et al. Normalization Methods for the Analysis of Unbalanced Transcriptome Data: A Review. Front. Bioeng. Biotechnol. 7, 1–11 (2019).

20) https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html.

21) Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2009).

22) Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, (2014).

23) Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-Seq data. Nat. Preced. 1–30 (2012) doi:10.1038/npre.2012.6837.2.

24) Langfelder, P. & Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics 9, (2008).

25) McKenna, A. et al. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

26) https://cran.r-project.org/web/packages/pheatmap/pheatmap.pdf.

27) https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf.

28) https://cran.r-project.org/web/packages/igraph/igraph.pdf.

29) https://amp.pharm.mssm.edu/Enrichr/.

30) https://david.ncifcrf.gov/.

31) Quesnel-Vallières, M., Weatheritt, R. J., Cordes, S. P. & Blencowe, B. J. Autism spectrum disorder: insights into convergent mechanisms from transcriptomics. Nat. Rev. Genet. 20, 51–63 (2019).

Evolution and Unprecedented Variants of the Mitochondrial Genetic Code in a Lineage of Green Algae – David Žihala

Presenter




David Žihala

Abstract

Mitochondria of diverse eukaryotes have evolved various departures from the standard genetic code, but the breadth of possible modifications and their phylogenetic distribution are known only incompletely. Furthermore, it is possible that some codon reassignments in previously sequenced mitogenomes have been missed, resulting in inaccurate protein sequences in databases. Considering the distribution of codons at conserved amino acid positions in mitogenome-encoded proteins, mitochondria of the green algal order Sphaeropleales exhibit a diversity of codon reassignments, including previously missed ones and some that are unprecedented in any translation system examined so far, necessitating redefinition of existing translation tables and creating at least seven new ones. We resolve a previous controversy concerning the meaning the UAG codon in Hydrodictyaceae, which beyond any doubt encodes alanine. We further demonstrate that AGG, sometimes together with AGA, encodes alanine instead of arginine in diverse sphaeroplealeans. Further newly detected changes include Arg-to-Met reassignment of the AGG codon and Arg-to-Leu reassignment of the CGG codon in particular species. Analysis of tRNAs specified by sphaeroplealean mitogenomes provides direct support for and molecular underpinning of the proposed reassignments. Furthermore, we point to unique mutations in the mitochondrial release factor mtRF1a that correlate with changes in the use of termination codons in Sphaeropleales, including the two independent stop-to-sense UAG reassignments, the reintroduction of UGA in some Scenedesmaceae, and the sense-to-stop reassignment of UCA widespread in the group. Codon disappearance seems to be the main drive of the dynamic evolution of the mitochondrial genetic code in Sphaeropleales.

Date: April 28th, 2020 – 7:00 pm (GMT+3)

Language: English

To register the webinar, you can visit this link:
https://www.bigmarker.com/bioinfonet/Evolution-and-Unprecedented-Variants-of-the-Mitochondrial-Genetic-Code-in-a-Lineage-of-Green-Algae

Accessing Multi-omics Data for the Purposes of Tumour Profiling – Aashil A. Batavia

Presenter

Aashil A. Batavia

Aashil Batavia received his undergraduate degree from the University of Manchester obtaining a B.Sc. in Biomedical Sciences in 2014. During his dissertation, he implemented in silico experimental evolution to gain insights into the relationship between mutation rate plasticity, evolvability and robustness; exposing him to computational approaches for biomedical research for the first time. In 2015, he elected to return to the University of Manchester where he obtained an M.Sc in Bioinformatics and Systems Biology. Here he completed two research projects, one of which assessed the impact of human variants on the structure and function of Prpf8; mutations in which have been shown to cause retinitis pigmentosa. This work paved the way for his move to Switzerland in 2017 where he would begin his PhD at the Institute of Pathology and Molecular Pathology, USZ and the Department of Biosystems Science and Engineering, ETH Zurich. With a foot in both the computational and experimental worlds, his current work is focused on the multi-omics assessment of a rare form of renal cell carcinoma termed wild-type von Hippel-Lindau (wtVHL) clear cell renal cell carcinoma.

Abstract

Cancers are a very complex and heterogeneous set of diseases and therefore, cancer research is by no means trivial. The greater our understanding of the molecular landscape of a particular tumour type the better equipped we will become to combat its growth and spread. Publicly available multi-omic datasets provide a valuable resource to further this understanding. These data sets are commonly used for the identification of novel areas of study, the validation of results and the benchmarking/assessment of novel statistical methods. The Cancer Genome Atlas (TCGA) provides one such dataset with its repository consisting of 11,000 patients across 33 cancer types. This rich resource assists research on both a tumour specific and pan-cancer setting. In this webinar, I will introduce the various ways of accessing The Cancer Genome Atlas repository, navigating the multiple data types available and the tools I use for the multi-omics assessment (single and integrated) of my tumours of interest; renal cell carcinomas.

Date: May 5th, 2020 – 3:00 pm (GMT+3)

Language: English

To register the webinar, you can visit this link:
https://www.bigmarker.com/bioinfonet/Accessing-Multi-omics-Data-for-the-Purposes-of-Tumour-Profiling

Connecting to Virtual Machine for Windows by using Putty (3-steps)

Big data requires big infrastructure. If your computer cannot handle with big data, you need to connect with a server or virtual machine to store and process your data.

I have been participating COVID19-bh20. If you are newbie like me to participate such events, and inexperienced in handling with big data in such a big hackathon, here is the first thing you need to know about how to manage such metadata: connecting the Virtual Machine (VM) via Putty.

  • First you need to download PuTTy
  • Please open the putty key generator,

Step-1

  • You need to generate the public and private keys in the format requested by the admin such as RSA format, shown in yellow box
  • You need to save them
  • After generation, you need to share the public key, shown in red box, with the admin of virtual machine/server
  • Btw you need to generate a password, which is shown with green box

Step-2

  • Next type the IP address to the host name/IP address box, shown in purple box
  • (Do not open without changing the Connection settings, which will be done in the following steps)
  • Then you will enter the private key to access to VM via changing the Connection settings, shown with an orange arrow

Step-3

  • After clicking the Connection, denoted with orange arrow
  • Next step is to click SSH, shown in orange arrow
  • Then you need to click select Auth, shown in orange arrow
  • When you select Auth, you need to add the path of the private key via browsing it, shown in red box
  • Now you need to click OPEN to access, shown in green arrow
  • Username is given by the admin username@IP_address, highlighted with bold
  • And the password will be the password you generated as key passphrase while generating the key.

I hope you find this post useful,

For detailed information you can check with Microsoft Azure page.

PS: Although my labmates showed me how to do it before, I forgot it. Thanks to hackathon, I had a chance to refresh my old memories. In case you are a newbie like me, this post might be useful.

All the best with your analysis!

The Mapping Pipeline of the Next Generation Sequencing Data

Next-generation sequencing (NGS) enables high-throughput detection of DNA sequences in genomic research. The NGS technologies are implemented for several applications, including whole-genome sequencing, de novo assembly sequencing, resequencing, and transcriptome sequencing at the DNA or RNA level. In order to sequence longer sections of DNA, a new approach called shotgun sequencing (Venter et al., 2003; Margulies et al., 2005; Shendure et al., 2005) was developed during Human Genome Project (HGP). In this approach, genomic DNA is enzymatically or mechanically broken down into smaller fragments and cloned into sequencing vectors in which cloned DNA fragments can be sequenced individually. Detecting abnormalities across the entire genome (whole-genome sequencing only), including substitutions, deletions, insertions, duplications, copy number changes (gene and exon) and chromosome inversions/translocations are possible with the help of the NGS approach. Thus, shotgun sequencing has more significant advantages from the original sequencing methodology, Sanger sequencing, that requires a specific primer to start the read at a specific location along with the DNA template and record the different labels for each nucleotide within the sequence. 

The aim of this study is to build a general workflow of mapping the short-read sequences that came from NGS machine.  

Before the analysis of NGS data with publicly or commercially available algorithms and tools, we need to know about some features of the NGS raw data.

The raw data from a sequencing machine are most widely provided as FASTQ (unaligned sequences) files, which include sequence information, similar to FASTA files, but additionally contain further information, including sequence quality information. A FASTQ file consists of blocks, corresponding to reads, and each block consists of four elements in four lines.  

Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description
Line 2 is the raw sequence letters
Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence

For instance;
@HS2000-306_201:6:1204:19922:79127/

ColumnBrief Description
HS2000-306_21The instrument name
6Flowcell lane
1204Tile number within the flowcell lane
19922x-coordinate of the cluster within the tile
79127y-coordinate of the cluster within the tile
1the member of a pair, 1 or 2 (paired-end)

ACGTCTGGCCTAAAGCACTTTTTCTGAATTC…  Sequence
+
BC@DFDFFHHHHHJJJIJJJJJJJJJJJJJJJJJJJJJH…  Base Qualities

1.Quality Control

Quality control is the most important step in the process of improving raw data by removing any identifiable errors from it. With the application of QC at the beginning of the analysis, the chance of finding any contamination, imprecision, error, and missing data are reduced.  

Quality (Q) is proportional to -log10  probability of sequence base being wrong (e):
Phred scaled Q = -10*log10(e)
Base Qualities = ASCII 33 + Phred scaled Q
e: base-calling error probability
SAM encoding adds 33 to the value because ASCII 33 is the first visible character. 

source: https://drive5.com/usearch/manual/quality_score.html



Phred Quality Score

Probability of Incorrect Base Call

Base Call Accuracy
101 in 1090%
201 in 10099%
301 in 100099.9%
401 in 10,00099.99%
501 in 100,00099.999%
601 in 1,000,00099.9999%

The most commonly used tool for assessing and visualizing the quality of FASTQ data is FastQC (Babraham Bioinformatics, n.d.), which provides comprehensive information about data quality, including base sequence quality scores, GC content information, sequence duplication levels, and overrepresented sequences. There are some alternatives to FastQC, and these are PRINSEQ, fastqp, NGS QC Toolkit, and QC-Chain

Running FastQC

1- To run the FastQC program on desktop, you can use File > Open to select the sequence file you want to check.

2- To run the FastQC program in the cluster, we would normally have to tell our computer where the program is located.

$ which fastqc

/usr/local/bin/fastqc

FastQC can accept multiple filenames as input, so we can use the *.fastq.gz wildcard to run FastQC on all of the FASTQ files in this directory.

$ fastqc *.fastq.gz

You will see an automatically updating output message telling you the progress of the analysis. It will start like this:

Started analysis of SRR2584863_1.fastq
Approx 5% complete for SRR2584687_1.fastq
Approx 10% complete for SRR2584687_1.fastq
Approx 15% complete for SRR2584687_1.fastq
Approx 20% complete for SRR2584687_1.fastq
Approx 25% complete for SRR2584687_1.fastq
Approx 30% complete for SRR2584687_1.fastq
Approx 35% complete for SRR2584687_1.fastq
Approx 40% complete for SRR2584687_1.fastq
Approx 45% complete for SRR2584687_1.fastq

For each input FASTQ file, FastQC has created a .zip file and a .html file. The .zip file extension indicates that this is actually a compressed set of multiple output files. We’ll be working with these output files soon. The .html file is a stable webpage displaying the summary report for each of our samples.

We want to keep our data files and our results files separate, so we will move these output files into a new directory within our results/  directory. If this directory does not exist, we will have to create it.

## -p flag stops a message from appearing if the directory already exists
$ mkdir -p  ~/kaya/ results
$ mv *.html ~/kaya/ results/
$ mv *.zip ~/kaya/ results/

It can be quite tedious to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.

The .html files and the uncompressed .zip files are still present, but now we also have a new directory for each of our samples. We can see for sure that it’s a directory if we use the -F flag for ls.

$ ls -F

SRR2584869_1_fastqc/      SRR2584866_1_fastqc/ SRR2589044_1_fastqc/SRR2584869_1_fastqc.html  SRR2584866_1_fastqc.html SRR2589044_1_fastqc.htmlSRR2584863_1_fastqc.zip   SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zipSRR2584863_2_fastqc/      SRR2584866_2_fastqc/ SRR2589044_2_fastqc/SRR2584863_2_fastqc.html  SRR2584866_2_fastqc.html SRR2589044_2_fastqc.htmlSRR2584863_2_fastqc.zip   SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip

Let’s see what files are present within one of these output directories.

$ ls -F SRR2584869_1_fastqc/

fastqc_data.txt  fastqc.fo fastqc_report.html Icons/ Images/  summary.txt

Use less to preview the summary.txt file for this sample.

$ less SRR2584869_1_fastqc/summary.txt 

PASS    Basic Statistics        SRR2584869_1.fastq
PASS    Per base sequence quality       SRR2584869_1.fastq
PASS    Per tile sequence quality       SRR2584869_1.fastq
PASS    Per sequence quality scores     SRR2584869_1.fastq
WARN    Per base sequence content       SRR2584869_1.fastq
WARN    Per sequence GC content SRR2584869_1.fastq
PASS    Per base N content      SRR2584869_1.fastq
PASS    Sequence Length Distribution    SRR2584869_1.fastq
PASS    Sequence Duplication Levels     SRR2584869_1.fastq
PASS    Overrepresented sequences       SRR2584869_1.fastq
WARN    Adapter Content SRR25848

Finally, we can make a report of the results we got for all our samples by concatenating all of our summary.txt files into a single file using the cat command.

$ cat */summary.txt > ~/kaya/results/fastqc_summaries.txt

For more information, please see the FastQC documentation here

Additionally, the multiqc tool has been designed for the tasks of combining QC reports  into a single report that is easy to analyze

$multiqc
$multiqc –help

Another way to check your NGS data quality is to work in R studio.
fastqcr can be installed from CRAN as follow.

install.packages(“fasqcr”)

Good Quality
Bad Quality

2. Trimming Low-quality Reads and Adapters

Trimming is the second step in analyzing NGS data. It has been broadly embraced in most recent NGS studies, specifically prior to genome assembly, transcriptome assembly, metagenome reconstruction, gene expression, epigenetic studies, and comparative genomics. Neglecting the presence of low-quality base calls may, in fact, be harmful to any NGS analysis, as it may add unreliable and potentially random sequences to the dataset. This may constitute a relevant problem for any downstream analysis pipeline and lead to false definitions of data. Also, adapter contamination can lead to NGS alignment errors and an increased number of unaligned reads, since the adapter sequences are synthetic and do not occur in the genomic sequence. There are applications (e.g., small RNA sequencing) where adapter trimming is highly necessary. With a fragment size of around 24 nucleotides, one will definitely sequence into the 3′ adapter. But there are also some applications (transcriptome sequencing, whole-genome sequencing, etc.) where adapter contamination can be expected to be so small (due to an appropriate size selection) that one could consider to skip the adapter removal and thereby save time and efforts. There are many tools to handle of QC, namely, AfterQc, Cutadapt, Trimmomatic, Erne-Filter, ConDeTri, Sickle, SolexaQAAlienTrimmerSkewer , BBDuk, Fastx Toolkit, and Trim Galore.

In the present work, we want to describe the basic commands to improve your NGS data quality and authenticity by the Cutadapt trimming tool.

When processing paired-end data, Cutadapt holds the trimming these reads. To facilitate this, provide two input files and a second output file with the -p option (this is the short form of –paired-output). This is the basic command-line syntax:

$ cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.

Here, the input reads are in reads.1.fastq and reads.2.fastq, and the result will be written to out.1.fastq and out.2.fastq.

In paired-end mode, the options -a, -b, -g and -u that also exist in single-end mode are applied to the forward reads only. To modify the reverse read, these options have uppercase versions -A, -B, -G and -U that work just like their counterparts. In the example above, ADAPTER_FWD will therefore be trimmed from the forward reads and ADAPTER_REV from the reverse reads.

The -q (or –quality-cutoff) parameter can be used to trim low-quality ends from reads. If you specify a single cutoff value, the 3’ end of each read is trimmed:

$ cutadapt -q 20,20 -o output.fastq input.fastq

It is also possible to also trim from the 5’ end by specifying two comma-separated cutoffs as 5’ cutoff, 3’ cutoff. For example,

$ cutadapt -q 15,10 -o output.fastq input.fastq

will quality-trim the 5’ end with a cutoff of 15 and the 3’ end with a cutoff of 10. To only trim the 5’ end, use a cutoff of 0 for the 3’ end, as in -q 15,0.

Interleaved paired-end reads

Paired-end reads can be read from a single FASTQ file in which the entries for the first and second read from each pair alternate. The first read in each pair comes before the second. Enable this file format by adding the –interleaved option to the command-line. For example:

$ cutadapt –interleaved -q 20 -a ACGT -A TGCA -o trimmed.fastq reads.fastq

To read from an interleaved file, but write regular two-file output, provide the second output file as usual with the -p option:

$ cutadapt –interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq -p trimmed.2.fastq reads.fastq

Reading two-file input and writing interleaved is also possible by providing a second input file:

$ cutadapt –interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq reads.1.fastq reads.2.fas

Trimming paired-end reads separately

Secondly, if you want to quality-trim the first read in each pair with a threshold of 20, and the second read in each pair with a threshold of 10, then the commands could be:

$ cutadapt -q 20 -a ADAPTER_FWD -o trimmed.1.fastq reads1.fastq
$ cutadapt -q 10 -a ADAPTER_REV -o trimmed.2.fastq reads2.fastq

 If one end of a paired-end read had > 5 % ‘N’ bases, then the paired-end read can be removed.  To deal with, Cutadapt recommends the following options to deal with N bases in your reads:

–max-n COUNT
Discard reads containing more than COUNT N bases. A fractional COUNT between 0 and 1 can also be given and will be treated as the proportion of maximally allowed N bases in the read.
–trim-n
Remove flanking N bases from each read. That is, a read such as this:

NNACGTACGTNNNN
It trimmed to just Ns and the rest of the sequence became ACGTACGT. This option is applied after adapter trimming. If you want to get rid of N bases before adapter removal, use quality trimming: N bases typically also have a low quality value associated with them.

Finally, Cutadapt has two sets of adapters to work with:

An example:

$ cutadapt –pair-adapters -a AAAAA -a GGGG -A CCCCC -A TTTT -o out.1.fastq -p out.2.fastq in.1.fastq in.2.fastq

Here, the adapter pairs are (AAAAA, CCCCC) and (GGGG, TTTT). That is, paired-end reads will only be trimmed if either

  • AAAAA is found in R1 and CCCCC is found in R2,
  • or GGGG is found in R1 and TTTT is found in R2.

For detailed information, please see the Cutadapt documentation

3. Aligned sequences – SAM/BAM format

Now, filtered reads of each sequencing sample are ready to attain the exact locations onto the corresponding reference genome. Also, you can find these locations using de novo assembly.

A reference genome is a collection of contigs
● A contig refers to overlapping DNA reads encoded as A, G, C, T or N
● Typically comes in FASTA format:
○ “>” line contains information on contig

There are a number of tools to choose from and, while there is no golden rule, there are some tools that are better suited for particular NGS analyses, to name a few, BWA, Bowtie2, SOAP, novoalign, and mummer. After aligning, a Sequence Alignment Map (SAM) file is produced. This file is a format for storing large nucleotide sequence alignments. The binary version of a SAM file is termed a Binary Alignment Map (BAM) file, and BAM file stores aligned reads and are technology independent. The SAM/BAM file consists of a header and an alignment section.

We will be using the Burrows Wheeler Aligner (BWA), which is a software package for mapping short-read sequences against a reference genome.

The alignment process consists of two steps:

  1. Indexing the reference genome
  2. Aligning the reads to the reference genome

Firstly, we create a new folder and download our reference genome from our source.

$ cd ~/kaya
$ mkdir -p data/ref_genome

$ curl -L -o data/ref_genome/ecoli_ref.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz
$ gunzip data/ref_genome/ecoli_ref.fasta.gz

We will also download a set of trimmed FASTQ files to work with.

$ curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
$ tar xvf sub.tar.gz
$ mv sub/ ~/kaya/data/trimmed_fastq_small

and you need to create multiple directories for the results that will be generated as part of this workflow.

$ mkdir -p results/sam_results/bam_results

Index the reference genome

Our first step is to index the reference genome for use by BWA. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment. Indexing the reference only has to be run once. The only reason you would want to create a new index is if you are working with a different reference genome or you are using a different tool for alignment.

$ bwa index data/ref_genome/ecoli_ref.fasta

## While the index is created, you will see output that looks something like this:

[bwa_index] Pack FASTA… 0.04 sec
[bwa_index] Construct BWT for the packed sequence…
[bwa_index] 1.05 seconds elapse.
[bwa_index] Update BWT… 0.03 sec
[bwa_index] Pack forward-only FASTA… 0.02 sec
[bwa_index] Construct SA from BWT and Occ… 0.57 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta
[main] Real time: 1.765 sec; CPU: 1.715 sec

Align reads to reference genome

The alignment process consists of choosing a suitable reference genome to map our reads against and then choosing on an aligner. We will use the BWA-MEM algorithm, which is the latest and is generally recommended for high-quality queries as it is faster and more accurate.

An example of what a bwa command looks like is below. This command will not run, as we do not have the files ref_genome.fa, input_file_R1.fastq, or input_file_R2.fastq.

$ bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam

We are running bwa with the default parameters here, your use case might require a change of parameters. NOTE: Always read the manual page for any tool before using and make sure the options you use are appropriate for your data.

We’re going to start by aligning the reads from just one of the samples in our dataset (SRR2584687). Later, we’ll be iterating this whole process on all of our sample files.

$ bwa mem data/ref_genome/ecoli_ref.fasta data/trimmed_fastq_small/SRR2584687_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584687_2.trim.sub.fastq > results/sam/SRR2584687.aligned.sam

##You will see output that starts like this:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 77446 sequences (10000033 bp)…
[M::process] read 77296 sequences (10000182 bp)…
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (48, 36728, 21, 61)
[M::mem_pestat] analyzing insert size distribution for orientation FF…
[M::mem_pestat] (25, 50, 75) percentile: (420, 660, 1774)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 4482)
[M::mem_pestat] mean and std.dev: (784.68, 700.87)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 5836)
[M::mem_pestat] analyzing insert size distribution for orientation FR…

SAM/BAM format

The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome.
The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.

The file begins with a header, which can be optional. The header is used to describe the source of data, a reference sequence, method of alignment, etc., this will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 necessary fields for essential mapping information and a variable number of other fields for aligner specific information. An example entry from a SAM file is displayed below with the different fields highlighted.

Read Name (RED)
The sequence of Read (BLUE)
Encoded Sequence Quality (GREEN)


(RNAME) Chromosome to which the read aligns (RED)
(POS) Position in chromosome to which 5′ end of the read aligns
Alignment information – “Cigar string” (BLUE)
100M – Continuous match of 100 bases (perfect match or mismatch)
28M1D72M – 28 bases continuously match, 1 deletion from reference, 72 base match (GREEN)
(RED) Bit FLAG – TRUE/FALSE for pre-defined read criteria, like: is it paired? duplicate? https://broadinstitute.github.io/picard/explain-flags.html
(BLUE) -ISIZE- Paired read position and insert size
(GREEN) User defined flags


We will convert the SAM file to BAM format using the samtools program with the view command and tell this command that the input is in SAM format (-S) and to output BAM format (-b):

$ samtools view -S -b results/sam/SRR2584687.aligned.sam > results/bam/SRR2584687.aligned.

Sort BAM file by coordinates

Next, we sort the BAM file using the sort command from samtools. -o  tells the command where to write the output.


$ samtools sort -o results/bam/SRR2584687.aligned.sorted.bam results/bam/SRR2584687.aligned.bam

If you want to follow statistics about your sorted bam file:

$ samtools flagstat results/bam/SRR2584687.aligned.sorted.bam

#OUPUT
231341 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1169 + 0 supplementary
0 + 0 duplicates
351103 + 0 mapped (99.98% : N/A)
350000 + 0 paired in sequencing
175000 + 0 read1
175000 + 0 read2
346688 + 0 properly paired (99.05% : N/A)
349876 + 0 with itself and mate mapped
58 + 0 singletons (0.02% : N/A)
0 + 0 with mate mapped to a different
chr0 + 0 with mate mapped to a different chr (mapQ>=5)

v2- Align the reads to the contigs using BWA

$bwa index kaya/LS0566-contigs.fa
$bwa mem -t2 kaya/LS0566-contigs.fa 25KLUK_4_1.fq.gz 25KLUK_4_2.fq.gz >kaya/25KLUK_4.sam
$samtools sort -@2 -o kaya/25KLUK_4.bam kaya/25KLUK_4.sam
$samtools index kaya/25KLUK_4.bam

Index the assembly FASTA file.

$ samtools faidx kaya/LS0566-contigs.fa

Viewing BAM file using samtools tview.


$ samtools tview kaya/25KLUK_4.bam kaya/LS0566-contigs.fa

You can browse your BAM file with IGV

4. Viewing with IGV

IGV is a genome browser, which has the advantage of being installed locally and providing fast access. Web-based genome browsers, like Ensembl or the UCSC browser, are slower but provide more functionality.

Locally on your own Mac or Windows computer

We need to open the IGV software. If you haven’t done so already, you can download IGV from the Broad Institute’s software page, double-click the .zip file to unzip it, and then drag the program into your Applications folder.

  1. Open IGV.
  2. Load our reference genome file (ecoli_ref.fasta) into IGV using the “Load Genomes from File…“ option under the “Genomes” pull-down menu.

Load our BAM file (SRR2584687.aligned.sorted.bam) using the “Load from File…“ option under the “File” pull-down menu.

To load data from an HTTP URL:

  1. Select File>Load from URL.
  2. Enter the HTTP or FTP URL for a data file or sample information file.
  3. If the file is indexed, enter the index file name in the field provided.
  4. Click OK.

To load a file from Google Cloud Storage, enter the path to the file with the “gs://” prefix.  //
Upload the following indexed/sorted Bam files with File -> Load from URL >http://faculty.xxx.edu/~kaya/Workshop/results/SRR20372154.fastq.bam

Controlling IGV from R

You can open IGV from within R with startIGV(“lm”) . Note this may not work on all systems. The testing URL (xxx.edu) is given below. You can try with your cluster URL.

library(SRAdb)
urls <- readLines(“http://xxxx.edu/data/samples/bam_urls.txt“)
#startIGV(“lm”) # opens IGV
sockiv <- IGVsocket()
session <- IGVsession (files=urls,
sessionFile=“session.xml”,
genome=“A. thaliana (TAIR10)”)
IGVload(sockiv, session)
IGVgoto(sockiv, ‘Chr2:67296-3521’)

I hope you find this tutorial useful to analyze your NGS data.
I would like to thank Dilek Koptekin ( @dilekopter ) for reviewing the pipeline. If you have any questions, please get in touch with us without hesitation.

References

Brabaham Bioinformatics website. Available: http:// www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 2013 Dec 1

Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/ Illumina FASTQ variants. Nucleic Acids Res 38: 1767-1771. doi: 10.1093/nar/gkp1137. PubMed: 20015970.

Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM (2013) An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis. PLoS ONE 8(12): e85024. doi:10.1371/journal.pone.0085024

Martin M (2011) Cutadapt removes adapter sequences from high- throughput sequencing reads. EmBnet Journal 17: 10-12.

Zhang J, Chiodini R, Badr A, Zhang G. The impact of next-generation sequencing on genomics. J Genet Genomics. 2011;38(3):95–109. doi:10.1016/j.jgg.2011.02.003

miRDeep2 – miRNA Sequencing Analysis, Example Run by Using Ubuntu Terminal

microRNAs (miRNAs) belong the family of small non-coding RNAs and regulate many processes in the body via regulating the mRNAs. They are 20 to 25-nucleotides-long small RNAs. Since they are short, and sequences vary only small number of nucleotides (e.g. 1 as in the case of SNPs), deep sequencing, high coverage, requires to detect the miRNAs, and identify the novel sequences sensitively.

There are different tools available to investigate miRNAs, miRNA structures, expression profiles of them and so on. Although RNA-sequencing technology is still in teenage years (Stark et al., 2019), miRNA sequencing technology is even more “immature” than RNA-seq or sc-RNA- seq, so do the tools available for miRNA-sequencing data analysis. Besides, there are limited number of tools available for bioinformatics analysis for mirNA sequencing (Motameny et al., 2010; Kang and Friedlander, 2015; Chen et al., 2019). miRDeep2 (Mackowiak, S., 2011; Friedlander et al., 2012; Yang et al., 2011) is one of the most commonly used and recently updated tools to detect known, canonical, and novel, non-canonical, miRNA sequences. Although, the pipelines are available for miRNA sequencing as in the case of ENCODE Project Pipelines, the bioinformatics tools such as miRDeep2 are easier to use people coming from different scientific backgrounds.

There are tutorials provided in miRDeep2 github pages. There are two github links (old, new) and so two different tutorials (old, new) available. Please make sure that you follow the tutorial provided in the recent/newest github page.

Although the tutorial is shared in the github page, a practical example run might be useful for people who is planning to use this tool first time. Therefore I will share the codes required with you , with the warnings that you need to be extra cautious.

Step 1: Download Ubuntu Terminal

This tool requires linux working environment. So, if you are using Windows, you need to download a program such as Ubuntu Terminal or Virtual Box/Machine to run the mirdeep package. For this, you need to open Microsoft Store and chose to download the Ubuntu (not the LTS ones but the terminal).

Step 2: Downloading miRDeep2 with conda install

If you try to install mirdeep2 without conda install, you might encounter some problems. I strongly recommend to use conda install. After installation, do not forget to test perl script: mapper.pl.

 dincaslan@D:~$ sudo apt-get update
 dincaslan@D:~$ sudo apt-get upgrade
 dincaslan@D:~$ cd /mnt/c/Users/USER/Downloads/

#You need to open a new terminal here. You can follow the instructions given in this link. Because I want to download the files to Downloads in Windows instead of Linux, I specificied the paths with "mnt/c/Users/...".

 dincaslan@D:~$ sha256sum  /mnt/c/Users/USER/Downloads/Anaconda3-2019.10-Linux-x86_64.sh 
 dincaslan@D:/mnt/c/Users/USER/Downloads$ bash /mnt/c/Users/USER/Downloads/Anaconda3-2019.10-Linux-x86_64.sh
 dincaslan@D:/mnt/c/Users/USER/Downloads$ source ~/.bashrc
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda config --set auto_activate_base
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda config --set auto_activate_base True
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda list
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda install -c bioconda mirdeep2
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ mapper.pl 

Step 3: Running the Tutorial for MiRDeep2

Before running your analysis, it would be better to test the tutorial run to make sure that everything is alright with the tool. You can download the mature and hairpin miRNA files from miRBase.

(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ cd drmirdeep.github.io-master/

#cd command is used to open files in the given path/directory. You need to chose the directory that you download the tutorial file.
#ls is to list the files in the given folder

(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master$ ls
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master$ cd drmirdeep.github.io-master/
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ ls

#grep to check how many of the reads have the adapter sequence
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ grep -c TGGAATTC example_small_rna_file.fastq
2001
#do not forget the extract the relevant files from mature and hairpin miRNA files you downloaded from mirbase.

(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ extract_miRNAs.pl /mnt/c/Users/USER/Downloads/mature.fa hsa > /mnt/c/Users/USER/Downloads/mature_hsa.fa  
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ extract_miRNAs.pl /mnt/c/Users/USER/Downloads/hairpin.fa hsa > /mnt/c/Users/USER/Downloads/hairpin_hsa.fa  
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ extract_miRNAs.pl /mnt/c/Users/USER/Downloads/mature.fa mmu,chi > /mnt/c/Users/USER/Downloads/mature_other_hsa.fa 

#to build index file via bowtie1
#make sure that you do not use the same name for the file you give as input, reference genome, and indexed output.

(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ bowtie-build refdb.fa refdb.fa

#to map the sample sequencing reads against the indexed genome file
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ mapper.pl example_small_rna_file.fastq -e -h -i -j -k TGGAATTC -l 18 -m -p refdb.fa -s reads_collapsed.fa -t reads_vs_refdb.arf -v -o 4

#to run the mirdeep2 analysis. You can find the detailed information regarding the parameters in the paper and the tutorial page.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ miRDeep2.pl reads_collapsed.fa refdb.fa reads_vs_refdb.arf mature_ref.fa mature_other.fa hairpin_ref.fa -t hsa 2>report.log

Step 4: Running the miRDeep2 for your sample

Before running the mirdeep2, you might want to check the quality of your fastq files by fastqc. Although mirdeep2 has intrinsic adapter trimming function, you might still need to use cutadapt based on your data’s specific needs. I will share the example codes to how to download an do the adapter trimming.

#for fastqc
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ sudo apt-get update
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ sudo apt-get install fastqc
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ fastqc --extract /mnt/c/Users/USER/Downloads/S26.fastq.gz -o /mnt/c/Users/USER/Downloads/fastqc_results

#for cutadapt and fastqc after
#Lets say your adapter sequence is this: TAGCTGATCGATCTGAAACT
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda install -c bioconda cutadapt
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ cutadapt -a TAGCTGATCGATCTGAAACT /mnt/c/Users/USER/Downloads/S26.fastq > /mnt/c/Users/USER/Downloads/outputS26.fastq
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ fastqc --extract /mnt/c/Users/USER/Downloads/outputS26.fastq -o /mnt/c/Users/USER/Downloads 

#before this step, you need to download a reference file in fasta/fa format.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ bowtie-build ucsc_hg19.fasta ucschg19

#You do not need to add .fa extension to file that you index
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ mapper.pl S26.fastq -e -h -i -j -k TAGCTGATCGATCTGAAACT-l 18 -m -p ucschg19 -s R___collapsed.fa -t R___refdb.arf -v -o 4

#You need to use index file as a reference here
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ miRDeep2.pl R___collapsed.fa ucsc_hg19.fasta R___refdb.arf mature_hsa.fa mature_other_hsa.fa hairpin_hsa.fa -t hsa 2> report.log

I hope you find this tutorial run useful. In addition to the websites given, whenever you have problems regarding the mirdeep2 run, I strongly recommend to read the documentation given in new github page and the article and check, if necessary ask, the questions/problems in biostar.

I would like thank my dear labmate Daniel Muliaditan for helping me to remember/learn the basics of linux and practice the mirdeep2 run in Ubuntu Terminal (by convenient way of handling such problems: using conda install). I would like to thank #AcademicTwitter, especially Dr. Ming Tang for his extremely useful answer to my question 🙂

References:

Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat Rev Genet20, 631–656 (2019). https://doi-org.libproxy1.nus.edu.sg/10.1038/s41576-019-0150-2

Motameny, S.; Wolters, S.; Nürnberg, P.; Schumacher, B. Next Generation Sequencing of miRNAs – Strategies, Resources and Methods. Genes 2010, 1, 70-84. https://doi.org/10.3390/genes1010070

Kang W, Friedländer MR. (2015) Computational prediction of miRNA genes from small RNA sequencing data. Front Bioeng Biotechnol 3: 7 10.3389/fbioe.2015.00007

Liang Chen, Liisa Heikkinen, Changliang Wang, Yang Yang, Huiyan Sun, Garry Wong, Trends in the development of miRNA bioinformatics tools, Briefings in Bioinformatics, Volume 20, Issue 5, September 2019, Pages 1836–1852, https://doi-org.libproxy1.nus.edu.sg/10.1093/bib/bby054

Mackowiak, S. D. Identification of novel and known miRNAs in deep-sequencing data with miRDeep2. Curr Protoc BioinformaticsChapter 12, Unit 12 10, 10.1002/0471250953.bi1210s36 (2011).

Xiaozeng Yang, Lei Li, miRDeep-P: a computational tool for analyzing the microRNA transcriptome in plants, Bioinformatics, Volume 27, Issue 18, 15 September 2011, Pages 2614–2615, https://doi-org.libproxy1.nus.edu.sg/10.1093/bioinformatics/btr430

Marc R. Friedländer, Sebastian D. Mackowiak, Na Li, Wei Chen, Nikolaus Rajewsky, miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades, Nucleic Acids Research, Volume 40, Issue 1, 1 January 2012, Pages 37–52, https://doi-org.libproxy1.nus.edu.sg/10.1093/nar/gkr688

https://www.encodeproject.org/microrna/microrna-seq/

Molecular Simulations as an in silico Experiment – Seyit Kale

Presenter

Seyit Kale

Seyit Kale received his Bachelor of Science in Physics from İhsan Doğramacı Bilkent University in 2006, and his Ph.D. in Biophysics and Structural Biology from Brandeis University in 2012. He continued his postdoctoral studies at the University of Chicago, developing methods for computational chemistry. Later in 2015, he became a visiting fellow at the National Institutes of Health in Bethesda, Maryland, where he developed and pursued an interest in the physics of chromatin and epigenetics. He joined Izmir Biomedicine and Genome Center in late 2019 as a research group leader where he is currently running a lab in computational biophysics.

Abstract

Structural studies in biology provide invaluable insights into how molecular machines inside our cells look like, yet the stories are often far from over. The set of atomic coordinates of a macromolecule is like a picture: it’s worth a thousand words. Then again, a picture lacks the temporal information which underlies the dynamic personalities of the molecule. Over the last several decades, exponential growth in computing power drew increasingly more physical scientists toward questions of life sciences. Faster algorithms and more accurate interaction potentials have been developed to propagate the Newtonian equations of motion in length- and timescales relatable to biological phenomena. In this lecture, I will discuss an often frowned upon analogy, i.e., how a molecular simulation can be thought of as an in silico experiment, by providing a historical and physical perspective.

Date: March 5th, 2020 – 5:00 pm (GMT+3)

Language: English

To register the webinar, you can visit this link:
https://www.bigmarker.com/bioinfonet/Molecular-Simulations-as-an-in-silico-Experiment

Do we have Big Data in Life Sciences? – Nikolay Oskolkov

Presenter

Image result for nikolay oskolkov

Nikolay Oskolkov

I am a SciLifeLab bioinformatician from Lund University doing various types of analyses in Computational Biology. Originally from Theoretical Physics (PhD 2007), switched to Biological and Life Sciences 2012, worked in biomedical research now expanding towards evolutionary science and data science

Abstract

Growing amounts of Next Generation Sequencing (NGS) data in Life Sciences provide new opportunities as well as pose a number of analytical challenges for Computational Biology and Bioinformatics. One of them is application of advanced methodologies such as Machine and Deep Learning that are ideally suited to address the massive amounts of data. In this webinar I will give an overview of some applications of Artificial Neural Networks (ANNs) to Single Cell Genomics, Microscopy Imaging and Genomics / Ancient DNA research areas.

Date: December 17th, 2019 – 4:00 pm

Language: English

Workshop

Scientific Figure Design workshop presentation is available now!

You can download pdf file from here
Also, If you want to access bookdown version you can click here.
created by: Handan Melike Dönertaş

Starting with our next student symposium, we are planning to organize workshops. Let us know your favorite workshop topics and help us organize something that interest you!

Aspects of High Throughput Molecular Data Analysis

Title:

Aspects of High Throughput Molecular Data Analysis

Presenter:

Arif Harmancı

Abstract:

The molecular information acquisition is gaining strong presence in every aspect of life. Much of this stems from the decreasing cost of DNA sequencing and computational power. Coupled with the data, computational methods are generating waves of interesting results. Interestingly, we are far from making sense of the extremely complicated molecular data. The more data we generate, the more we realize the complexity of the cellular information processing. These developments bring so many exciting opportunities and challenges. In this presentation, I will review different aspects of how triage of genomics, transcriptomics, epigenetics is changing the way we understand how molecular health translates to individual health. I will also review some of the future challenges related to high throughput data acquisition and data analysis.

Date: 7 January 2019 – 8:00 PM on BioInfoNet

Language: English

Youtube: https://www.youtube.com/watch?v=byv0BT9shwA&w=560&h=315

RSG-Turkey is a member of The International Society for Computational Biology (ISCB) Student Council (SC) Regional Student Groups (RSG). We are a non-profit community composed of early career researchers interested in computational biology and bioinformatics.

Contact: turkey.rsg@gmail.com

Follow us on social media!