REVIEW: A Brief Introduction to Microencapsulation

Introduction

The containment of a  core material inside of a  small capsule is called  microencapsulation. A polymeric material coates liquid or solid substances to protect polymeric material from circumambient area1. Microcapsules size vary between 50 nm to 2 mm2. Microcapsule’s size and structure differs according to core material being solid, liquid or gas as in figure 12

Figure 1: (a) Mononuclear microcapsules carrying solid material, (b) Aggregated microcapsules carrying liquid material2.
Figure 2: Schematic presentation of a microcapsule2.

Coating material must be adhesive to the core material  in order to cover core material properly. Coating materials must work as an harmonious aid to core material in required strength, flexibility, impermeability, optical properties, and stability. Its release must be  controllable under required conditions1.

Figure 3 : Coating material examples1

Water Soluble MaterialsWater Insoluble Materials Waxes and Lipid Materials
GelatinCalcium alginateParaffin
Gum ArabicPolyethyleneCarnauba
StarchPolyamide (Nylon)Spermaceti
PolyvinylpyrrolidoneSiliconesBeeswax
Polyacrylic acidPolymethacrylateStearic acid
Carboxymethyl-celluloseCellulose nitrateGlyceryl stearates
Figure 4 : Alginate coated adipose stem cells extracted from (A) rat and (B) human3
 
Figure 5 : Confocal laser scanning microscope image of rhodamine-labeled hydrogel microcapsules4.

Method

  The microencapsulation of adipose stem cells  coating with alginate is shown in figure 6. The cross- linking solution contains calcium chloride and glucose and is buffered with HEPES. Calcium chloride provides divalent cations to alginate during cross-linking. Glucose is useful for maintaining physiological osmolality of the cross-linking solution for the  adipose stem cells. HEPES is used tomaintain pH at or below pH 7.33.

Figure 6: Schematic presentation of method used for  microencapsulation of adipose stem cells3.

The generation of hydrogel microcapsules with a microfluidic system is shown in figure 7. Oligosaccharides and  peptide–starPEG were inserted through two distinct channels. The flow rates of the oil phase and Oligosaccharides and  peptide–starPEG have been set  to get required droplet formation4.

Figure 7 : Scheme of the microfluidic system used for hydrogel microcapsule generation4.

Conclusion 

Microencapsulation can be used to encapsulate different materials therefore it is useful for treatment of different diseases that occurs in various tissues. There are various methods to make microcapsules. Microcapsule generation method must be chosen carefully according to the materials that microcapsule made out of. Microcapsules can be used to deliver drug molucules, various cell types into the targeted tissue. As technology improves, microencapsulation mehods will also improve and become more effective. 

References

1. MICROENCAPSULATION. Int J Pharm Sci Rev Res. 2010;5(2):58-62.

2.  M.N. Singh, K.S.Y. Hemant, M. Ram  and HGS. Microencapsulation: A promising technique for controlled drug delivery. Res Pharm Sci. 2010;5(2):65-77.

3.  Shirae K. Leslie , Ramsey C. Kinney , Zvi Schwartz  and BDB, Abstract. Microencapsulation of Stem Cells for Therapy. In: Vol 1479. ; 2017:225-235. doi:10.1007/978-1-4939-6364-5

4.  Wieduwild R, Krishnan S, Chwalek K, et al. Noncovalent Hydrogel Beads as Microcarriers for Cell Culture. Angew Chemie. 2015;127(13):4034-4038. doi:10.1002/ange.201411400

A short review of RNA sequencing and its applications

What are the omics sciences?

Omics sciences are targeting quantification of whole biomolecules such as RNA and proteins at organism, tissue, or a single-cell level. Omics sciences are separated into several branches such as genomics, transcriptomics, and proteomics1.

What is transcriptomics?

Transcriptomics is one of the omics sciences dissecting the organism’s transcriptome which is the sum of all of its RNA molecules2,3.

What is RNA sequencing?

RNA sequencing (RNA-seq) is a technique providing quantification of all RNAs in bulk tissues or each cell. The transcript amounts of each gene across samples are calculated by using this technique. It is utilizing next-generation sequencing (NGS) platforms deciphering the sequencing of biomolecules such as DNA and RNA4,5.

What are the kinds of RNA-seq?

Bulk tissue RNA-seq

The whole transcriptome of target bulk tissues is sequenced to make transcriptomics analyses. Here, target bulk tissue can contain various cell types, and therefore, the whole transcriptome is mixed with RNAs of those cells. This approach is the most common usage of RNA-seq and is performed for some aims such as elucidating of diseases7.

Single-cell RNA-seq

In contrast to bulk tissue RNA-seq, single-cell RNA-seq (scRNA-seq) is performed in individual cells. The whole transcriptome of each cell in a tissue is sequenced to make transcriptomics analysis. The scRNA-seq has revealed that the transcriptome of each cell in a tissue is different from each other and individual cells can be separated into specific clusters according to its transcriptomic signature. The scRNA-seq has helped the discovery of some cells such as ionocyte cells, which could be relevant to the pathology of cystic fibrosis7,8.

Spatial RNA-seq

The relationship between cells and their relative locations within a tissue sample can be critical to understanding disease pathology. Spatial transcriptomics is a technology that allows the measurement of all the gene activity in a tissue sample and map where the activity is occurring. This technique is utilized in the understanding of biological processes and disease. Spatial RNA-seq can be performed at intact tissue sections as well as a single-cell level. The general aim of this technique is a combination of gene expression and morphological information and providing information on tissue architecture and micro-environment for the generation of sub-cellular data. Current bulk and scRNA-seq methods provide users with highly detailed data regarding tissues or cell populations but do not capture spatial information7,9,10.

RNA-seq analysis work-flow

1) Experimental design

There are many various library types in RNA-seq resulted in sequencing reads (sequenced transcripts) with different characteristics. For instance, reads can be single-end in which a transcript is read from its only an end (5’ or 3’), however, in the paired-end libraries, a transcript is read from both its 5’ and 3’ end. Paired-end sequencing can additionally help disambiguate read mappings and is preferred for alternative-exon quantification, fusion transcript detection, particularly when working with poorly annotated transcriptomes7. In addition to that, libraries can be stranded or unstranded. The strandedness for libraries is important to determine which DNA strand reads coming from and it is utilized to assign reads to relevant genes. If strandedness information of libraries is misused, then reads are not assigned to true genes, thus gene expression results gonna be wrong11. Besides, technical replicates can be utilized in this process in which one sample is sequenced more than one by using the same high-throughput platform to increase the elimination of technical bias.

2) Laboratory performance

After RNA extraction from all samples, libraries are prepared for sequencing according to the selected library type. After detection of library type, libraries are sequenced to read depth of 10–30 million reads per sample on a high-throughput platform7.

3) Data analysis

After sequencing has been completed, the starting point for analysis is the data files, which contain base-called sequencing reads, usually in the form of FASTQ. The reads having poor quality in FASTQ files are eliminated before the alignment process in which raw sequences are aligned to a reference genome to find their relevant genes. Each sequence read is converted to one or more genomic coordinates and Sequence Alignment Map (SAM) files containing those coordinates are obtained after alignment process7,12. This process has traditionally been accomplished using distinct alignment tools, such as TopHat13, STAR14, or HISAT15, which rely on a reference genome. The SAM files are converted to Binary Alignment Map (BAM) files for further analyses because of their large size and this process is carried out by using Samtools16. After alignment and file conversation steps, reads (transcripts) quantification across samples is performed by using some tools such as featureCounts17 to obtain expression matrix in which each row corresponds to individual genes, however, each column corresponds to individual samples7. Normalization of transcripts abundance across samples is made by using expression matrix to lessen range-based gene expression differences between samples7,18,19. Normalization methods are shown in (Figure 1)20.


Figure 1. Normalization methods that are used in RNA-seq analyses.

After normalization step, genes with low expression across samples are filtered to prevent statistical noise7, and then statistically meaningful genes (namely, differentially expressed genes) can be detected by using some tools such as edgeR21, DESeq222. In the end, obtained genes can be used for enrichment analyses such as KEGG and Reactome to find out which pathways are affected. RNA-seq technology is utilized for distinct aims, some of which are shown in (Figure 2). The representations of RNA-seq results are shown in (Figure 3).


Figure 2. RNA-seq usage fields.



Figure 3. Representation of differential expression, splicing, and co-expression results. In differential expression figure, each row represents the expression amount of a gene, however, each column represents each sample. Red color shows higher expressions, but the yellow color shows lower expressions. In the co-expression figure, a network containing the interaction of each gene with other genes is depicted. In the differential alternative splicing figure, differential usage of E010 exon between control and knockdown groups is depicted.

A detailed RNA-seq work-flow is shown in (Figure 4)12.


Figure 4. An example of differential expression work-flow.

The various tools that are used for RNA-seq and their tutorials were listed below as well as visualization tools that are used for high-throughput data.

Table 1. List of RNA-seq tool and their usage fields.

Tool names Usage Tutorial Link
DESeq222 Differential expression https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
edgeR21 Differential expression https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
DEXSeq23 Differential splicing https://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.html
WGCNA24 Co-expression https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/
GATK25 Variant-calling https://gatk.broadinstitute.org/hc/en-us

Table 2. List of high-throughput visualization and enrichment tools.

Tool names Usage
pheatmap26 Heatmap plot for differentially expressed genes
ggplot227 Most various visualizations ranging from bar charts to violin plots
igraph28 Network visualization for co-expression networks and other network types
Enrichr29 Enrichment analysis of genes
DAVID30 Enrichment analysis of genes

Note/ Most of the listed tools are dependent on the R statistical computing environment.

Table 3. Examples of differential expression work-flows.

Examples Links
Example 1 https://www.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html
Example 2 https://digibio.blogspot.com/2017/11/rna-seq-analysis-hisat2-featurecounts.html
Example 3 https://bioinformaticsworkbook.org/dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-using-a-genome.html
Example 4 https://uclouvain-cbio.github.io/BSS2019/rnaseq_gene_summerschool_belgium_2019.html

In addition to differential expression pipelines above, If you want to examine my pipeline containing differential expression analysis with DESeq2, you can visit this https://github.com/kaanokay/Differential-Expression-Analysis/blob/master/HISAT2-featureCounts-DESeq2-workflow.md website address in which I attached my Linux and R scripts.

Transcriptome researches in autism spectrum disorder

Autism Spectrum Disorder (ASD) is an early-onset neuropsychiatric disorder. ASD is clinically described with behavioural abnormalities such as restrictive interest and repetitive behaviour. ASD is genetically heterogeneous and heritable (~50%) and 80% of its genetic background is unclear. Aberrations in autistic brains take mostly place in cortex regions (Figure 5) rather than cerebellum. When ASD is compared with other neuropsychiatric disorders such as schizophrenia and bipolar disorder, it has a higher heritability-rate than them, which means that it appears with the more strong genetic background than schizophrenia and bipolar disorder. Studies have revealed that ASD-related genes are enriched in brain-development, neuronal activity, signalling, and transcription regulation. Wnt signalling, synaptic function, and translational regulation are pathways that are affected by mutations in ASD-related genes31.


Figure 5. Brain regions most affected in autism.

Transcriptome studies have shown that mRNA, microRNA (miRNA), small nucleolar RNA (snoRNA), and long non-coding RNA (lncRNAs) misexpression occurred in autistic brains. Genes with mRNA misregulation are especially enriched in immune and neuronal pathways, briefly neuronal development and immune system activation are both misregulated in the brains of individuals with ASD. Misregulated miRNAs in autistic brains target mostly genes with synaptic function. Additionally, alternative splicing is misregulated in splicing regulators and this causes mis-splicing patterns in autistic individuals31.

To summarize, RNA-seq is strong technology for understanding diseases and it can be used for various aims.

That’s all 🙂

If you have any questions about this short review and my differential expression pipeline in GitHub, you feel free to contact me via kaan.okay@msfr.ibg.edu.tr e-mail address.

Very thanks for your interest and time!

REFERENCES

1) https://en.wikipedia.org/wiki/Omics.

2) https://en.wikipedia.org/wiki/Transcriptomics_technologies.

3) https://en.wikipedia.org/wiki/Transcriptome.

4) Kadakkuzha, B. M., Liu, X. an, Swarnkar, S. & Chen, Y. Genomic and proteomic mechanisms and models in toxicity and safety evaluation of nutraceuticals. in Nutraceuticals: Efficacy, Safety and Toxicity 227–237 (Elsevier Inc., 2016). doi:10.1016/B978-0-12-802147-7.00018-8.

5) Behjati, S. & Tarpey, P. S. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed. 98, 236–238 (2013).

6) https://www.ebi.ac.uk/training/online/course/functional-genomics-ii-common-technologies-and-data-analysis-methods/performing-rna-seq.

7) Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. 20, 631–656 (2019).

8) https://en.wikipedia.org/wiki/Single_cell_sequencing.

9) https://www.10xgenomics.com/spatial-transcriptomics/.

10) https://www.diva-portal.org/smash/get/diva2:1068517/FULLTEXT01.pdf.

11) https://salmon.readthedocs.io/en/latest/library_type.html.

12) https://bioinformaticsworkbook.org/dataAnalysis/RNA-Seq/RNA-SeqIntro/RNAseq-using-a-genome.html.

13) Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: Discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).

14) Dobin, A. et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

15) Kim, D., Langmead, B. & Salzberg, S. L. HISAT: A fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

16) Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

17) Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

18) Evans, C., Hardin, J. & Stoebel, D. M. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief. Bioinform. 19, 776–792 (2018).

19) Liu, X. et al. Normalization Methods for the Analysis of Unbalanced Transcriptome Data: A Review. Front. Bioeng. Biotechnol. 7, 1–11 (2019).

20) https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html.

21) Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2009).

22) Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, (2014).

23) Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-Seq data. Nat. Preced. 1–30 (2012) doi:10.1038/npre.2012.6837.2.

24) Langfelder, P. & Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics 9, (2008).

25) McKenna, A. et al. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

26) https://cran.r-project.org/web/packages/pheatmap/pheatmap.pdf.

27) https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf.

28) https://cran.r-project.org/web/packages/igraph/igraph.pdf.

29) https://amp.pharm.mssm.edu/Enrichr/.

30) https://david.ncifcrf.gov/.

31) Quesnel-Vallières, M., Weatheritt, R. J., Cordes, S. P. & Blencowe, B. J. Autism spectrum disorder: insights into convergent mechanisms from transcriptomics. Nat. Rev. Genet. 20, 51–63 (2019).

The Mapping Pipeline of the Next Generation Sequencing Data

Next-generation sequencing (NGS) enables high-throughput detection of DNA sequences in genomic research. The NGS technologies are implemented for several applications, including whole-genome sequencing, de novo assembly sequencing, resequencing, and transcriptome sequencing at the DNA or RNA level. In order to sequence longer sections of DNA, a new approach called shotgun sequencing (Venter et al., 2003; Margulies et al., 2005; Shendure et al., 2005) was developed during Human Genome Project (HGP). In this approach, genomic DNA is enzymatically or mechanically broken down into smaller fragments and cloned into sequencing vectors in which cloned DNA fragments can be sequenced individually. Detecting abnormalities across the entire genome (whole-genome sequencing only), including substitutions, deletions, insertions, duplications, copy number changes (gene and exon) and chromosome inversions/translocations are possible with the help of the NGS approach. Thus, shotgun sequencing has more significant advantages from the original sequencing methodology, Sanger sequencing, that requires a specific primer to start the read at a specific location along with the DNA template and record the different labels for each nucleotide within the sequence. 

The aim of this study is to build a general workflow of mapping the short-read sequences that came from NGS machine.  

Before the analysis of NGS data with publicly or commercially available algorithms and tools, we need to know about some features of the NGS raw data.

The raw data from a sequencing machine are most widely provided as FASTQ (unaligned sequences) files, which include sequence information, similar to FASTA files, but additionally contain further information, including sequence quality information. A FASTQ file consists of blocks, corresponding to reads, and each block consists of four elements in four lines.  

Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description
Line 2 is the raw sequence letters
Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence

For instance;
@HS2000-306_201:6:1204:19922:79127/

ColumnBrief Description
HS2000-306_21The instrument name
6Flowcell lane
1204Tile number within the flowcell lane
19922x-coordinate of the cluster within the tile
79127y-coordinate of the cluster within the tile
1the member of a pair, 1 or 2 (paired-end)

ACGTCTGGCCTAAAGCACTTTTTCTGAATTC…  Sequence
+
BC@DFDFFHHHHHJJJIJJJJJJJJJJJJJJJJJJJJJH…  Base Qualities

1.Quality Control

Quality control is the most important step in the process of improving raw data by removing any identifiable errors from it. With the application of QC at the beginning of the analysis, the chance of finding any contamination, imprecision, error, and missing data are reduced.  

Quality (Q) is proportional to -log10  probability of sequence base being wrong (e):
Phred scaled Q = -10*log10(e)
Base Qualities = ASCII 33 + Phred scaled Q
e: base-calling error probability
SAM encoding adds 33 to the value because ASCII 33 is the first visible character. 

source: https://drive5.com/usearch/manual/quality_score.html



Phred Quality Score

Probability of Incorrect Base Call

Base Call Accuracy
101 in 1090%
201 in 10099%
301 in 100099.9%
401 in 10,00099.99%
501 in 100,00099.999%
601 in 1,000,00099.9999%

The most commonly used tool for assessing and visualizing the quality of FASTQ data is FastQC (Babraham Bioinformatics, n.d.), which provides comprehensive information about data quality, including base sequence quality scores, GC content information, sequence duplication levels, and overrepresented sequences. There are some alternatives to FastQC, and these are PRINSEQ, fastqp, NGS QC Toolkit, and QC-Chain

Running FastQC

1- To run the FastQC program on desktop, you can use File > Open to select the sequence file you want to check.

2- To run the FastQC program in the cluster, we would normally have to tell our computer where the program is located.

$ which fastqc

/usr/local/bin/fastqc

FastQC can accept multiple filenames as input, so we can use the *.fastq.gz wildcard to run FastQC on all of the FASTQ files in this directory.

$ fastqc *.fastq.gz

You will see an automatically updating output message telling you the progress of the analysis. It will start like this:

Started analysis of SRR2584863_1.fastq
Approx 5% complete for SRR2584687_1.fastq
Approx 10% complete for SRR2584687_1.fastq
Approx 15% complete for SRR2584687_1.fastq
Approx 20% complete for SRR2584687_1.fastq
Approx 25% complete for SRR2584687_1.fastq
Approx 30% complete for SRR2584687_1.fastq
Approx 35% complete for SRR2584687_1.fastq
Approx 40% complete for SRR2584687_1.fastq
Approx 45% complete for SRR2584687_1.fastq

For each input FASTQ file, FastQC has created a .zip file and a .html file. The .zip file extension indicates that this is actually a compressed set of multiple output files. We’ll be working with these output files soon. The .html file is a stable webpage displaying the summary report for each of our samples.

We want to keep our data files and our results files separate, so we will move these output files into a new directory within our results/  directory. If this directory does not exist, we will have to create it.

## -p flag stops a message from appearing if the directory already exists
$ mkdir -p  ~/kaya/ results
$ mv *.html ~/kaya/ results/
$ mv *.zip ~/kaya/ results/

It can be quite tedious to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.

The .html files and the uncompressed .zip files are still present, but now we also have a new directory for each of our samples. We can see for sure that it’s a directory if we use the -F flag for ls.

$ ls -F

SRR2584869_1_fastqc/      SRR2584866_1_fastqc/ SRR2589044_1_fastqc/SRR2584869_1_fastqc.html  SRR2584866_1_fastqc.html SRR2589044_1_fastqc.htmlSRR2584863_1_fastqc.zip   SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zipSRR2584863_2_fastqc/      SRR2584866_2_fastqc/ SRR2589044_2_fastqc/SRR2584863_2_fastqc.html  SRR2584866_2_fastqc.html SRR2589044_2_fastqc.htmlSRR2584863_2_fastqc.zip   SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip

Let’s see what files are present within one of these output directories.

$ ls -F SRR2584869_1_fastqc/

fastqc_data.txt  fastqc.fo fastqc_report.html Icons/ Images/  summary.txt

Use less to preview the summary.txt file for this sample.

$ less SRR2584869_1_fastqc/summary.txt 

PASS    Basic Statistics        SRR2584869_1.fastq
PASS    Per base sequence quality       SRR2584869_1.fastq
PASS    Per tile sequence quality       SRR2584869_1.fastq
PASS    Per sequence quality scores     SRR2584869_1.fastq
WARN    Per base sequence content       SRR2584869_1.fastq
WARN    Per sequence GC content SRR2584869_1.fastq
PASS    Per base N content      SRR2584869_1.fastq
PASS    Sequence Length Distribution    SRR2584869_1.fastq
PASS    Sequence Duplication Levels     SRR2584869_1.fastq
PASS    Overrepresented sequences       SRR2584869_1.fastq
WARN    Adapter Content SRR25848

Finally, we can make a report of the results we got for all our samples by concatenating all of our summary.txt files into a single file using the cat command.

$ cat */summary.txt > ~/kaya/results/fastqc_summaries.txt

For more information, please see the FastQC documentation here

Additionally, the multiqc tool has been designed for the tasks of combining QC reports  into a single report that is easy to analyze

$multiqc
$multiqc –help

Another way to check your NGS data quality is to work in R studio.
fastqcr can be installed from CRAN as follow.

install.packages(“fasqcr”)

Good Quality
Bad Quality

2. Trimming Low-quality Reads and Adapters

Trimming is the second step in analyzing NGS data. It has been broadly embraced in most recent NGS studies, specifically prior to genome assembly, transcriptome assembly, metagenome reconstruction, gene expression, epigenetic studies, and comparative genomics. Neglecting the presence of low-quality base calls may, in fact, be harmful to any NGS analysis, as it may add unreliable and potentially random sequences to the dataset. This may constitute a relevant problem for any downstream analysis pipeline and lead to false definitions of data. Also, adapter contamination can lead to NGS alignment errors and an increased number of unaligned reads, since the adapter sequences are synthetic and do not occur in the genomic sequence. There are applications (e.g., small RNA sequencing) where adapter trimming is highly necessary. With a fragment size of around 24 nucleotides, one will definitely sequence into the 3′ adapter. But there are also some applications (transcriptome sequencing, whole-genome sequencing, etc.) where adapter contamination can be expected to be so small (due to an appropriate size selection) that one could consider to skip the adapter removal and thereby save time and efforts. There are many tools to handle of QC, namely, AfterQc, Cutadapt, Trimmomatic, Erne-Filter, ConDeTri, Sickle, SolexaQAAlienTrimmerSkewer , BBDuk, Fastx Toolkit, and Trim Galore.

In the present work, we want to describe the basic commands to improve your NGS data quality and authenticity by the Cutadapt trimming tool.

When processing paired-end data, Cutadapt holds the trimming these reads. To facilitate this, provide two input files and a second output file with the -p option (this is the short form of –paired-output). This is the basic command-line syntax:

$ cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.

Here, the input reads are in reads.1.fastq and reads.2.fastq, and the result will be written to out.1.fastq and out.2.fastq.

In paired-end mode, the options -a, -b, -g and -u that also exist in single-end mode are applied to the forward reads only. To modify the reverse read, these options have uppercase versions -A, -B, -G and -U that work just like their counterparts. In the example above, ADAPTER_FWD will therefore be trimmed from the forward reads and ADAPTER_REV from the reverse reads.

The -q (or –quality-cutoff) parameter can be used to trim low-quality ends from reads. If you specify a single cutoff value, the 3’ end of each read is trimmed:

$ cutadapt -q 20,20 -o output.fastq input.fastq

It is also possible to also trim from the 5’ end by specifying two comma-separated cutoffs as 5’ cutoff, 3’ cutoff. For example,

$ cutadapt -q 15,10 -o output.fastq input.fastq

will quality-trim the 5’ end with a cutoff of 15 and the 3’ end with a cutoff of 10. To only trim the 5’ end, use a cutoff of 0 for the 3’ end, as in -q 15,0.

Interleaved paired-end reads

Paired-end reads can be read from a single FASTQ file in which the entries for the first and second read from each pair alternate. The first read in each pair comes before the second. Enable this file format by adding the –interleaved option to the command-line. For example:

$ cutadapt –interleaved -q 20 -a ACGT -A TGCA -o trimmed.fastq reads.fastq

To read from an interleaved file, but write regular two-file output, provide the second output file as usual with the -p option:

$ cutadapt –interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq -p trimmed.2.fastq reads.fastq

Reading two-file input and writing interleaved is also possible by providing a second input file:

$ cutadapt –interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq reads.1.fastq reads.2.fas

Trimming paired-end reads separately

Secondly, if you want to quality-trim the first read in each pair with a threshold of 20, and the second read in each pair with a threshold of 10, then the commands could be:

$ cutadapt -q 20 -a ADAPTER_FWD -o trimmed.1.fastq reads1.fastq
$ cutadapt -q 10 -a ADAPTER_REV -o trimmed.2.fastq reads2.fastq

 If one end of a paired-end read had > 5 % ‘N’ bases, then the paired-end read can be removed.  To deal with, Cutadapt recommends the following options to deal with N bases in your reads:

–max-n COUNT
Discard reads containing more than COUNT N bases. A fractional COUNT between 0 and 1 can also be given and will be treated as the proportion of maximally allowed N bases in the read.
–trim-n
Remove flanking N bases from each read. That is, a read such as this:

NNACGTACGTNNNN
It trimmed to just Ns and the rest of the sequence became ACGTACGT. This option is applied after adapter trimming. If you want to get rid of N bases before adapter removal, use quality trimming: N bases typically also have a low quality value associated with them.

Finally, Cutadapt has two sets of adapters to work with:

An example:

$ cutadapt –pair-adapters -a AAAAA -a GGGG -A CCCCC -A TTTT -o out.1.fastq -p out.2.fastq in.1.fastq in.2.fastq

Here, the adapter pairs are (AAAAA, CCCCC) and (GGGG, TTTT). That is, paired-end reads will only be trimmed if either

  • AAAAA is found in R1 and CCCCC is found in R2,
  • or GGGG is found in R1 and TTTT is found in R2.

For detailed information, please see the Cutadapt documentation

3. Aligned sequences – SAM/BAM format

Now, filtered reads of each sequencing sample are ready to attain the exact locations onto the corresponding reference genome. Also, you can find these locations using de novo assembly.

A reference genome is a collection of contigs
● A contig refers to overlapping DNA reads encoded as A, G, C, T or N
● Typically comes in FASTA format:
○ “>” line contains information on contig

There are a number of tools to choose from and, while there is no golden rule, there are some tools that are better suited for particular NGS analyses, to name a few, BWA, Bowtie2, SOAP, novoalign, and mummer. After aligning, a Sequence Alignment Map (SAM) file is produced. This file is a format for storing large nucleotide sequence alignments. The binary version of a SAM file is termed a Binary Alignment Map (BAM) file, and BAM file stores aligned reads and are technology independent. The SAM/BAM file consists of a header and an alignment section.

We will be using the Burrows Wheeler Aligner (BWA), which is a software package for mapping short-read sequences against a reference genome.

The alignment process consists of two steps:

  1. Indexing the reference genome
  2. Aligning the reads to the reference genome

Firstly, we create a new folder and download our reference genome from our source.

$ cd ~/kaya
$ mkdir -p data/ref_genome

$ curl -L -o data/ref_genome/ecoli_ref.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz
$ gunzip data/ref_genome/ecoli_ref.fasta.gz

We will also download a set of trimmed FASTQ files to work with.

$ curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
$ tar xvf sub.tar.gz
$ mv sub/ ~/kaya/data/trimmed_fastq_small

and you need to create multiple directories for the results that will be generated as part of this workflow.

$ mkdir -p results/sam_results/bam_results

Index the reference genome

Our first step is to index the reference genome for use by BWA. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment. Indexing the reference only has to be run once. The only reason you would want to create a new index is if you are working with a different reference genome or you are using a different tool for alignment.

$ bwa index data/ref_genome/ecoli_ref.fasta

## While the index is created, you will see output that looks something like this:

[bwa_index] Pack FASTA… 0.04 sec
[bwa_index] Construct BWT for the packed sequence…
[bwa_index] 1.05 seconds elapse.
[bwa_index] Update BWT… 0.03 sec
[bwa_index] Pack forward-only FASTA… 0.02 sec
[bwa_index] Construct SA from BWT and Occ… 0.57 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta
[main] Real time: 1.765 sec; CPU: 1.715 sec

Align reads to reference genome

The alignment process consists of choosing a suitable reference genome to map our reads against and then choosing on an aligner. We will use the BWA-MEM algorithm, which is the latest and is generally recommended for high-quality queries as it is faster and more accurate.

An example of what a bwa command looks like is below. This command will not run, as we do not have the files ref_genome.fa, input_file_R1.fastq, or input_file_R2.fastq.

$ bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam

We are running bwa with the default parameters here, your use case might require a change of parameters. NOTE: Always read the manual page for any tool before using and make sure the options you use are appropriate for your data.

We’re going to start by aligning the reads from just one of the samples in our dataset (SRR2584687). Later, we’ll be iterating this whole process on all of our sample files.

$ bwa mem data/ref_genome/ecoli_ref.fasta data/trimmed_fastq_small/SRR2584687_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584687_2.trim.sub.fastq > results/sam/SRR2584687.aligned.sam

##You will see output that starts like this:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 77446 sequences (10000033 bp)…
[M::process] read 77296 sequences (10000182 bp)…
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (48, 36728, 21, 61)
[M::mem_pestat] analyzing insert size distribution for orientation FF…
[M::mem_pestat] (25, 50, 75) percentile: (420, 660, 1774)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 4482)
[M::mem_pestat] mean and std.dev: (784.68, 700.87)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 5836)
[M::mem_pestat] analyzing insert size distribution for orientation FR…

SAM/BAM format

The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome.
The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.

The file begins with a header, which can be optional. The header is used to describe the source of data, a reference sequence, method of alignment, etc., this will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 necessary fields for essential mapping information and a variable number of other fields for aligner specific information. An example entry from a SAM file is displayed below with the different fields highlighted.

Read Name (RED)
The sequence of Read (BLUE)
Encoded Sequence Quality (GREEN)


(RNAME) Chromosome to which the read aligns (RED)
(POS) Position in chromosome to which 5′ end of the read aligns
Alignment information – “Cigar string” (BLUE)
100M – Continuous match of 100 bases (perfect match or mismatch)
28M1D72M – 28 bases continuously match, 1 deletion from reference, 72 base match (GREEN)
(RED) Bit FLAG – TRUE/FALSE for pre-defined read criteria, like: is it paired? duplicate? https://broadinstitute.github.io/picard/explain-flags.html
(BLUE) -ISIZE- Paired read position and insert size
(GREEN) User defined flags


We will convert the SAM file to BAM format using the samtools program with the view command and tell this command that the input is in SAM format (-S) and to output BAM format (-b):

$ samtools view -S -b results/sam/SRR2584687.aligned.sam > results/bam/SRR2584687.aligned.

Sort BAM file by coordinates

Next, we sort the BAM file using the sort command from samtools. -o  tells the command where to write the output.


$ samtools sort -o results/bam/SRR2584687.aligned.sorted.bam results/bam/SRR2584687.aligned.bam

If you want to follow statistics about your sorted bam file:

$ samtools flagstat results/bam/SRR2584687.aligned.sorted.bam

#OUPUT
231341 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1169 + 0 supplementary
0 + 0 duplicates
351103 + 0 mapped (99.98% : N/A)
350000 + 0 paired in sequencing
175000 + 0 read1
175000 + 0 read2
346688 + 0 properly paired (99.05% : N/A)
349876 + 0 with itself and mate mapped
58 + 0 singletons (0.02% : N/A)
0 + 0 with mate mapped to a different
chr0 + 0 with mate mapped to a different chr (mapQ>=5)

v2- Align the reads to the contigs using BWA

$bwa index kaya/LS0566-contigs.fa
$bwa mem -t2 kaya/LS0566-contigs.fa 25KLUK_4_1.fq.gz 25KLUK_4_2.fq.gz >kaya/25KLUK_4.sam
$samtools sort -@2 -o kaya/25KLUK_4.bam kaya/25KLUK_4.sam
$samtools index kaya/25KLUK_4.bam

Index the assembly FASTA file.

$ samtools faidx kaya/LS0566-contigs.fa

Viewing BAM file using samtools tview.


$ samtools tview kaya/25KLUK_4.bam kaya/LS0566-contigs.fa

You can browse your BAM file with IGV

4. Viewing with IGV

IGV is a genome browser, which has the advantage of being installed locally and providing fast access. Web-based genome browsers, like Ensembl or the UCSC browser, are slower but provide more functionality.

Locally on your own Mac or Windows computer

We need to open the IGV software. If you haven’t done so already, you can download IGV from the Broad Institute’s software page, double-click the .zip file to unzip it, and then drag the program into your Applications folder.

  1. Open IGV.
  2. Load our reference genome file (ecoli_ref.fasta) into IGV using the “Load Genomes from File…“ option under the “Genomes” pull-down menu.

Load our BAM file (SRR2584687.aligned.sorted.bam) using the “Load from File…“ option under the “File” pull-down menu.

To load data from an HTTP URL:

  1. Select File>Load from URL.
  2. Enter the HTTP or FTP URL for a data file or sample information file.
  3. If the file is indexed, enter the index file name in the field provided.
  4. Click OK.

To load a file from Google Cloud Storage, enter the path to the file with the “gs://” prefix.  //
Upload the following indexed/sorted Bam files with File -> Load from URL >http://faculty.xxx.edu/~kaya/Workshop/results/SRR20372154.fastq.bam

Controlling IGV from R

You can open IGV from within R with startIGV(“lm”) . Note this may not work on all systems. The testing URL (xxx.edu) is given below. You can try with your cluster URL.

library(SRAdb)
urls <- readLines(“http://xxxx.edu/data/samples/bam_urls.txt“)
#startIGV(“lm”) # opens IGV
sockiv <- IGVsocket()
session <- IGVsession (files=urls,
sessionFile=“session.xml”,
genome=“A. thaliana (TAIR10)”)
IGVload(sockiv, session)
IGVgoto(sockiv, ‘Chr2:67296-3521’)

I hope you find this tutorial useful to analyze your NGS data.
I would like to thank Dilek Koptekin ( @dilekopter ) for reviewing the pipeline. If you have any questions, please get in touch with us without hesitation.

References

Brabaham Bioinformatics website. Available: http:// www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 2013 Dec 1

Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/ Illumina FASTQ variants. Nucleic Acids Res 38: 1767-1771. doi: 10.1093/nar/gkp1137. PubMed: 20015970.

Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM (2013) An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis. PLoS ONE 8(12): e85024. doi:10.1371/journal.pone.0085024

Martin M (2011) Cutadapt removes adapter sequences from high- throughput sequencing reads. EmBnet Journal 17: 10-12.

Zhang J, Chiodini R, Badr A, Zhang G. The impact of next-generation sequencing on genomics. J Genet Genomics. 2011;38(3):95–109. doi:10.1016/j.jgg.2011.02.003

miRDeep2 – miRNA Sequencing Analysis, Example Run by Using Ubuntu Terminal

microRNAs (miRNAs) belong the family of small non-coding RNAs and regulate many processes in the body via regulating the mRNAs. They are 20 to 25-nucleotides-long small RNAs. Since they are short, and sequences vary only small number of nucleotides (e.g. 1 as in the case of SNPs), deep sequencing, high coverage, requires to detect the miRNAs, and identify the novel sequences sensitively.

There are different tools available to investigate miRNAs, miRNA structures, expression profiles of them and so on. Although RNA-sequencing technology is still in teenage years (Stark et al., 2019), miRNA sequencing technology is even more “immature” than RNA-seq or sc-RNA- seq, so do the tools available for miRNA-sequencing data analysis. Besides, there are limited number of tools available for bioinformatics analysis for mirNA sequencing (Motameny et al., 2010; Kang and Friedlander, 2015; Chen et al., 2019). miRDeep2 (Mackowiak, S., 2011; Friedlander et al., 2012; Yang et al., 2011) is one of the most commonly used and recently updated tools to detect known, canonical, and novel, non-canonical, miRNA sequences. Although, the pipelines are available for miRNA sequencing as in the case of ENCODE Project Pipelines, the bioinformatics tools such as miRDeep2 are easier to use people coming from different scientific backgrounds.

There are tutorials provided in miRDeep2 github pages. There are two github links (old, new) and so two different tutorials (old, new) available. Please make sure that you follow the tutorial provided in the recent/newest github page.

Although the tutorial is shared in the github page, a practical example run might be useful for people who is planning to use this tool first time. Therefore I will share the codes required with you , with the warnings that you need to be extra cautious.

Step 1: Download Ubuntu Terminal

This tool requires linux working environment. So, if you are using Windows, you need to download a program such as Ubuntu Terminal or Virtual Box/Machine to run the mirdeep package. For this, you need to open Microsoft Store and chose to download the Ubuntu (not the LTS ones but the terminal).

Step 2: Downloading miRDeep2 with conda install

If you try to install mirdeep2 without conda install, you might encounter some problems. I strongly recommend to use conda install. After installation, do not forget to test perl script: mapper.pl.

 dincaslan@D:~$ sudo apt-get update
 dincaslan@D:~$ sudo apt-get upgrade
 dincaslan@D:~$ cd /mnt/c/Users/USER/Downloads/

#You need to open a new terminal here. You can follow the instructions given in this link. Because I want to download the files to Downloads in Windows instead of Linux, I specificied the paths with "mnt/c/Users/...".

 dincaslan@D:~$ sha256sum  /mnt/c/Users/USER/Downloads/Anaconda3-2019.10-Linux-x86_64.sh 
 dincaslan@D:/mnt/c/Users/USER/Downloads$ bash /mnt/c/Users/USER/Downloads/Anaconda3-2019.10-Linux-x86_64.sh
 dincaslan@D:/mnt/c/Users/USER/Downloads$ source ~/.bashrc
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda config --set auto_activate_base
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda config --set auto_activate_base True
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda list
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda install -c bioconda mirdeep2
 (base) dincaslan@D:/mnt/c/Users/USER/Downloads$ mapper.pl 

Step 3: Running the Tutorial for MiRDeep2

Before running your analysis, it would be better to test the tutorial run to make sure that everything is alright with the tool. You can download the mature and hairpin miRNA files from miRBase.

(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ cd drmirdeep.github.io-master/

#cd command is used to open files in the given path/directory. You need to chose the directory that you download the tutorial file.
#ls is to list the files in the given folder

(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master$ ls
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master$ cd drmirdeep.github.io-master/
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ ls

#grep to check how many of the reads have the adapter sequence
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ grep -c TGGAATTC example_small_rna_file.fastq
2001
#do not forget the extract the relevant files from mature and hairpin miRNA files you downloaded from mirbase.

(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ extract_miRNAs.pl /mnt/c/Users/USER/Downloads/mature.fa hsa > /mnt/c/Users/USER/Downloads/mature_hsa.fa  
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ extract_miRNAs.pl /mnt/c/Users/USER/Downloads/hairpin.fa hsa > /mnt/c/Users/USER/Downloads/hairpin_hsa.fa  
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ extract_miRNAs.pl /mnt/c/Users/USER/Downloads/mature.fa mmu,chi > /mnt/c/Users/USER/Downloads/mature_other_hsa.fa 

#to build index file via bowtie1
#make sure that you do not use the same name for the file you give as input, reference genome, and indexed output.

(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ bowtie-build refdb.fa refdb.fa

#to map the sample sequencing reads against the indexed genome file
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ mapper.pl example_small_rna_file.fastq -e -h -i -j -k TGGAATTC -l 18 -m -p refdb.fa -s reads_collapsed.fa -t reads_vs_refdb.arf -v -o 4

#to run the mirdeep2 analysis. You can find the detailed information regarding the parameters in the paper and the tutorial page.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ miRDeep2.pl reads_collapsed.fa refdb.fa reads_vs_refdb.arf mature_ref.fa mature_other.fa hairpin_ref.fa -t hsa 2>report.log

Step 4: Running the miRDeep2 for your sample

Before running the mirdeep2, you might want to check the quality of your fastq files by fastqc. Although mirdeep2 has intrinsic adapter trimming function, you might still need to use cutadapt based on your data’s specific needs. I will share the example codes to how to download an do the adapter trimming.

#for fastqc
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ sudo apt-get update
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ sudo apt-get install fastqc
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ fastqc --extract /mnt/c/Users/USER/Downloads/S26.fastq.gz -o /mnt/c/Users/USER/Downloads/fastqc_results

#for cutadapt and fastqc after
#Lets say your adapter sequence is this: TAGCTGATCGATCTGAAACT
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda install -c bioconda cutadapt
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ cutadapt -a TAGCTGATCGATCTGAAACT /mnt/c/Users/USER/Downloads/S26.fastq > /mnt/c/Users/USER/Downloads/outputS26.fastq
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ fastqc --extract /mnt/c/Users/USER/Downloads/outputS26.fastq -o /mnt/c/Users/USER/Downloads 

#before this step, you need to download a reference file in fasta/fa format.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ bowtie-build ucsc_hg19.fasta ucschg19

#You do not need to add .fa extension to file that you index
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ mapper.pl S26.fastq -e -h -i -j -k TAGCTGATCGATCTGAAACT-l 18 -m -p ucschg19 -s R___collapsed.fa -t R___refdb.arf -v -o 4

#You need to use index file as a reference here
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ miRDeep2.pl R___collapsed.fa ucsc_hg19.fasta R___refdb.arf mature_hsa.fa mature_other_hsa.fa hairpin_hsa.fa -t hsa 2> report.log

I hope you find this tutorial run useful. In addition to the websites given, whenever you have problems regarding the mirdeep2 run, I strongly recommend to read the documentation given in new github page and the article and check, if necessary ask, the questions/problems in biostar.

I would like thank my dear labmate Daniel Muliaditan for helping me to remember/learn the basics of linux and practice the mirdeep2 run in Ubuntu Terminal (by convenient way of handling such problems: using conda install). I would like to thank #AcademicTwitter, especially Dr. Ming Tang for his extremely useful answer to my question 🙂

References:

Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat Rev Genet20, 631–656 (2019). https://doi-org.libproxy1.nus.edu.sg/10.1038/s41576-019-0150-2

Motameny, S.; Wolters, S.; Nürnberg, P.; Schumacher, B. Next Generation Sequencing of miRNAs – Strategies, Resources and Methods. Genes 2010, 1, 70-84. https://doi.org/10.3390/genes1010070

Kang W, Friedländer MR. (2015) Computational prediction of miRNA genes from small RNA sequencing data. Front Bioeng Biotechnol 3: 7 10.3389/fbioe.2015.00007

Liang Chen, Liisa Heikkinen, Changliang Wang, Yang Yang, Huiyan Sun, Garry Wong, Trends in the development of miRNA bioinformatics tools, Briefings in Bioinformatics, Volume 20, Issue 5, September 2019, Pages 1836–1852, https://doi-org.libproxy1.nus.edu.sg/10.1093/bib/bby054

Mackowiak, S. D. Identification of novel and known miRNAs in deep-sequencing data with miRDeep2. Curr Protoc BioinformaticsChapter 12, Unit 12 10, 10.1002/0471250953.bi1210s36 (2011).

Xiaozeng Yang, Lei Li, miRDeep-P: a computational tool for analyzing the microRNA transcriptome in plants, Bioinformatics, Volume 27, Issue 18, 15 September 2011, Pages 2614–2615, https://doi-org.libproxy1.nus.edu.sg/10.1093/bioinformatics/btr430

Marc R. Friedländer, Sebastian D. Mackowiak, Na Li, Wei Chen, Nikolaus Rajewsky, miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades, Nucleic Acids Research, Volume 40, Issue 1, 1 January 2012, Pages 37–52, https://doi-org.libproxy1.nus.edu.sg/10.1093/nar/gkr688

https://www.encodeproject.org/microrna/microrna-seq/

Workshop

Scientific Figure Design workshop presentation is available now!

You can download pdf file from here
Also, If you want to access bookdown version you can click here.
created by: Handan Melike Dönertaş

Starting with our next student symposium, we are planning to organize workshops. Let us know your favorite workshop topics and help us organize something that interest you!

Resources to Learn Computational Biology

We have started to compile a list of resources to learn or improve computational biology skills: Resources for Computational Biology & Bioinformatics

You can add new resources to relevant sheets or just check the list to find your new favourite book/course!

The list includes different type of resources to learn programming languages, a specific type of analysis, or pure theory. We also added a sheet for the databases, which we hope to be full of exciting databases soon – you can add yours as well 🙂

Don’t forget to add the resources you found useful and share with your circle!

Please contact turkey.rsg@gmail.com for any suggestion/comment.

ISMB2018 – 26th Conference on Intelligent Systems for Molecular Biology

We as RSG-Turkey are so proud to be part of great organization ISCB and ISCBSC. ISMB2018, one of the conferences organized by ISCB, was held in Chicago between 6-11 July. I have nominated for ISCB-SC RSG Leadership Travel Fellowship for the conference and she had the opportunity to attend the conference. Despite this post were overdue, it has a bunch of highlights which should be recorded.

The first thing worth to mention is ISCB Communities of Special Interest (COSIs) which are topically-focused collaborative communities of shared interest wherein scientists communicate with one another on research problems and/or opportunities in specific areas of computational biology. For detailed information about sixteen COSIs of ISCB, click on the link. One of my favorites is  SysMod COSI where I got a chance to present my Ph.D. project and meet the great scientist as well as my future collaborators.

In the first day of the conference, Thomas Lengauer, ISCB president welcomed over 1,600 delegates to Chicago and started the tight schedule of ISMB2018. During the event, ISCB Conferences mobile application helped each participant to create their own program.
20180707_092742
The conference hosted very successful and interesting talks including keynotes. The conference-leading keynote was Steven Salzberg from the Center for Computational Biology McKusick-Nathans Institute of Genetic Medicine at Johns Hopkins University. His keynote, titled “25 years of human gene finding: are we there yet?” focused on how The Human Genome Project was launched with the promise of revealing all of our genes, the “code” that would help explain our biology. The publication of the human genome in 2001 provided only a very rough answer to this question. For more than a decade following, the number of protein-coding genes steadily shrank, but the introduction of RNA sequencing revealed a vast new world of splice variants and RNA genes. His talk reviewed where we’ve been and where we are today, described a new effort to use an unprecedentedly large RNA sequencing resource to create a comprehensive new human gene catalog. The ISCB Overton Prize Keynote, Cole Trapnell of the University of Washington gave an engaging and informative talk titled “Reconstructing and deforming developmental landscapes” focused on how developing embryos are comprised of highly plastic individual cells that shift from one functional state to another, often reversibly so. A cell executes a different gene expression program for each of its possible roles, switching between them as needed throughout its life. How does the genome encode the developmentally intended sequence of program switches? Which gene regulatory events are crucial for a given cell fate decision? Quantifying each gene’s contribution in governing even one developmental step is a staggeringly difficult challenge. However, massively scalable single-cell transcriptome and epigenome profiling offer a way to quantitatively dissect developmental regulatory circuits. He discussed new assays and algorithms developed by his laboratory to realize this goal, and offer some lessons from several recent projects. Martha L Bulyk from Brigham & Women’s Hospital and Harvard Medical School in Boston was the next day keynote Her talk titled Transcription factors, and cis-regulatory elements focused on mapping the impact of unique variants on the expression of transcription factors.  Specifically, it highlighted that similar target sequences can have far-reaching impacts when mutated.  The difficulties associated with establishing a proper background were also addressed. The engaging talk culminated in an informative question and answer period. Madan Babu of the MRC Laboratory of Molecular Biology in the United Kingdom was another keynote speaker. His talk focused on understanding how the amino acid sequence of a protein contributes to its function (sequence-function relationship) and foundation for the sequence–structure–function paradigm. He presented IDR-Screen, which is a high-throughput experimental and computational approach for discovering functional disordered regions in a biologically relevant context and identifying features of functional sequences through statistical learning. The final keynote of the conference, the ISCB Accomplishments by a Senior Scientist Award winner, Ruth Nussinov had inspiring talk entitled A woman’s computational biology journey focused on her journey through the field, beginning when revolutionary sequencing methods produced the first long DNA sequences with the development of an efficient algorithm to fold RNA, followed by pioneering bioinformatic DNA sequence analyses.

Throughout the conference days, attendees were able to meet and seek out information on new technologies, platforms, and ideas. Addition to having an opportunity to meet with exhibitors, attendees could view the poster presentations of the day to seek out new ideas and approaches. With nearly 300 attendee participants interacting with 15 recruiting entities the ISCB Career Fair was also a notable event.  The Career Fair allowed for a designated time for engaging discussion among talented candidates seeking positions in the fields of computational biology and bioinformatics.

Attending an ISCB conference is also a good chance for understanding the ISCB organization structure, transparency should be one of their strengths. Bruno Gaeta, ISCB Treasurer, reviewed the Society’s financial statements and current membership numbers.  Scott Markel, the Nominations Co-Chair, reminded members to vote and gave a brief overview of the Nominations process.  The student council delivered their annual report and highlighted this year’s ISMB Student Symposium.

ISCB offers poster or oral presentation and different numbers of travel fellowship opportunity as well as competitions like Art in Science or Wikipedia in every ISMB. In ISMB2018, 2017-2018 Wikipedia Winners were announced, 2018 Art in Science winners were announced, and over 40 students and post-docs were recognized as ISMB travel fellowship recipients. The ISCB aims to improve the communication of scientific knowledge to the public at large, so the ISCB Wikipedia Competition aims to improve the quality of Wikipedia articles relating to computational biology. Entries to the competition are open now; the competition closes on 31 Dec 2018. Prizes of up to $500 will be awarded to the best contributions as chosen by a judging panel of experts; these will be awarded at the ISMB/ECCB conference in Basel, Switzerland in July 2019. Detailed information on this link. Another annual event is Art in Science. Art in Science competition which offers a way to show the beauty of science in the art form. The winners presented with a USD 200 prize, as well as be the feature cover image for the ISCB Fall Newsletter.

I write just some highlights from the conference, however, more information about the conference is also available in ISCB-newsletter.  If you wonder the selected works from ISMB2018 presenters, you can find the special issue in Bioinformatics

If you feel sorry that you missed this breathtaking event after you read the post, no worries. You can watch the presentations online by clicking the link

Moreover, please save the date for next ISMB in Basel, Switzerland between July 21 – July 25, 2019.

7 reasons why HiBiT 2017 was the best — Ribosome News

I have always felt the need for contributing to the social environment of the organisations I have been a part of, since I was a kid. Always took part in student groups, wanted to make a difference for everyone who will be in my position in the future. I think, I have managed to do […]

7 reasons why HiBiT 2017 was the best — Ribosome News üzerinden

Webinar Project – Bioinfonet

Bioinfonet project is an ISCB Student Council supported webinar project. Our main aim is to build bridges between bioinformatics professionals and young bioinformatics students, mainly who live in developing countries and have difficulty joining the aura of the community. But anyone who want to get an idea of how bioinformatics is done, who are those bioinformaticians and what they do or seeks for a collaboration can find their “remedy” here.
Our main activity is on our community page. When you register to our community, you can get an e-mail notice whenever we organize a new webinar – and nothing more (We hate spamming!). The other way to keep in touch with us is to follow us on Facebook and linked-in.
If you have any questions, thoughts or recommendations please contact us: turkey.rsg@gmail.com

Welcome to ISCB SC RSG Turkey’s Official Website!

Welcome to ISCB SC RSG Turkey’s official website!

Most of the people who contact us say “Do you still continue organising webinars with rshfjskh Turkey?”. This “rshfjskh” feeling comes, because, I come to the conclusion that, they do not know the abbreviation 🙂 So, here is the step by step tutorial for saying ISCB SC RSG Turkey without any struggle:

ISCB: International Society for Computational Biology
ISCB SC: International Society for Computational Biology Student Council
ISCB SC RSG: International Society for Computational Biology Student Council Regional Student Group
And here we are: ISCB SC RSG Turkey!

ActiveRSGWorldMap

But why do we have such a long name?

Let me explain. Because, ISCB is the main organisation, in which a student council is functioning. RSG’s are connected to ISCB through SC, but not directly. This is why. For further information, please refer to previous links I mentioned at the beginning.

RSG-Turkey is a member of The International Society for Computational Biology (ISCB) Student Council (SC) Regional Student Groups (RSG). We are a non-profit community composed of early career researchers interested in computational biology and bioinformatics.

Contact: turkey.rsg@gmail.com

Follow us on social media!