The containment of a core material inside of a small capsule is called microencapsulation. A polymeric material coates liquid or solid substances to protect polymeric material from circumambient area1. Microcapsules size vary between 50 nm to 2 mm2. Microcapsule’s size and structure differs according to core material being solid, liquid or gas as in figure 12.
Coating material must be adhesive to the core material in order to cover core material properly. Coating materials must work as an harmonious aid to core material in required strength, flexibility, impermeability, optical properties, and stability. Its release must be controllable under required conditions1.
Figure 3 : Coating material examples1
Water Soluble Materials
Water Insoluble Materials
Waxes and Lipid Materials
The microencapsulation of adipose stem cells coating with alginate is shown in figure 6. The cross- linking solution contains calcium chloride and glucose and is buffered with HEPES. Calcium chloride provides divalent cations to alginate during cross-linking. Glucose is useful for maintaining physiological osmolality of the cross-linking solution for the adipose stem cells. HEPES is used tomaintain pH at or below pH 7.33.
The generation of hydrogel microcapsules with a microfluidic system is shown in figure 7. Oligosaccharides and peptide–starPEG were inserted through two distinct channels. The flow rates of the oil phase and Oligosaccharides and peptide–starPEG have been set to get required droplet formation4.
Microencapsulation can be used to encapsulate different materials therefore it is useful for treatment of different diseases that occurs in various tissues. There are various methods to make microcapsules. Microcapsule generation method must be chosen carefully according to the materials that microcapsule made out of. Microcapsules can be used to deliver drug molucules, various cell types into the targeted tissue. As technology improves, microencapsulation mehods will also improve and become more effective.
1. MICROENCAPSULATION. Int J Pharm Sci Rev Res. 2010;5(2):58-62.
2. M.N. Singh, K.S.Y. Hemant, M. Ram and HGS. Microencapsulation: A promising technique for controlled drug delivery. Res Pharm Sci. 2010;5(2):65-77.
3. Shirae K. Leslie , Ramsey C. Kinney , Zvi Schwartz and BDB, Abstract. Microencapsulation of Stem Cells for Therapy. In: Vol 1479. ; 2017:225-235. doi:10.1007/978-1-4939-6364-5
4. Wieduwild R, Krishnan S, Chwalek K, et al. Noncovalent Hydrogel Beads as Microcarriers for Cell Culture. Angew Chemie. 2015;127(13):4034-4038. doi:10.1002/ange.201411400
Omics sciences are targeting quantification of whole biomolecules such as RNA and proteins at organism, tissue, or a single-cell level. Omics sciences are separated into several branches such as genomics, transcriptomics, and proteomics1.
Transcriptomics is one of the omics sciences dissecting the organism’s transcriptome which is the sum of all of its RNA molecules2,3.
is RNA sequencing?
RNA sequencing (RNA-seq) is a technique providing quantification of all RNAs in bulk tissues or each cell. The transcript amounts of each gene across samples are calculated by using this technique. It is utilizing next-generation sequencing (NGS) platforms deciphering the sequencing of biomolecules such as DNA and RNA4,5.
What are the kinds of RNA-seq?
Bulk tissue RNA-seq
The whole transcriptome of target bulk tissues is sequenced to make transcriptomics analyses. Here, target bulk tissue can contain various cell types, and therefore, the whole transcriptome is mixed with RNAs of those cells. This approach is the most common usage of RNA-seq and is performed for some aims such as elucidating of diseases7.
In contrast to bulk tissue RNA-seq, single-cell RNA-seq (scRNA-seq) is performed in individual cells. The whole transcriptome of each cell in a tissue is sequenced to make transcriptomics analysis. The scRNA-seq has revealed that the transcriptome of each cell in a tissue is different from each other and individual cells can be separated into specific clusters according to its transcriptomic signature. The scRNA-seq has helped the discovery of some cells such as ionocyte cells, which could be relevant to the pathology of cystic fibrosis7,8.
The relationship between cells and their relative locations within a tissue sample can be critical to understanding disease pathology. Spatial transcriptomics is a technology that allows the measurement of all the gene activity in a tissue sample and map where the activity is occurring. This technique is utilized in the understanding of biological processes and disease. Spatial RNA-seq can be performed at intact tissue sections as well as a single-cell level. The general aim of this technique is a combination of gene expression and morphological information and providing information on tissue architecture and micro-environment for the generation of sub-cellular data. Current bulk and scRNA-seq methods provide users with highly detailed data regarding tissues or cell populations but do not capture spatial information7,9,10.
There are many various library types in RNA-seq resulted in sequencing reads (sequenced transcripts) with different characteristics. For instance, reads can be single-end in which a transcript is read from its only an end (5’ or 3’), however, in the paired-end libraries, a transcript is read from both its 5’ and 3’ end. Paired-end sequencing can additionally help disambiguate read mappings and is preferred for alternative-exon quantification, fusion transcript detection, particularly when working with poorly annotated transcriptomes7. In addition to that, libraries can be stranded or unstranded. The strandedness for libraries is important to determine which DNA strand reads coming from and it is utilized to assign reads to relevant genes. If strandedness information of libraries is misused, then reads are not assigned to true genes, thus gene expression results gonna be wrong11. Besides, technical replicates can be utilized in this process in which one sample is sequenced more than one by using the same high-throughput platform to increase the elimination of technical bias.
After RNA extraction from all samples, libraries are prepared for sequencing according to the selected library type. After detection of library type, libraries are sequenced to read depth of 10–30 million reads per sample on a high-throughput platform7.
3) Data analysis
After sequencing has been completed, the starting point for analysis is the data files, which contain base-called sequencing reads, usually in the form of FASTQ. The reads having poor quality in FASTQ files are eliminated before the alignment process in which raw sequences are aligned to a reference genome to find their relevant genes. Each sequence read is converted to one or more genomic coordinates and Sequence Alignment Map (SAM) files containing those coordinates are obtained after alignment process7,12. This process has traditionally been accomplished using distinct alignment tools, such as TopHat13, STAR14, or HISAT15, which rely on a reference genome. The SAM files are converted to Binary Alignment Map (BAM) files for further analyses because of their large size and this process is carried out by using Samtools16. After alignment and file conversation steps, reads (transcripts) quantification across samples is performed by using some tools such as featureCounts17 to obtain expression matrix in which each row corresponds to individual genes, however, each column corresponds to individual samples7. Normalization of transcripts abundance across samples is made by using expression matrix to lessen range-based gene expression differences between samples7,18,19. Normalization methods are shown in (Figure 1)20.
After normalization step, genes with low expression across samples are filtered to prevent statistical noise7, and then statistically meaningful genes (namely, differentially expressed genes) can be detected by using some tools such as edgeR21, DESeq222. In the end, obtained genes can be used for enrichment analyses such as KEGG and Reactome to find out which pathways are affected. RNA-seq technology is utilized for distinct aims, some of which are shown in (Figure 2). The representations of RNA-seq results are shown in (Figure 3).
A detailed RNA-seq work-flow is shown in (Figure 4)12.
The various tools that are used for RNA-seq and their tutorials were listed below as well as visualization tools that are used for high-throughput data.
Table 1. List of RNA-seq tool and their usage fields.
In addition to differential expression pipelines above, If you want to examine my pipeline containing differential expression analysis with DESeq2, you can visit this https://github.com/kaanokay/Differential-Expression-Analysis/blob/master/HISAT2-featureCounts-DESeq2-workflow.md website address in which I attached my Linux and R scripts.
researches in autism spectrum disorder
Autism Spectrum Disorder (ASD) is an early-onset neuropsychiatric disorder. ASD is clinically described with behavioural abnormalities such as restrictive interest and repetitive behaviour. ASD is genetically heterogeneous and heritable (~50%) and 80% of its genetic background is unclear. Aberrations in autistic brains take mostly place in cortex regions (Figure 5) rather than cerebellum. When ASD is compared with other neuropsychiatric disorders such as schizophrenia and bipolar disorder, it has a higher heritability-rate than them, which means that it appears with the more strong genetic background than schizophrenia and bipolar disorder. Studies have revealed that ASD-related genes are enriched in brain-development, neuronal activity, signalling, and transcription regulation. Wnt signalling, synaptic function, and translational regulation are pathways that are affected by mutations in ASD-related genes31.
Transcriptome studies have shown that mRNA, microRNA (miRNA), small nucleolar RNA (snoRNA), and long non-coding RNA (lncRNAs) misexpression occurred in autistic brains. Genes with mRNA misregulation are especially enriched in immune and neuronal pathways, briefly neuronal development and immune system activation are both misregulated in the brains of individuals with ASD. Misregulated miRNAs in autistic brains target mostly genes with synaptic function. Additionally, alternative splicing is misregulated in splicing regulators and this causes mis-splicing patterns in autistic individuals31.
To summarize, RNA-seq is strong technology for understanding diseases and it can be used for various aims.
If you have any questions about this short review and my differential expression pipeline in GitHub, you feel free to contact me via firstname.lastname@example.org e-mail address.
B. M., Liu, X. an, Swarnkar, S. & Chen, Y. Genomic and proteomic
mechanisms and models in toxicity and safety evaluation of
nutraceuticals. in Nutraceuticals:
Efficacy, Safety and Toxicity
227–237 (Elsevier Inc., 2016).
5) Behjati, S. & Tarpey, P. S. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed.98, 236–238 (2013).
Next-generation sequencing (NGS) enables high-throughput detection of DNA sequences in genomic research. The NGS technologies are implemented for several applications, including whole-genome sequencing, de novo assembly sequencing, resequencing, and transcriptome sequencing at the DNA or RNA level. In order to sequence longer sections of DNA, a new approach called shotgun sequencing (Venter et al., 2003; Margulies et al., 2005; Shendure et al., 2005) was developed during Human Genome Project (HGP). In this approach, genomic DNA is enzymatically or mechanically broken down into smaller fragments and cloned into sequencing vectors in which cloned DNA fragments can be sequenced individually. Detecting abnormalities across the entire genome (whole-genome sequencing only), including substitutions, deletions, insertions, duplications, copy number changes (gene and exon) and chromosome inversions/translocations are possible with the help of the NGS approach. Thus, shotgun sequencing has more significant advantages from the original sequencing methodology, Sanger sequencing, that requires a specific primer to start the read at a specific location along with the DNA template and record the different labels for each nucleotide within the sequence.
The aim of this study is to build a general workflow of mapping the short-read sequences that came from NGS machine.
Before the analysis of NGS data with publicly or commercially available algorithms and tools, we need to know about some features of the NGS raw data.
The raw data from a sequencing machine are most widely provided as FASTQ (unaligned sequences) files, which include sequence information, similar to FASTA files, but additionally contain further information, including sequence quality information. A FASTQ file consists of blocks, corresponding to reads, and each block consists of four elements in four lines.
Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description Line 2 is the raw sequence letters Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence
For instance; @HS2000-306_201:6:1204:19922:79127/
The instrument name
Tile number within the flowcell lane
x-coordinate of the cluster within the tile
y-coordinate of the cluster within the tile
the member of a pair, 1 or 2 (paired-end)
ACGTCTGGCCTAAAGCACTTTTTCTGAATTC… Sequence + BC@DFDFFHHHHHJJJIJJJJJJJJJJJJJJJJJJJJJH… Base Qualities
Quality control is the most important step in the process of improving raw data by removing any identifiable errors from it. With the application of QC at the beginning of the analysis, the chance of finding any contamination, imprecision, error, and missing data are reduced.
Quality (Q) is proportional to -log10 probability of sequence base being wrong (e): Phred scaled Q = -10*log10(e) Base Qualities = ASCII 33 + Phred scaled Q e: base-calling error probability SAM encoding adds 33 to the value because ASCII 33 is the first visible character.
Phred Quality Score
Probability of Incorrect Base Call
Base Call Accuracy
1 in 10
1 in 100
1 in 1000
1 in 10,000
1 in 100,000
1 in 1,000,000
The most commonly used tool for assessing and visualizing the quality of FASTQ data is FastQC (Babraham Bioinformatics, n.d.), which provides comprehensive information about data quality, including base sequence quality scores, GC content information, sequence duplication levels, and overrepresented sequences. There are some alternatives to FastQC, and these are PRINSEQ, fastqp, NGS QC Toolkit, and QC-Chain
1- To run the FastQC program on desktop, you can use File > Open to select the sequence file you want to check.
2- To run the FastQC program in the cluster, we would normally have to tell our computer where the program is located.
$ which fastqc
FastQC can accept multiple filenames as input, so we can use the *.fastq.gz wildcard to run FastQC on all of the FASTQ files in this directory.
$ fastqc *.fastq.gz
You will see an automatically updating output message telling you the progress of the analysis. It will start like this:
Started analysis of SRR2584863_1.fastq Approx 5% complete for SRR2584687_1.fastq Approx 10% complete for SRR2584687_1.fastq Approx 15% complete for SRR2584687_1.fastq Approx 20% complete for SRR2584687_1.fastq Approx 25% complete for SRR2584687_1.fastq Approx 30% complete for SRR2584687_1.fastq Approx 35% complete for SRR2584687_1.fastq Approx 40% complete for SRR2584687_1.fastq Approx 45% complete for SRR2584687_1.fastq
For each input FASTQ file, FastQC has created a .zip file and a .html file. The .zip file extension indicates that this is actually a compressed set of multiple output files. We’ll be working with these output files soon. The .html file is a stable webpage displaying the summary report for each of our samples.
We want to keep our data files and our results files separate, so we will move these output files into a new directory within our results/ directory. If this directory does not exist, we will have to create it.
## -p flag stops a message from appearing if the directory already exists $ mkdir -p ~/kaya/ results $ mv *.html ~/kaya/ results/ $ mv *.zip ~/kaya/ results/
It can be quite tedious to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.
The .html files and the uncompressed .zip files are still present, but now we also have a new directory for each of our samples. We can see for sure that it’s a directory if we use the -F flag for ls.
For more information, please see the FastQC documentation here
Additionally, the multiqc tool has been designed for the tasks of combining QC reports into a single report that is easy to analyze
$multiqc $multiqc –help
Another way to check your NGS data quality is to work in R studio. fastqcr can be installed from CRAN as follow.
2. Trimming Low-quality Reads and Adapters
Trimming is the second step in analyzing NGS data. It has been broadly embraced in most recent NGS studies, specifically prior to genome assembly, transcriptome assembly, metagenome reconstruction, gene expression, epigenetic studies, and comparative genomics. Neglecting the presence of low-quality base calls may, in fact, be harmful to any NGS analysis, as it may add unreliable and potentially random sequences to the dataset. This may constitute a relevant problem for any downstream analysis pipeline and lead to false definitions of data. Also, adapter contamination can lead to NGS alignment errors and an increased number of unaligned reads, since the adapter sequences are synthetic and do not occur in the genomic sequence. There are applications (e.g., small RNA sequencing) where adapter trimming is highly necessary. With a fragment size of around 24 nucleotides, one will definitely sequence into the 3′ adapter. But there are also some applications (transcriptome sequencing, whole-genome sequencing, etc.) where adapter contamination can be expected to be so small (due to an appropriate size selection) that one could consider to skip the adapter removal and thereby save time and efforts. There are many tools to handle of QC, namely, AfterQc, Cutadapt, Trimmomatic, Erne-Filter, ConDeTri, Sickle, SolexaQA, AlienTrimmer, Skewer , BBDuk, Fastx Toolkit, and Trim Galore.
In the present work, we want to describe the basic commands to improve your NGS data quality and authenticity by the Cutadapt trimming tool.
When processing paired-end data, Cutadapt holds the trimming these reads. To facilitate this, provide two input files and a second output file with the -p option (this is the short form of –paired-output). This is the basic command-line syntax:
$ cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.
Here, the input reads are in reads.1.fastq and reads.2.fastq, and the result will be written to out.1.fastq and out.2.fastq.
In paired-end mode, the options -a, -b, -g and -u that also exist in single-end mode are applied to the forward reads only. To modify the reverse read, these options have uppercase versions -A, -B, -G and -U that work just like their counterparts. In the example above, ADAPTER_FWD will therefore be trimmed from the forward reads and ADAPTER_REV from the reverse reads.
The -q (or –quality-cutoff) parameter can be used to trim low-quality ends from reads. If you specify a single cutoff value, the 3’ end of each read is trimmed:
$ cutadapt -q 20,20 -o output.fastq input.fastq
It is also possible to also trim from the 5’ end by specifying two comma-separated cutoffs as 5’ cutoff, 3’ cutoff. For example,
$ cutadapt -q 15,10 -o output.fastq input.fastq
will quality-trim the 5’ end with a cutoff of 15 and the 3’ end with a cutoff of 10. To only trim the 5’ end, use a cutoff of 0 for the 3’ end, as in -q 15,0.
Interleaved paired-end reads
Paired-end reads can be read from a single FASTQ file in which the entries for the first and second read from each pair alternate. The first read in each pair comes before the second. Enable this file format by adding the –interleaved option to the command-line. For example:
$ cutadapt –interleaved -q 20 -a ACGT -A TGCA -o trimmed.fastq reads.fastq
To read from an interleaved file, but write regular two-file output, provide the second output file as usual with the -p option:
$ cutadapt –interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq -p trimmed.2.fastq reads.fastq
Reading two-file input and writing interleaved is also possible by providing a second input file:
$ cutadapt –interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq reads.1.fastq reads.2.fas
Trimming paired-end reads separately
Secondly, if you want to quality-trim the first read in each pair with a threshold of 20, and the second read in each pair with a threshold of 10, then the commands could be:
$ cutadapt -q 20 -a ADAPTER_FWD -o trimmed.1.fastq reads1.fastq $ cutadapt -q 10 -a ADAPTER_REV -o trimmed.2.fastq reads2.fastq
If one end of a paired-end read had > 5 % ‘N’ bases, then the paired-end read can be removed. To deal with, Cutadapt recommends the following options to deal with N bases in your reads:
–max-n COUNT Discard reads containing more than COUNT N bases. A fractional COUNT between 0 and 1 can also be given and will be treated as the proportion of maximally allowed N bases in the read. –trim-n Remove flanking N bases from each read. That is, a read such as this:
NNACGTACGTNNNN It trimmed to just Ns and the rest of the sequence became ACGTACGT. This option is applied after adapter trimming. If you want to get rid of N bases before adapter removal, use quality trimming: N bases typically also have a low quality value associated with them.
Finally, Cutadapt has two sets of adapters to work with:
$ cutadapt –pair-adapters -a AAAAA -a GGGG -A CCCCC -A TTTT -o out.1.fastq -p out.2.fastq in.1.fastq in.2.fastq
Here, the adapter pairs are (AAAAA, CCCCC) and (GGGG, TTTT). That is, paired-end reads will only be trimmed if either
Now, filtered reads of each sequencing sample are ready to attain the exact locations onto the corresponding reference genome. Also, you can find these locations using de novo assembly.
A reference genome is a collection of contigs ● A contig refers to overlapping DNA reads encoded as A, G, C, T or N ● Typically comes in FASTA format: ○ “>” line contains information on contig
There are a number of tools to choose from and, while there is no golden rule, there are some tools that are better suited for particular NGS analyses, to name a few, BWA, Bowtie2, SOAP, novoalign, and mummer. After aligning, a Sequence Alignment Map (SAM) file is produced. This file is a format for storing large nucleotide sequence alignments. The binary version of a SAM file is termed a Binary Alignment Map (BAM) file, and BAM file stores aligned reads and are technology independent. The SAM/BAM file consists of a header and an alignment section.
and you need to create multiple directories for the results that will be generated as part of this workflow.
$ mkdir -p results/sam_results/bam_results
Index the reference genome
Our first step is to index the reference genome for use by BWA. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment. Indexing the reference only has to be run once. The only reason you would want to create a new index is if you are working with a different reference genome or you are using a different tool for alignment.
$ bwa index data/ref_genome/ecoli_ref.fasta
## While the index is created, you will see output that looks something like this:
[bwa_index] Pack FASTA… 0.04 sec [bwa_index] Construct BWT for the packed sequence… [bwa_index] 1.05 seconds elapse. [bwa_index] Update BWT… 0.03 sec [bwa_index] Pack forward-only FASTA… 0.02 sec [bwa_index] Construct SA from BWT and Occ… 0.57 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta [main] Real time: 1.765 sec; CPU: 1.715 sec
Align reads to reference genome
The alignment process consists of choosing a suitable reference genome to map our reads against and then choosing on an aligner. We will use the BWA-MEM algorithm, which is the latest and is generally recommended for high-quality queries as it is faster and more accurate.
An example of what a bwa command looks like is below. This command will not run, as we do not have the files ref_genome.fa, input_file_R1.fastq, or input_file_R2.fastq.
$ bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam
We are running bwa with the default parameters here, your use case might require a change of parameters. NOTE: Always read the manual page for any tool before using and make sure the options you use are appropriate for your data.
We’re going to start by aligning the reads from just one of the samples in our dataset (SRR2584687). Later, we’ll be iterating this whole process on all of our sample files.
$ bwa mem data/ref_genome/ecoli_ref.fasta data/trimmed_fastq_small/SRR2584687_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584687_2.trim.sub.fastq > results/sam/SRR2584687.aligned.sam
##You will see output that starts like this: [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 77446 sequences (10000033 bp)… [M::process] read 77296 sequences (10000182 bp)… [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (48, 36728, 21, 61) [M::mem_pestat] analyzing insert size distribution for orientation FF… [M::mem_pestat] (25, 50, 75) percentile: (420, 660, 1774) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 4482) [M::mem_pestat] mean and std.dev: (784.68, 700.87) [M::mem_pestat] low and high boundaries for proper pairs: (1, 5836) [M::mem_pestat] analyzing insert size distribution for orientation FR…
The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome. The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.
The file begins with a header, which can be optional. The header is used to describe the source of data, a reference sequence, method of alignment, etc., this will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 necessary fields for essential mapping information and a variable number of other fields for aligner specific information. An example entry from a SAM file is displayed below with the different fields highlighted.
We will convert the SAM file to BAM format using the samtools program with the view command and tell this command that the input is in SAM format (-S) and to output BAM format (-b):
IGV is a genome browser, which has the advantage of being installed locally and providing fast access. Web-based genome browsers, like Ensembl or the UCSC browser, are slower but provide more functionality.
Locally on your own Mac or Windows computer
We need to open the IGV software. If you haven’t done so already, you can download IGV from the Broad Institute’s software page, double-click the .zip file to unzip it, and then drag the program into your Applications folder.
Load our reference genome file (ecoli_ref.fasta) into IGV using the “Load Genomes from File…“ option under the “Genomes” pull-down menu.
Load our BAM file (SRR2584687.aligned.sorted.bam) using the “Load from File…“ option under the “File” pull-down menu.
To load data from an HTTP URL:
Select File>Load from URL.
Enter the HTTP or FTP URL for a data file or sample information file.
If the file is indexed, enter the index file name in the field provided.
To load a file from Google Cloud Storage, enter the path to the file with the “gs://” prefix. // Upload the following indexed/sorted Bam files with File -> Load from URL >http://faculty.xxx.edu/~kaya/Workshop/results/SRR20372154.fastq.bam
Controlling IGV from R
You can open IGV from within R with startIGV(“lm”) . Note this may not work on all systems. The testing URL (xxx.edu) is given below. You can try with your cluster URL.
I hope you find this tutorial useful to analyze your NGS data. I would like to thank Dilek Koptekin ( @dilekopter ) for reviewing the pipeline. If you have any questions, please get in touch with us without hesitation.
Brabaham Bioinformatics website. Available: http:// www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 2013 Dec 1
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/ Illumina FASTQ variants. Nucleic Acids Res 38: 1767-1771. doi: 10.1093/nar/gkp1137. PubMed: 20015970.
Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM (2013) An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis. PLoS ONE 8(12): e85024. doi:10.1371/journal.pone.0085024
Martin M (2011) Cutadapt removes adapter sequences from high- throughput sequencing reads. EmBnet Journal 17: 10-12.
Zhang J, Chiodini R, Badr A, Zhang G. The impact of next-generation sequencing on genomics. J Genet Genomics. 2011;38(3):95–109. doi:10.1016/j.jgg.2011.02.003
microRNAs (miRNAs) belong the family of small non-coding RNAs and regulate many processes in the body via regulating the mRNAs. They are 20 to 25-nucleotides-long small RNAs. Since they are short, and sequences vary only small number of nucleotides (e.g. 1 as in the case of SNPs), deep sequencing, high coverage, requires to detect the miRNAs, and identify the novel sequences sensitively.
There are different tools available to investigate miRNAs, miRNA structures, expression profiles of them and so on. Although RNA-sequencing technology is still in teenage years (Stark et al., 2019), miRNA sequencing technology is even more “immature” than RNA-seq or sc-RNA- seq, so do the tools available for miRNA-sequencing data analysis. Besides, there are limited number of tools available for bioinformatics analysis for mirNA sequencing (Motameny et al., 2010; Kang and Friedlander, 2015; Chen et al., 2019). miRDeep2 (Mackowiak, S., 2011; Friedlander et al., 2012; Yang et al., 2011) is one of the most commonly used and recently updated tools to detect known, canonical, and novel, non-canonical, miRNA sequences. Although, the pipelines are available for miRNA sequencing as in the case of ENCODE Project Pipelines, the bioinformatics tools such as miRDeep2 are easier to use people coming from different scientific backgrounds.
There are tutorials provided in miRDeep2 github pages. There are two github links (old, new) and so two different tutorials (old, new) available. Please make sure that you follow the tutorial provided in the recent/newest github page.
Although the tutorial is shared in the github page, a practical example run might be useful for people who is planning to use this tool first time. Therefore I will share the codes required with you , with the warnings that you need to be extra cautious.
Step 1: Download Ubuntu Terminal
This tool requires linux working environment. So, if you are using Windows, you need to download a program such as Ubuntu Terminal or Virtual Box/Machine to run the mirdeep package. For this, you need to open Microsoft Store and chose to download the Ubuntu (not the LTS ones but the terminal).
Step 2: Downloading miRDeep2 with conda install
If you try to install mirdeep2 without conda install, you might encounter some problems. I strongly recommend to use conda install. After installation, do not forget to test perl script: mapper.pl.
#You need to open a new terminal here. You can follow the instructions given in this link. Because I want to download the files to Downloads in Windows instead of Linux, I specificied the paths with "mnt/c/Users/...".
Before running your analysis, it would be better to test the tutorial run to make sure that everything is alright with the tool. You can download the mature and hairpin miRNA files from miRBase.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ cd drmirdeep.github.io-master/
#cd command is used to open files in the given path/directory. You need to chose the directory that you download the tutorial file.
#ls is to list the files in the given folder
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master$ ls
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master$ cd drmirdeep.github.io-master/
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ ls
#grep to check how many of the reads have the adapter sequence
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ grep -c TGGAATTC example_small_rna_file.fastq
#do not forget the extract the relevant files from mature and hairpin miRNA files you downloaded from mirbase.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ extract_miRNAs.pl /mnt/c/Users/USER/Downloads/mature.fa hsa > /mnt/c/Users/USER/Downloads/mature_hsa.fa
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ extract_miRNAs.pl /mnt/c/Users/USER/Downloads/hairpin.fa hsa > /mnt/c/Users/USER/Downloads/hairpin_hsa.fa
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ extract_miRNAs.pl /mnt/c/Users/USER/Downloads/mature.fa mmu,chi > /mnt/c/Users/USER/Downloads/mature_other_hsa.fa
#to build index file via bowtie1
#make sure that you do not use the same name for the file you give as input, reference genome, and indexed output.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ bowtie-build refdb.fa refdb.fa
#to map the sample sequencing reads against the indexed genome file
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ mapper.pl example_small_rna_file.fastq -e -h -i -j -k TGGAATTC -l 18 -m -p refdb.fa -s reads_collapsed.fa -t reads_vs_refdb.arf -v -o 4
#to run the mirdeep2 analysis. You can find the detailed information regarding the parameters in the paper and the tutorial page.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads/drmirdeep.github.io-master/drmirdeep.github.io-master$ miRDeep2.pl reads_collapsed.fa refdb.fa reads_vs_refdb.arf mature_ref.fa mature_other.fa hairpin_ref.fa -t hsa 2>report.log
Step 4: Running the miRDeep2 for your sample
Before running the mirdeep2, you might want to check the quality of your fastq files by fastqc. Although mirdeep2 has intrinsic adapter trimming function, you might still need to use cutadapt based on your data’s specific needs. I will share the example codes to how to download an do the adapter trimming.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ sudo apt-get update
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ sudo apt-get install fastqc
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ fastqc --extract /mnt/c/Users/USER/Downloads/S26.fastq.gz -o /mnt/c/Users/USER/Downloads/fastqc_results
#for cutadapt and fastqc after
#Lets say your adapter sequence is this: TAGCTGATCGATCTGAAACT
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ conda install -c bioconda cutadapt
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ cutadapt -a TAGCTGATCGATCTGAAACT /mnt/c/Users/USER/Downloads/S26.fastq > /mnt/c/Users/USER/Downloads/outputS26.fastq
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ fastqc --extract /mnt/c/Users/USER/Downloads/outputS26.fastq -o /mnt/c/Users/USER/Downloads
#before this step, you need to download a reference file in fasta/fa format.
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ bowtie-build ucsc_hg19.fasta ucschg19
#You do not need to add .fa extension to file that you index
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ mapper.pl S26.fastq -e -h -i -j -k TAGCTGATCGATCTGAAACT-l 18 -m -p ucschg19 -s R___collapsed.fa -t R___refdb.arf -v -o 4
#You need to use index file as a reference here
(base) dincaslan@D:/mnt/c/Users/USER/Downloads$ miRDeep2.pl R___collapsed.fa ucsc_hg19.fasta R___refdb.arf mature_hsa.fa mature_other_hsa.fa hairpin_hsa.fa -t hsa 2> report.log
I hope you find this tutorial run useful. In addition to the websites given, whenever you have problems regarding the mirdeep2 run, I strongly recommend to read the documentation given in new github page and the article and check, if necessary ask, the questions/problems in biostar.
I would like thank my dear labmate Daniel Muliaditan for helping me to remember/learn the basics of linux and practice the mirdeep2 run in Ubuntu Terminal (by convenient way of handling such problems: using conda install). I would like to thank #AcademicTwitter, especially Dr. Ming Tang for his extremely useful answer to my question 🙂
Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat Rev Genet20, 631–656 (2019). https://doi-org.libproxy1.nus.edu.sg/10.1038/s41576-019-0150-2
Motameny, S.; Wolters, S.; Nürnberg, P.; Schumacher, B. Next Generation Sequencing of miRNAs – Strategies, Resources and Methods. Genes 2010, 1, 70-84. https://doi.org/10.3390/genes1010070
Kang W, Friedländer MR. (2015) Computational prediction of miRNA genes from small RNA sequencing data. Front Bioeng Biotechnol 3: 7 10.3389/fbioe.2015.00007
Liang Chen, Liisa Heikkinen, Changliang Wang, Yang Yang, Huiyan Sun, Garry Wong, Trends in the development of miRNA bioinformatics tools, Briefings in Bioinformatics, Volume 20, Issue 5, September 2019, Pages 1836–1852, https://doi-org.libproxy1.nus.edu.sg/10.1093/bib/bby054
Mackowiak, S. D. Identification of novel and known miRNAs in deep-sequencing data with miRDeep2. Curr Protoc BioinformaticsChapter 12, Unit 12 10, 10.1002/0471250953.bi1210s36 (2011).
Marc R. Friedländer, Sebastian D. Mackowiak, Na Li, Wei Chen, Nikolaus Rajewsky, miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades, Nucleic Acids Research, Volume 40, Issue 1, 1 January 2012, Pages 37–52, https://doi-org.libproxy1.nus.edu.sg/10.1093/nar/gkr688
You can add new resources to relevant sheets or just check the list to find your new favourite book/course!
The list includes different type of resources to learn programming languages, a specific type of analysis, or pure theory. We also added a sheet for the databases, which we hope to be full of exciting databases soon – you can add yours as well 🙂
Don’t forget to add the resources you found useful and share with your circle!
Please contact email@example.com for any suggestion/comment.
We as RSG-Turkey are so proud to be part of great organization ISCB and ISCBSC. ISMB2018, one of the conferences organized by ISCB, was held in Chicago between 6-11 July. I have nominated for ISCB-SC RSG Leadership Travel Fellowship for the conference and she had the opportunity to attend the conference. Despite this post were overdue, it has a bunch of highlights which should be recorded.
The first thing worth to mention is ISCB Communities of Special Interest (COSIs) which are topically-focused collaborative communities of shared interest wherein scientists communicate with one another on research problems and/or opportunities in specific areas of computational biology. For detailed information about sixteen COSIs of ISCB, click on the link. One of my favorites is SysMod COSI where I got a chance to present my Ph.D. project and meet the great scientist as well as my future collaborators.
In the first day of the conference, Thomas Lengauer, ISCB president welcomed over 1,600 delegates to Chicago and started the tight schedule of ISMB2018. During the event, ISCB Conferences mobile application helped each participant to create their own program.
The conference hosted very successful and interesting talks including keynotes. The conference-leading keynote was Steven Salzberg from the Center for Computational Biology McKusick-Nathans Institute of Genetic Medicine at Johns Hopkins University. His keynote, titled “25 years of human gene finding: are we there yet?” focused on how The Human Genome Project was launched with the promise of revealing all of our genes, the “code” that would help explain our biology. The publication of the human genome in 2001 provided only a very rough answer to this question. For more than a decade following, the number of protein-coding genes steadily shrank, but the introduction of RNA sequencing revealed a vast new world of splice variants and RNA genes. His talk reviewed where we’ve been and where we are today, described a new effort to use an unprecedentedly large RNA sequencing resource to create a comprehensive new human gene catalog. The ISCB Overton Prize Keynote, Cole Trapnell of the University of Washington gave an engaging and informative talk titled “Reconstructing and deforming developmental landscapes” focused on how developing embryos are comprised of highly plastic individual cells that shift from one functional state to another, often reversibly so. A cell executes a different gene expression program for each of its possible roles, switching between them as needed throughout its life. How does the genome encode the developmentally intended sequence of program switches? Which gene regulatory events are crucial for a given cell fate decision? Quantifying each gene’s contribution in governing even one developmental step is a staggeringly difficult challenge. However, massively scalable single-cell transcriptome and epigenome profiling offer a way to quantitatively dissect developmental regulatory circuits. He discussed new assays and algorithms developed by his laboratory to realize this goal, and offer some lessons from several recent projects. Martha L Bulyk from Brigham & Women’s Hospital and Harvard Medical School in Boston was the next day keynote Her talk titled Transcription factors, and cis-regulatory elements focused on mapping the impact of unique variants on the expression of transcription factors. Specifically, it highlighted that similar target sequences can have far-reaching impacts when mutated. The difficulties associated with establishing a proper background were also addressed. The engaging talk culminated in an informative question and answer period. Madan Babu of the MRC Laboratory of Molecular Biology in the United Kingdom was another keynote speaker. His talk focused on understanding how the amino acid sequence of a protein contributes to its function (sequence-function relationship) and foundation for the sequence–structure–function paradigm. He presented IDR-Screen, which is a high-throughput experimental and computational approach for discovering functional disordered regions in a biologically relevant context and identifying features of functional sequences through statistical learning. The final keynote of the conference, the ISCB Accomplishments by a Senior Scientist Award winner, Ruth Nussinov had inspiring talk entitled A woman’s computational biology journey focused on her journey through the field, beginning when revolutionary sequencing methods produced the first long DNA sequences with the development of an efficient algorithm to fold RNA, followed by pioneering bioinformatic DNA sequence analyses.
Throughout the conference days, attendees were able to meet and seek out information on new technologies, platforms, and ideas. Addition to having an opportunity to meet with exhibitors, attendees could view the poster presentations of the day to seek out new ideas and approaches. With nearly 300 attendee participants interacting with 15 recruiting entities the ISCB Career Fair was also a notable event. The Career Fair allowed for a designated time for engaging discussion among talented candidates seeking positions in the fields of computational biology and bioinformatics.
Attending an ISCB conference is also a good chance for understanding the ISCB organization structure, transparency should be one of their strengths. Bruno Gaeta, ISCB Treasurer, reviewed the Society’s financial statements and current membership numbers. Scott Markel, the Nominations Co-Chair, reminded members to vote and gave a brief overview of the Nominations process. The student council delivered their annual report and highlighted this year’s ISMB Student Symposium.
ISCB offers poster or oral presentation and different numbers of travel fellowship opportunity as well as competitions like Art in Science or Wikipedia in every ISMB. In ISMB2018, 2017-2018 Wikipedia Winners were announced, 2018 Art in Science winners were announced, and over 40 students and post-docs were recognized as ISMB travel fellowship recipients. The ISCB aims to improve the communication of scientific knowledge to the public at large, so the ISCB Wikipedia Competition aims to improve the quality of Wikipedia articles relating to computational biology. Entries to the competition are open now; the competition closes on 31 Dec 2018. Prizes of up to $500 will be awarded to the best contributions as chosen by a judging panel of experts; these will be awarded at the ISMB/ECCB conference in Basel, Switzerland in July 2019. Detailed information on this link. Another annual event is Art in Science. Art in Science competition which offers a way to show the beauty of science in the art form. The winners presented with a USD 200 prize, as well as be the feature cover image for the ISCB Fall Newsletter.
I write just some highlights from the conference, however, more information about the conference is also available in ISCB-newsletter. If you wonder the selected works from ISMB2018 presenters, you can find the special issue in Bioinformatics
If you feel sorry that you missed this breathtaking event after you read the post, no worries. You can watch the presentations online by clicking the link
Moreover, please save the date for next ISMB in Basel, Switzerland between July 21 – July 25, 2019.
I have always felt the need for contributing to the social environment of the organisations I have been a part of, since I was a kid. Always took part in student groups, wanted to make a difference for everyone who will be in my position in the future. I think, I have managed to do […]
Bioinfonet project is an ISCB Student Council supported webinar project. Our main aim is to build bridges between bioinformatics professionals and young bioinformatics students, mainly who live in developing countries and have difficulty joining the aura of the community. But anyone who want to get an idea of how bioinformatics is done, who are those bioinformaticians and what they do or seeks for a collaboration can find their “remedy” here.
Our main activity is on our community page. When you register to our community, you can get an e-mail notice whenever we organize a new webinar – and nothing more (We hate spamming!). The other way to keep in touch with us is to follow us on Facebook and linked-in.
Most of the people who contact us say “Do you still continue organising webinars with rshfjskh Turkey?”. This “rshfjskh” feeling comes, because, I come to the conclusion that, they do not know the abbreviation 🙂 So, here is the step by step tutorial for saying ISCB SC RSG Turkey without any struggle:
Let me explain. Because, ISCB is the main organisation, in which a student council is functioning. RSG’s are connected to ISCB through SC, but not directly. This is why. For further information, please refer to previous links I mentioned at the beginning.
RSG-Turkey is a member of The International Society for Computational Biology (ISCB) Student Council (SC) Regional Student Groups (RSG). We are a non-profit community composed of early career researchers interested in computational biology and bioinformatics.