Language Models Can Learn Complex Functional Properties of Proteins


Serbülent Ünsal

Serbulent Unsal received her B.Sc. degree in Statistics and Computer Sciences from Karadeniz Technical University in Turkey. Following his graduation he continued his M.Sc. degree in Medical Informatics from Middle East Technical University in Turkey. During the Master’s program he studied multiscale computational tumor modeling in which he developed a tumor progression model using cellular automata and partial differential equations with Dr.Aybar Can Acar. In 2014 he started his PhD at the same department on developing deep learning models for low-data protein function prediction. His thesis is also part of a large-scale research project on discovery of new immune-escape mechanisms and drug repurposing against them. Currently, he is about to finish his PhD and working as Senior ML Engineer in Antiverse to design antibodies using machine learning and deep learning models.


Proteins are essential macromolecules for life. To understand and manipulate biological mechanisms, functions of proteins should be understood, and this is pos- sible through studying their relationship with the amino acid sequence and 3-D structure. So far, only a small percentage of proteins could be functionally characterized (currently ∼0.5% according to UniProt) due to cost and time requirements of wet-lab-based procedures. Lately, protein function prediction (PFP), which can be defined as the annotation of proteins with functional definitions using statistical/computational methods, gains importance to explore the uncharacterized protein space and/or protein variants carrying function altering changes. Among many different algorithmic approaches proposed so far, machine learning (ML), especially deep learning (DL), techniques have become popular in PFP due to their high pre- dictive performance. The input data used by these ML/DL methods are numerical feature vectors representing the protein (i.e., protein representations), and they are mostly generated from amino acid sequences of proteins which are readily available in databases (e.g., UniProt). In this study, we evaluated protein representation methods for the prediction of functional attributes of proteins and benchmarked these methods in 4 challeng- ing tasks, namely: (i) Semantic similarity inference (we calculated pairwise semantic similarities between human proteins using their gene ontology annotations and compared them with representation vector similarities to observe the correlation in- between), (ii) Ontological protein function prediction (we built GO term categories based on term specificities and the sample sizes which reflects different levels of pre- dictive difficulty and evaluated representation methods by training/validating ML models on these datasets), (iii) Drug target protein family classification (five major target families are selected and methods are evaluated in terms of classifying proteins to families via ML models), and (iv) Protein-protein binding affinity estimation (we used the SKEMPI dataset to evaluate methods in estimating protein-protein binding affinity changes upon mutations). We evaluated 23 protein representation methods in total, including both classical approaches and cutting-edge representation learning methods, to observe whether these novel approaches have advantages over classical ones, in terms of extracting high level/complex properties of proteins that are hid- den in their sequence. Finally, we provide an open-access tool, PROBE (Protein RepresentatiOn BEnchmark), where the user can assess new protein representation models over the above mentioned benchmarking tasks with only a line of code.

Date: July 6th, 2022 – 18:00 (GMT+3)

Language: English

You can register for this webinar here !

Computational Challenges in Protein-RNA Interactions


Asst. Prof. Yaron Orenstein

Yaron Orenstein is a Senior Lecturer and the head of the Computational Biology lab at the School of Electrical and Computer Engineering at Ben-Gurion University of the Negev. Yaron completed his BSc summa cum laude in Electrical Engineering and Computer Science at Tel-Aviv University, where he continued on a direct MSc track under the supervision of Prof. Dana Ron. He then completed his PhD in Computer Science at Tel-Aviv University supervised by Prof. Ron Shamir, where he received numerous awards and fellowships, such as the Deutch prize and the Dan David fellowship. He completed his post-doctoral training at Massachusetts Institute of Technology with Prof. Bonnie Berger, and spent a semester as a Research Fellow at the Simons Institute for the Theory of Computing. In the last four and a half years, Yaron has been the head of a fruitful and productive lab with numerous publications, grants, and graduating students. He authored more than 40 journal manuscripts and conference proceedings papers, received grants from the ISF, BSF, NIH, ICA, and IIA, and mentored more than 15 graduate students. His main research interests include sequence design problems and application of deep neural networks in genomics.


Protein-RNA interactions play vital roles in many cellular processes, and as a result are the main focus of many biological studies. Biologists would like to efficiently measure protein-RNA interactions in high-throughput, and based on these high-throughput experimental measurements train accurate machine-learning models to predict interactions to new RNA sequences. In the talk, I will present solutions to both challenges: design of efficient high-throughput experiments, and training highly accurate predictive models on high-throughput genomic data. First, I will present DeCoDe, a new method based on Integer Linear Programming to design protein-coding templates to efficiently cover many proteins in a single high-throughput experiment. DeCoDe outperforms extant methods for the task, and newly enables features that were not possible before, such as covering variable-length proteins and optimizing globally over multiple templates. Second, I will present DeepUTR, a new method based on Deep Learning to predict mRNA degradation dynamics based on the 3’-UTR sequence of an mRNA. DeepUTR outperforms extant methods for the task, and newly enables prediction of mRNA levels at various time points. Moreover, we extended the Integrated Gradients interpretability approach to handle multiple input types, and using the extended approach discovered known and novel regulatory 3’-UTR elements associated with mRNA degradation. I will conclude my talk with future plans on both sequence design problems, and deep neural networks applications in genomics.

Date: June 14th, 2022 – 11:00 AM (GMT+3)

Language: English

You can register for this webinar here !

Open Student Webinars Acıbadem University


Alara Erenel

The Co-Occurrence of X, Y, and Z SNPs in BRCA1 Gene: an in silico Investigation


Breast Cancer (BC) is the most common cancer type seen in women and the third most common one worldwide with an increasing rate of cases. Genomic studies revealed X, Y, and Z SNPs on Breast Cancer Gene 1 (BRCA1) may be co-occurring and affecting the BC formation. If they co-occur, investigating them individually would be misdirecting. In this research, the aim was to investigate whether 1) X, Y, and Z are on the same haplotype and co-occur and 2) co-occurrence of X, Y, and Z is pathogenic. During this study, frequencies and conservation scores of SNPs’, haplotype status, linkage disequilibrium (LD), dual and triple co-occurrence statuses, BRCA1 transcripts, and possible protein changes are investigated through data portals and R. By the comparison of the healthy dataset (2,504), general cancer dataset (296), and BC dataset (98), association between co-occurrence of X, Y, and Z with cancer/BC formation is done. Associations are tested with logistic regression, odds ratio, Fisher’s Exact Test, and Chi-Square test. All results are cross-checked with the variant classification guidelines for pathogenicity. As a result, these SNPs coherent with the same haplotype pattern, co-occurrence experiments supported the co-occurrence of these three SNPs and also strengthen the pathogenicity hypothesis. It was shown that odds to have cancer (Odds Ratio (OR): 34.28, probability value (p-value): 0.0006) and BC (OR: 52.15, p-value: 0.0041) are significantly higher for the individuals with triple co-occurrence. More in vitro research needs to be done to strengthen the pieces of evidence obtained in silico.

Alper Bülbül

Variant pathgenicity prediction tool with 3-D and sequence analyzes of protein-protein interactions


Motivation: A large number of patient samples can be analyzed with the developing next-generation sequence and protein interaction technologies. In this way, we see that many genes are involved, especially in autoinflammatory monogenic diseases. At the same time, the number of variants associated with diseases is increasing. We used protein-protein interactions and 3D structure analysis for the classification of large number of variants. Results: 3D docking analysis of proteins, sequence-based interaction scores and delta delta Gibbs free energy (ddG) were created using stability analysis based on protein binary interactions from STRING and Intact databases. ZDOCK and SPRINT values were weighted according to the HGPEC gene rank scores with a variant in 36 monogenic autoinflammatory diseases. When the relationships between ZDOCK, SPRINT, and ddG values were examined in the benign and pathogenic variant groups, we find that the ZDOCK and SPRINT values were positively correlated with each other. In addition, ddG values are negatively correlated with ZDOCK and SPRINT values. 702 missense disease associated variants are retrieved from infevers database. Since there was an imbalance between the sample number of 130 Bening and 572 pathogenic mutations, we created synthetic data with the SMOTE algorithm. The ROC AUC values of the model, created with the Random Forest algorithm, are 97%.

Ekin Köni

Integrated Analysis of Mutated Genes in Leptomeningeal Metastasis Caused by Breast Cancer


Background/aim: Leptomeningeal carcinomatosis (LMC) is a rare type of cancer that settles through metastasis from a tumor in the body to the meninges and affects the brain, spinal cord, and nerves, causing sudden neurological disorders and death. Most common solid tumors causing LMC include breast, lung, and melanoma. The average life expectancy of LMC patients with the prescribed treatments is an average of 6 months. Due to the unknown molecular mechanism and genetic state of the disease, next-generation sequencing (NGS), Whole-exome sequencing (WES) and RNA sequencing (RNA-seq) are being performed to investigate the transcriptome properties of circulating tumor cells (CTCs) found in cerebrospinal fluid (CSF). Currently, the diversification of cancer treatment and the prolonged patient survival have also led to increased LMC incidence. Therefore, molecular studies investigating the development of LMC are required. The aim of this study is to gather information about the genes that are mutated in Breast-LMC studies to analyze possible molecular interactions. Results: According to our results in Breast cancer-LMC there were in total 24 mutated genes. 7 of these were only seen in Breast cancer-LMC, only one mutual gene with melanoma-LMC and 11 common genes with NSCLC-LMC. The PPI network constructed with STRING showed interactions among these 24 genes. In addition, pathway enrichment analysis which was observed with g:profiler and Cytoscape revealed the enriched pathways. The networks contained 24 nodes and 87 edges. Chromatin organization, modification of cellular content were some of the enriched pathways. Moreover, transcription regulation, immune system development, activation and regulation pathways were some of the most important pathways in which the mutated genes were involved. Finally, drugs that interact with breast cancer genes, have been approved or are under clinical trials, were identified with DrugBank and online tools.

İrem Çongur

Bioinformatic Analysis of Mutated Genes in Leptomeningeal Carcinoma Caused by Non-Small Cell Lung Cancer


Background/aim: Leptomeningeal carcinoma (LM) is mostly seen as a result of metastasis caused by melanoma, breast, and non-small cell lung cancer (NSCLC) and is formed by the placement of tumor cells in the meninges of the brain. As a result of this metastasis tumor cells also leak into the cerebrospinal fluid (CSF). The average survival time for LM patients is less than one year. Mutations and gene expression changes in patients are being studied with next-generation sequencing (NGS), whole exome sequencing (WES) and single-cell RNA sequencing (scRNA-seq), but no study has yet been conducted to elucidate the molecular mechanism of this disease. Since LM disease has a narrow patient population, studies on candidate marker genes are limited. Therefore, there is a great lack of information in the literature about its mechanism. The aim of this study is to analyze genes mutated in NSCLC-LM in order to determine which pathways may be involved in the development of LM. Results: 87 genes were found to be mutated in NSCLC-LM patients after classifying the mutated genes from 11 articles. Among 87 genes, 65 were mutated only in NSCLC-LM patients. There were common mutations: 5 with both breast and melanoma LM, 6 with melanoma-LM, and 11 with breast-LM patients. PPI network of mutated genes in NSCLC-LM was composed of 87 nodes and 1181 edges which was constructed using the String database. EnrichmentMap plug-in of Cytoscape was used to construct a network of enriched pathways to visualize the output of g:Profiler. The network contained 856 nodes and 31411 edges. Using the MCODE plug-in 25 clusters were created. Some of the clusters included the following pathways: regulation of cell cycle, DNA damage and repair, cell adhesion, regulation of cytoskeleton and cellular response to environmental stimulus. Finally, drugs that interact with 8 NSCLC biomarkers were identified with DrugBank and publicly available articles.

Date: May 21st, 2022 – 11:00 AM (GMT+3)

Language: English

Don’t forget to register this event!

Novel full-length transcriptome analysis workflow ‘Nexons’ to uncover the regulation of poison exons in splicing factors in human germinal centre B cells


Özge Gizlenci

Ozge Gizlenci received her B.Sc. degree in Molecular Biology and Genetics from Middle East Technical University in Turkey. Following her graduation in 2015, she continued her M.Sc. degree in Molecular Biosciences with a major in Cancer Biology from the University of Heidelberg. During the Master’s program, she took a semester abroad to start a joint project in her specialized interests, gene editing and stem cells, in the laboratory of Dr. Christian Brendel and Dr. David A. Williams at Dana-Farber/Boston Children’s Cancer and Blood Disorders Center where she returned to her work with Dr. Christian Brendel as a researcher prior to her graduate studies. At Dana-Farber, she used the base editing method to correct a disease-causing mutation in Schwachman-Diamond Syndrome disease and later to apply it to gene therapy approaches. In October 2018, she started her PhD position funded by the Marie Skłodowska-Curie Actions of the European Union’s Horizon 2020 research and innovation programme of COSMIC consortium in the Immunology Programme at the Babraham Institute. Her PhD project with Dr. Martin Turner is focused on understanding the changes in gene expression and alternative splicing in B cells in response to positive selection signals in the germinal centre using long-read next-generation sequencing technologies (e.g. Oxford Nanopore Technology). She aims to investigate the relationship between alternative splicing and abnormally functioning adaptive immune cells in B cell malignancy and Rheumatoid Arthritis using both computational and molecular biology approaches.


Alternative splicing (AS) plays a major role in the differentiation of immune cells during an immune response as 29% of AS genes are specific to the immune system. Although the role of AS is extensively investigated in T cells, its role in B cell activation is less characterised. We sought to develop a long-read technology, Oxford Nanopore Technologies (ONT), workflow to understand post-transcriptional regulation at both gene and isoform levels of human germinal centre B cells. As one of the challenges of ONT is the accurate computational analysis of isoforms, we developed the ‘Nexons’ pipeline to identify differentially spliced transcript variants using long-read sequencing. An in-depth analysis of splicing regulators with Nexons revealed that poison exons of splicing factors (e.g. SRSF3) were preferentially spliced out upon activation whereas naïve B cells expressed isoforms carrying poison exon, leading to nonsense-mediated mRNA decay. Moreover, we identified novel spliced variants of these genes, which were difficult to deconvolute using short-read data due to the limitations of short-read technology. Altogether, our findings validate the combination of Nexons with ONT cDNA-PCR sequencing as a suitable method for the identification and quantification of complex isoforms.

Date: May 20th, 2022 – 10:30 AM (GMT+3)

Language: English

You can register for this webinar here !

Deep Learning for Medical Image Analysis


Prof. Çiğdem Gündüz Demir

Çiğdem Gündüz Demir received her B.S. and M.S. degrees in computer engineering from Boğaziçi University in 1999 and 2001, respectively, and her Ph.D. degree in computer science from Rensselaer Polytechnic Institute in 2005. She is currently a Professor of Computer Engineering and the Deputy Director of the KUIS AI Center at Koç University. Before joining Koç University, she was working as a faculty member at the Computer Engineering Department at Bilkent University. She was a visiting professor at Nanyang Technological University NTU, Singapore, in Fall 2009, and Stanford University in Spring 2013. Her main research interests and projects include development of new computational methods based on deep learning and computer vision for medical image analysis. Currently, her research group works on the interdisciplinary projects in collaborations with the Departments of Pathology and Biology for the microscopic analysis of histopathological images and in vitro fluorescence and live cell images and with the Departments of Ophthalmology and Radiology for the analysis of images acquired with in vivo imaging of CT, MR, and OCT. She was a recipient of Distinguished Young Scientist of the Turkish Academy of Sciences and CAREER Award of the National Scientific and Technological Research Council of Turkey.


Automated imaging systems are becoming important tools for medicine and biology research as they facilitate rapid analyses with better reproducibility. Segmenting regions of interest on a medical image is typically the first but one of the foremost steps of these systems, which greatly affects the success of the entire analysis. In this talk, I will briefly mention the main challenges associated with segmentation tasks in medical image analysis, and then present examples of the dense prediction networks that my research group designed and implemented to address these challenges. Particularly, I will talk about our proposed network architectures and loss functions that were specifically designed to facilitate better training of the segmentation networks. At the end, I will discuss future research possibilities towards the direction of developing more robust segmentation networks for medical image analysis.

Date: April 27th, 2021 – 6:00 PM (GMT+3)

Language: English

You can register for this webinar here !

Modelling Complex Microbial Communities Using Metagenomic Data


Assoc. Prof. Niranjan Nagarajan

Dr. Nagarajan is Associate Director and Senior Group Leader in the Genome Institute of Singapore, and Associate Professor in the Department of Medicine and Department of Computer Science at the National University of Singapore. His research focuses on developing cutting edge genome analytic tools and using them to study the role of microbial communities in human health. His team conducts research at the interface of genetics, computer science and microbiology, in particular using a systems biology approach to understand host-microbiome- pathogen interactions in various disease conditions. Dr. Nagarajan received a B.A. in Computer Science and Mathematics from Ohio Wesleyan University in 2000, and a Ph.D. in Computer Science from Cornell University in 2006 (Advisor: Prof. Uri Keich). He did his postdoctoral work in the Center for Bioinformatics and Computational Biology at the University of Maryland working on problems in genome assembly and metagenomics (Advisor: Prof. Mihai Pop).


The structure and function of diverse microbial communities is underpinned by ecological interactions that remain uncharacterized. With rapid adoption of next-generation sequencing for studying microbiomes, data-driven inference of microbial interactions based on abundance correlations is widely used, but with the drawback that ecological interpretations may not be possible. Leveraging cross-sectional microbiome datasets for unravelling ecological structure in a scalable manner thus remains an open problem. We present an expectation-maximization algorithm (BEEM-Static) that can be applied to cross-sectional datasets to infer interaction networks based on an ecological model (generalized Lotka-Volterra). The method exhibits robustness to violations in model assumptions by using statistical filters to identify and remove corresponding samples. Benchmarking against 10 state-of-the-art correlation based methods showed that BEEM-Static can infer presence and directionality of ecological interactions even with relative abundance data (AUC-ROC > 0.85), a task that other methods struggle with (AUC-ROC < 0.63). In addition, BEEM-Static can tolerate a high fraction of samples (up to 40%) being not at steady state or coming from an alternate model. Applying BEEM-Static to a large public dataset of human gut microbiomes (n = 4,617) identified multiple stable equilibria that better reflect ecological enterotypes with distinct carrying capacities and interactions for key species.

Date: April 13th, 2021 – 10:00 AM (GMT+3)

Language: English

You can register for this webinar here !

Open Student Webinars – Gebze Technical University


Dilara Uzuner

Transcriptional landscape of cellular networks reveal interactions driving the dormancy mechanisms in cancer


Primary cancer cells exert unique capacity to disseminate and nestle in distant organs. Once seeded in secondary sites, cancer cells may enter a dormant state, becoming resistant to current treatment approaches, and they remain silent until they reactivate and cause overt metastases. To illuminate the complex mechanisms of cancer dormancy, 10 transcriptomic datasets from the literature enabling 21 dormancy–cancer comparisons were mapped on protein–protein interaction networks and gene-regulatory networks to extract subnetworks that are enriched in significantly deregulated genes. The genes appearing in the subnetworks and significantly upregulated in dormancy with respect to proliferative state were scored and filtered across all comparisons, leading to a dormancy–interaction network for the first time in the literature, which includes 139 genes and 1974 interactions. The dormancy interaction network will contribute to the elucidation of cellular mechanisms orchestrating cancer dormancy, paving the way for improvements in the diagnosis and treatment of metastatic cancer.

Ecehan Abdik

Systematic investigation of mouse models of Parkinson’s disease by transcriptome mapping on a brain-specific genome-scale metabolic network


Genome-scale metabolic networks enable systemic investigation of metabolic alterations caused by diseases by providing interpretation of omics data. Although Mus musculus (mouse) is one of the most commonly used model organisms for neurodegenerative diseases, a brain-specific metabolic network model of mouse has not yet been reconstructed. Here we reconstructed the first brain-specific metabolic network model of mouse, iBrain674-Mm, by a homology-based approach, which consisted of 992 reactions controlled by 674 genes and distributed over 48 pathways. We validated the newly reconstructed network model by showing that it predicts healthy resting-state metabolic phenotypes of mouse brain compatible with literature. We later used iBrain674-Mm to interpret various experimental mouse models of Parkinson’s Disease (PD) at the transcriptome level. To this aim, we applied a constraint-based modelling based biomarker prediction method called TIMBR (Transcriptionally Inferred Metabolic Biomarker Response) to predict altered metabolite productions from transcriptomic data. Systemic analysis of seven different PD mouse models by TIMBR showed that neuronal levels of glutamate, lactate, creatine phosphate, neuronal acetylcholine, bilirubin and formate increased in most of PD mouse models whereas levels of melatonin, epinephrine, astrocytic formate and astrocytic bilirubin decreased. Although most of the predictions were consistent with the literature, there were some inconsistencies among different PD mouse models, signifying that there is no perfect experimental model to reflect PD metabolism. The newly reconstructed brain-specific genome-scale metabolic network model of mouse can make important contributions to the interpretation and development of experimental mouse models of PD and other neurodegenerative diseases.

Hatice Büşra Lüleci

iMAT application as an integration method in Alzheimer’s disease in order to predict reaction activity


Alzheimer’s disease (AD) is the most common cause of dementia. There is increasing evidence of a possible link between the incidence and progression of AD and metabolic dysfunction. Determining the changes in the activity of metabolic pathways should be a major interest in the treatment of AD. Mapping sample-based gene expression levels by using Integrative Metabolic Analysis Tool (iMAT) optimization algorithm on Human-GEM led to personalized metabolic networks. Each personalized metabolic network for healthy and disease cases has a different number of reactions and genes. This variation across personalized models reveals the inherent heterogeneity of control and AD samples and justifies our personalized approach. Reactions in each model were converted to binary vectors. This categorized data was analyzed by performing Fisher-Exact test. Based on these calculations, significantly changed reactions and pathways were detected. Mapping biochemical alterations associated with AD is crucial to fill knowledge gaps on the disease mechanisms.

Müberra Fatma Cesur

Network-based metabolism-centered screening of potential drug targets in Klebsiella pneumoniae at genome scale


Klebsiella pneumoniae is an opportunistic bacterial pathogen leading to life-threatening nosocomial infections. Emergence of highly resistant strains poses a major challenge in the management of the infections by healthcare-associated K. pneumoniae isolates. Thus, despite intensive efforts, the current treatment strategies remain insufficient to eradicate such infections. Failure of the conventional infection-prevention and treatment efforts explicitly indicates the requirement of new therapeutic approaches. This prompted us to systematically analyze the K. pneumoniae metabolism to investigate drug targets. Genome-scale metabolic networks (GMNs) facilitating the systematic analysis of the metabolism are promising platforms. Thus, we used a GMN of K. pneumoniae MGH 78578 to determine putative targets through gene- and metabolite-centric approaches. To develop more realistic infection models, we performed the bacterial growth simulations within different host-mimicking media, using an improved biomass formation reaction. We selected more suitable targets based on several property-based prioritization procedures. KdsA was identified as the high-ranked putative target satisfying most of the target prioritization criteria specified under the gene-centric approach. Through a structure-based virtual screening protocol, we identified potential KdsA inhibitors. In addition, the metabolite-centric approach extended the drug target list based on synthetic lethality. This revealed the importance of combined metabolic analyses for a better understanding of the metabolism. To our knowledge, this is the first comprehensive effort on the investigation of the K. pneumoniae metabolism for drug target prediction through the constraint-based analysis of its GMN in conjunction with several bioinformatic approaches. This study can guide the researchers for the future drug designs by providing initial findings regarding crucial components of the Klebsiella metabolism.

Date: April 8th, 2022 – 2:00 PM (GMT+3)

Language: English

Don’t forget to register this event!


As one of the biggest studen-driven organizations in Turkey, we are familiar with the challenges the students, especially the post-graduate students face. One of these problems the students face is not having enough time or financial resources to present their data to fellow students and researchers in a scientific meeting which hampers their visibility within the scientific community. As RSG Turkey, we have been conducting a webinar series called BioInfoNet for a long time. Throughout the years it has come to our attention that there was a rapid ramp up in the number of students who want to present their work but we also noticed that there have been many more who have been refraining themselves from presenting their data simply because BioInfoNet was seen as a project in which only Postdocs and PIs could present their work. To overcome this misreading and to create a safe zone to all students we have decided to start a new student webinar series: OPEN STUDENT WEBINARS.

In this new project, we aim for an open conference concept but with an online webinar approach meaning all talks will be online and open to the public. However, instead of letting students give presentations in a random order, the students from same university will be given a single day to present their work one after another as 30-40 minutes presentations. Depending on the participation requests from the university, this can be rearranged as a two day event. At the end of each talk there will be a discussion session in which all attendees can ask their questions. We initially aim to create a platform where students can present their work and improve their presentation skills. Secondly we aim to encourage students from the same university to get to know each other’s work better and help each other out. Our third aim with this compact presentation concept is letting students and researchers from other universities learn about the bioinformatics-related studies of the presenting university as well as their approaches to the studies. And finally, our main and most important goal is to increase intra&inter university collaborations.

We will collect demands until the end of April and will arrange the presentations’ dates of a particular university by creating a consensus of the availability of students from that particular university.

qPCR Primer Design Tutorial

(The original tutorial I prepared with Microsoft Sway last year can be found here.)

PCR (Polymerase Chain Reaction) is a method widely used in the wet-lab to amplify the specific target sequences (mostly either directly from DNA or converting messenger RNA/mRNA to cDNA, then amplification of the target, also known as RT-PCR).

Nowadays, it is also popular among public due to accuracy of the detection for the viral materials during SARS-COV-2 caused pandemic (COVID, for more information, click here.).

Brief History of the Discovery of PCR

Shortly after the synthesis of oligonucleotides synthetically in the lab, young scientist Karry B. Mullis was curious about making more copies of scarce genetic materials for further studies. The question was “but how?”. In fact, this milestone technology was rewarded by Nobel Prize (in Chemistry at 1993. For more information, click here.).

Advantages of PCR

  • It reproduces accurate millions of copies of a given target in a short period of time by taking advantage of extremely heat resistant (in contrast to human enzymes) Taq DNA polymerase of thermophilic (=heat lover) bacteria.
  • It enables identification and modification of genetic materials.

RT-qPCR is real-time, quantitative and reverse-transcribed nucleic acid such as transcripts, version of regular PCR, which enables the simultaneous surveillance of amplification during the cycles thanks to fluorescent dye/probes.

It shares the similar steps with regular PCR: denaturation, annealing and elongation.

How to amplify?

Apart from the DNA polymerase (the enzyme), suitable buffer for the enzyme and the target nucleic acid sequence, it is required to have a proper primer sequences (forward and reverse) for DNA polymerase to bind and reproduce the target area on the sequence.

Primer pairs do not only provide a docking and start site for the enzyme, but also provide the determination of target site. Thus, it is important to design primers properly. It usually preferred as 15-25 nucleotides (denoted as nt) long single stranded DNA sequences (optimizing the trade-off between cost and specificity).

Here, we will learn how to design a qPCR primer for a target sequence and how to analyze the data in the following slides.

Main objectives of the tutorial course are:

  1. Understanding the importance of q/PCR for biology
  2. Learn the basics for primer design for qPCR
  3. Introduction to NCBI
  4. Primer design: using NCBI-PrimerBLAST
  5. Primer design: using Primer3
  6. Importance of In Silico testing of primers
  7. In Silico PCR
  8. Learning about qPCR analysis: Reference Gene
  9. Learning about qPCR analysis: qPCR Data Analysis
  10. Additional notes about primer efficiency calculation and tissue specific expression

Using NCBI to find the detailed information about the target gene

NCBI is the website of National Center for Biotechnology and Information hosted by US Government and open to everyone. It has many useful features including Pubmed, BLAST , SRA for sequence data archiving and more. Today, we will focus on Gene and PrimerBLAST features.

Type the target (gene) name to the NCBI“Search” box. For the sake of this tutorial, let’s continue with “ACTB” as the target. Search under “Gene” category on the left.

Figure 1: Learn more about the target by typing its name in “Search” box (e.g. ACTB). To make it easier, search in the “Gene” category.

You will see different options after click search. These are either the corresponding human ACTB gene in other species (defined as “ortholog“), or the duplicates of the genes happened in the evolution history of the gene in the same specie having difference in the sequence and/or function (defined as “paralog“, for more information, please use this link).

Figure 2: Since our interest here is human ACTB gene, we will continue the tutorial based on it.

You can find various information about the target gene here. To exemplify, exonic-intronic regions, the chromosome where the gene is located, which tissue(s) it is expressed in the Expression part, or the phenotypic relationship, single nucleotide variations (SNPs) and more (UCSC Genome Browser can be an alternative tool to use at this point.)

Figure 3: Detailed location on the given chromosome.

When you scroll down little bit, you can see all the transcripts (the expressed RNA) of the relevant gene.

Figure 4: There is a transcript regarding this gene. Let’s click on this and go to its sequence.

Let’s see the details of relevant mRNA of the gene simply from NCBI-Nucleotide instead of NCBI-Gene (there is more than one way of reaching the information).

Figure 5: Details of mRNA. You can see the details of CDS (Coding Sequence/Region) when scroll down.

There are two options here. The first option is the one shown with pink color: Pick Primers. This directs you to Primer-BLAST to design primers. The second option shown in brown box leads you to FASTA sequence (the nucleotides in A, T, C, G format). You can copy and paste the sequence in FASTA format in alternative primer design tools such as Primer3. Let’s see the options in detail.


If you have chosen “Pick Primer”, this will lead you to the page below:

Figure 6: Primer Blast, online primer design tool. The criteria for the primer design.

Now, you can change the parameters about your primer here. You can use the the boxes with question mark on the right to learn more about the details of the parameters.

Let’s look at some of them in detail.

  • PCR product size given in red box, shows the length of the target area you want to amplify. If you are interested in specific transcripts of the gene, you might not need to amplify the whole sequence.

In this case, it is better to take the optimal and maximum length of the target that SYBR-Green (in this case, fluorescent dye that interacts with double stranded-DNA, also known as ds-DNA, and gives the signal to help quantification). Although 500 base pair (denoted as bp) has given as the maximum detection limit for the qPCR target, it usually works in optimum for 100-200 nt target.

  • You can use # of primers to return given in orange box to set the number of primer pairs that you want to return (e.g. show me most suitable top 10)
  • Usually we use both forward and reverse primers to amplify the target. It is important these primers to have similar melting temperature, Tm (which is used to set annealing temperature). You can adjust this by using the box in blue (e.g. show me the primer pairs having at least 57 and at most 63 Tm, having the difference at most 3 degree between each).
  • If you are not sure about your purpose, it is better stay spanning exon exon junction shown in green box (you can find the details here.)
  • You can see the parameters in a new window by selecting “show results in a separate window” , which helps you to save some time if you need to change the parameters.

You can always use “advanced parameters” if you know what you are doing with those parameters. Otherwise, go with the default.

Figure7: Advanced parameters such as changing the algorithm, GC content, etc.

By using advanced parameters, you can optimize the “primer size” and “Primer GC content (I prefer 40-60 for GC content and, 18-20-23 for primer length. Please keep in mind that the efficiency of the primers might change depending on the difference between GC-AT ratio due to difference between the hydrogen bond number, 3 for G-Cs. 2 for A-T.)

Let’s wait little bit after the submission.

Figure 8: While waiting after the submission of primer design criteria for the target.

In the next window, you will see the region your primer pairs will be amplifying. You will also see the details of the primer pairs (e.g. annealing temperature, whether there is self-complementarity or not).

Figure 9: The graph showing the suitable primers on the target sequence.

The black boxed are exons. The yellow box shows the exon-exon spanning region. The red one shows the protein coding region (not every transcripts encodes for protein).

The blue arrows in pair show the primer pair that fits to the parameters we selected on the target.

Please keep in mind that we do not want to select the 3′ end of the transcript (rightmost region) due to RNA degradation characteristics. Particularly 3′ end was effected more in this case. 5′ end might be safer (leftmost).

Even if they seem okay, we need to look them in detail.

Figure 10: Detailed report for the matching primer pairs.

When we look at the report, you will see that the target length is 162 (in between ideal, 100-200). Exon-exon spanning. Primer lengths are ideal, too: 19 and 20. Tm is around 60<=. GC ratio is okay: 40-60%. There is little bit the possibility of self-complementarity, unwanted primer dimers, however might be in the acceptable range.

On the other hand, if it targets other variants or other genes than the target, then there is an issue. Unfortunately, we see other regions can be amplified in addition to the target. If this is confirmed by in silico PCR, this means there pairs are not specific enough an whereby not suitable.

In this case, an alternative tool might be useful. For example, Primer3.

When we look at the report, you will see that the target length is 162 (in between ideal, 100-200). Exon-exon spanning. Primer lengths are ideal, too: 19 and 20. Tm is around 60<=. GC ratio is okay: 40-60%. There is little bit the possibility of self-complementarity, unwanted primer dimers, however might be in the acceptable range.

On the other hand, if it targets other variants or other genes than the target, then there is an issue. Unfortunately, we see other regions can be amplified in addition to the target. If this is confirmed by in silico PCR, this means there pairs are not specific enough an whereby not suitable.

In this case, an alternative tool might be useful. For example, Primer3.


The home page of Primer3 is here. As a next step, just copy and paste the FASTA sequence from NCBI for the target gene you want to amplify. For ACTB, it is here.

Figure11: Primer 3 intro page, copy and paste the mRNA FASTA sequence (generic task).

Here, you have similar parameters (e.g. primer size, target size, GC %, minimum-maximum Tm) that you can play with.

Figure 12: General Primer Selection Criteria.

Click “pick primers” after adjusting the parameters based on your need:

Figure 13: Primer 3 results after submission of the criteria.

In the yellow box, you see the details of the primer pair, while red box show where it amplifies.

When you scroll down, you can see 9 primer pair candidates that suit with the parameters.

Figure 14: The primer pairs (Forward and Reverse). Maximum return of pairs were chosen as 9.

It would be better to choose primer pairs closer to left side (if you do not have any specific aim requiring the 3’end of it) due the reasons explained above. Let’s choose the second alternative in 100-200nt target range ( 1.and 8. can be alternatives, if 2. is not a good candidate).

Then, we will use In silico PCR to confirm.

In silico PCR

Before setting up qPCR for the primers or ordering them for qPCR, it is better to test them on in silico environment. This enables to see how specific your primers for the target sequence in the latest genome assembly.

Figure 15: Home page of UCSC Genome Browser.

We could not chose the exon-exon spanning option in Primer3. However, it can also be confirmed in silico.

If there is at least one intron between them (namely a longer target than primer3 shows) when you run an agarose gel after q/PCR, it is highly likely due to DNA contamination for RNA samples. Whereas exon-exon spanning regions are composed of transcripts

If the primers are not specific and amplify other sequences than the targets (multiple and non-specific targets), In Silico PCR of UCSC will help you to catch (as a dry-lab tool).

Figure 16: In Silico PCR tool.

It would be better to choose the latest assembly of human genome for more accurate representation. Then copy and paste the forward and reverse primers for the given regions.

Figure 17: The result for In Silico PCR.

As a result of in silico PCR test, you see that the primers are amplifying a region at chromosome 7, which is the location where ACTB resides. There is also only one amplified target, that is great. Besides, there is at least one intronic region on the genome (it is longer than the target region, 182 vs. 163), which is not in the transcriptome where only exons are available. Therefore, if there is any DNA contamination, and the length of wet-lab PCR product is longer than the target, you can easily spot the difference and the reason.

Reference (Housekeeping) Gene

qPCR is done to measure gene expression. However, there might be changes in the gene expression due to the some intrinsic or technical reasons (e.g. not put equal amounts of material) during the experiments. To rule out any bias in the data, a reference (housekeeping) gene is used. There are the genes that supposed to have similar expression profile no matter the condition, mutation fo the target and time point is.

Having said that, it is important to choose reference gene carefully. They should have similar expression profile among different tissues (if you are comparing a target from different tissues) or not affected by the specific modification (e.g. mutation done on the target.

Actin and GAPDH were chosen historically, independent of the study type. However, recent studies showed that they are not always stable for every tissue or experiment condition. Due to this, there are now different studies helping you to identify most proper reference gene. It is better to keep up the following recent literature.

For example, there is an article, providing the reference gene for each cancer type:

Conventionally used reference genes are not outstanding for normalization of gene expression in human cancer research (suggestions: • HNRNPLPCBP1RER1)

this article also provides good insights regarding how to choose reference gene:

Human housekeeping genes, revisited

qPCR Data Analysis

What do you expect to see when you analyze the qPCR data? Relative expression (e.g. gene A is expressed 10 times more than gene B) or absolute expression (e.g. gene A is expressed this much, whereas gene B is expressed that much)?

There are two common methods for qPCR data analyses (delta methods) suggested by Livak et al.:

  • 2^(-𝚫CT)
  • 2^(-𝚫𝚫CT
An example table with a suspected periodontitis risk factor gene, gbgt1l3, to further investigate.

Method 1: [2^(-𝚫𝚫CT)]

In this method, we first normalize the target (GBGT1l3) using the reference (ACTB). As a next step, the unhealthy expression (Ct) is normalized to healthy, also shown as calibrator, expression (Ct). Then, you will find the relative expression as a fold-change.

𝚫CT (calibrator) = 𝚫CT(gbgt1l3) – 𝚫CT(b-actin)

𝚫CT (calibrator) =  17-18 = -1.0

𝚫CT (disease) = 𝚫CT(gbgt1l3) -𝚫CT (b-actin)

𝚫CT (disease) = 14-19 = -5

𝚫𝚫CT= 𝚫CT(disease) -𝚫CT(calibrator)

𝚫𝚫CT= -5 – (-1) = -4

2^(-𝚫𝚫CT)= Normalized expression = 2^(-4) = 16

What does it mean? gbgt1l3 gene is expressed 16 times more in diseased condition compared to healthy individuals.

Method 2: [2^(-𝚫CT)]

In this method, we first find the relative expression of the relevant genes (i.e. gbgt1l3 vs. bactin) in the condition (i.e. healhty, diseased).

2^(𝚫CT of bactin – 𝚫CT  of gbgt1l3) = Find relative expression

For calibrator = 2^(𝚫CT of bactin – 𝚫CT  of gbgt1l3) = 2^(18-17) =21=2

For disease= 2^(𝚫CT of bactin – 𝚫CT  of gbgt1l3) =2^( 19-14) =25= 32

Then, find the expression corresponding these:

Healthy = calibrator/calibrator=2/2=1

Disease = Disease/ calibrator= 32/2=16 times more gbgt1l3 expression.

Additional Notes

Gene Expression in Different Tissues

To learn which genes are expressed in which tissues and how much expressed, these online tools might help you to find out quickly: GeneCards and ExpressionAtlas.

Primer Efficiency Calculation

Apart from the calculations above, there are other measurement that primer efficiency is taken into consideration.

Lets first understand what “PCR efficiency” means. The primers that matches with the target regions are supposed to double in every cycle to be amplified. However, we do not really know whether this is the case or not. In the above calculations, we assumed that it is the case. In case, you wonder how efficient the primers in the given PCR conditions are, you can test them by serial dilutions of cDNA.

If you reduce the amount of the cDNA to its half, the material that PCR amplifying will be reduced to its half. For every dilution, PCR will reach threshold expression value (denoted as Ct) later. When you put the Ct values on a graph in Excel or GraphPad, you will get a line. This line shows how efficient your primers are.

R: correlation. If correlation value is near to +/-1, this shows that primers are effectively doubled in every cycle as expected.

Why log? Because,

 log2(2) =1 


 log2(1/2)= -1



Whenever it is doubled, you can get a fold-change in a proper value in log base. When it is half, -1 fold-change, when it is doubled, +1 fold-change etc. Briefly, log provides you a straight correlation line for the situations of increment/decrement.


I would like to thank my master thesis supervisor Dr. Ozlen Konu, and my dear friends (particularly Ayse Gokce Keskus, Said Tiryaki, and Seniye Targen) from KONU Lab who shared their experience/knowledge with me during the beginning of my academic life while practicing the basics of the bioinformatics and wet-lab.


  • Turkish version of this post: (by Fatma Betül Dinçaslan)
  • A Nobel winning method: PCR: :
  • PCR and COVID:
  • NCBI website:
  • Ortholog vs Paralog: Jensen, R.A. Orthologs and paralogs – we need to get it right. Genome Biol 2, interactions1002.1 (2001).
  • Exon-exon junction:
  • Primer3:
  • UCSC:
  • In silico PCR: E, Levanon EY.
  • Human housekeeping genes, revisited [published correction appears in Trends Genet. 2014 Mar;30(3):119-20]. Trends Genet. 2013;29(10):569-574. doi:10.1016/j.tig.2013.05.010Jo, J., Choi, S., Oh, J. et al.
  • Conventionally used reference genes are not outstanding for normalization of gene expression in human cancer research. BMC Bioinformatics 20, 245 (2019). https://doi-org/10.1186/s12859-019-2809-2
  • qPCR analysis: Livak KJ, Schmittgen TD. Analysis of relative gene expression data using real-time quantitative PCR and the 2(-Delta Delta C(T)) Method. Methods. 2001;25(4):402-408. doi:10.1006/meth.2001.1262
  • Gene Cards:
  • Expression Atlas:
  • Primer Efficiency Calculation:
  • An example graph about how to evaluate the efficiency of qPCR:

Discovering Coding lncRNAs Using Deep Learning Training Dynamics


Afshan Nabi

Afshan is a machine learning engineer at OccamzRazor. She completed her MS in Computer Science from Sabanci University and her BS in Molecular Biology & Genetics from Bilkent University. She is interested in applying machine learning to solve problems in computational biology.


Long non-coding RNAs (lncRNAs) are the largest class of non-coding RNAs (ncRNAs). However, recent experimental evidence has shown that some lncRNAs contain small open reading frames (sORFs) that are translated into functional micropeptides. Current methods to detect misannotated lncRNAs rely on ribosome-profiling (ribo-seq) experiments, which are expensive and cell-type dependent. We present a framework that leverages deep learning models’ training dynamics to determine whether a given lncRNA transcript is misannotated. Our deep sequential learning models achieve AUC scores >91% and AUPR >93% in classifying non-coding vs. coding sequences while allowing us to identify possible misannotated lncRNAs present in the dataset. Our results overlap significantly with a set of experimentally validated misannotated lncRNAs as well as with coding sORFs within lncRNAs found by a ribo-seq dataset. The methodology offers promising potential for assisting experimental efforts in characterizing the hidden proteome encoded by misannotated lncRNAs and for curating better datasets for building coding potential predictors.

Date: October 14th, 2021 – 6:00 PM (GMT+3)

Language: English

You can register for the webinar here !

RSG-Turkey is a member of The International Society for Computational Biology (ISCB) Student Council (SC) Regional Student Groups (RSG). We are a non-profit community composed of early career researchers interested in computational biology and bioinformatics.


Follow us on social media!