Proteins are essential macromolecules for life. To understand and manipulate biological mechanisms, functions of proteins should be understood, and this is pos- sible through studying their relationship with the amino acid sequence and 3-D structure. So far, only a small percentage of proteins could be functionally characterized (currently ∼0.5% according to UniProt) due to cost and time requirements of wet-lab-based procedures. Lately, protein function prediction (PFP), which can be defined as the annotation of proteins with functional definitions using statistical/computational methods, gains importance to explore the uncharacterized protein space and/or protein variants carrying function altering changes. Among many different algorithmic approaches proposed so far, machine learning (ML), especially deep learning (DL), techniques have become popular in PFP due to their high pre- dictive performance. The input data used by these ML/DL methods are numerical feature vectors representing the protein (i.e., protein representations), and they are mostly generated from amino acid sequences of proteins which are readily available in databases (e.g., UniProt). In this study, we evaluated protein representation methods for the prediction of functional attributes of proteins and benchmarked these methods in 4 challeng- ing tasks, namely: (i) Semantic similarity inference (we calculated pairwise semantic similarities between human proteins using their gene ontology annotations and compared them with representation vector similarities to observe the correlation in- between), (ii) Ontological protein function prediction (we built GO term categories based on term specificities and the sample sizes which reflects different levels of pre- dictive difficulty and evaluated representation methods by training/validating ML models on these datasets), (iii) Drug target protein family classification (five major target families are selected and methods are evaluated in terms of classifying proteins to families via ML models), and (iv) Protein-protein binding affinity estimation (we used the SKEMPI dataset to evaluate methods in estimating protein-protein binding affinity changes upon mutations). We evaluated 23 protein representation methods in total, including both classical approaches and cutting-edge representation learning methods, to observe whether these novel approaches have advantages over classical ones, in terms of extracting high level/complex properties of proteins that are hid- den in their sequence. Finally, we provide an open-access tool, PROBE (Protein RepresentatiOn BEnchmark), where the user can assess new protein representation models over the above mentioned benchmarking tasks with only a line of code.
Protein-RNA interactions play vital roles in many cellular processes, and as a result are the main focus of many biological studies. Biologists would like to efficiently measure protein-RNA interactions in high-throughput, and based on these high-throughput experimental measurements train accurate machine-learning models to predict interactions to new RNA sequences. In the talk, I will present solutions to both challenges: design of efficient high-throughput experiments, and training highly accurate predictive models on high-throughput genomic data. First, I will present DeCoDe, a new method based on Integer Linear Programming to design protein-coding templates to efficiently cover many proteins in a single high-throughput experiment. DeCoDe outperforms extant methods for the task, and newly enables features that were not possible before, such as covering variable-length proteins and optimizing globally over multiple templates. Second, I will present DeepUTR, a new method based on Deep Learning to predict mRNA degradation dynamics based on the 3’-UTR sequence of an mRNA. DeepUTR outperforms extant methods for the task, and newly enables prediction of mRNA levels at various time points. Moreover, we extended the Integrated Gradients interpretability approach to handle multiple input types, and using the extended approach discovered known and novel regulatory 3’-UTR elements associated with mRNA degradation. I will conclude my talk with future plans on both sequence design problems, and deep neural networks applications in genomics.
Alternative splicing (AS) plays a major role in the differentiation of immune cells during an immune response as 29% of AS genes are specific to the immune system. Although the role of AS is extensively investigated in T cells, its role in B cell activation is less characterised. We sought to develop a long-read technology, Oxford Nanopore Technologies (ONT), workflow to understand post-transcriptional regulation at both gene and isoform levels of human germinal centre B cells. As one of the challenges of ONT is the accurate computational analysis of isoforms, we developed the ‘Nexons’ pipeline to identify differentially spliced transcript variants using long-read sequencing. An in-depth analysis of splicing regulators with Nexons revealed that poison exons of splicing factors (e.g. SRSF3) were preferentially spliced out upon activation whereas naïve B cells expressed isoforms carrying poison exon, leading to nonsense-mediated mRNA decay. Moreover, we identified novel spliced variants of these genes, which were difficult to deconvolute using short-read data due to the limitations of short-read technology. Altogether, our findings validate the combination of Nexons with ONT cDNA-PCR sequencing as a suitable method for the identification and quantification of complex isoforms.
Automated imaging systems are becoming important tools for medicine and biology research as they facilitate rapid analyses with better reproducibility. Segmenting regions of interest on a medical image is typically the first but one of the foremost steps of these systems, which greatly affects the success of the entire analysis. In this talk, I will briefly mention the main challenges associated with segmentation tasks in medical image analysis, and then present examples of the dense prediction networks that my research group designed and implemented to address these challenges. Particularly, I will talk about our proposed network architectures and loss functions that were specifically designed to facilitate better training of the segmentation networks. At the end, I will discuss future research possibilities towards the direction of developing more robust segmentation networks for medical image analysis.
The structure and function of diverse microbial communities is underpinned by ecological interactions that remain uncharacterized. With rapid adoption of next-generation sequencing for studying microbiomes, data-driven inference of microbial interactions based on abundance correlations is widely used, but with the drawback that ecological interpretations may not be possible. Leveraging cross-sectional microbiome datasets for unravelling ecological structure in a scalable manner thus remains an open problem. We present an expectation-maximization algorithm (BEEM-Static) that can be applied to cross-sectional datasets to infer interaction networks based on an ecological model (generalized Lotka-Volterra). The method exhibits robustness to violations in model assumptions by using statistical filters to identify and remove corresponding samples. Benchmarking against 10 state-of-the-art correlation based methods showed that BEEM-Static can infer presence and directionality of ecological interactions even with relative abundance data (AUC-ROC > 0.85), a task that other methods struggle with (AUC-ROC < 0.63). In addition, BEEM-Static can tolerate a high fraction of samples (up to 40%) being not at steady state or coming from an alternate model. Applying BEEM-Static to a large public dataset of human gut microbiomes (n = 4,617) identified multiple stable equilibria that better reflect ecological enterotypes with distinct carrying capacities and interactions for key species.
RSG-Turkey is a member of The International Society for Computational Biology (ISCB) Student Council (SC) Regional Student Groups (RSG). We are a non-profit community composed of early career researchers interested in computational biology and bioinformatics.