Language Models Can Learn Complex Functional Properties of Proteins


Serbülent Ünsal

Serbulent Unsal received her B.Sc. degree in Statistics and Computer Sciences from Karadeniz Technical University in Turkey. Following his graduation he continued his M.Sc. degree in Medical Informatics from Middle East Technical University in Turkey. During the Master’s program he studied multiscale computational tumor modeling in which he developed a tumor progression model using cellular automata and partial differential equations with Dr.Aybar Can Acar. In 2014 he started his PhD at the same department on developing deep learning models for low-data protein function prediction. His thesis is also part of a large-scale research project on discovery of new immune-escape mechanisms and drug repurposing against them. Currently, he is about to finish his PhD and working as Senior ML Engineer in Antiverse to design antibodies using machine learning and deep learning models.


Proteins are essential macromolecules for life. To understand and manipulate biological mechanisms, functions of proteins should be understood, and this is pos- sible through studying their relationship with the amino acid sequence and 3-D structure. So far, only a small percentage of proteins could be functionally characterized (currently ∼0.5% according to UniProt) due to cost and time requirements of wet-lab-based procedures. Lately, protein function prediction (PFP), which can be defined as the annotation of proteins with functional definitions using statistical/computational methods, gains importance to explore the uncharacterized protein space and/or protein variants carrying function altering changes. Among many different algorithmic approaches proposed so far, machine learning (ML), especially deep learning (DL), techniques have become popular in PFP due to their high pre- dictive performance. The input data used by these ML/DL methods are numerical feature vectors representing the protein (i.e., protein representations), and they are mostly generated from amino acid sequences of proteins which are readily available in databases (e.g., UniProt). In this study, we evaluated protein representation methods for the prediction of functional attributes of proteins and benchmarked these methods in 4 challeng- ing tasks, namely: (i) Semantic similarity inference (we calculated pairwise semantic similarities between human proteins using their gene ontology annotations and compared them with representation vector similarities to observe the correlation in- between), (ii) Ontological protein function prediction (we built GO term categories based on term specificities and the sample sizes which reflects different levels of pre- dictive difficulty and evaluated representation methods by training/validating ML models on these datasets), (iii) Drug target protein family classification (five major target families are selected and methods are evaluated in terms of classifying proteins to families via ML models), and (iv) Protein-protein binding affinity estimation (we used the SKEMPI dataset to evaluate methods in estimating protein-protein binding affinity changes upon mutations). We evaluated 23 protein representation methods in total, including both classical approaches and cutting-edge representation learning methods, to observe whether these novel approaches have advantages over classical ones, in terms of extracting high level/complex properties of proteins that are hid- den in their sequence. Finally, we provide an open-access tool, PROBE (Protein RepresentatiOn BEnchmark), where the user can assess new protein representation models over the above mentioned benchmarking tasks with only a line of code.

Date: July 6th, 2022 – 18:00 (GMT+3)

Language: English

You can register for this webinar here !

Leave a Reply