PhD project by José Juan Armenteros

Project title: The subcellular journey: Predicting the destination of proteins using deep learning
Group: BMEM

Supervisor: Henrik Nielsen

Project description:

Cells of eukaryotes (e.g. animals and plants) have many different internal subcellular compartments. Each compartment contains a characteristic set of proteins, which carry out their functions there. Some proteins are even exported to the outside of the cell to perform their function.

So how does the cell know where to put which proteins? In 1999, Günter Blobel received the Nobel Prize in Physiology or Medicine for the discovery that “proteins have intrinsic signals that govern their transport and localization in the cell”. In other words, there are “zip codes” in the proteins that inform the cell about their proper destination.

In these years, available protein sequence data are growing at a very high rate, while experimentally confirmed determinations of protein destinations are gathered much more slowly. Therefore, there is a keen interest in the prediction of these “zip codes” from amino acid sequences, and this has been an important problem in bioinformatics ever since the dawn of this field.

This PhD thesis takes these predictions methods an important step forward by introducing modern artificial intelligence techniques to the area. Convolutional and recurrent neural networks have been used successfully in fields such as image recognition and natural language processing (including machine translation), but they are new in the context of protein subcellular sorting prediction.

In this thesis, José Juan Almagro Armenteros presents two new tools, DeepLoc and NetGPI, and two updates to well-known and successful tools, SignalP and TargetP. DeepLoc is a general predictor of protein subcellular localization in eukaryotes, while the others recognize specific “zip codes”. SignalP, the most cited method in the history of the Bioinformatics section at DTU, predicts secretory signal peptides. This “zip code” is responsible for sending proteins out of the cell in organisms from all domains of life (both eu- and prokaryotes). TargetP predicts transit peptides, which serve as signals for protein import into mitochondria and chloroplasts. NetGPI predicts the attachment of a particular lipid group, which can anchor proteins to the outside of cells.

Additionally, this thesis contains a manuscript about “the language of life” – an exploration of how predictable proteins are in general. When a deep learning-based language model is applied to proteins sequences instead of sentences from a human language, it is to some degree possible to predict an amino acid from the neighbouring amino acids. It is shown that this degree depends on the origin and quality of the sequence data. For instance, it seems that bacterial proteins are generally more predictable than eukaryotic ones. A deep neural network trained in this way has an advanced internal representation of protein sequences, which may be beneficial to use in specific biological prediction tasks, including that of recognizing protein “zip codes”. 


Jose Juan Almagro Armenteros
Gæste Postdoc
DTU Sundhedsteknologi


Henrik Nielsen
DTU Sundhedsteknologi
45 25 20 98