Blog

I am Hirofumi Tsuruta, in charge of research and development on machine learning at COGNANO. Recently, COGNANO published a paper and press release on alpaca antibodies effective against the novel coronavirus. Since it was widely covered on TV and in newspapers, some of you may have seen the term "alpaca antibody. The key to this achievement is the vast amount of antibody gene data produced by vaccinated alpacas. This data contains extremely complex and unexplained life phenomena, and machine learning technologies such as deep learning can be expected to exert tremendous power. In this article, we will introduce the machine learning efforts currently underway at COGNANO.

Background: Immune response and antibody drugs.

Before introducing COGNANO's efforts, what exactly are antibodies? Before introducing COGNANO's efforts, I would like to briefly explain the background.

Humans have a defense system called immunity to protect the body from antigens such as viruses and bacteria that invade the body from the outside. This defense system produces a large number of antibodies to fight against antigens that have invaded the body, binding to the antigens and attempting to eliminate them. This process of eliminating antigens is called an immune response. Since one antibody has the specificity to react only to a particular antigen, antibodies are produced corresponding to the invading antigen.

Drugs that use these antibodies to prevent or treat disease are antibody drugs. By utilizing the specificity of antibodies, specific antigens can be attacked with pinpoint accuracy, making them highly effective drugs with few side effects. The term "antibody" refers to the ability of an antibody to bind to an antigen, and the substance of an antibody is a protein consisting of a chain of many amino acids. In the development of antibody drugs, the search for proteins (antibodies) that bind specifically to the target antigen is an important issue.

Data generated by alpaca

The data held by COGNANO is generated by alpacas. Many of you may be wondering why alpacas? Many of you may be wondering why alpacas are used in COGNANO. The reason is that the VHH antibodies held by alpacas have a simpler structure than those held by humans and other animals in general, and can be converted into data using an experimental device called a next-generation sequencer. The data here refers to the amino acid sequence, which is a sequence of 20 different amino acids; each amino acid can be represented by a single letter of the alphabet, so the amino acid sequence can be expressed as a string of characters.

So how do we search for antibodies that bind specifically to a particular antigen? The answer is: "Use the immune response that occurs in the alpaca body. When an antigen is injected into an alpaca, the body's immune system kicks in and produces a large number of antibodies to fight the antigen. These antibodies are then extracted and converted into data. Furthermore, COGNANO's proprietary method labels the antibodies that actually bind to the antigen from among the extracted antibodies with those that do not. These are the so-called machine learning correct labels. The actual process of data conversion and labeling is more complex, but we plan to publish these details in a future paper.

To review the data, we have the amino acid sequence of the target antigen (string), the amino acid sequence of the drug candidate antibody (string), and the binding/non-binding label (binary) for the antigen/antibody pair. Therefore, a machine learning model for binary classification can be constructed to predict whether or not binding occurs, using the amino acid sequence of the antigen/antibody as the explanatory variable and the label as the objective variable. Furthermore, from the perspective of a machine learning specialist, there are several desirable characteristics of this data that make it a powerful target for machine learning, especially deep learning.

1. Huge amount of data.

A fundamental problem in applying deep learning to some data is often that there is not enough data. In this case, no matter how high the quality of the data, it becomes difficult to extract useful information from it. Alpacas are very good living computational machines, so they produce a huge number of antibodies to a single antigen, which we then convert into data. Specifically, depending on the conditions of data processing, we can obtain information on the order of hundreds of thousands to millions of antibody genes for a single antigen. Furthermore, by databasing against multiple antigens, we can generate a huge amount of data.

2. Dense data distribution.

Antibodies obtained from alpacas contain many amino acid sequences with high similarity (homology). In other words, antibodies are obtained that match most of the sequence but differ in only a few amino acids. This is very useful information for capturing antigen-antibody binding. From a machine learning perspective, areas of high data density generally lead to higher predictive performance and reliability of the model.

3. Data with latent laws of immune response.

Since our data reflects the results of a real immune response occurring in the alpaca's body, it is likely that there is some law inherent in the data. If there were no laws and antibodies were produced randomly, we would not be able to obtain highly similar amino acid sequences as described in 2. If the length of the amino acid sequence of an antibody is N, the number of possible amino acid sequences is N squared of 20. However, the world of immune response in vivo is actually much smaller, and there may be a finite number of immune responses to the same antigen. If this is the case, it may be possible to somehow capture it with deep learning. In this way, the fact that the true phenomenon to be captured is hidden behind the data is a very attractive application of deep learning.

The possibilities opened up by deep learning

Deep learning is being actively applied to the data of amino acid sequences, which are the building blocks of proteins. Interestingly, since amino acid sequences can be represented as strings, recent research tends to rapidly incorporate developments in the field of natural language processing. Specifically, there are the following studies Both studies apply BERT, a language model proposed by Google in 2018, to protein amino acid sequences.

[2006.15222] BERTology Meets Biology: Interpreting Attention in Protein Language Models
Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. In this work, we demonstrate a set of methods for analyzing protein Transformer models through the lens of attention. We show that attention: (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We find this behavior to be consistent across three Transformer architectures (BERT, ALBERT, XLNet) and two distinct protein datasets. We also present a three-dimensional visualization of the interaction between attention and protein structure. Code for visualization and analysis is available at https://github.com/salesforce/provis.
iconhttps://arxiv.org/abs/2006.15222
image
ProteinBERT: A universal deep-learning model of protein sequence and function | bioRxiv
Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme consists of masked language modeling combined with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to very large sequence lengths. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains state-of-the-art performance on multiple benchmarks covering diverse protein properties (including protein structure, post translational modifications and biophysical attributes), despite using a far smaller model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. Code and pretrained model weights are available at <https://github.com/nadavbra/protein_bert>. ### Competing Interest Statement The authors have declared no competing interest.
iconhttps://www.biorxiv.org/content/10.1101/2021.05.24.445464v1
image

We are also working on applying deep learning techniques developed in the field of natural language processing, such as Transformer and BERT, to our own alpaca antibody data.

There are two main immediate challenges we want to solve using deep learning. One is to predict whether an antibody will bind (interact) with a specific antigen. If we can construct such a deep learning model, we can screen the amino acid sequences of antibodies that bind to the antigen by predicting their binding to COGNANO's vast array of alpaca antibodies, if the amino acid sequence of the target antigen is known. Although binding does not necessarily mean that the antibody will be effective as a drug, the ability to narrow down drug candidates using only a computer will greatly improve the efficiency of the drug discovery process.

Another challenge is to predict which site of an antigen an antibody will bind to. The site on the antigen to which an antibody specifically binds is called an epitope. Since an epitope is generally a small region consisting of a few amino acids, the prediction of an epitope is very useful information for designing antibodies that bind specifically to an antigen. This challenge can be approached from the perspective of machine learning interpretability, which has been actively studied in the field of machine learning in recent years. In particular, methods that present the contribution of input features to the prediction of models, such as LIME and SHAP, can be directly applied to epitope prediction. Incorporating this state-of-the-art research on the interpretability of machine learning into epitope prediction is both very challenging and practical.

What COGNANO can do.

The above mentioned efforts to build a deep learning model using alpaca antibody gene data are unprecedented and only possible because of COGNANO. COGNANO has never doubted the potential of alpaca antibodies since it first began focusing on them, and has continued to accumulate data through repeated trial and error over a long period of time. To begin with, it is difficult to generate data without advanced biotechnological experimental skills. On the other hand, many of the existing studies mentioned above use small amounts of data that have been labeled correct by strong assumptions or generated virtually by simulation. This gives them a significant advantage in terms of both quantity and quality of data. No matter how sophisticated a deep learning model is, without good data, it is impossible to create a valuable and practical model.

Mathematical approaches such as deep learning and statistical analysis are essential to extract useful information from our data, where the laws of immune response in vivo are latent. At COGNANO, we have begun to successfully extract value from our data through the application of mathematical approaches to data generated by the tireless efforts of biotechnology experts. One of the results is a paper on a new coronavirus, and these efforts will accelerate in the future. Please look forward to COGNANO in the future.

Summary

In this article, we introduced COGNANO's machine learning-related efforts. We use the immune response that occurs in alpacas when they are injected with target antigens to collect data on their antibody genes. By applying deep learning to this data, we are working to predict which antibodies will bind to specific antigens and which sites of antigens the antibodies will specifically bind to. We plan to publish more specific details of our research in a future paper.

Thank you for reading to the end, and if you have any questions about COGNANO, please contact CEO Akihiro Imura, CTO Ryosuke Matsumoto, or Hirofumi Tsuruta.