Photo by Unsplash
AVIDa-hIL6
AVIDa-hIL6 is an antigen-variable domain of heavy chain of heavy chain antibody (VHH) interaction dataset produced from an alpaca immunized with the human interleukin-6 (IL-6) protein. By leveraging the simple structure of VHHs, which facilitates identification of full-length amino acid sequences by DNA sequencing technology, AVIDa-hIL6 contains 573,891 antigen-VHH pairs with amino acid sequences. All the antigen-VHH pairs have reliable labels for binding or non-binding, as generated by a novel labeling method. Furthermore, AVIDa-hIL6 has the wild type and 30 mutants of the IL-6 protein as antigens, and it includes many sensitive cases in which point mutations in IL-6 enhance or inhibit antibody binding. We envision that AVIDa-hIL6 will serve as a valuable benchmark for machine learning research in the growing field of predicting antigen-antibody interactions.
Columns
A description of columns in the dataset CSV file.
Column | Description |
VHH_sequence | Amino acid sequence of VHH |
Ag_sequence | Amino acid sequence of IL-6 protein |
Ag_label | Type of IL-6 protein |
label | Binary label represented by 1 for the binding pair and 0 for the non-binding pair |
Pipeline
AVIDa-hIL6 is generated through the following workflow. The scripts highlighted in blue are available on GitHub.
Statistics
AVIDa-hIL6 contains 573,891 data samples, comprising 20,980 binding pairs and 552,911 non-binding pairs. The following figure shows shows the number of samples for each antigen type. AVIDa-hIL6 contains over 10,000 samples for each type of IL-6 protein, including at least 250 binder VHH sequences.
Mother Libraries
Mother libraries were collected from a single alpaca immunized with a cocktail of 31 different IL-6 proteins four times at about two-week intervals. After each immunization, one blood sample and one or more lymph nodes from different body sites were collected, yielding a total of 12 libraries.
Sublibraries
Sublibraries were generated by performing affinity selection on each of the 12 mother libraries. As target molecules, we used the wild-type (3 times) and 30 different mutants, and a negative control sample that did not contain any IL-6 protein, as listed in the table below.