AVIDa-SARS-CoV-2 image

Photo by Unsplash

Datasets

AVIDa-SARS-CoV-2

AVIDa-SARS-CoV-2 is a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions obtained from two alpacas immunized with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins. AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and Omicron variants. AVIDa-SARS-CoV-2 provides valuable benchmarks for evaluating the representation capabilities of antibody language models for binding prediction, thereby facilitating the development of AI-driven antibody discovery.

Columns

A description of columns in the dataset CSV file.

AVIDa-SARS-CoV-2.csv

Column
Description
VHH_sequence
Amino acid sequence of VHH
Ag_label
Antigen Type
label
Binary label represented by 1 for the binding pair and 0 for the non-binding pair
subject_species
Species of the subject from which VHH was collected
subject_name
Name of the subject from which VHH was collected
subject_sex
Sex of the subject from which VHH was collected

antigen_sequences.csv

Column
Description
Ag_label
Antigen Type
Ag_sequence
Amino acid sequence of antigen

Pipeline

AVIDa-SARS-CoV-2 was generated through the following workflow. The scripts highlighted in blue are available on GitHub.

Image

Statistics

AVIDa-SARS-CoV-2 contains 77,003 data samples, comprising 22,002 binding pairs and 55,001 non-binding pairs. The following figure shows the number of data samples for each antigen type.

Image

Subjects

Two alpacas were used for dataset generation.

Name
Species
Sex
Christy
Alpaca
Female
Puta
Alpaca
Male

Antigen Types

We used 13 types of antigens as targets listed in the table below.

Antigen Type
Panning
Description
WT
cell
Wild-type (WT) SARS-CoV-2 identified in Wuhan
D614G
cell
Mutant with D614G mutation
Alpha
cell, bead
Mutant with representative mutations of Alpha variant with a C9 tag at the C-terminus.
Alpha+K417N
cell
Mutant of antigen type “Alpha” with K417N mutation
Alpha+K484K
cell
Mutant of antigen type “Alpha” with E484K mutation
Beta
cell, bead
Mutant with representative mutations of Beta variant
Delta
cell, bead
Mutant with representative mutations of Delta variant
Kappa
bead
Mutant with representative mutations of Kappa variant
Lambda
bead
Mutant with representative mutations of Lambda variant
Omicron
cell, bead
Mutant with representative mutations of Omicron (BA.1) variant
PMS
bead
Polymutant spike (PMS) protein
S2-domain
bead
S2-domain of the WT
OC43
bead
Human coronavirus OC43 (HCoV-OC43)