Photo by Unsplash
VHHCorpus
VHHCorpus is a pre-training corpus with full-length amino acid sequences of variable domain of heavy chain of heavy chain antibody (VHH) collected from alpacas. We currently released VHHCorpus-2M containing over two million unlabeled VHH sequences. VHHCorpus-2M can be used for pre-training of VHH-specific language models.
Columns
A description of columns in the dataset CSV file.
Column | Description |
VHH_sequence | Amino acid sequence of VHH |
subject_species | Species of the subject from which VHH was collected |
subject_name | Name of the subject from which VHH was collected |
subject_sex | Sex of the subject from which VHH was collected |
Pipeline
VHHCorpus was generated through the following workflow. The scripts highlighted in blue are available on GitHub.
Subjects
VHHCorpus-2M is a collection of unique VHH sequences produced from five alpacas, different from those used in the generation of AVIDa-SARS-CoV-2. Note that VHHCorpus-2M includes publicly available AVIDa-hIL6 in addition to multiple datasets that have not been published as labeled binding datasets.
Name | Species | Sex |
Lucky | Alpaca | Female |
Marin | Alpaca | Male |
Wizzy | Alpaca | Male |
Yodel-Suri | Alpaca | Female |
Yuki | Alpaca | Female |