VHHCorpus image

Photo by Unsplash

Datasets

VHHCorpus

VHHCorpus is a pre-training corpus with full-length amino acid sequences of variable domain of heavy chain of heavy chain antibody (VHH) collected from alpacas. We currently released VHHCorpus-2M containing over two million unlabeled VHH sequences. VHHCorpus-2M can be used for pre-training of VHH-specific language models.

Columns

A description of columns in the dataset CSV file.

Column
Description
VHH_sequence
Amino acid sequence of VHH
subject_species
Species of the subject from which VHH was collected
subject_name
Name of the subject from which VHH was collected
subject_sex
Sex of the subject from which VHH was collected

Pipeline

VHHCorpus was generated through the following workflow. The scripts highlighted in blue are available on GitHub.

Image

Subjects

VHHCorpus-2M is a collection of unique VHH sequences produced from five alpacas, different from those used in the generation of AVIDa-SARS-CoV-2. Note that VHHCorpus-2M includes publicly available AVIDa-hIL6 in addition to multiple datasets that have not been published as labeled binding datasets.

Name
Species
Sex
Lucky
Alpaca
Female
Marin
Alpaca
Male
Wizzy
Alpaca
Male
Yodel-Suri
Alpaca
Female
Yuki
Alpaca
Female