alchemab/her2-binding-prediction
收藏Hugging Face2023-09-12 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/alchemab/her2-binding-prediction
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: vh
dtype: string
- name: vl
dtype: string
- name: label
dtype:
class_label:
names:
'0': non-binder
'1': binder
splits:
- name: train
num_bytes: 4428189
num_examples: 18223
- name: eval
num_bytes: 553554
num_examples: 2278
- name: test
num_bytes: 553554
num_examples: 2278
download_size: 415916
dataset_size: 5535297
---
# HER2 binding dataset
HER2 binding antibodies have been obtained from the [Github repo](https://github.com/dahjan/DMS_opt) for [Mason et al. (2021)](https://www.nature.com/articles/s41551-021-00699-9).
Labels for antibody sequences were generated using scripts in the above Github repo. The number of negatives and positives were balanced through random undersampling using [imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn), and sequences were deduplicated.
The dataset has:
* 39108 antibodies in total
* 22779 antibodies after undersampling and deduplication
* 18223 in the training set, 2278 in the evaluation set, and 2278 in the test set.
Example
| vh_full | vl_full | label |
| ------- | ------- | ----- |
| EVQ... | DIQ... | 1 |
| EVQ... | DIQ... | 0 |
| EVQ... | DIQ... | 0 |
| EVQ... | DIQ... | 1 |
References
* [Mason et al. (2021) paper](https://www.nature.com/articles/s41551-021-00699-9)
数据集信息:
特征:
- 名称:vh,数据类型:字符串
- 名称:vl,数据类型:字符串
- 名称:label,数据类型:分类标签,其类别名称为:'0': 非结合体(non-binder),'1': 结合体(binder)
划分集:
- 名称:训练集(train),字节数:4428189,样本数:18223
- 名称:评估集(eval),字节数:553554,样本数:2278
- 名称:测试集(test),字节数:553554,样本数:2278
下载大小:415916 字节,数据集总大小:5535297 字节
# HER2结合数据集
HER2结合抗体源自对应[Mason等人(2021)研究](https://www.nature.com/articles/s41551-021-00699-9)的[Github仓库](https://github.com/dahjan/DMS_opt)。
抗体序列的标签通过上述Github仓库中的脚本生成。通过[不平衡学习库(imbalanced-learn)](https://github.com/scikit-learn-contrib/imbalanced-learn)的随机欠采样方法对正负样本进行平衡,并对序列进行去重处理。
本数据集相关信息如下:
* 原始总抗体序列共39108条
* 经欠采样与去重后共保留22779条抗体序列
* 训练集包含18223条样本,评估集与测试集各包含2278条样本
示例
| vh_full(重链全序列) | vl_full(轻链全序列) | label(标签) |
| ------- | ------- | ----- |
| EVQ... | DIQ... | 1 |
| EVQ... | DIQ... | 0 |
| EVQ... | DIQ... | 0 |
| EVQ... | DIQ... | 1 |
参考文献
* [Mason等人(2021)研究论文](https://www.nature.com/articles/s41551-021-00699-9)
提供机构:
alchemab



