five

alchemab/her2-binding-prediction

收藏
Hugging Face2023-09-12 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/alchemab/her2-binding-prediction
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: vh dtype: string - name: vl dtype: string - name: label dtype: class_label: names: '0': non-binder '1': binder splits: - name: train num_bytes: 4428189 num_examples: 18223 - name: eval num_bytes: 553554 num_examples: 2278 - name: test num_bytes: 553554 num_examples: 2278 download_size: 415916 dataset_size: 5535297 --- # HER2 binding dataset HER2 binding antibodies have been obtained from the [Github repo](https://github.com/dahjan/DMS_opt) for [Mason et al. (2021)](https://www.nature.com/articles/s41551-021-00699-9). Labels for antibody sequences were generated using scripts in the above Github repo. The number of negatives and positives were balanced through random undersampling using [imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn), and sequences were deduplicated. The dataset has: * 39108 antibodies in total * 22779 antibodies after undersampling and deduplication * 18223 in the training set, 2278 in the evaluation set, and 2278 in the test set. Example | vh_full | vl_full | label | | ------- | ------- | ----- | | EVQ... | DIQ... | 1 | | EVQ... | DIQ... | 0 | | EVQ... | DIQ... | 0 | | EVQ... | DIQ... | 1 | References * [Mason et al. (2021) paper](https://www.nature.com/articles/s41551-021-00699-9)

数据集信息: 特征: - 名称:vh,数据类型:字符串 - 名称:vl,数据类型:字符串 - 名称:label,数据类型:分类标签,其类别名称为:'0': 非结合体(non-binder),'1': 结合体(binder) 划分集: - 名称:训练集(train),字节数:4428189,样本数:18223 - 名称:评估集(eval),字节数:553554,样本数:2278 - 名称:测试集(test),字节数:553554,样本数:2278 下载大小:415916 字节,数据集总大小:5535297 字节 # HER2结合数据集 HER2结合抗体源自对应[Mason等人(2021)研究](https://www.nature.com/articles/s41551-021-00699-9)的[Github仓库](https://github.com/dahjan/DMS_opt)。 抗体序列的标签通过上述Github仓库中的脚本生成。通过[不平衡学习库(imbalanced-learn)](https://github.com/scikit-learn-contrib/imbalanced-learn)的随机欠采样方法对正负样本进行平衡,并对序列进行去重处理。 本数据集相关信息如下: * 原始总抗体序列共39108条 * 经欠采样与去重后共保留22779条抗体序列 * 训练集包含18223条样本,评估集与测试集各包含2278条样本 示例 | vh_full(重链全序列) | vl_full(轻链全序列) | label(标签) | | ------- | ------- | ----- | | EVQ... | DIQ... | 1 | | EVQ... | DIQ... | 0 | | EVQ... | DIQ... | 0 | | EVQ... | DIQ... | 1 | 参考文献 * [Mason等人(2021)研究论文](https://www.nature.com/articles/s41551-021-00699-9)
提供机构:
alchemab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作