wanglab/delbert_data
收藏Hugging Face2026-04-02 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/wanglab/delbert_data
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: DCAF7
features:
- name: COMPOUND_ID
dtype: string
- name: LIBRARY_ID
dtype: string
- name: LABEL
dtype: int32
- name: MW
dtype: float32
- name: ALOGP
dtype: float32
- name: ECFP4_indices
sequence: uint16
- name: ECFP4_values
sequence: uint16
- name: ECFP6_indices
sequence: uint16
- name: ECFP6_values
sequence: uint16
- name: FCFP4_indices
sequence: uint16
- name: FCFP4_values
sequence: uint16
- name: FCFP6_indices
sequence: uint16
- name: FCFP6_values
sequence: uint16
- name: MACCS_indices
sequence: uint16
- name: MACCS_values
sequence: uint16
- name: RDK_indices
sequence: uint16
- name: RDK_values
sequence: uint16
- name: AVALON_indices
sequence: uint16
- name: AVALON_values
sequence: uint16
- name: ATOMPAIR_indices
sequence: uint16
- name: ATOMPAIR_values
sequence: uint16
- name: TOPTOR_indices
sequence: uint16
- name: TOPTOR_values
sequence: uint16
splits:
- name: train
num_bytes: 4437627052
num_examples: 512199
download_size: 1809833636
dataset_size: 4437627052
- config_name: LRRK2
features:
- name: COMPOUND_ID
dtype: string
- name: LIBRARY_ID
dtype: string
- name: LABEL
dtype: int32
- name: MW
dtype: float32
- name: ALOGP
dtype: float32
- name: ECFP4_indices
list: uint16
- name: ECFP4_values
list: uint16
- name: ECFP6_indices
list: uint16
- name: ECFP6_values
list: uint16
- name: FCFP4_indices
list: uint16
- name: FCFP4_values
list: uint16
- name: FCFP6_indices
list: uint16
- name: FCFP6_values
list: uint16
- name: MACCS_indices
list: uint16
- name: MACCS_values
list: uint16
- name: RDK_indices
list: uint16
- name: RDK_values
list: uint16
- name: AVALON_indices
list: uint16
- name: AVALON_values
list: uint16
- name: ATOMPAIR_indices
list: uint16
- name: ATOMPAIR_values
list: uint16
- name: TOPTOR_indices
list: uint16
- name: TOPTOR_values
list: uint16
splits:
- name: train
num_bytes: 2567986954
num_examples: 320791
download_size: 1057052171
dataset_size: 2567986954
- config_name: SETDB1
features:
- name: COMPOUND_ID
dtype: string
- name: LIBRARY_ID
dtype: string
- name: LABEL
dtype: int32
- name: MW
dtype: float32
- name: ALOGP
dtype: float32
- name: ECFP4_indices
list: uint16
- name: ECFP4_values
list: uint16
- name: ECFP6_indices
list: uint16
- name: ECFP6_values
list: uint16
- name: FCFP4_indices
list: uint16
- name: FCFP4_values
list: uint16
- name: FCFP6_indices
list: uint16
- name: FCFP6_values
list: uint16
- name: MACCS_indices
list: uint16
- name: MACCS_values
list: uint16
- name: RDK_indices
list: uint16
- name: RDK_values
list: uint16
- name: AVALON_indices
list: uint16
- name: AVALON_values
list: uint16
- name: ATOMPAIR_indices
list: uint16
- name: ATOMPAIR_values
list: uint16
- name: TOPTOR_indices
list: uint16
- name: TOPTOR_values
list: uint16
splits:
- name: train
num_bytes: 3368828730
num_examples: 419679
download_size: 1372120106
dataset_size: 3368828730
- config_name: WDR12
features:
- name: COMPOUND_ID
dtype: string
- name: LIBRARY_ID
dtype: string
- name: LABEL
dtype: int32
- name: MW
dtype: float32
- name: ALOGP
dtype: float32
- name: ECFP4_indices
list: uint16
- name: ECFP4_values
list: uint16
- name: ECFP6_indices
list: uint16
- name: ECFP6_values
list: uint16
- name: FCFP4_indices
list: uint16
- name: FCFP4_values
list: uint16
- name: FCFP6_indices
list: uint16
- name: FCFP6_values
list: uint16
- name: MACCS_indices
list: uint16
- name: MACCS_values
list: uint16
- name: RDK_indices
list: uint16
- name: RDK_values
list: uint16
- name: AVALON_indices
list: uint16
- name: AVALON_values
list: uint16
- name: ATOMPAIR_indices
list: uint16
- name: ATOMPAIR_values
list: uint16
- name: TOPTOR_indices
list: uint16
- name: TOPTOR_values
list: uint16
splits:
- name: train
num_bytes: 1092486474
num_examples: 140808
download_size: 447931553
dataset_size: 1092486474
- config_name: WDR91
features:
- name: COMPOUND_ID
dtype: string
- name: LIBRARY_ID
dtype: string
- name: LABEL
dtype: int32
- name: MW
dtype: float32
- name: ALOGP
dtype: float32
- name: ECFP4_indices
sequence: uint16
- name: ECFP4_values
sequence: uint16
- name: ECFP6_indices
sequence: uint16
- name: ECFP6_values
sequence: uint16
- name: FCFP4_indices
sequence: uint16
- name: FCFP4_values
sequence: uint16
- name: FCFP6_indices
sequence: uint16
- name: FCFP6_values
sequence: uint16
- name: MACCS_indices
sequence: uint16
- name: MACCS_values
sequence: uint16
- name: RDK_indices
sequence: uint16
- name: RDK_values
sequence: uint16
- name: AVALON_indices
sequence: uint16
- name: AVALON_values
sequence: uint16
- name: ATOMPAIR_indices
sequence: uint16
- name: ATOMPAIR_values
sequence: uint16
- name: TOPTOR_indices
sequence: uint16
- name: TOPTOR_values
sequence: uint16
splits:
- name: train
num_bytes: 3084455761
num_examples: 375595
download_size: 1265745132
dataset_size: 3084455761
- config_name: WDR91_test
features:
- name: RandomID
dtype: string
- name: SMILES
dtype: string
- name: MW
dtype: float32
- name: AlogP
dtype: float32
- name: ECFP4_indices
list: uint16
- name: ECFP4_values
list: uint16
- name: ECFP6_indices
list: uint16
- name: ECFP6_values
list: uint16
- name: FCFP4_indices
list: uint16
- name: FCFP4_values
list: uint16
- name: FCFP6_indices
list: uint16
- name: FCFP6_values
list: uint16
- name: MACCS_indices
list: uint16
- name: MACCS_values
list: uint16
- name: RDK_indices
list: uint16
- name: RDK_values
list: uint16
- name: AVALON_indices
list: uint16
- name: AVALON_values
list: uint16
- name: ATOMPAIR_indices
list: uint16
- name: ATOMPAIR_values
list: uint16
- name: TOPTOR_indices
list: uint16
- name: TOPTOR_values
list: uint16
splits:
- name: train
num_bytes: 1994680534
num_examples: 339258
download_size: 813330958
dataset_size: 1994680534
configs:
- config_name: DCAF7
data_files:
- split: train
path: DCAF7/train-*
- config_name: LRRK2
data_files:
- split: train
path: LRRK2/train-*
- config_name: SETDB1
data_files:
- split: train
path: SETDB1/train-*
- config_name: WDR12
data_files:
- split: train
path: WDR12/train-*
- config_name: WDR91
data_files:
- split: train
path: WDR91/train-*
- config_name: WDR91_test
data_files:
- split: train
path: WDR91_test/train-*
---
提供机构:
wanglab
搜集汇总
数据集介绍

构建方式
在计算药物化学领域,delbert_data数据集通过整合多个靶点的高通量筛选数据构建而成。该数据集涵盖了DCAF7、LRRK2、SETDB1、WDR12和WDR91等多个蛋白质靶点的化合物活性信息,每个配置均包含数十万至数百万的样本。构建过程中,化合物通过标准化流程转化为多种分子指纹表示,包括ECFP4、ECFP6、FCFP4、FCFP6、MACCS、RDK、AVALON、ATOMPAIR和TOPTOR等指纹类型,并以稀疏索引-值对形式存储,确保了分子结构特征的精确编码。数据集的划分以单一训练集形式呈现,为机器学习模型提供了丰富的监督学习样本。
使用方法
为利用delbert_data数据集进行药物发现研究,研究者可通过HuggingFace数据集库直接加载特定靶点配置,例如‘DCAF7’或‘LRRK2’。每个数据条目包含化合物标识、文库来源、活性标签以及全套分子指纹特征。用户可依据活性标签(LABEL)构建分类或回归模型,预测化合物对特定靶点的抑制活性。丰富的指纹特征为模型提供了多样化的分子结构输入,支持从传统机器学习到图神经网络等多种算法。对于WDR91_test配置,其包含SMILES字符串与随机标识,适用于模型泛化能力的外部验证与前瞻性预测分析。
背景与挑战
背景概述
在计算药物发现领域,分子活性预测是加速新药研发的核心环节。delbert_data数据集由相关研究机构构建,专注于针对特定蛋白质靶点(如DCAF7、LRRK2、SETDB1、WDR12和WDR91)的化合物活性分类任务。该数据集整合了超过百万个化合物的多种分子指纹特征,包括ECFP、FCFP、MACCS等经典表示方法,旨在为机器学习模型提供高质量的训练资源。其创建推动了基于人工智能的虚拟筛选技术的发展,为靶向药物设计提供了重要的数据基础,显著提升了早期药物发现的效率与精度。
当前挑战
该数据集致力于解决分子活性预测中的关键挑战,即如何从高维稀疏的分子指纹中准确识别具有生物活性的化合物。构建过程中面临多重困难:分子指纹的多样性与高维度导致特征表示复杂,增加了模型训练的计算负担与过拟合风险;不同靶点数据的异质性要求模型具备良好的泛化能力;此外,活性标签的获取依赖昂贵的实验验证,数据标注成本高昂且可能存在噪声,这些因素共同构成了数据集应用与扩展的核心障碍。
常用场景
解决学术问题
该数据集有效解决了药物发现中化合物活性预测的数据稀缺性和表示异构性问题。通过提供多个靶点蛋白的高通量筛选数据,它支持了跨靶点泛化能力的研究,有助于探索分子特征与生物活性之间的复杂映射关系。其意义在于为学术界建立了一个可重复的基准平台,促进了图神经网络、深度森林等先进算法在化学信息学中的创新应用,并为理解多靶点选择性机制提供了数据基础。
实际应用
在实际药物研发中,delbert_data被制药企业和研究机构用于加速早期发现阶段。基于该数据集训练的模型可快速从大型化合物库中识别出对特定靶点具有潜在活性的候选分子,显著降低实验筛选的成本与时间。例如,针对LRRK2和SETDB1等与神经退行性疾病相关的靶点,该数据集支撑的预测工具能够指导合成化学家优先合成高概率活性的化合物,从而提高药物研发的成功率与效率。
数据集最近研究
最新研究方向
在计算药物发现领域,delbert_data以其丰富的分子指纹特征集合,正成为前沿研究的关键资源。该数据集整合了多种分子表征方法,如ECFP、FCFP、MACCS等,为深度学习模型提供了多维度的结构信息输入。当前研究热点聚焦于利用这些特征进行活性预测与虚拟筛选,特别是在针对特定靶点如LRRK2和SETDB1的化合物优化中,结合图神经网络与注意力机制,提升模型的可解释性与泛化能力。随着人工智能在药物研发中的深入应用,该数据集推动了高通量筛选的智能化进程,为加速先导化合物发现提供了坚实的数据基础,具有显著的学术与产业价值。
以上内容由遇见数据集搜集并总结生成



