drug-target-activity
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/eve-bio/drug-target-activity
下载链接
链接失效反馈官方服务:
资源简介:
# Introduction
This dataset containing measurements of drug-target interactions is provided by [EvE Bio](https://evebio.org/).
It is actively being generated with a quantitative screening process, and data for new targets is added every other month. For each target, one or more types of activity (agonism, antagonism, etc.) are measured
for every drug in a 1,397 member compound library that primarily represents FDA approved small molecule drugs. Results are reported for every combination, whether they are active or inactive. Targets are members of three classes:
nuclear receptors (NRs), 7-transmembrane receptors (7TMs, aka G-protein coupled receptors), and protein kinases. Data is also reported for cell viability, which is particularly important
in conjunction with the 7TM results, which are generated with live cell assays (the other classes use biochemical assays). More information can be found on [Eve's Data Page](https://data.evebio.org/)
including methods detailing assays used and data processing. The data provided in this dataset, as well as the raw data, can be viewed interactively in the data explorer available on that site.
Since drugs are typically selective for a small number of targets, active results are sparse, on the order of 1%. The key response variables are compound activity and potency. Binary activity and maximum observed activity is captured for every compound-assay combination
(`outcome_is_active`, `outcome_max_activity`). Activity is expressed as a % of maximum activity, in reference to known standard compounds for each assay. For active compounds that have sufficient potency to be
measurable in the concentration range tested (`is_quantified`), four-parameter logistic curve fits result in quantified potency, measured as pXC50 (outcome_potency_pxc50). pXC50 is the negative log of the IC50/EC50 –
the concentration at which half of the maximum activity is reached. Higher pXC50s are higher potency, and 5 is the lowest quantifiable pXC50 in the concentration range used.
To collect these measurements, EvE uses a two-phase quantitative screening process. All combinations of compounds and assays are included in the screening phase,
which includes two replicates of three concentrations. A rules-based progression algorithm determines which compounds advance to the profiling phase,
where the 11-point concentration range is 10 μM to 10 pM. The full concentration response is effectively censored by the concentration range tested. For low potency compounds,
this leads to results that are reported as active, but not quantified.
In addition to cytotoxicity, compounds can interfere with assays in various ways, leading to potentially spurious results. Compounds with potential cytotoxicity, indicating poor cell viability, are flagged when the cell viability assay has observed maximum values greater than 15%. Compounds that appear with suspicious frequency (> 20 hits) for any given target class and mode are flagged as “high frequency”.
They could be removed from the data before model development, but in some cases true activity will be lost in the process. Alternatively,
this frequency flag could be treated as a response in itself, in order to develop models that link compound and concentration response characteristics
with particular forms of interference. Columns that flag combinations where either cell viability or hit frequency merit consideration are included in the dataset
(`viability_flag`, `frequency_flag`).
The dataset contains one row per combination of target, compound, mode, and mechanism (currently there is only one mechanism per target class,
but this will change when data for both signaling pathways is added for 7TMs in 2026). NRs and 7TMs have two modes each, while PKs and cell viability have one.
Multiple identifiers are included for both compounds and targets. For compounds: SMILES (a text-based chemical representation), InChIkey, CAS #, UNII, and DrugBank ID.
For targets: gene, Uniprot ID, and mutant/wildtype indicators.
# Data Schema
- `assay_id` (string): EvE Bio assay identifier
- `target_id` (string): EvE Bio target identifier
- `compound_id` (string): EvE Bio compound identifier
- `mode` (string): Assay mode (Agonist, Antagonist, or Binding)
- `mechanism` (string): Assay mechanism of action (Barr2 Recruitment, Co-factor Recruitment, Competition Binding, ATP Production)
- `outcome_is_active` (bool): Flag to indicate whether the result is active
- `outcome_potency_pxc50` (float|null): pXC50 based on the fitted curve
- `outcome_max_activity` (float) : Maximum Activity: The asymptotic max if a pXC50 was quantified, otherwise the highest average observed activity
- `observed_max` (float): Maximum Activity: The highest average observed activity (by concentration) for the final phase progressed to
- `is_quantified` (bool): Flag to indicate whether the result was quantified in the profiling phase
- `frequency_flag` (bool): Flag to indicate whether high frequency of 'hits' for the target class and mode should be considered when intrepreting assay results
- `viability_flag`(bool): Flag to indicate whether viability data should be examined in conjunction with assay results
- `pxc50_modifier` (string): Modifier for the pXC50 value (fitting all replicates together)
- `slope` (float|null): Slope of the fitted curve (fitting all replicates together)
- `asymp_min` (float|null): Asymptotic minimum of the fitted curve (fitting all replicates together)
- `asymp_max` (float|null): Asymptotic maximum of the fitted curve (fitting all replicates together)
- `assay__technology` (string): Assay technology used (TR-FRET, FRET, Luminescence)
- `target__class` (string): Target class (7TM, NR, Kinase)
- `target__gene` (string): Gene name
- `target__uniprot_id` (string): UniProt identifier (www.uniprot.org)
- `target__is_mutant` (bool): Flag to indicate whether the gene for a target is a mutant
- `target__wildtype_id` (string): Target_ID for the wildtype to link mutant assays with their associated wildtype assay
- `target__name` (string): Target full name
- `compound__name` (string): Compound name
- `compound__smiles` (string): SMILES
- `compound__drugbank_id` (string): DrugBank identifier
- `compound__cas` (string): CAS registry number
- `compound__unii` (string): FDA Unique Ingredient Identifier
- `compound__inchikey` (string): International Chemical Identifier key
- `progressed` (bool): Whether the compound progressed to the profiling phase
- `release` (string): EvE Bio data release number
# Quickstart Guide
This guide demonstrates how to load and work with the drug-target activity dataset by target class, illustrating the elements of the dataset that are relevant by target class.
## Setup: Load the Dataset
```python
import polars as pl
from datasets import load_dataset
# Load dataset from Hugging Face
ds = load_dataset("eve-bio/drug-target-activity")
train_ds = ds['train']
df = pl.from_pandas(train_ds.to_pandas())
```
## Nuclear Receptors (NR)
Nuclear receptors are addressed with a biochemical assay in two modes - Agonist and Antagonist. Since these are biochemical assays, cell viability is not relevant. High frequency compounds are flagged based on activity levels by mode (`frequency_flag`).
```python
# Filter for nuclear receptor data
df_nr = df.filter(pl.col('target__class') == 'NR')
# Select relevant columns for NR analysis
df_nr_selected = df_nr.select([
'target_id',
'target__gene',
'target__name',
'target__uniprot_id',
'compound_id',
'compound__name',
'compound__smiles',
'mode',
'mechanism',
'outcome_is_active',
'outcome_potency_pxc50',
'outcome_max_activity',
'is_quantified',
'frequency_flag'
])
```
## 7TM Receptors (7TMs/GPCRs)
7TM receptors (G-protein coupled receptors) are addressed with cell-based assays in two modes - Agonist and Antagonist. The use of cell-based assays makes consideration of cell viability critical to data interpretation. Cell death can masquerade as antagonism. Compounds with potential viability concerns are flagged (`viability_flag`) but direct comparison of potency and activity results for viability vs. 7TM results directly is necessary for proper interpretation of the data. Some compounds that are flagged for viability are still truly active. High frequency compounds are flagged based on activity levels by mode (`frequency_flag`). 7TMs can act through multiple mechanisms of action, and this is an important pharmacological consideration regarding biased agonism. These mechanisms are captured in `mechanism` (e.g., Barr2). Data for multiple mechanisms will be available starting in 2026.
```python
# Filter for 7TM receptor data
df_7tm = df.filter(pl.col('target__class') == '7TM')
# Select relevant columns for 7TM analysis
df_7tm_selected = df_7tm.select([
'target_id',
'target__gene',
'target__name',
'target__uniprot_id',
'compound_id',
'compound__name',
'compound__smiles',
'mode',
'mechanism', # e.g., 'Barr2' for 7TM
'outcome_is_active',
'outcome_potency_pxc50',
'outcome_max_activity',
'viability_flag', # Important: flags compounds that affect cell viability
'is_quantified',
'frequency_flag'
])
```
## Protein Kinases
Protein Kinases are addressed with a biochemical assay in a single mode - Binding. Since these are biochemical assays, cell viability is not relevant. High frequency compounds are flagged based on activity levels by mode (`frequency_flag`), excluding known promiscuous kinase binders. Protein Kinase targets include both wildtype and mutant variants. Use `target__is_mutant` and `target__wildtype_id` to analyze mutant selectivity.
```python
# Filter for kinase data
df_kinase = df.filter(pl.col('target__class') == 'Kinase')
# Select relevant columns for kinase analysis
df_kinase_selected = df_kinase.select([
'target_id',
'target__gene',
'target__name',
'target__uniprot_id',
'target__is_mutant', # Boolean: is this a mutant kinase?
'target__wildtype_id', # Reference to wildtype target if mutant
'compound_id',
'compound__name',
'compound__smiles',
'mode',
'outcome_is_active',
'outcome_potency_pxc50',
'outcome_max_activity',
'is_quantified',
'frequency_flag'
])
```
## Cell Viability
For viability assays, `outcome_is_active = True` indicates the compound is potentially cytotoxic.
```python
# Filter for viability assay data
df_viability = df.filter(pl.col('target__class') == 'Viability')
# Select relevant columns for viability analysis
df_viability_selected = df_viability.select([
'target_id',
'compound_id',
'compound__name',
'compound__smiles',
'outcome_is_active', # Active = potentially cytotoxic
'outcome_potency_pxc50',
'outcome_max_activity',
'is_quantified'
])
```
## Joining Viability Data to Other Target Classes
You can join viability data to 7TM data to identify compounds with potential cytotoxicity concerns. It is relevant to this target class due to the use of cell based assays, and is particularly critical for interpretation of antagonist data.
```python
# Prepare viability outcomes for joining
viability_outcomes = df.filter(
pl.col('target__class') == 'Viability'
).select([
'compound_id',
pl.col('outcome_is_active').alias('viability_is_active'),
pl.col('is_quantified').alias('viability_is_quantified'),
pl.col('outcome_potency_pxc50').alias('viability_pxc50'),
pl.col('outcome_max_activity').alias('viability_max_activity')
])
# Join viability data to 7TM results
df_7tm_with_viability = df_7tm_selected.join(
viability_outcomes,
on='compound_id',
how='left' # Left join keeps all 7TM records, adds viability data where available
)
# Identify 7TM actives that may have viability concerns
potential_concerns = df_7tm_with_viability.filter(
(pl.col('mode') == 'Antagonist') & # 7TM target in Antagonist mode only
(pl.col('outcome_is_active') == True) & # Active at 7TM target
(pl.col('viability_flag') == True) # Flagged for viability issues
)
```
# Citation
## Citation Information
EvE Bio, LLC (2025). Data Releases #1-#7. https://huggingface.co/datasets/eve-bio/drug-target-activity. Accessed yyyy-mm-dd.
# 简介
本数据集收录了药物-靶点相互作用的测量数据,由EvE Bio(https://evebio.org/)提供。本数据集通过定量筛选流程持续更新,每两个月新增一批新靶点数据。针对每一个靶点,我们会对包含1397种化合物的库中的每一种药物,测定其一种或多种活性类型(激动作用(agonism)、拮抗作用(antagonism)等);该化合物库主要收录经美国食品药品监督管理局(FDA)批准的小分子药物。所有靶点-药物组合的测定结果均会被记录,无论该组合表现为活性还是非活性。靶点共分为三类:核受体(nuclear receptors, NRs)、7次跨膜受体(7-transmembrane receptors, 7TMs,亦称G蛋白偶联受体)以及蛋白激酶。本数据集同时收录细胞活力数据,该数据与7TM类靶点的测定结果结合分析尤为关键——7TM类靶点采用活细胞实验(live cell assays)进行检测,而另外两类靶点则采用生化实验(biochemical assays)。更多信息(包括所用实验方法与数据处理流程的详细说明)可查阅[EvE Bio数据页面](https://data.evebio.org/);该页面还提供交互式数据探索工具,可用于浏览本数据集及原始数据。
由于药物通常仅对少量靶点具有选择性,活性结果占比极低,仅约1%。本数据集的核心响应变量包括化合物活性与效力(potency)。所有化合物-实验组合均会记录二元活性标签与观测到的最大活性值,对应字段为`outcome_is_active`与`outcome_max_activity`。活性值以相对于已知标准化合物的最大活性百分比表示,每个实验均配有对应的标准化合物。对于效力足够高、可在测试浓度范围内被定量检测的活性化合物(对应字段为`is_quantified`),通过四参数logistic曲线(four-parameter logistic curve)拟合可得到定量的效力值,以pXC50(outcome_potency_pxc50)表示。pXC50为IC50/EC50的负对数——IC50/EC50指达到最大活性一半时的化合物浓度。pXC50数值越高,代表化合物效力越强;本次所用测试浓度范围内可定量的最低pXC50值为5。
为采集上述测定数据,EvE Bio采用两阶段定量筛选流程:第一阶段为筛选阶段,覆盖所有化合物-实验组合,设置3个浓度并进行2次重复实验;随后通过基于规则的递进算法筛选符合条件的化合物,进入第二阶段的表征阶段,该阶段采用10 μM至10 pM的11点浓度梯度。由于测试浓度范围限制,完整的浓度响应曲线会被截尾(censored);对于效力较低的化合物,其结果会被标记为活性,但无法得到定量数值。
除细胞毒性(cytotoxicity)外,化合物还可能通过多种方式干扰实验,导致结果出现假阳性(spurious results)。当细胞活力实验观测到的最大活性值超过15%时,即判定该化合物存在潜在细胞毒性(提示细胞活力不佳),并标记。若某类靶点与实验模式下,某化合物的“命中”(hits)次数异常偏高(>20次),则会被标记为“高频化合物”。在模型开发前,这类化合物可被移除,但部分情况下可能会丢失真实的活性数据;反之,也可将该频率标记本身作为响应变量,用于构建模型以关联化合物、浓度响应特征与特定的实验干扰类型。本数据集包含用于标记需关注的细胞活力或命中频率情况的字段:`viability_flag`与`frequency_flag`。
本数据集以行的形式记录每一组靶点-化合物-实验模式-作用机制组合(目前每类靶点仅对应一种作用机制,但2026年7TM类靶点的两条信号通路数据加入后,该规则将被打破)。核受体与7次跨膜受体各对应两种实验模式,而蛋白激酶类与细胞活力实验仅对应一种。化合物与靶点均包含多种标识符:化合物的标识符包括:SMILES(一种基于文本的化学结构表示形式)、InChIkey、CAS编号、UNII以及DrugBank ID;靶点的标识符包括:基因名、UniProt ID以及突变型/野生型标识。
# 数据模式
- `assay_id`(字符串型):EvE Bio实验标识符
- `target_id`(字符串型):EvE Bio靶点标识符
- `compound_id`(字符串型):EvE Bio化合物标识符
- `mode`(字符串型):实验模式(激动剂、拮抗剂或结合实验)
- `mechanism`(字符串型):实验作用机制(Barr2募集、辅因子募集、竞争结合、ATP生成)
- `outcome_is_active`(布尔型):标记结果是否为活性
- `outcome_potency_pxc50`(浮点型|null):基于拟合曲线得到的pXC50值
- `outcome_max_activity`(浮点型):最大活性值:若已定量pXC50,则为拟合曲线的渐近最大值;否则为最终阶段观测到的最高平均活性
- `observed_max`(浮点型):最大活性值:最终阶段按浓度分组的最高平均活性
- `is_quantified`(布尔型):标记结果是否在表征阶段被定量
- `frequency_flag`(布尔型):标记针对靶点类别与实验模式的“命中”频率是否需在解读实验结果时予以关注
- `viability_flag`(布尔型):标记是否需结合细胞活力数据解读实验结果
- `pxc50_modifier`(字符串型):pXC50值的修正项(所有重复实验联合拟合)
- `slope`(浮点型|null):拟合曲线的斜率(所有重复实验联合拟合)
- `asymp_min`(浮点型|null):拟合曲线的渐近最小值(所有重复实验联合拟合)
- `asymp_max`(浮点型|null):拟合曲线的渐近最大值(所有重复实验联合拟合)
- `assay__technology`(字符串型):所用实验技术(TR-FRET、FRET、发光法)
- `target__class`(字符串型):靶点类别(7TM、NR、激酶)
- `target__gene`(字符串型):基因名
- `target__uniprot_id`(字符串型):UniProt标识符(www.uniprot.org)
- `target__is_mutant`(布尔型):标记靶点基因是否为突变型
- `target__wildtype_id`(字符串型):野生型靶点的ID,用于关联突变型实验与对应野生型实验
- `target__name`(字符串型):靶点全称
- `compound__name`(字符串型):化合物名称
- `compound__smiles`(字符串型):SMILES编码
- `compound__drugbank_id`(字符串型):DrugBank标识符
- `compound__cas`(字符串型):CAS登记号
- `compound__unii`(字符串型):美国FDA唯一成分标识
- `compound__inchikey`(字符串型):国际化学标识符密钥
- `progressed`(布尔型):标记化合物是否进入表征阶段
- `release`(字符串型):EvE Bio数据发布版本号
# 快速入门指南
本指南演示了如何按靶点类别加载并使用该药物-靶点活性数据集,并说明各类靶点对应的数据集相关要素。
## 设置:加载数据集
python
import polars as pl
from datasets import load_dataset
# 从Hugging Face加载数据集
ds = load_dataset("eve-bio/drug-target-activity")
train_ds = ds['train']
df = pl.from_pandas(train_ds.to_pandas())
## 核受体(NR)
核受体采用生化实验进行检测,包含两种实验模式——激动剂模式与拮抗剂模式。由于采用生化实验,无需考虑细胞活力。高频化合物基于各实验模式下的活性水平进行标记(`frequency_flag`)。
python
# 筛选核受体数据
df_nr = df.filter(pl.col('target__class') == 'NR')
# 选择核受体分析所需的列
df_nr_selected = df_nr.select([
'target_id',
'target__gene',
'target__name',
'target__uniprot_id',
'compound_id',
'compound__name',
'compound__smiles',
'mode',
'mechanism',
'outcome_is_active',
'outcome_potency_pxc50',
'outcome_max_activity',
'is_quantified',
'frequency_flag'
])
## 7次跨膜受体(7TMs/GPCRs)
7次跨膜受体(G蛋白偶联受体)采用细胞实验进行检测,包含两种实验模式——激动剂模式与拮抗剂模式。由于采用细胞实验,细胞活力数据的解读对实验结果分析至关重要:细胞死亡可能被误判为拮抗作用。存在细胞活力问题的化合物会被标记(`viability_flag`),但需将活力结果与7TM实验结果直接对比,才能正确解读数据;部分被标记的化合物仍可能具有真实活性。高频化合物基于各实验模式下的活性水平进行标记(`frequency_flag`)。7TM类靶点可通过多种作用机制发挥功能,这与偏向性激动作用的药理学特性密切相关,相关机制已在`mechanism`字段中记录(例如Barr2)。2026年起将新增多种作用机制的数据。
python
# 筛选7次跨膜受体数据
df_7tm = df.filter(pl.col('target__class') == '7TM')
# 选择7次跨膜受体分析所需的列
df_7tm_selected = df_7tm.select([
'target_id',
'target__gene',
'target__name',
'target__uniprot_id',
'compound_id',
'compound__name',
'compound__smiles',
'mode',
'mechanism', # 例如7TM的'Barr2'
'outcome_is_active',
'outcome_potency_pxc50',
'outcome_max_activity',
'viability_flag', # 重要:标记存在细胞活力问题的化合物
'is_quantified',
'frequency_flag'
])
## 蛋白激酶
蛋白激酶采用生化实验进行检测,仅包含一种实验模式——结合实验。由于采用生化实验,无需考虑细胞活力。高频化合物基于各实验模式下的活性水平进行标记(`frequency_flag`),已排除已知的非特异性激酶结合物。蛋白激酶靶点包含野生型与突变型两种变体,可使用`target__is_mutant`与`target__wildtype_id`字段分析突变体选择性。
python
# 筛选激酶数据
df_kinase = df.filter(pl.col('target__class') == 'Kinase')
# 选择激酶分析所需的列
df_kinase_selected = df_kinase.select([
'target_id',
'target__gene',
'target__name',
'target__uniprot_id',
'target__is_mutant', # 布尔型:该激酶是否为突变型?
'target__wildtype_id', # 若为突变体,对应野生型靶点的ID
'compound_id',
'compound__name',
'compound__smiles',
'mode',
'outcome_is_active',
'outcome_potency_pxc50',
'outcome_max_activity',
'is_quantified',
'frequency_flag'
])
## 细胞活力
对于活力实验,`outcome_is_active = True`表示该化合物具有潜在细胞毒性。
python
# 筛选细胞活力实验数据
df_viability = df.filter(pl.col('target__class') == 'Viability')
# 选择细胞活力分析所需的列
df_viability_selected = df_viability.select([
'target_id',
'compound_id',
'compound__name',
'compound__smiles',
'outcome_is_active', # 活性=具有潜在细胞毒性
'outcome_potency_pxc50',
'outcome_max_activity',
'is_quantified'
])
## 将活力数据与其他靶点类别数据合并
可将细胞活力数据与7TM类数据合并,以识别存在潜在细胞毒性问题的化合物。由于7TM类靶点采用细胞实验,该合并操作对数据解读尤为关键,尤其是在解读拮抗剂实验结果时。
python
# 准备用于合并的细胞活力结果
viability_outcomes = df.filter(
pl.col('target__class') == 'Viability'
).select([
'compound_id',
pl.col('outcome_is_active').alias('viability_is_active'),
pl.col('is_quantified').alias('viability_is_quantified'),
pl.col('outcome_potency_pxc50').alias('viability_pxc50'),
pl.col('outcome_max_activity').alias('viability_max_activity')
])
# 将活力数据与7TM实验结果合并
df_7tm_with_viability = df_7tm_selected.join(
viability_outcomes,
on='compound_id',
how='left' # 左连接保留所有7TM记录,仅补充可用的活力数据
)
# 识别可能存在细胞活力问题的7TM活性化合物
potential_concerns = df_7tm_with_viability.filter(
(pl.col('mode') == 'Antagonist') & # 仅7TM靶点的拮抗剂模式
(pl.col('outcome_is_active') == True) & # 在7TM靶点上表现为活性
(pl.col('viability_flag') == True) # 被标记为存在细胞活力问题
)
# 引用
## 引用信息
EvE Bio有限责任公司(2025)。第1-7版数据发布。https://huggingface.co/datasets/eve-bio/drug-target-activity。访问日期:yyyy-mm-dd。
提供机构:
maas
创建时间:
2025-11-25



