five

eve-bio-drug-target-activity

收藏
魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/RealmSky/eve-bio-drug-target-activity
下载链接
链接失效反馈
官方服务:
资源简介:
# Introduction This dataset containing measurements of drug-target interactions is provided by [EvE Bio](https://evebio.org/). It is actively being generated with a quantitative screening process, and data for new targets is added every other month. For each target, one or more types of activity (agonism, antagonism, etc.) are measured for every drug in a 1,397 member compound library that primarily represents FDA approved small molecule drugs. Results are reported for every combination, whether they are active or inactive. Targets are members of three classes: nuclear receptors (NRs), 7-transmembrane receptors (7TMs, aka G-protein coupled receptors), and protein kinases. Data is also reported for cell viability, which is particularly important in conjunction with the 7TM results, which are generated with live cell assays (the other classes use biochemical assays). More information can be found on [Eve's Data Page](https://data.evebio.org/) including methods detailing assays used and data processing. The data provided in this dataset, as well as the raw data, can be viewed interactively in the data explorer available on that site. Since drugs are typically selective for a small number of targets, active results are sparse, on the order of 1%. The key response variables are compound activity and potency. Binary activity and maximum observed activity is captured for every compound-assay combination (`outcome_is_active`, `outcome_max_activity`). Activity is expressed as a % of maximum activity, in reference to known standard compounds for each assay. For active compounds that have sufficient potency to be measurable in the concentration range tested (`is_quantified`), four-parameter logistic curve fits result in quantified potency, measured as pXC50 (outcome_potency_pxc50). pXC50 is the negative log of the IC50/EC50 – the concentration at which half of the maximum activity is reached. Higher pXC50s are higher potency, and 5 is the lowest quantifiable pXC50 in the concentration range used. To collect these measurements, EvE uses a two-phase quantitative screening process. All combinations of compounds and assays are included in the screening phase, which includes two replicates of three concentrations. A rules-based progression algorithm determines which compounds advance to the profiling phase, where the 11-point concentration range is 10 μM to 10 pM. The full concentration response is effectively censored by the concentration range tested. For low potency compounds, this leads to results that are reported as active, but not quantified. In addition to cytotoxicity, compounds can interfere with assays in various ways, leading to potentially spurious results. Compounds with potential cytotoxicity, indicating poor cell viability, are flagged when the cell viability assay has observed maximum values greater than 15%. Compounds that appear with suspicious frequency (> 20 hits) for any given target class and mode are flagged as “high frequency”. They could be removed from the data before model development, but in some cases true activity will be lost in the process. Alternatively, this frequency flag could be treated as a response in itself, in order to develop models that link compound and concentration response characteristics with particular forms of interference. Columns that flag combinations where either cell viability or hit frequency merit consideration are included in the dataset (`viability_flag`, `frequency_flag`). The dataset contains one row per combination of target, compound, mode, and mechanism (currently there is only one mechanism per target class, but this will change when data for both signaling pathways is added for 7TMs in 2026). NRs and 7TMs have two modes each, while PKs and cell viability have one. Multiple identifiers are included for both compounds and targets. For compounds: SMILES (a text-based chemical representation), InChIkey, CAS #, UNII, and DrugBank ID. For targets: gene, Uniprot ID, and mutant/wildtype indicators. # Data Schema - `assay_id` (string): EvE Bio assay identifier - `target_id` (string): EvE Bio target identifier - `compound_id` (string): EvE Bio compound identifier - `mode` (string): Assay mode (Agonist, Antagonist, or Binding) - `mechanism` (string): Assay mechanism of action (Barr2 Recruitment, Co-factor Recruitment, Competition Binding, ATP Production) - `outcome_is_active` (bool): Flag to indicate whether the result is active - `outcome_potency_pxc50` (float|null): pXC50 based on the fitted curve - `outcome_max_activity` (float) : Maximum Activity: The asymptotic max if a pXC50 was quantified, otherwise the highest average observed activity - `observed_max` (float): Maximum Activity: The highest average observed activity (by concentration) for the final phase progressed to - `is_quantified` (bool): Flag to indicate whether the result was quantified in the profiling phase - `frequency_flag` (bool): Flag to indicate whether high frequency of 'hits' for the target class and mode should be considered when intrepreting assay results - `viability_flag`(bool): Flag to indicate whether viability data should be examined in conjunction with assay results - `pxc50_modifier` (string): Modifier for the pXC50 value (fitting all replicates together) - `slope` (float|null): Slope of the fitted curve (fitting all replicates together) - `asymp_min` (float|null): Asymptotic minimum of the fitted curve (fitting all replicates together) - `asymp_max` (float|null): Asymptotic maximum of the fitted curve (fitting all replicates together) - `assay__technology` (string): Assay technology used (TR-FRET, FRET, Luminescence) - `target__class` (string): Target class (7TM, NR, Kinase) - `target__gene` (string): Gene name - `target__uniprot_id` (string): UniProt identifier (www.uniprot.org) - `target__is_mutant` (bool): Flag to indicate whether the gene for a target is a mutant - `target__wildtype_id` (string): Target_ID for the wildtype to link mutant assays with their associated wildtype assay - `target__name` (string): Target full name - `compound__name` (string): Compound name - `compound__smiles` (string): SMILES - `compound__drugbank_id` (string): DrugBank identifier - `compound__cas` (string): CAS registry number - `compound__unii` (string): FDA Unique Ingredient Identifier - `compound__inchikey` (string): International Chemical Identifier key - `progressed` (bool): Whether the compound progressed to the profiling phase - `release` (string): EvE Bio data release number # Quickstart Guide This guide demonstrates how to load and work with the drug-target activity dataset by target class, illustrating the elements of the dataset that are relevant by target class. ## Setup: Load the Dataset ```python import polars as pl from datasets import load_dataset # Load dataset from Hugging Face ds = load_dataset("eve-bio/drug-target-activity") train_ds = ds['train'] df = pl.from_pandas(train_ds.to_pandas()) ``` ## Nuclear Receptors (NR) Nuclear receptors are addressed with a biochemical assay in two modes - Agonist and Antagonist. Since these are biochemical assays, cell viability is not relevant. High frequency compounds are flagged based on activity levels by mode (`frequency_flag`). ```python # Filter for nuclear receptor data df_nr = df.filter(pl.col('target__class') == 'NR') # Select relevant columns for NR analysis df_nr_selected = df_nr.select([ 'target_id', 'target__gene', 'target__name', 'target__uniprot_id', 'compound_id', 'compound__name', 'compound__smiles', 'mode', 'mechanism', 'outcome_is_active', 'outcome_potency_pxc50', 'outcome_max_activity', 'is_quantified', 'frequency_flag' ]) ``` ## 7TM Receptors (7TMs/GPCRs) 7TM receptors (G-protein coupled receptors) are addressed with cell-based assays in two modes - Agonist and Antagonist. The use of cell-based assays makes consideration of cell viability critical to data interpretation. Cell death can masquerade as antagonism. Compounds with potential viability concerns are flagged (`viability_flag`) but direct comparison of potency and activity results for viability vs. 7TM results directly is necessary for proper interpretation of the data. Some compounds that are flagged for viability are still truly active. High frequency compounds are flagged based on activity levels by mode (`frequency_flag`). 7TMs can act through multiple mechanisms of action, and this is an important pharmacological consideration regarding biased agonism. These mechanisms are captured in `mechanism` (e.g., Barr2). Data for multiple mechanisms will be available starting in 2026. ```python # Filter for 7TM receptor data df_7tm = df.filter(pl.col('target__class') == '7TM') # Select relevant columns for 7TM analysis df_7tm_selected = df_7tm.select([ 'target_id', 'target__gene', 'target__name', 'target__uniprot_id', 'compound_id', 'compound__name', 'compound__smiles', 'mode', 'mechanism', # e.g., 'Barr2' for 7TM 'outcome_is_active', 'outcome_potency_pxc50', 'outcome_max_activity', 'viability_flag', # Important: flags compounds that affect cell viability 'is_quantified', 'frequency_flag' ]) ``` ## Protein Kinases Protein Kinases are addressed with a biochemical assay in a single mode - Binding. Since these are biochemical assays, cell viability is not relevant. High frequency compounds are flagged based on activity levels by mode (`frequency_flag`), excluding known promiscuous kinase binders. Protein Kinase targets include both wildtype and mutant variants. Use `target__is_mutant` and `target__wildtype_id` to analyze mutant selectivity. ```python # Filter for kinase data df_kinase = df.filter(pl.col('target__class') == 'Kinase') # Select relevant columns for kinase analysis df_kinase_selected = df_kinase.select([ 'target_id', 'target__gene', 'target__name', 'target__uniprot_id', 'target__is_mutant', # Boolean: is this a mutant kinase? 'target__wildtype_id', # Reference to wildtype target if mutant 'compound_id', 'compound__name', 'compound__smiles', 'mode', 'outcome_is_active', 'outcome_potency_pxc50', 'outcome_max_activity', 'is_quantified', 'frequency_flag' ]) ``` ## Cell Viability For viability assays, `outcome_is_active = True` indicates the compound is potentially cytotoxic. ```python # Filter for viability assay data df_viability = df.filter(pl.col('target__class') == 'Viability') # Select relevant columns for viability analysis df_viability_selected = df_viability.select([ 'target_id', 'compound_id', 'compound__name', 'compound__smiles', 'outcome_is_active', # Active = potentially cytotoxic 'outcome_potency_pxc50', 'outcome_max_activity', 'is_quantified' ]) ``` ## Joining Viability Data to Other Target Classes You can join viability data to 7TM data to identify compounds with potential cytotoxicity concerns. It is relevant to this target class due to the use of cell based assays, and is particularly critical for interpretation of antagonist data. ```python # Prepare viability outcomes for joining viability_outcomes = df.filter( pl.col('target__class') == 'Viability' ).select([ 'compound_id', pl.col('outcome_is_active').alias('viability_is_active'), pl.col('is_quantified').alias('viability_is_quantified'), pl.col('outcome_potency_pxc50').alias('viability_pxc50'), pl.col('outcome_max_activity').alias('viability_max_activity') ]) # Join viability data to 7TM results df_7tm_with_viability = df_7tm_selected.join( viability_outcomes, on='compound_id', how='left' # Left join keeps all 7TM records, adds viability data where available ) # Identify 7TM actives that may have viability concerns potential_concerns = df_7tm_with_viability.filter( (pl.col('mode') == 'Antagonist') & # 7TM target in Antagonist mode only (pl.col('outcome_is_active') == True) & # Active at 7TM target (pl.col('viability_flag') == True) # Flagged for viability issues ) ``` # Citation ## Citation Information EvE Bio, LLC (2025). Data Releases #1-#7. https://huggingface.co/datasets/eve-bio/drug-target-activity. Accessed yyyy-mm-dd.

# 引言 本数据集收录药物-靶点相互作用检测数据,由EvE Bio(https://evebio.org/)提供。该数据集正通过定量筛选流程持续更新,每两个月新增一批新靶点数据。针对每个靶点,本数据集会对一个包含1397种化合物的小分子化合物库中的每一种药物,检测其一种或多种活性类型(激动剂活性、拮抗剂活性等);该化合物库中的药物主要为经美国食品药品监督管理局(FDA)批准的小分子药物。所有靶点-药物组合的检测结果均会被记录,无论其活性为阳性还是阴性。靶点分为三大类:核受体(nuclear receptors, NRs)、7次跨膜受体(7-transmembrane receptors, 7TMs,亦称G蛋白偶联受体(G-protein coupled receptors))与蛋白激酶(protein kinases)。本数据集同时收录细胞活力检测数据,该数据与7TM类靶点的检测结果关联尤为紧密——7TM类靶点采用活细胞实验进行检测,其余两类靶点则采用生化实验。更多信息(包括实验方法细节与数据处理流程)可查阅[EvE数据页面](https://data.evebio.org/);该页面提供的数据浏览器可交互式查看本数据集收录数据与原始数据。 由于药物通常仅对少量靶点具有选择性,阳性检测结果占比极低,仅约1%。本数据集的核心响应变量为化合物活性与效价。针对每一组化合物-检测组合,均会记录二元活性结果与最大观测活性(分别对应`outcome_is_active`与`outcome_max_activity`)。活性以相对于已知标准化合物的最大活性百分比表示。对于在检测浓度范围内效价足够高、可被定量检测的活性化合物(`is_quantified`为真),通过四参数logistic曲线拟合得到定量效价,以pXC50(`outcome_potency_pxc50`)表示。pXC50为IC50/EC50的负对数——IC50/EC50指达到一半最大活性时的药物浓度。pXC50数值越高,代表化合物效价越强;本次检测使用的浓度范围内,可定量的最低pXC50为5。 为获取上述检测数据,EvE采用两阶段定量筛选流程:第一阶段为筛选阶段,覆盖所有化合物-检测组合,设置3种浓度并进行2次重复实验;随后通过基于规则的递进算法筛选符合条件的化合物进入第二阶段分析阶段,该阶段采用10 μM至10 pM的11点浓度梯度。检测浓度范围会对完整浓度反应曲线产生截尾效应,对于效价较低的化合物,这会导致其结果被标记为阳性,但无法完成定量。 除细胞毒性外,化合物还可能通过多种方式干扰实验,导致结果出现假阳性。当细胞活力检测的最大观测值超过15%时,会标记该化合物存在潜在细胞毒性(即细胞活力低下)。若某类靶点与检测模式下,某化合物的“命中”次数异常偏高(>20次),则会被标记为“高频命中”化合物。在模型开发前,这类化合物可被移除,但此举可能会丢失部分真实活性数据;反之,也可将该频率标记本身作为响应变量,用于构建关联化合物、浓度反应特征与特定实验干扰类型的模型。本数据集收录了用于标记需重点考量的细胞活力或命中频率相关组合的字段(`viability_flag`与`frequency_flag`)。 本数据集以行的形式记录每一组靶点-化合物-检测模式-作用机制组合(目前每类靶点仅对应一种作用机制,但2026年7TM类靶点新增两条信号通路相关数据后,该情况将改变)。核受体与7TM类靶点各对应两种检测模式,而蛋白激酶(PKs)与细胞活力检测仅对应一种模式。化合物与靶点均提供多种标识符:化合物标识符包括SMILES(基于文本的化学表征格式)、InChIkey、CAS编号、UNII编号与DrugBank编号;靶点标识符包括基因名、UniProt ID与突变/野生型标识。 # 数据架构 - `assay_id`(字符串型):EvE Bio实验标识符 - `target_id`(字符串型):EvE Bio靶点标识符 - `compound_id`(字符串型):EvE Bio化合物标识符 - `mode`(字符串型):检测模式(激动剂、拮抗剂或结合实验) - `mechanism`(字符串型):实验作用机制(Barr2募集、辅因子募集、竞争结合、ATP生成) - `outcome_is_active`(布尔型):结果活性标记,用于指示本次检测结果是否为阳性 - `outcome_potency_pxc50`(浮点型/空值):基于拟合曲线计算得到的pXC50值 - `outcome_max_activity`(浮点型):最大活性:若已完成pXC50定量,则为拟合曲线的渐近最大值;否则为最终阶段观测到的最高平均活性 - `observed_max`(浮点型):观测最大活性:进入最终分析阶段后,按浓度分组得到的最高平均活性 - `is_quantified`(布尔型):定量标记,用于指示本次结果是否在分析阶段完成定量 - `frequency_flag`(布尔型):高频命中标记,用于指示在解读实验结果时是否需考虑该靶点类别与检测模式下的异常高频命中情况 - `viability_flag`(布尔型):细胞活力标记,用于指示是否需结合细胞活力数据解读本次实验结果 - `pxc50_modifier`(字符串型):pXC50值修饰符(基于所有重复实验的联合拟合) - `slope`(浮点型/空值):拟合曲线斜率(基于所有重复实验的联合拟合) - `asymp_min`(浮点型/空值):拟合曲线渐近最小值(基于所有重复实验的联合拟合) - `asymp_max`(浮点型/空值):拟合曲线渐近最大值(基于所有重复实验的联合拟合) - `assay__technology`(字符串型):所用实验技术(时间分辨荧光共振能量转移TR-FRET、荧光共振能量转移FRET、发光法) - `target__class`(字符串型):靶点类别(7TM、NR、激酶) - `target__gene`(字符串型):基因名 - `target__uniprot_id`(字符串型):UniProt标识符(www.uniprot.org) - `target__is_mutant`(布尔型):突变标记,用于指示该靶点的基因是否为突变型 - `target__wildtype_id`(字符串型):野生型靶点ID,用于将突变型靶点实验与其对应的野生型靶点实验关联 - `target__name`(字符串型):靶点全称 - `compound__name`(字符串型):化合物名 - `compound__smiles`(字符串型):SMILES格式化学表征 - `compound__drugbank_id`(字符串型):DrugBank标识符 - `compound__cas`(字符串型):CAS登记号 - `compound__unii`(字符串型):美国FDA唯一成分标识符UNII - `compound__inchikey`(字符串型):国际化学标识符密钥InChIkey - `progressed`(布尔型):化合物是否进入分析阶段 - `release`(字符串型):EvE Bio数据发布版本号 # 快速入门指南 本指南演示如何加载并使用按靶点类别分类的药物-靶点活性数据集,并说明各类靶点相关的数据集元素。 ## 准备:加载数据集 python import polars as pl from datasets import load_dataset # 从Hugging Face平台加载数据集 ds = load_dataset("eve-bio/drug-target-activity") train_ds = ds['train'] df = pl.from_pandas(train_ds.to_pandas()) ## 核受体(NRs) 核受体采用生化实验进行检测,包含两种检测模式——激动剂与拮抗剂。由于此类实验为生化实验,无需考虑细胞活力因素。高频命中化合物标记`frequency_flag`基于对应检测模式下的活性水平生成。 python # 筛选核受体数据集 df_nr = df.filter(pl.col('target__class') == 'NR') # 选择核受体分析所需的相关字段 df_nr_selected = df_nr.select([ 'target_id', 'target__gene', 'target__name', 'target__uniprot_id', 'compound_id', 'compound__name', 'compound__smiles', 'mode', 'mechanism', 'outcome_is_active', 'outcome_potency_pxc50', 'outcome_max_activity', 'is_quantified', 'frequency_flag' ]) ## 7次跨膜受体(7TMs/G蛋白偶联受体) 7次跨膜受体(即G蛋白偶联受体)采用细胞实验进行检测,包含两种检测模式——激动剂与拮抗剂。由于使用细胞实验,细胞活力的考量对数据解读至关重要:细胞死亡可能被误判为拮抗活性。存在细胞活力异常风险的化合物会被标记`viability_flag`,但需结合细胞活力与7TM实验结果进行对比,才能正确解读数据;部分被标记存在活力异常的化合物仍可能具有真实活性。高频命中化合物标记`frequency_flag`基于对应检测模式下的活性水平生成。7TM类靶点可通过多种作用机制发挥功能,这与偏置激动剂的药理学特性密切相关,相关信息记录于`mechanism`字段(例如Barr2)。2026年起将新增多作用机制相关数据。 python # 筛选7次跨膜受体数据集 df_7tm = df.filter(pl.col('target__class') == '7TM') # 选择7次跨膜受体分析所需的相关字段 df_7tm_selected = df_7tm.select([ 'target_id', 'target__gene', 'target__name', 'target__uniprot_id', 'compound_id', 'compound__name', 'compound__smiles', 'mode', 'mechanism', # 例如,7TM靶点的'Barr2' 'outcome_is_active', 'outcome_potency_pxc50', 'outcome_max_activity', 'viability_flag', # 重要:标记存在细胞活力异常风险的化合物 'is_quantified', 'frequency_flag' ]) ## 蛋白激酶 蛋白激酶采用生化实验进行检测,仅包含一种检测模式——结合实验。由于此类实验为生化实验,无需考虑细胞活力因素。高频命中化合物标记`frequency_flag`基于对应检测模式下的活性水平生成,已排除已知的非特异性激酶结合化合物。蛋白激酶靶点包含野生型与突变型两种变体,可通过`target__is_mutant`与`target__wildtype_id`字段分析突变选择性。 python # 筛选蛋白激酶数据集 df_kinase = df.filter(pl.col('target__class') == 'Kinase') # 选择蛋白激酶分析所需的相关字段 df_kinase_selected = df_kinase.select([ 'target_id', 'target__gene', 'target__name', 'target__uniprot_id', 'target__is_mutant', # 布尔值:该靶点是否为突变型激酶? 'target__wildtype_id', # 若为突变型靶点,对应野生型靶点的ID 'compound_id', 'compound__name', 'compound__smiles', 'mode', 'outcome_is_active', 'outcome_potency_pxc50', 'outcome_max_activity', 'is_quantified', 'frequency_flag' ]) ## 细胞活力 对于细胞活力检测实验,`outcome_is_active = True`表示该化合物具有潜在细胞毒性。 python # 筛选细胞活力检测数据集 df_viability = df.filter(pl.col('target__class') == 'Viability') # 选择细胞活力分析所需的相关字段 df_viability_selected = df_viability.select([ 'target_id', 'compound_id', 'compound__name', 'compound__smiles', 'outcome_is_active', # 活性为真则表示该化合物具有潜在细胞毒性 'outcome_potency_pxc50', 'outcome_max_activity', 'is_quantified' ]) ## 关联细胞活力数据与其他靶点类别数据集 可将细胞活力数据与7TM类靶点数据集关联,以识别存在潜在细胞毒性风险的化合物。由于7TM类靶点采用细胞实验,该关联对数据解读尤为关键,尤其是拮抗模式的实验结果。 python # 准备用于关联的细胞活力结果数据 viability_outcomes = df.filter( pl.col('target__class') == 'Viability' ).select([ 'compound_id', pl.col('outcome_is_active').alias('viability_is_active'), pl.col('is_quantified').alias('viability_is_quantified'), pl.col('outcome_potency_pxc50').alias('viability_pxc50'), pl.col('outcome_max_activity').alias('viability_max_activity') ]) # 将细胞活力数据与7TM实验结果关联 df_7tm_with_viability = df_7tm_selected.join( viability_outcomes, on='compound_id', how='left' # 左连接保留所有7TM记录,仅在有匹配数据时添加细胞活力信息 ) # 识别可能存在细胞活力风险的7TM阳性化合物 potential_concerns = df_7tm_with_viability.filter( (pl.col('mode') == 'Antagonist') & # 仅针对7TM拮抗模式实验 (pl.col('outcome_is_active') == True) & # 7TM靶点检测结果为阳性 (pl.col('viability_flag') == True) # 被标记存在细胞活力异常风险 ) # 引用信息 ## 引用说明 EvE Bio, LLC(2025)。数据发布版本#1-#7。https://huggingface.co/datasets/eve-bio/drug-target-activity。访问日期:yyyy-mm-dd。
提供机构:
maas
创建时间:
2025-11-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作