five

openadmet/Octant_CYP_inhibition_reactivity_blog_release

收藏
Hugging Face2026-03-02 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/openadmet/Octant_CYP_inhibition_reactivity_blog_release
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 configs: - config_name: inhibition data_files: "inhibition.tsv" - config_name: inhibition_detailed data_files: "inhibition_wells.tsv" - config_name: will_it_fly data_files: "will_it_fly_in_mass_spec.tsv" - config_name: reactivity data_files: "reactivity.tsv" - config_name: reactivity_detailed data_files: "reactivity_wells.tsv" --- # OpenADMET Octant CYP Inhibition & Reactivity Data release from the [OpenADMET](https://openadmet.ghost.io) consortium, generated by [Octant Bio](https://www.octant.bio). This dataset accompanies the blog post [**Building the OpenADMET Data Engine**](https://openadmet.org/Octant_CYP_blog_post/). Source code, assay protocols, and raw TSV files are on [GitHub](https://github.com/OpenADMET/Octant_CYP_blog_post). ## Overview Cytochrome P450 (CYP) enzymes drive the oxidative metabolism of most drugs and are a primary cause of drug-drug interactions (DDIs). Despite their importance, public CYP datasets are sparse, noisy, and collected under inconsistent conditions — making them unreliable for machine learning. This release provides **self-consistent, multi-endpoint CYP data** generated on a single platform under controlled conditions, with full well-level readouts and quality annotations. Approximately 1,200 compounds from a diversity chemical library were screened for: - **CYP3A4 and CYP2J2 reaction phenotyping** (substrate identification via Echo acoustic ejection mass spectrometry, 2 µL, 1536-well format) - **CYP3A4 inhibition** (fluorescence-based dose-response curves, 4 µL, 1536-well format, with a 30-minute active-enzyme pre-incubation to capture both reversible and time-dependent inhibitors) The dataset is structured as two tiers — compound-level summaries for modeling, and well-level detail for QC, outlier analysis, and advanced modeling. --- ## Subsets ### `inhibition` **1,340 rows — one row per compound** Compound-level CYP3A4 inhibition summary derived from 12-point dose-response curves. Use this for ML model training on inhibition potency. | Column | Description | |---|---| | `ocnt_batch` | Compound identifier | | `standardized_smiles` | Standardized SMILES string | | `CYP3A4_pIC50` | Fitted pIC₅₀ (−log₁₀ IC₅₀ in M) | | `CYP3A4_pIC50_se` | Standard error on pIC₅₀ | | `CYP3A4_pIC50_ci_lower` / `_ci_upper` | 95% confidence interval bounds | | `slope_log2` | Hill slope of the fitted DRC | | `emax_log2fc` | Maximum effect (log₂ fold-change in fluorescence) | | `activity_status` | Whether the compound shows detectable inhibition | | `rollover_status` | Flag for hook-effect / rollover artifacts | | `saturation_status` | Whether the curve reaches saturation | | `direction` | Direction of fluorescence change | | `drc_qc_status` / `drc_qc_flag` | Dose-response curve QC pass/fail | | `qc_flag_primary` | Primary screen QC flag | | `plate_qc_status` | Plate-level QC status | **Why collected:** Provides quantitative inhibition potency for the full library. The active-enzyme pre-incubation means IC₅₀ values reflect combined reversible + time-dependent inhibition — important for DDI risk assessment but distinct from standard reversible-only IC₅₀ measurements. --- ### `inhibition_detailed` **16,931 rows — well-level fluorescence from dose-response assays** Raw fluorescence readouts underlying the `inhibition` summaries. Use this for QC analysis, outlier investigation, and training models on raw assay signals. | Column | Description | |---|---| | `ocnt_batch` | Compound identifier | | `standardized_smiles` | Standardized SMILES string | | `compound_class` | Library compound, positive control, or negative control | | `plate` | Plate identifier | | `row` / `col` | Well position | | `concentration_M` | Compound concentration in molar | | `fluorescence` | Raw fluorescence signal | | `fluorescence_norm` | Normalized fluorescence (log₂ fold-change relative to controls) | | `outlier` | Whether this well was flagged as an outlier during curve fitting | **Why collected:** Summary IC₅₀ values hide plate artifacts, outlier wells, and edge effects. Well-level data allows modelers to apply their own QC criteria, detect spatial plate effects, and train on richer experimental signals. --- ### `reactivity` **2,442 rows — one row per compound per enzyme** Compound-level CYP reactivity summary (substrate depletion). One row per compound-enzyme pair (CYP3A4 and CYP2J2). Use this for ML model training on metabolic substrate status. | Column | Description | |---|---| | `ocnt_batch` | Compound identifier | | `standardized_smiles` | Standardized SMILES string | | `enzyme` | CYP enzyme tested (`CYP3A4` or `CYP2J2`) | | `control` | Mean log₁₀ peak area in control wells | | `treatment` | Mean log₁₀ peak area in enzyme-treated wells | | `log10_control` / `log10_treatment` | Log₁₀ peak area (control / treatment) | | `log10fc` | Log₁₀ fold-change (treatment vs control) | | `log2fc` | Log₂ fold-change (treatment vs control) | | `pct_remaining` | Percent compound remaining after enzyme incubation | **Why collected:** Reaction phenotyping identifies whether a compound is a CYP substrate, a prerequisite for understanding metabolic clearance and DDI. CYP2J2 was prioritized alongside CYP3A4 because of its role in extra-hepatic metabolism and its relative absence in public datasets. --- ### `reactivity_detailed` **19,344 rows — well-level Echo-MS peak areas** Raw acoustic ejection mass spectrometry (Echo-MS) peak areas underlying the `reactivity` summaries. Use this for QC analysis, understanding measurement variability, and training on raw MS signals. | Column | Description | |---|---| | `ocnt_batch` | Compound identifier | | `standardized_smiles` | Standardized SMILES string | | `enzyme` | CYP enzyme tested | | `condition` | `control` (no enzyme) or `treatment` (with enzyme) | | `plate` | Plate identifier | | `well` | Well position | | `time_start` / `time_end` | Echo-MS acquisition time window (minutes) | | `mz_query` | Target m/z for the compound | | `mz_observed` | Observed m/z | | `mass_error_ppm` | Mass accuracy error (parts per million) | | `area` | Integrated peak area | **Why collected:** Echo-MS enables sub-2-second contactless sampling from 1536-well plates. Well-level peak areas with 4 biological replicates per condition provide statistical power for depletion detection and allow downstream reproducibility and noise modeling. --- ### `will_it_fly` **11,353 rows — ionization buffer comparison** Echo-MS peak areas for ~11,000 library compounds measured under two carrier solvent conditions (ammonium formate vs. ammonium fluoride), without enzyme. Used to pre-profile chemical libraries for MS compatibility before reactivity assays. | Column | Description | |---|---| | `ocnt_batch` | Compound identifier | | `standardized_smiles` | Standardized SMILES string | | `ammonium_fluoride_area` | Peak area in 1 mM ammonium fluoride carrier | | `ammonium_formate_area` | Peak area in 5 mM ammonium formate carrier | **Why collected:** Not every molecule ionizes well under generic untargeted TOF-MS conditions. Pre-profiling identifies compounds that cannot be reliably detected ("won't fly") before they enter the assay, preventing false negatives. This data also quantifies how switching from ammonium formate to ammonium fluoride expanded chemical coverage from ~50% to ~75% of the library — directly informing assay design decisions. --- ## Loading the Data ```python from datasets import load_dataset # Compound-level summaries (for ML) inhibition = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "inhibition") reactivity = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "reactivity") # Well-level detail (for QC / advanced modeling) inhib_wells = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "inhibition_detailed") react_wells = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "reactivity_detailed") # Ionization profiling will_it_fly = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "will_it_fly") ``` --- ## Key Design Choices - **Active-enzyme pre-incubation** in the inhibition assay: IC₅₀ values reflect reversible inhibition **plus** any time-dependent effects that develop during the 30-minute pre-incubation. This differs from standard reversible-only IC₅₀ assays. - **1536-well miniaturization**: 4 µL (inhibition) and 2 µL (reactivity) assay volumes reduce cost ~100× vs. standard CRO formats while maintaining biological relevance using industry-standard Gentest Supersomes. - **Echo-MS (acoustic ejection MS)**: Enables label-free, contactless sub-2-second sampling. Only compounds that ionize above background are reported — non-detecting compounds are excluded rather than reported as zero. - **Full QC transparency**: Well-level data, plate maps, outlier flags, and fitted curve parameters are included so modelers can apply their own quality thresholds. --- ## Related Resources - **Blog post:** [Building the OpenADMET Data Engine](https://openadmet.org/Octant_CYP_blog_post/) - **GitHub (source code, protocols, TSV files):** [OpenADMET/Octant_CYP_blog_post](https://github.com/OpenADMET/Octant_CYP_blog_post) - **OpenADMET Discord:** [Join the conversation](https://discord.gg/ndgtXfYhJe) - **Contact:** [openadmet@omsf.io](mailto:openadmet@omsf.io) --- ## Citation If you use this dataset, please cite the accompanying blog post and link to this repository. **License:** Apache 2.0

许可证:Apache 2.0 配置项: - 配置名称:inhibition,数据文件:inhibition.tsv - 配置名称:inhibition_detailed,数据文件:inhibition_wells.tsv - 配置名称:will_it_fly,数据文件:will_it_fly_in_mass_spec.tsv - 配置名称:reactivity,数据文件:reactivity.tsv - 配置名称:reactivity_detailed,数据文件:reactivity_wells.tsv # OpenADMET Octant 细胞色素P450(CYP)抑制与反应性数据集 本数据集由[OpenADMET](https://openadmet.ghost.io)联盟发布,[Octant Bio](https://www.octant.bio)负责生成。本数据集配套博文《**构建OpenADMET数据引擎**》(Building the OpenADMET Data Engine)一同发布。源代码、实验方案与原始TSV文件可在[GitHub](https://github.com/OpenADMET/Octant_CYP_blog_post)获取。 ## 概述 细胞色素P450(CYP)酶介导了绝大多数药物的氧化代谢,也是药物-药物相互作用(DDIs)的主要诱因之一。尽管CYP酶具有重要的临床意义,但公开的CYP数据集普遍存在样本稀缺、噪声较高且采集条件不统一的问题,导致其难以可靠地应用于机器学习建模。 本数据集发布了**在统一受控平台上生成的自洽多终点CYP数据**,包含完整的孔板级读数与质量注释。研究人员从多样性化学库中筛选了约1200种化合物,开展了以下两类检测: - **CYP3A4与CYP2J2反应表型分型**:通过回声声波喷射质谱(Echo acoustic ejection mass spectrometry, Echo-MS)进行底物鉴定,采用2 µL体系、1536孔板格式 - **CYP3A4抑制活性检测**:基于荧光的剂量反应曲线检测,采用4 µL体系、1536孔板格式,并设置30分钟活性酶预孵育步骤,以同时捕获可逆性与时间依赖性抑制剂 本数据集分为两个层级:用于建模的化合物级汇总数据,以及用于质量控制(QC)、异常值分析与高级建模的孔板级详细数据。 ## 子集 ### `inhibition` **1340条数据,每种化合物对应一行** 该子集包含基于12点剂量反应曲线得到的化合物级CYP3A4抑制活性汇总数据,可用于基于抑制活性效力的机器学习模型训练。 | 列名 | 描述 | |---|---| | `ocnt_batch` | 化合物标识符 | | `standardized_smiles` | 标准化SMILES字符串 | | `CYP3A4_pIC50` | 拟合得到的pIC₅₀(以摩尔浓度计的−log₁₀ IC₅₀) | | `CYP3A4_pIC50_se` | pIC₅₀的标准误差 | | `CYP3A4_pIC50_ci_lower` / `_ci_upper` | 95%置信区间上下界 | | `slope_log2` | 拟合剂量反应曲线的Hill斜率 | | `emax_log2fc` | 最大效应值(荧光信号的log₂倍数变化) | | `activity_status` | 化合物是否表现出可检测的抑制活性 | | `rollover_status` | 钩效应/翻转伪影标记 | | `saturation_status` | 剂量反应曲线是否达到饱和 | | `direction` | 荧光信号变化方向 | | `drc_qc_status` / `drc_qc_flag` | 剂量反应曲线质量控制合格/不合格标记 | | `qc_flag_primary` | 初筛质量控制标记 | | `plate_qc_status` | 孔板级质量控制状态 | **采集目的**:为全化学库提供定量的抑制活性效力数据。本实验采用活性酶预孵育步骤,使得IC₅₀值同时反映可逆性抑制与时间依赖性抑制——这对药物-药物相互作用风险评估至关重要,但区别于仅检测可逆性抑制的标准IC₅₀实验。 --- ### `inhibition_detailed` **16931条数据,为剂量反应实验的孔板级荧光读数** 该子集包含`inhibition`汇总数据背后的原始荧光读数,可用于质量控制分析、异常值排查以及基于原始实验信号的模型训练。 | 列名 | 描述 | |---|---| | `ocnt_batch` | 化合物标识符 | | `standardized_smiles` | 标准化SMILES字符串 | | `compound_class` | 化合物类型:库化合物、阳性对照或阴性对照 | | `plate` | 孔板标识符 | | `row` / `col` | 孔板位置(行/列) | | `concentration_M` | 化合物摩尔浓度 | | `fluorescence` | 原始荧光信号 | | `fluorescence_norm` | 归一化荧光信号(相对于对照组的log₂倍数变化) | | `outlier` | 该孔在曲线拟合过程中是否被标记为异常值 | **采集目的**:汇总得到的IC₅₀值会掩盖孔板伪影、异常孔与边缘效应。孔板级数据允许建模人员应用自定义质量控制标准、检测孔板空间效应,并基于更丰富的实验信号开展模型训练。 --- ### `reactivity` **2442条数据,每种化合物对应每一种酶的一行数据** 该子集包含化合物级CYP反应活性汇总数据(底物消耗情况),每种化合物-酶对(CYP3A4与CYP2J2)对应一行。可用于基于代谢底物状态的机器学习模型训练。 | 列名 | 描述 | |---|---| | `ocnt_batch` | 化合物标识符 | | `standardized_smiles` | 标准化SMILES字符串 | | `enzyme` | 检测的CYP酶(`CYP3A4`或`CYP2J2`) | | `control` | 对照组孔的平均log₁₀峰面积 | | `treatment` | 酶处理组孔的平均log₁₀峰面积 | | `log10_control` / `log10_treatment` | 对照组/处理组的log₁₀峰面积 | | `log10fc` | 处理组相对于对照组的log₁₀倍数变化 | | `log2fc` | 处理组相对于对照组的log₂倍数变化 | | `pct_remaining` | 酶孵育后剩余化合物的百分比 | **采集目的**:反应表型分型可用于鉴定化合物是否为CYP底物,这是理解药物代谢清除与药物-药物相互作用的前提。研究人员将CYP2J2与CYP3A4一同纳入检测,是因为其在肝外代谢中发挥的作用以及公开数据中该酶数据的相对匮乏。 --- ### `reactivity_detailed` **19344条数据,为回声声波喷射质谱(Echo-MS)的孔板级峰面积数据** 该子集包含`reactivity`汇总数据背后的原始Echo-MS峰面积,可用于质量控制分析、理解测量变异性以及基于原始质谱信号的模型训练。 | 列名 | 描述 | |---|---| | `ocnt_batch` | 化合物标识符 | | `standardized_smiles` | 标准化SMILES字符串 | | `enzyme` | 检测的CYP酶 | | `condition` | 实验条件:`control`(无酶)或`treatment`(加酶) | | `plate` | 孔板标识符 | | `well` | 孔板位置 | | `time_start` / `time_end` | Echo-MS采集时间窗口(分钟) | | `mz_query` | 化合物目标质荷比 | | `mz_observed` | 观测到的质荷比 | | `mass_error_ppm` | 质量准确度误差(百万分率) | | `area` | 积分峰面积 | **采集目的**:Echo-MS支持从1536孔板中进行亚2秒的无接触采样。本数据集仅报告信号高于背景的化合物——未检测到的化合物会被直接排除,而非以0值记录。孔板级峰面积与每组4次生物学重复可为底物消耗检测提供统计效力,并支持后续的重复性与噪声建模。 --- ### `will_it_fly` **11353条数据,为离子化缓冲液对比实验数据** 该子集包含约11000种库化合物在两种载体溶剂条件下(甲酸铵 vs 氟化铵)的Echo-MS峰面积,未添加酶。可用于反应活性实验前,预先筛选化学库的质谱兼容性。 | 列名 | 描述 | |---|---| | `ocnt_batch` | 化合物标识符 | | `standardized_smiles` | 标准化SMILES字符串 | | `ammonium_fluoride_area` | 1 mM氟化铵载体中的峰面积 | | `ammonium_formate_area` | 5 mM甲酸铵载体中的峰面积 | **采集目的**:并非所有分子都能在通用非靶向飞行时间质谱(Time-of-Flight Mass Spectrometry, TOF-MS)条件下实现良好的离子化。预筛选可识别出无法被可靠检测的化合物(即“无法飞行”的化合物),避免其进入后续实验导致假阴性结果。本数据还量化了将载体从甲酸铵更换为氟化铵后,化学库的可检测覆盖率从约50%提升至约75%——这直接为实验设计决策提供了依据。 --- ## 数据加载 python from datasets import load_dataset # 化合物级汇总数据(用于机器学习) inhibition = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "inhibition") reactivity = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "reactivity") # 孔板级详细数据(用于质量控制/高级建模) inhib_wells = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "inhibition_detailed") react_wells = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "reactivity_detailed") # 离子化特性分析 will_it_fly = load_dataset("openadmet/Octant_CYP_inhibition_reactivity_blog_release", "will_it_fly") --- ## 关键设计选择 - **抑制实验中的活性酶预孵育**:IC₅₀值同时反映可逆性抑制与30分钟预孵育过程中产生的时间依赖性抑制效应,这区别于仅检测可逆性抑制的标准IC₅₀实验。 - **1536孔板微型化**:采用4 µL(抑制实验)与2 µL(反应活性实验)的体系体积,相较于传统合同研究组织(CRO)格式,成本降低约100倍,同时使用行业标准的Gentest Supersomes保持实验生物学相关性。 - **Echo-MS(声波喷射质谱)**:支持无标记、无接触的亚2秒采样。仅报告信号高于背景的化合物——未检测到的化合物会被直接排除,而非以0值记录。 - **完整的质量控制透明度**:包含孔板级数据、孔板图谱、异常值标记与拟合曲线参数,允许建模人员应用自定义质量控制阈值。 --- ## 相关资源 - **博文**:[构建OpenADMET数据引擎](https://openadmet.org/Octant_CYP_blog_post/) - **GitHub(源代码、实验方案、TSV文件)**:[OpenADMET/Octant_CYP_blog_post](https://github.com/OpenADMET/Octant_CYP_blog_post) - **OpenADMET Discord社区**:[加入讨论](https://discord.gg/ndgtXfYhJe) - **联系方式**:[openadmet@omsf.io](mailto:openadmet@omsf.io) --- ## 引用 如果使用本数据集,请引用配套博文并链接至本仓库。 **许可证**:Apache 2.0
提供机构:
openadmet
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作