ClarusC64/protein-aggregation-risk-instability-v0.1

Name: ClarusC64/protein-aggregation-risk-instability-v0.1
Creator: ClarusC64
Published: 2026-04-30 16:12:43
License: 暂无描述

Hugging Face2026-04-30 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/ClarusC64/protein-aggregation-risk-instability-v0.1

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit pretty_name: Protein Aggregation Risk Instability task_categories: - tabular-classification tags: - clarusc64 - stability-reasoning - protein - aggregation - protein-folding - molecular-instability - tabular size_categories: - n<1K --- # protein-aggregation-risk-instability-v0.1 ## What this dataset does This dataset evaluates whether models can detect instability related to protein aggregation risk. Each row represents a simplified molecular stability scenario described through structural and folding proxies. The task is to determine whether the protein configuration is likely to remain soluble or move toward aggregation. ## Core stability idea Protein aggregation occurs when misfolded intermediates expose hydrophobic patches that promote intermolecular binding. Aggregation risk emerges from interactions between: - hydrophobic surface exposure - folding frustration - misfolding propensity - chaperone buffering capacity - solubility margin - thermal stability - aggregation seeding potential No single feature determines aggregation risk. Instability emerges from their interaction. ## Prediction target label = 1 → aggregation instability label = 0 → stable soluble folding ## Row structure Each row includes proxies describing molecular stability: - sequence length - hydrophobic patch density - contact density - local frustration proxy - misfolding propensity proxy - chaperone buffer proxy - solubility proxy - thermal stability proxy - aggregation seed proxy ## Evaluation Predictions must follow: scenario_id,prediction Example: PA101,0 PA102,1 Run evaluation: python scorer.py --predictions predictions.csv --truth data/test.csv --output metrics.json Metrics produced: accuracy precision recall f1 confusion matrix ## Structural Note This dataset reflects latent molecular stability geometry expressed through observable structural proxies. The dataset generator and latent stability rules are not included. ## License MIT

This dataset evaluates whether models can detect instability related to protein aggregation risk. Each row represents a simplified molecular stability scenario described through structural and folding proxies. The task is to determine whether the protein configuration is likely to remain soluble or move toward aggregation. The dataset includes proxies such as sequence length, hydrophobic patch density, contact density, local frustration proxy, misfolding propensity proxy, chaperone buffer proxy, solubility proxy, thermal stability proxy, and aggregation seed proxy. The prediction target is binary classification, with label 1 indicating aggregation instability and label 0 indicating stable soluble folding. Evaluation metrics include accuracy, precision, recall, f1, and confusion matrix. The dataset reflects latent molecular stability geometry expressed through observable structural proxies.

提供机构：

ClarusC64

搜集汇总

数据集介绍

构建方式

本数据集旨在评估模型对蛋白质聚集风险的识别能力，其构建基于简化的分子稳定性场景，每一行数据通过结构折叠代理变量描述蛋白质构象状态。这些变量包括序列长度、疏水斑块密度、接触密度、局部挫折代理、错误折叠倾向代理、分子伴侣缓冲能力代理、溶解度代理、热稳定性代理以及聚集种子代理。数据集通过模拟多个稳定性特征的交互作用来反映聚集风险，而非依赖单一特征。预测目标为二分类，标签1代表聚集不稳定性，标签0代表稳定可溶折叠。测试集与评分脚本（scorer.py）一同提供，用于标准化评估。

特点

该数据集的核心特点在于其多维度交互的稳定性推理设计。蛋白质聚集风险由疏水表面暴露、折叠挫折、错误折叠倾向、分子伴侣缓冲能力、溶解度余量、热稳定性及聚集种子潜力等特征共同决定，捕捉了分子不稳定性的涌现性质。数据集的规模较小（n<1K），每个样本通过简化的代理变量隐含了潜在的分子稳定性几何结构，但未公开原始生成器及潜在稳定性规则，增强了评估的挑战性与泛化测试意义。评估指标涵盖准确率、精确率、召回率、F1分数及混淆矩阵，确保全面的性能度量。

使用方法

使用本数据集时，用户需遵循指定的预测格式：每行输出包含“scenario_id, prediction”的CSV文件，其中预测值为0（稳定）或1（聚集不稳定）。例如，“PA101,0”表示场景PA101的稳定状态。评估通过运行命令“python scorer.py --predictions predictions.csv --truth data/test.csv --output metrics.json”完成，脚本自动计算准确率、精确率、召回率、F1分数及混淆矩阵。数据集以英文标注，采用MIT许可证开放使用，适用于表格分类任务，特别适合研究蛋白质稳定性推理与分子不稳定性检测的机器学习模型开发。

背景与挑战

背景概述

蛋白质聚集是生物分子稳定性研究中的核心议题，与多种神经退行性疾病及生物制药的研发密切相关。protein-aggregation-risk-instability-v0.1数据集由研究团队于近年创建，旨在评估机器学习模型对蛋白质聚集风险的判别能力。该数据集通过结构化代理变量（如疏水斑块密度、折叠挫折度、分子伴侣缓冲容量等）模拟分子稳定性场景，聚焦于多因素交互驱动的聚集风险预测。作为首个以表格分类形式系统呈现蛋白质聚集不稳定性判别的基准数据集，它填补了分子稳定性推理任务中缺乏标准化评估数据的空白，为计算生物学与机器学习交叉领域提供了关键验证工具。

当前挑战

该数据集所解决的领域问题在于蛋白质聚集风险的多因素耦合特性：单一结构特征（如疏水性）无法独立决定聚集倾向，不稳定性源自疏水暴露、折叠挫折、成核势能等多维变量的非线性相互作用。在构建过程中，挑战在于将隐性的分子稳定性几何结构转化为可观测的结构代理变量，同时避免引入生成规则偏差；此外，数据规模较小（少于1000条样本）限制了模型对复杂稳定性模式的泛化学习，如何在有限数据中捕捉高维交互规律成为方法学上的核心难题。

常用场景

经典使用场景

在蛋白质科学领域，该数据集被设计用于评估模型对蛋白质聚集风险的预测能力。通过提供包含序列长度、疏水斑块密度、接触密度、折叠挫折度、错误折叠倾向、分子伴侣缓冲能力、溶解度、热稳定性及聚集种子潜力等多维结构代理变量的样本，研究者可训练或测试机器学习模型以区分稳定可溶折叠与聚集不稳定性。此数据集特别适用于二元分类任务，其中标签1代表聚集不稳定性，标签0代表稳定可溶折叠，从而为理解蛋白质聚集的分子机制提供量化基础。

衍生相关工作

基于此数据集，研究者已开发出多种融合结构代理与深度学习模型的工作。例如，利用图神经网络捕捉残基间接触模式预测聚集热点，或通过注意力机制加权各代理变量以增强可解释性。衍生工作还包括开发多任务学习框架同时预测稳定性和溶解度，以及构建主动学习策略以最小化实验验证的样本量。这些进展拓展了数据集在蛋白质设计、分子动力学模拟验证及罕见病相关突变影响评估中的应用边界，形成了从风险预测到功能优化的完整研究链条。

数据集最近研究