five

drake463/FireProtDB2

收藏
Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/drake463/FireProtDB2
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: mutation_dg data_files: - split: train path: data/subsets/mutation_dg/train.parquet - split: validation path: data/subsets/mutation_dg/validation.parquet - split: test path: data/subsets/mutation_dg/test.parquet - config_name: mutation_ddg data_files: - split: train path: data/subsets/mutation_ddg/train.parquet - split: validation path: data/subsets/mutation_ddg/validation.parquet - split: test path: data/subsets/mutation_ddg/test.parquet - config_name: mutation_tm data_files: - split: train path: data/subsets/mutation_tm/train.parquet - split: validation path: data/subsets/mutation_tm/validation.parquet - split: test path: data/subsets/mutation_tm/test.parquet - config_name: mutation_dtm data_files: - split: train path: data/subsets/mutation_dtm/train.parquet - split: validation path: data/subsets/mutation_dtm/validation.parquet - split: test path: data/subsets/mutation_dtm/test.parquet - config_name: mutation_fitness data_files: - split: train path: data/subsets/mutation_fitness/train.parquet - split: validation path: data/subsets/mutation_fitness/validation.parquet - split: test path: data/subsets/mutation_fitness/test.parquet - config_name: mutation_binary data_files: - split: train path: data/subsets/mutation_binary/train.parquet - split: validation path: data/subsets/mutation_binary/validation.parquet - split: test path: data/subsets/mutation_binary/test.parquet license: cc-by-4.0 language: - en tags: - thermal-stability - mutations - mutagenesis - experimental - structural-biology pretty_name: FireProtDB 2.0 --- # Dataset Card for FireProtDB_2.0 Subsets of protein stability data for single-point mutants from FireProtDB, a comprehensive curated database. ## Dataset Details Subsets of different thermal data of single-point mutations in the FireProtDB database with train/validation/test splits: - 1. ΔG, ΔΔG - 2. Tm, ΔTm - 3. Fitness - 4. Stabilizing ### Dataset Description This dataset contains curated subsets of various thermal stability measurements derived from FireProtDB. Subsets are separated based on measurement type (ΔG, ΔΔG, Tm, ΔTm, Fitness, Binary Stabilizing). Subsets are split into 80/10/10 parititions based on sequence similarity of the various proteins in each subset. Stabilizing refers to a classification performed by FireProtDB which designates if a mutation is stabilizing or not. This datatype sets a 'true' or 'false' binary value to indicate if mutation is stabilizing or destabilizing. - **Curated by:** Zachary Drake / zacharydrake (at) g.ucla.edu ### Dataset Sources - **Repository:** https://loschmidt.chaemi.muni.cz/fireprotdb/ - **Paper:** Milos Musil, Simeon Borko, Joan Planas-Iglesias, David Lacko, Monika Rosinska, Petr Kabourek, Lígia O Martins, Mateusz Tataruch, Jiri Damborsky, Stanislav Mazurenko, David Bednar, FireProtDB 2.0: large-scale manually curated database of the protein stability data, Nucleic Acids Research, Volume 54, Issue D1, 6 January 2026, Pages D409–D418, https://doi.org/10.1093/nar/gkaf1211 ## Uses Useful for training models to predict various thermal stability metrics, or evaluating stability effects of mutations. ## Dataset Structure Subsets included are: - mutations_dg - mutations_ddg - mutations_tm - mutations_dtm - mutations_fitness - mutations_binary #### Data Collection and Processing Datasets were processed using the FireProtDB CSV (https://loschmidt.chemi.muni.cz/fireprotdb/download/). The CSV was processed using a pipeline of primarily Pandas and mmseqs2 (code is available in src/). ## Citation **BibTeX:** ```bibtex @article{10.1093/nar/gkaf1211, author = {Musil, Milos and Borko, Simeon and Planas-Iglesias, Joan and Lacko, David and Rosinska, Monika and Kabourek, Petr and Martins, Lígia O and Tataruch, Mateusz and Damborsky, Jiri and Mazurenko, Stanislav and Bednar, David}, title = {FireProtDB 2.0: large-scale manually curated database of the protein stability data}, journal = {Nucleic Acids Research}, volume = {54}, number = {D1}, pages = {D409-D418}, year = {2025}, month = {11}, abstract = {Thermostable proteins are crucial in numerous biomedical and biotechnological applications. However, naturally occurring proteins have evolved to function in mild conditions, and laboratory experiments aiming at improving protein stability have proven laborious and expensive. Computational methods overcome this issue by providing a cheap and scalable alternative. Despite significant progress, their reliability is still hindered by the availability of high-quality data. FireProtDB 2.0 (http://loschmidt.chemi.muni.cz/fireprotdb) is a large-scale database aggregating stability data from multiple sources. The second version builds upon its predecessor, retaining its original functionality while introducing a new approach to data storage and maintenance. The new scheme enables the introduction of both absolute and relative data types connected with measurements of wild-types, mutants, protein domains, and de novo designed proteins. Furthermore, while the original database was limited to single-point mutations, more complex data such as insertions, deletions, and multiple-point mutations are now available. As a result, the inclusion of large-scale mutagenesis has increased the size of the database from 16 000 to almost 5 500 000 experiments. Moreover, the updated abstract scheme is fully expandable with any new measurements and annotations without the need for any restructuring. Finally, the tracking of history together with fixed identifiers is in accordance with the FAIR principles.}, issn = {1362-4962}, doi = {10.1093/nar/gkaf1211}, url = {https://doi.org/10.1093/nar/gkaf1211}, eprint = {https://academic.oup.com/nar/article-pdf/54/D1/D409/65405634/gkaf1211.pdf}, } ``` **APA:** Musil, M., Borko, S., Planas-Iglesias, J., Lacko, D., Rosinska, M., Kabourek, P., Martins, L. O., Tataruch, M., Damborsky, J., Mazurenko, S., & Bednar, D. (2026). FireProtDB 2.0: large-scale manually curated database of the protein stability data. Nucleic acids research, 54(D1), D409–D418. https://doi.org/10.1093/nar/gkaf1211

配置项: - 配置名称:mutation_dg(ΔG突变数据集) 数据文件: - 拆分集:训练集(train),文件路径:data/subsets/mutation_dg/train.parquet - 拆分集:验证集(validation),文件路径:data/subsets/mutation_dg/validation.parquet - 拆分集:测试集(test),文件路径:data/subsets/mutation_dg/test.parquet - 配置名称:mutation_ddg(ΔΔG突变数据集) 数据文件: - 拆分集:训练集(train),文件路径:data/subsets/mutation_ddg/train.parquet - 拆分集:验证集(validation),文件路径:data/subsets/mutation_ddg/validation.parquet - 拆分集:测试集(test),文件路径:data/subsets/mutation_ddg/test.parquet - 配置名称:mutation_tm(解链温度(Tm)突变数据集) 数据文件: - 拆分集:训练集(train),文件路径:data/subsets/mutation_tm/train.parquet - 拆分集:验证集(validation),文件路径:data/subsets/mutation_tm/validation.parquet - 拆分集:测试集(test),文件路径:data/subsets/mutation_tm/test.parquet - 配置名称:mutation_dtm(解链温度变化值(ΔTm)突变数据集) 数据文件: - 拆分集:训练集(train),文件路径:data/subsets/mutation_dtm/train.parquet - 拆分集:验证集(validation),文件路径:data/subsets/mutation_dtm/validation.parquet - 拆分集:测试集(test),文件路径:data/subsets/mutation_dtm/test.parquet - 配置名称:mutation_fitness(适配度(Fitness)突变数据集) 数据文件: - 拆分集:训练集(train),文件路径:data/subsets/mutation_fitness/train.parquet - 拆分集:验证集(validation),文件路径:data/subsets/mutation_fitness/validation.parquet - 拆分集:测试集(test),文件路径:data/subsets/mutation_fitness/test.parquet - 配置名称:mutation_binary(二元稳定性分类突变数据集) 数据文件: - 拆分集:训练集(train),文件路径:data/subsets/mutation_binary/train.parquet - 拆分集:验证集(validation),文件路径:data/subsets/mutation_binary/validation.parquet - 拆分集:测试集(test),文件路径:data/subsets/mutation_binary/test.parquet 许可证:CC BY 4.0 语言:英语 标签:热稳定性(thermal-stability)、突变(mutations)、诱变(mutagenesis)、实验相关(experimental)、结构生物学(structural-biology) 展示名称:FireProtDB 2.0 --- # FireProtDB 2.0 数据集卡片 本数据集源自经人工整理的综合性数据库FireProtDB,包含其中单点突变体的蛋白质稳定性数据子集。 ## 数据集详情 FireProtDB数据库中单点突变的各类热稳定性数据子集,已划分为训练集、验证集与测试集: 1. 吉布斯自由能变化值(ΔG)、突变前后吉布斯自由能变化差值(ΔΔG) 2. 解链温度(Tm)、解链温度变化值(ΔTm) 3. 适配度(Fitness) 4. 稳定性分类(Stabilizing) ### 数据集说明 本数据集包含从FireProtDB提取的各类热稳定性测定数据的人工整理子集,子集按照测定类型分为吉布斯自由能变化值(ΔG)、突变前后吉布斯自由能变化差值(ΔΔG)、解链温度(Tm)、解链温度变化值(ΔTm)、适配度(Fitness)以及二元稳定性分类(Binary Stabilizing)六大类。每个子集内的样本依据蛋白质序列相似度划分为80%训练集、10%验证集与10%测试集。其中“稳定性分类”指FireProtDB定义的突变稳定性分类任务:该分类将为每个突变赋予“真”或“假”的二元标签,以标识该突变是稳定性提升突变还是稳定性降低突变。 - **整理者:** Zachary Drake,邮箱:zacharydrake(at)g.ucla.edu ### 数据集来源 - **仓库地址:** https://loschmidt.chaemi.muni.cz/fireprotdb/ - **相关论文:** Milos Musil, Simeon Borko, Joan Planas-Iglesias, David Lacko, Monika Rosinska, Petr Kabourek, Lígia O Martins, Mateusz Tataruch, Jiri Damborsky, Stanislav Mazurenko, David Bednar. FireProtDB 2.0: large-scale manually curated database of the protein stability data. Nucleic Acids Research, Volume 54, Issue D1, 6 January 2026, Pages D409–D418. https://doi.org/10.1093/nar/gkaf1211 ## 适用场景 可用于训练模型以预测各类热稳定性指标,或评估突变对蛋白质稳定性的影响。 ## 数据集结构 本数据集包含以下子集: - mutations_dg(ΔG突变数据集) - mutations_ddg(ΔΔG突变数据集) - mutations_tm(Tm突变数据集) - mutations_dtm(ΔTm突变数据集) - mutations_fitness(适配度突变数据集) - mutations_binary(二元稳定性分类突变数据集) #### 数据收集与处理 本数据集基于FireProtDB的CSV文件(https://loschmidt.chemi.muni.cz/fireprotdb/download/)处理得到,处理流程主要使用Pandas与mmseqs2工具,相关代码可在src/目录下获取。 ## 引用信息 **BibTeX格式:** bibtex @article{10.1093/nar/gkaf1211, author = {Musil, Milos and Borko, Simeon and Planas-Iglesias, Joan and Lacko, David and Rosinska, Monika and Kabourek, Petr and Martins, Lígia O and Tataruch, Mateusz and Damborsky, Jiri and Mazurenko, Stanislav and Bednar, David}, title = {FireProtDB 2.0: large-scale manually curated database of the protein stability data}, journal = {Nucleic Acids Research}, volume = {54}, number = {D1}, pages = {D409-D418}, year = {2025}, month = {11}, abstract = {Thermostable proteins are crucial in numerous biomedical and biotechnological applications. However, naturally occurring proteins have evolved to function in mild conditions, and laboratory experiments aiming at improving protein stability have proven laborious and expensive. Computational methods overcome this issue by providing a cheap and scalable alternative. Despite significant progress, their reliability is still hindered by the availability of high-quality data. FireProtDB 2.0 (http://loschmidt.chemi.muni.cz/fireprotdb) is a large-scale database aggregating stability data from multiple sources. The second version builds upon its predecessor, retaining its original functionality while introducing a new approach to data storage and maintenance. The new scheme enables the introduction of both absolute and relative data types connected with measurements of wild-types, mutants, protein domains, and de novo designed proteins. Furthermore, while the original database was limited to single-point mutations, more complex data such as insertions, deletions, and multiple-point mutations are now available. As a result, the inclusion of large-scale mutagenesis has increased the size of the database from 16 000 to almost 5 500 000 experiments. Moreover, the updated abstract scheme is fully expandable with any new measurements and annotations without the need for any restructuring. Finally, the tracking of history together with fixed identifiers is in accordance with the FAIR principles.}, issn = {1362-4962}, doi = {10.1093/nar/gkaf1211}, url = {https://doi.org/10.1093/nar/gkaf1211}, eprint = {https://academic.oup.com/nar/article-pdf/54/D1/D409/65405634/gkaf1211.pdf}, } **APA格式:** Musil, M., Borko, S., Planas-Iglesias, J., Lacko, D., Rosinska, M., Kabourek, P., Martins, L. O., Tataruch, M., Damborsky, J., Mazurenko, S., & Bednar, D. (2026). FireProtDB 2.0: large-scale manually curated database of the protein stability data. Nucleic acids research, 54(D1), D409–D418. https://doi.org/10.1093/nar/gkaf1211
提供机构:
drake463
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作