UR100P

Name: UR100P
Creator: maas
Published: 2025-12-05 16:25:08
License: 暂无描述

魔搭社区2025-12-05 更新2025-03-01 收录

下载链接：

https://modelscope.cn/datasets/chandar-lab/UR100P

下载链接

链接失效反馈

官方服务：

资源简介：

## UR100P This dataset contains curated protein sequences from multiple sources and has been used to train and evaluate [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_350M), an efficient state-of-the-art protein language model. It combines data from [UniProt](https://www.uniprot.org), the [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/) (OAS), and the [Structural Classification of Proteins](https://www.ebi.ac.uk/pdbe/scop/) version 2 (SCOP 2) databases to enable task-specific validation of the models. All data were collected in December 2023, and sequences containing ambiguous amino acids (B, J, O, U, X, Z) were removed. For OAS, only paired heavy and light chain sequences were included, and the dataset was augmented by incorporating sequences in both heavy|light (Hc|Lc) and light|heavy (Lc|Hc) chain arrangements, separated by a chainbreak token `|`. MMseqs2 was used to filter out sequences in the train sets with >90% sequence identity to the validation sets, preventing data leakage and ensuring a fair evaluation. For more details, please refer to the [accompanying paper](https://www.biorxiv.org/content/10.1101/2024.09.23.614603v1). ## Get Started ```python from datasets import load_dataset # Load the entire dataset (all train and test sets) # dataset = load_dataset("chandar-lab/UR100P") # Load all test sets # dataset = load_dataset("chandar-lab/UR100P", split="test") # Load the UniProt test set dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test") ``` ## Citations If you find the models useful in your research, we ask that you cite the paper: ```bibtex @article{Fournier2024.09.23.614603, title = {Protein Language Models: Is Scaling Necessary?}, author = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James}, year = {2024}, journal = {bioRxiv}, publisher = {Cold Spring Harbor Laboratory}, doi = {10.1101/2024.09.23.614603}, url = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603}, elocation-id = {2024.09.23.614603}, eprint = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf} } ``` If you utilize UniProt, OAS, or SCOP, we ask that you cite their respective papers: The UniProt Consortium, UniProt: the Universal Protein Knowledgebase in 2023, Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D523–D531, https://doi.org/10.1093/nar/gkac1052 Olsen TH, Boyles F, Deane CM. Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science. 2022; 31: 141–146. https://doi.org/10.1002/pro.4205 PDB-101: Educational resources supporting molecular explorations through biology and medicine. Christine Zardecki, Shuchismita Dutta, David S. Goodsell, Robert Lowe, Maria Voigt, Stephen K. Burley. (2022) Protein Science 31: 129-140 https://doi.org/10.1002/pro.4200 ## License UniProt, OAS, and SCOP are licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

# UR100P 本数据集包含经人工整理的多来源蛋白质序列，曾用于训练与评估高效的当前顶尖蛋白质语言模型AMPLIFY（https://huggingface.co/chandar-lab/AMPLIFY_350M）。该数据集整合了来自通用蛋白质知识库（UniProt，https://www.uniprot.org）、观察抗体空间（Observed Antibody Space，简称OAS，https://opig.stats.ox.ac.uk/webapps/oas/）以及蛋白质结构分类数据库2.0版（Structural Classification of Proteins version 2，简称SCOP 2，https://www.ebi.ac.uk/pdbe/scop/）的数据，以支持针对模型的任务专属验证。所有数据均采集于2023年12月，且已移除包含模糊氨基酸（B、J、O、U、X、Z）的序列。针对OAS数据，仅纳入成对的重链与轻链序列，并通过同时添加重链|轻链（Hc|Lc）以及轻链|重链（Lc|Hc）两种排列的序列对数据集进行增强，两种排列以链分隔标记`|`进行分隔。研究人员使用MMseqs2工具对训练集序列进行过滤，移除与验证集序列相似度超过90%的条目，以避免数据泄露，保障评估的公平性。如需了解更多细节，请参阅配套论文（https://www.biorxiv.org/content/10.1101/2024.09.23.614603v1）。 ## 快速上手 python from datasets import load_dataset # 加载完整数据集（包含全部训练集与测试集） # dataset = load_dataset("chandar-lab/UR100P") # 加载全部测试集 # dataset = load_dataset("chandar-lab/UR100P", split="test") # 加载UniProt测试集 dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test") ## 引用声明若您在研究中使用该模型，请引用以下论文： bibtex @article{Fournier2024.09.23.614603, title = {Protein Language Models: Is Scaling Necessary?}, author = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James}, year = {2024}, journal = {bioRxiv}, publisher = {Cold Spring Harbor Laboratory}, doi = {10.1101/2024.09.23.614603}, url = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603}, elocation-id = {2024.09.23.614603}, eprint = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf} } 若您使用了UniProt、OAS或SCOP数据库，请分别引用其对应论文： 1. UniProt联盟. 2023年版通用蛋白质知识库. 核酸研究, 2023, 51(D1): D523–D531. https://doi.org/10.1093/nar/gkac1052 2. Olsen TH, Boyles F, Deane CM. 观察抗体空间：一个包含经清理、注释与翻译的非成对及成对抗体序列的多样化数据库. 蛋白质科学, 2022, 31: 141–146. https://doi.org/10.1002/pro.4205 3. Christine Zardecki, Shuchismita Dutta, David S. Goodsell, Robert Lowe, Maria Voigt, Stephen K. Burley. PDB-101：支持通过生物学与医学开展分子探索的教育资源. 蛋白质科学, 2022, 31: 129-140. https://doi.org/10.1002/pro.4200 ## 许可协议通用蛋白质知识库（UniProt）、观察抗体空间（OAS）及蛋白质结构分类数据库2.0版（SCOP 2）均采用知识共享署名4.0国际许可协议（Creative Commons Attribution 4.0 International，简称CC BY 4.0）进行授权。

提供机构：

maas

创建时间：

2025-02-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集