BAAI/OPI

Name: BAAI/OPI
Creator: BAAI
Published: 2024-03-05 02:50:47
License: 暂无描述

Hugging Face2024-03-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/BAAI/OPI

下载链接

链接失效反馈

官方服务：

资源简介：

--- extra_gated_heading: Acknowledge license to accept the repository extra_gated_prompt: > The Beijing Academy of Artificial Intelligence (hereinafter referred to as "we" or "BAAI") provides you with an open-source dataset (hereinafter referred to as "dataset") through the OPI HuggingFace repository (https://huggingface.co/datasets/BAAI/OPI). You can download the dataset you need and use it for purposes such as learning and research while abiding by the usage rules of each original dataset. Before you acquire the open-source dataset (including but not limited to accessing, downloading, copying, distributing, using, or any other handling of the dataset), you should read and understand this "OPI Open-Source Dataset Usage Notice and Disclaimer" (hereinafter referred to as "this statement"). Once you acquire the open-source dataset, regardless of your method of acquisition, your actions will be regarded as acknowledgment of the full content of this statement. 1. Ownership and Operation Rights You should fully understand that the ownership and operation rights of the OPI HuggingFace repository (including the current and all previous versions) belong to BAAI. BAAI has the final interpretation and decision rights over this platform/tool and the open-source dataset plan. You acknowledge and understand that due to updates and improvements in relevant laws and regulations and the need to fulfill our legal compliance obligations, we reserve the right to update, maintain, or even suspend or permanently terminate the services of this platform/tool from time to time. We will notify you of possible situations mentioned above reasonably such as through an announcement or email within a reasonable time. You should make corresponding adjustments and arrangements in a timely manner. However, we do not bear any responsibility for any losses caused to you by any of the aforementioned situations. 2. Claim of Rights to Open-Source Datasets For the purpose of facilitating your dataset acquisition and use for learning, and research, we have performed necessary steps such as format integration, data cleaning, labeling, categorizing, annotating, and other related processing on the third-party original datasets to form the open-source datasets for this platform/tool's users. You understand and acknowledge that we do not claim the proprietary rights of intellectual property to the open-source datasets. Therefore, we have no obligation to actively recognize and protect the potential intellectual property of the open-source datasets. However, this does not mean that we renounce the personal rights to claim credit, publication, modification, and protection of the integrity of the work (if any) of the open-source datasets. The potential intellectual property and corresponding legal rights of the original datasets belong to the original rights holders. In addition, providing you with open-source datasets that have been reasonably arranged, processed, and handled does not mean that we acknowledge the authenticity, accuracy, or indisputability of the intellectual property and information content of the original datasets. You should filter and carefully discern the open-source datasets you choose to use. You understand and agree that BAAI does not undertake any obligation or warranty responsibility for any defects or flaws in the original datasets you choose to use. 3. Usage Restrictions for Open-Source Datasets Your use of the dataset must not infringe on our or any third party's legal rights and interests (including but not limited to copyrights, patent rights, trademark rights, and other intellectual property and other rights). After obtaining the open-source dataset, you should ensure that your use of the open-source dataset does not exceed the usage rules explicitly stipulated by the rights holders of the original dataset in the form of a public notice or agreement, including the range, purpose, and lawful purposes of the use of the original data. We kindly remind you here that if your use of the open-source dataset exceeds the predetermined range and purpose of the original dataset, you may face the risk of infringing on the legal rights and interests of the rights holders of the original dataset, such as intellectual property, and may bear corresponding legal responsibilities. 4. Personal Information Protection Due to technical limitations and the public welfare nature of the open-source datasets, we cannot guarantee that the open-source datasets do not contain any personal information, and we do not bear any legal responsibility for any personal information that may be involved in the open-source datasets. If the open-source dataset involves personal information, we do not bear any legal responsibility for any personal information processing activities you may involve when using the open-source dataset. We kindly remind you here that you should handle personal information in accordance with the provisions of the "Personal Information Protection Law" and other relevant laws and regulations. To protect the legal rights and interests of the information subject and to fulfill possible applicable laws and administrative regulations, if you find content that involves or may involve personal information during the use of the open-source dataset, you should immediately stop using the part of the dataset that involves personal information and contact us as indicated in "6. Complaints and Notices." 5. Information Content Management We do not bear any legal responsibility for any illegal and bad information that may be involved in the open-source dataset. If you find that the open-source dataset involves or may involve any illegal and bad information during your use, you should immediately stop using the part of the dataset that involves illegal and bad information and contact us in a timely manner as indicated in "6. Complaints and Notices." 6. Complaints and Notices If you believe that the open-source dataset has infringed on your legal rights and interests, you can contact us at 010-50955974, and we will handle your claims and complaints in accordance with the law in a timely manner. To handle your claims and complaints, we may need you to provide contact information, infringement proof materials, and identity proof materials. Please note that if you maliciously complain or make false statements, you will bear all legal responsibilities caused thereby (including but not limited to reasonable compensation costs). 7. Disclaimer You understand and agree that due to the nature of the open-source dataset, the dataset may contain data from different sources and contributors, and the authenticity, accuracy, and objectivity of the data may vary, and we cannot make any promises about the availability and reliability of any dataset. In any case, we do not bear any legal responsibility for any risks such as personal information infringement, illegal and bad information dissemination, and intellectual property infringement that may exist in the open-source dataset. In any case, we do not bear any legal responsibility for any loss (including but not limited to direct loss, indirect loss, and loss of potential benefits) you suffer or is related to the open-source dataset. 8. Others The open-source dataset is in a constant state of development and change. We may update, adjust the range of the open-source dataset we provide, or suspend, pause, or terminate the open-source dataset service due to business development, third-party cooperation, changes in laws and regulations, and other reasons. extra_gated_fields: Name: text Affiliation: text Country: text I agree to accept the license: checkbox extra_gated_button_content: Acknowledge license license: cc-by-nc-4.0 language: - en tags: - biology - protein - instruction dataset - instruction tuning pretty_name: Open Protein Instructions(OPI) size_categories: - 1M<n<10M task_categories: - text-generation --- ![image.png](./OPI_logo.png) # Dataset Card for Open Protein Instructions (OPI) ## Dataset Update The previous version of OPI dataset is based on the **release 2022_01** of UniProtKB/Swiss-Prot protein knowledgebase. At current, OPI is updated to contain the latest **release 2023_05**, which can be accessed via the dataset file [OPI_updated_160k.json](./OPI_DATA/OPI_updated_160k.json). Reference: - https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2022_01/knowledgebase/UniProtKB_SwissProt-relstat.html - https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2023_05/knowledgebase/UniProtKB_SwissProt-relstat.html ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Open Protein Instructions(OPI) is the initial part of Open Biology Instructions(OBI) project, together with the subsequent Open Molecule Instructions(OMI), Open DNA Instructions(ODI), Open RNA Instructions(ORI) and Open Single-cell Instructions (OSCI). OBI is a project which aims to fully leverage the potential ability of Large Language Models(LLMs), especially the scientific LLMs like Galactica, to facilitate research in AI for Life Science community. While OBI is still in an early stage, we hope to provide a starting point for the community to bridge LLMs and biological domain knowledge. ## Dataset Structure ### Data Instances ``` instruction: What is the EC classification of the input protein sequence based on its biological function? input: MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT ATERQYELQP output: 2.7.10.2 ``` ### Data Splits The OPI dataset folder structure is as follows: ``` ./OPI_DATA/ ├── AP │ ├── Function │ │ ├── test │ │ │ ├── CASPSimilarSeq_function_test.jsonl │ │ │ ├── IDFilterSeq_function_test.jsonl │ │ │ └── UniProtSeq_function_test.jsonl │ │ └── train │ │ ├── function_description_train.json │ │ └── function_description_train_0.01.json │ ├── GO │ │ ├── test │ │ │ ├── CASPSimilarSeq_go_test.jsonl │ │ │ ├── IDFilterSeq_go_test.jsonl │ │ │ └── UniProtSeq_go_test.jsonl │ │ └── train │ │ ├── go_terms_train.json │ │ └── go_terms_train_0.01.json │ └── Keywords │ ├── test │ │ ├── CASPSimilarSeq_keywords_test.jsonl │ │ ├── IDFilterSeq_keywords_test.jsonl │ │ └── UniProtSeq_keywords_test.jsonl │ └── train │ ├── keywords_train.json │ └── keywords_train_0.01.json ├── KM │ ├── gSymbol2Cancer │ │ ├── test │ │ │ └── gene_symbol_to_cancer_test.jsonl │ │ └── train │ │ └── gene_symbol_to_cancer_train.json │ ├── gName2Cancer │ │ ├── test │ │ │ └── gene_name_to_cancer_test.jsonl │ │ └── train │ │ └── gene_name_to_cancer_train.json │ └── gSymbol2Tissue │ ├── test │ │ └── gene_symbol_to_tissue_test.jsonl │ └── train │ └── gene_symbol_to_tissue_train.json └── SU ├── EC_number │ ├── test │ │ ├── CLEAN_EC_number_new_test.jsonl │ │ └── CLEAN_EC_number_price_test.jsonl │ └── train │ ├── CLEAN_EC_number_train.json ├── Fold_type-Remote │ ├── test │ │ └── Remote_test.jsonl │ └── train │ └── Remote_train.json └── Subcellular_location ├── test │ ├── location_test.jsonl └── train └── location_train.json ``` ## Dataset Creation The OPI dataset is curated on our own by extracting key information from [Swiss-Prot](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue&query=%2A) database. The detailed construction pipeline is depicted in the supplementary material of our manuscript which has been submitted to NeurIPS 2023 Datasets and Benchmarks. The following figure shows the general construction process. ![image.png](./OPI_data.png) ## License The dataset is licensed under a Creative Commons Attribution Non Commercial 4.0 License. The use of this dataset should also abide by the original [License & Disclaimer](https://www.uniprot.org/help/license) and [Privacy Notice](https://www.uniprot.org/help/privacy) of UniProt.

需确认许可协议以接收此仓库 extra_gated_prompt: 北京人工智能研究院（以下简称“我方”或“BAAI”）通过OPI Hugging Face仓库（https://huggingface.co/datasets/BAAI/OPI）向您提供开源数据集（以下简称“本数据集”）。您可在遵守各原始数据集使用规则的前提下，下载所需数据集并用于学习与研究等合法用途。在您获取本开源数据集（包括但不限于访问、下载、复制、分发、使用或以任何其他方式处理本数据集）之前，请您仔细阅读并理解本《OPI开源数据集使用声明与免责条款》（以下简称“本声明”）。一旦您获取本开源数据集，无论采用何种获取方式，均视为您已充分知晓并同意本声明的全部内容。 1. 所有权与运营权您应充分知晓，OPI Hugging Face仓库（包括当前及所有过往版本）的所有权与运营权归属于BAAI。BAAI对本平台/工具及本开源数据集计划拥有最终解释权与决策权。您知晓并理解，鉴于相关法律法规的更新完善以及我方履行合规义务的需要，我方保留不时更新、维护，甚至暂停或永久终止本平台/工具服务的权利。我方将通过合理方式（如公告或电子邮件）在合理时限内就前述可能发生的情况通知您，您应及时做出相应调整与安排。但我方不对前述任何情况给您造成的任何损失承担责任。 2. 开源数据集的权利声明为便于您获取并使用本数据集用于学习与研究，我方已对第三方原始数据集进行格式整合、数据清洗、标注、分类、注释及其他相关处理，以形成本平台/工具用户可用的开源数据集。您知晓并同意，我方不对本开源数据集主张任何知识产权所有权。因此，我方无义务主动认可并保护本开源数据集可能存在的知识产权。但这并不意味着我方放弃对本开源数据集（如存在）的署名权、发表权、修改权及保护作品完整权等人身权利。本原始数据集的潜在知识产权及相应合法权利归属于原始权利持有人。此外，向您提供经合理整理、处理的开源数据集，并不代表我方认可原始数据集的知识产权及信息内容的真实性、准确性或无可争议性。您应仔细筛选并甄别您选择使用的开源数据集。您知晓并同意，BAAI不对您选择使用的原始数据集存在的任何缺陷或瑕疵承担任何义务或担保责任。 3. 开源数据集的使用限制您使用本数据集不得侵犯我方或任何第三方的合法权益（包括但不限于著作权、专利权、商标权及其他知识产权与其他权利）。获取本开源数据集后，您应确保对本开源数据集的使用未超出原始数据集权利持有人以公告或协议形式明确规定的使用规则，包括原始数据的使用范围、用途及合法目的。我方在此提醒您，若您对本开源数据集的使用超出原始数据集预设的范围与用途，您可能面临侵犯原始数据集权利持有人合法权益（如知识产权）的风险，并需承担相应的法律责任。 4. 个人信息保护鉴于技术限制及本开源数据集的公益属性，我方无法保证本开源数据集不包含任何个人信息，且不对本开源数据集可能涉及的任何个人信息承担法律责任。若本开源数据集涉及个人信息，我方不对您在使用本开源数据集过程中可能涉及的任何个人信息处理活动承担法律责任。我方在此提醒您，应按照《中华人民共和国个人信息保护法》及其他相关法律法规的规定处理个人信息。为保护信息主体的合法权益并履行可能适用的法律与行政法规要求，若您在使用本开源数据集过程中发现涉及或可能涉及个人信息的内容，应立即停止使用该涉个人信息的数据集部分，并按照“6. 投诉与通知”中的方式联系我方。 5. 信息内容管理我方不对本开源数据集可能涉及的任何非法及不良信息承担法律责任。若您在使用过程中发现本开源数据集涉及或可能涉及任何非法及不良信息，应立即停止使用该涉非法及不良信息的数据集部分，并及时按照“6. 投诉与通知”中的方式联系我方。 6. 投诉与通知若您认为本开源数据集侵犯了您的合法权益，可通过010-50955974联系我方，我方将依法及时处理您的诉求与投诉。为处理您的诉求与投诉，我方可能需要您提供联系方式、侵权证明材料及身份证明材料。请注意，若您恶意投诉或作出虚假陈述，您将承担由此产生的全部法律责任（包括但不限于合理的赔偿费用）。 7. 免责声明您知晓并同意，鉴于本开源数据集的性质，数据集可能包含来自不同来源与贡献者的数据，数据的真实性、准确性与客观性可能存在差异，我方不对任何数据集的可用性与可靠性作出任何承诺。在任何情况下，我方均不对本开源数据集可能存在的个人信息侵权、非法及不良信息传播、知识产权侵权等任何风险承担法律责任。在任何情况下，我方均不对您因本开源数据集遭受或与之相关的任何损失（包括但不限于直接损失、间接损失及潜在利益损失）承担法律责任。 8. 其他条款本开源数据集处于持续开发与更新状态。我方可能因业务发展、第三方合作、法律法规变更等原因，更新、调整所提供的开源数据集范围，或暂停、中止或终止本开源数据集服务。 extra_gated_fields: 姓名：文本输入所属机构：文本输入国家/地区：文本输入我同意接受许可协议：复选框 extra_gated_button_content: 确认许可协议 license: CC BY-NC 4.0（知识共享署名-非商业性使用4.0国际许可协议） language: - 英语 tags: - 生物学 - 蛋白质 - 指令数据集 - 指令微调 pretty_name: 开放蛋白质指令集（OPI, Open Protein Instructions） size_categories: - 100万<n<1000万 task_categories: - 文本生成 ![image.png](./OPI_logo.png) # 开放蛋白质指令集（OPI）数据集卡片 ## 数据集更新 OPI数据集的过往版本基于UniProtKB/Swiss-Prot蛋白质知识库的**2022_01发布版**构建。当前，OPI已更新至包含最新的**2023_05发布版**，可通过数据集文件[OPI_updated_160k.json](./OPI_DATA/OPI_updated_160k.json)获取。参考文献： - https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2022_01/knowledgebase/UniProtKB_SwissProt-relstat.html - https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2023_05/knowledgebase/UniProtKB_SwissProt-relstat.html ## 数据集描述 - **主页**： - **仓库**： - **论文**： - **排行榜**： - **联系人**： ### 数据集概述开放蛋白质指令集（OPI, Open Protein Instructions）是开放生物学指令集（Open Biology Instructions, OBI）项目的初始组成部分，后续还包括开放分子指令集（Open Molecule Instructions, OMI）、开放DNA指令集（Open DNA Instructions, ODI）、开放RNA指令集（Open RNA Instructions, ORI）以及开放单细胞指令集（Open Single-cell Instructions, OSCI）。OBI项目旨在充分挖掘大语言模型（Large Language Model, LLM），尤其是Galactica等科学领域大语言模型的潜力，以助力生命科学领域的AI研究。尽管OBI仍处于早期阶段，我们希望为社区搭建一座桥梁，连接大语言模型与生物领域知识，提供一个良好的起点。 ## 数据集结构 ### 数据实例 instruction: 请根据输入的蛋白质序列的生物学功能，确定其EC分类编号。 input: MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT ATERQYELQP output: 2.7.10.2 ### 数据划分 OPI数据集的文件夹结构如下： ./OPI_DATA/ ├── AP │ ├── Function │ │ ├── test │ │ │ ├── CASPSimilarSeq_function_test.jsonl │ │ │ ├── IDFilterSeq_function_test.jsonl │ │ │ └── UniProtSeq_function_test.jsonl │ │ └── train │ │ ├── function_description_train.json │ │ └── function_description_train_0.01.json │ ├── GO │ │ ├── test │ │ │ ├── CASPSimilarSeq_go_test.jsonl │ │ │ ├── IDFilterSeq_go_test.jsonl │ │ │ └── UniProtSeq_go_test.jsonl │ │ └── train │ │ ├── go_terms_train.json │ │ └── go_terms_train_0.01.json │ └── Keywords │ ├── test │ │ ├── CASPSimilarSeq_keywords_test.jsonl │ │ ├── IDFilterSeq_keywords_test.jsonl │ │ └── UniProtSeq_keywords_test.jsonl │ └── train │ ├── keywords_train.json │ └── keywords_train_0.01.json ├── KM │ ├── gSymbol2Cancer │ │ ├── test │ │ │ └── gene_symbol_to_cancer_test.jsonl │ │ └── train │ │ └── gene_symbol_to_cancer_train.json │ ├── gName2Cancer │ │ ├── test │ │ │ └── gene_name_to_cancer_test.jsonl │ │ └── train │ │ └── gene_name_to_cancer_train.json │ └── gSymbol2Tissue │ ├── test │ │ └── gene_symbol_to_tissue_test.jsonl │ └── train │ └── gene_symbol_to_tissue_train.json └── SU ├── EC_number │ ├── test │ │ ├── CLEAN_EC_number_new_test.jsonl │ │ └── CLEAN_EC_number_price_test.jsonl │ └── train │ ├── CLEAN_EC_number_train.json ├── Fold_type-Remote │ ├── test │ │ └── Remote_test.jsonl │ └── train │ └── Remote_train.json └── Subcellular_location ├── test │ ├── location_test.jsonl └── train └── location_train.json ## 数据集构建 OPI数据集由我方自主整理，从[Swiss-Prot](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue&query=%2A)数据库中提取关键信息构建而成。详细的构建流程已提交至NeurIPS 2023数据集与基准赛道的论文补充材料中，下图展示了整体构建流程。 ![image.png](./OPI_data.png) ## 许可证本数据集采用知识共享署名-非商业性使用4.0国际许可协议（CC BY-NC 4.0）进行授权。使用本数据集时，您还需遵守UniProt的原始[许可证与免责声明](https://www.uniprot.org/help/license)及[隐私声明](https://www.uniprot.org/help/privacy)。

提供机构：

BAAI

原始信息汇总

数据集卡片 for Open Protein Instructions (OPI)

数据集更新

OPI 数据集的先前版本基于 UniProtKB/Swiss-Prot 蛋白质知识库的 2022_01 版本。目前，OPI 已更新至包含最新的 2023_05 版本，可通过数据集文件 OPI_updated_160k.json 访问。

数据集描述

数据集概述

Open Protein Instructions (OPI) 是 Open Biology Instructions (OBI) 项目的初始部分，与后续的 Open Molecule Instructions (OMI)、Open DNA Instructions (ODI)、Open RNA Instructions (ORI) 和 Open Single-cell Instructions (OSCI) 一起。OBI 项目旨在充分利用大型语言模型（LLMs），特别是像 Galactica 这样的科学 LLMs，以促进生命科学领域的人工智能研究。尽管 OBI 仍处于早期阶段，我们希望为社区提供一个起点，以桥接 LLMs 和生物学领域知识。

数据实例

instruction: What is the EC classification of the input protein sequence based on its biological function? input:
MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT ATERQYELQP output: 2.7.10.2

数据分割

OPI 数据集的文件夹结构如下：

./OPI_DATA/ ├── AP │ ├── Function │ │ ├── test │ │ │ ├── CASPSimilarSeq_function_test.jsonl │ │ │ ├── IDFilterSeq_function_test.jsonl │ │ │ └── UniProtSeq_function_test.jsonl │ │ └── train │ │ ├── function_description_train.json │ │ └── function_description_train_0.01.json │ ├── GO │ │ ├── test │ │ │ ├── CASPSimilarSeq_go_test.jsonl │ │ │ ├── IDFilterSeq_go_test.jsonl │ │ │ └── UniProtSeq_go_test.jsonl │ │ └── train │ │ ├── go_terms_train.json │ │ └── go_terms_train_0.01.json │ └── Keywords │ ├── test │ │ ├── CASPSimilarSeq_keywords_test.jsonl │ │ ├── IDFilterSeq_keywords_test.jsonl │ │ └── UniProtSeq_keywords_test.jsonl │ └── train │ ├── keywords_train.json │ └── keywords_train_0.01.json ├── KM │ ├── gSymbol2Cancer │ │ ├── test │ │ │ └── gene_symbol_to_cancer_test.jsonl │ │ └── train │ │ └── gene_symbol_to_cancer_train.json │ ├── gName2Cancer │ │ ├── test │ │ │ └── gene_name_to_cancer_test.jsonl │ │ └── train │ │ └── gene_name_to_cancer_train.json │ └── gSymbol2Tissue │ ├── test │ │ └── gene_symbol_to_tissue_test.jsonl │ └── train │ └── gene_symbol_to_tissue_train.json └── SU ├── EC_number │ ├── test │ │ ├── CLEAN_EC_number_new_test.jsonl │ │ └── CLEAN_EC_number_price_test.jsonl │ └── train │ ├── CLEAN_EC_number_train.json ├── Fold_type-Remote │ ├── test │ │ └── Remote_test.jsonl │ └── train │ └── Remote_train.json └── Subcellular_location ├── test │ ├── location_test.jsonl └── train └── location_train.json

数据集创建

OPI 数据集由我们自己策划，从 Swiss-Prot 数据库中提取关键信息。详细的构建流程在我们的手稿补充材料中描述，该手稿已提交至 NeurIPS 2023 数据集和基准。下图展示了构建过程的概述。

许可证

该数据集采用 Creative Commons Attribution Non Commercial 4.0 许可证。使用此数据集还应遵守 UniProt 的原始 License & Disclaimer 和 Privacy Notice。

搜集汇总

数据集介绍

构建方式

OPI数据集的构建是基于对Swiss-Prot数据库的关键信息提取，通过对蛋白质序列的生物功能进行分类和标注，形成了包含9个蛋白质相关任务的指令数据集。这一过程涉及数据清洗、格式整合、标注和分类等多个步骤，以确保数据集的质量和适用性。

特点

OPI数据集的特点在于其涵盖了广泛的蛋白质生物学任务，如EC编号预测、折叠类型预测、亚细胞定位预测等，为大型语言模型在蛋白质领域的应用提供了丰富的指令和注释。数据集还注重隐私保护和知识产权的尊重，确保了数据使用的合法性和安全性。

使用方法

使用OPI数据集时，用户需遵守Creative Commons Attribution Non Commercial 4.0 License以及UniProt的原始许可和隐私通知。用户可以通过HuggingFace平台下载数据集，并在学术研究和学习目的下使用。同时，用户应确保在使用过程中不侵犯任何第三方权利，并妥善处理可能涉及的个人信息。

背景与挑战

背景概述

Open Protein Instructions（OPI）数据集，由北京人工智能研究院（BAAI）提供，旨在推动大型语言模型在蛋白质生物学领域的应用研究。该数据集整合了来自第三方原始数据集的信息，经过格式整合、数据清洗、标注、分类和注释等处理，形成了覆盖9个蛋白质相关任务的指令集。OPI数据集是Open Biology Instructions（OBI）项目的初始部分，该项目旨在充分利用大型语言模型，尤其是科学LLM模型如Galactica，以促进生命科学社区的研究。OPI数据集的研究背景是推动AI在生命科学领域的应用，创建时间为2024年，主要研究机构为北京人工智能研究院。

当前挑战

在构建OPI数据集的过程中，研究人员面临了多个挑战。首先，数据集的构建需要从Swiss-Prot数据库中提取关键信息，这要求精确的数据处理技术。其次，数据集需要遵守原始数据集的版权和使用规则，同时确保不侵犯任何第三方的知识产权。此外，数据集的构建还需考虑到个人信息的保护，避免涉及个人隐私信息的处理。在使用过程中，研究人员还需面对如何准确理解和应用数据集中的指令，以及如何处理可能存在的数据不准确或不完整的问题。

常用场景

经典使用场景

OPI数据集的经典使用场景在于为大型语言模型(LLM)提供针对蛋白质相关任务的指令调优(instruction tuning)训练数据。该数据集包含9个蛋白质相关任务，如EC编号预测、折叠类型预测、亚细胞定位预测等，为LLM在蛋白质生物学领域的应用提供了丰富的指令示例和训练场景。

实际应用

在实际应用中，OPI数据集可以被用于药物设计、生物信息学研究和蛋白质工程等领域。它为研究人员提供了工具，以更准确地预测蛋白质的功能和与其他分子的交互，这对于疾病治疗和新药开发具有重要意义。

衍生相关工作

基于OPI数据集，已经衍生出了一系列相关工作，包括对LLM在蛋白质任务上的性能评估、指令调优方法的改进，以及结合其他生物信息学数据源的综合分析。这些工作进一步扩展了OPI数据集的应用范围，并推动了相关领域的研究进展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集