---
extra_gated_heading: Acknowledge license to accept the repository
extra_gated_prompt: >
The Beijing Academy of Artificial Intelligence (hereinafter referred to as
"we" or "BAAI") provides you with an open-source dataset (hereinafter referred
to as "dataset") through the OPI HuggingFace repository
(https://huggingface.co/datasets/BAAI/OPI). You can download the dataset you
need and use it for purposes such as learning and research while abiding by
the usage rules of each original dataset.
Before you acquire the open-source dataset (including but not limited to
accessing, downloading, copying, distributing, using, or any other handling of
the dataset), you should read and understand this "OPI Open-Source Dataset
Usage Notice and Disclaimer" (hereinafter referred to as "this statement").
Once you acquire the open-source dataset, regardless of your method of
acquisition, your actions will be regarded as acknowledgment of the full
content of this statement.
1. Ownership and Operation Rights
You should fully understand that the ownership and operation rights of the OPI
HuggingFace repository (including the current and all previous versions)
belong to BAAI. BAAI has the final interpretation and decision rights over
this platform/tool and the open-source dataset plan.
You acknowledge and understand that due to updates and improvements in
relevant laws and regulations and the need to fulfill our legal compliance
obligations, we reserve the right to update, maintain, or even suspend or
permanently terminate the services of this platform/tool from time to time. We
will notify you of possible situations mentioned above reasonably such as
through an announcement or email within a reasonable time. You should make
corresponding adjustments and arrangements in a timely manner. However, we do
not bear any responsibility for any losses caused to you by any of the
aforementioned situations.
2. Claim of Rights to Open-Source Datasets
For the purpose of facilitating your dataset acquisition and use for learning,
and research, we have performed necessary steps such as format integration,
data cleaning, labeling, categorizing, annotating, and other related
processing on the third-party original datasets to form the open-source
datasets for this platform/tool's users.
You understand and acknowledge that we do not claim the proprietary rights of
intellectual property to the open-source datasets. Therefore, we have no
obligation to actively recognize and protect the potential intellectual
property of the open-source datasets. However, this does not mean that we
renounce the personal rights to claim credit, publication, modification, and
protection of the integrity of the work (if any) of the open-source datasets.
The potential intellectual property and corresponding legal rights of the
original datasets belong to the original rights holders.
In addition, providing you with open-source datasets that have been reasonably
arranged, processed, and handled does not mean that we acknowledge the
authenticity, accuracy, or indisputability of the intellectual property and
information content of the original datasets. You should filter and carefully
discern the open-source datasets you choose to use. You understand and agree
that BAAI does not undertake any obligation or warranty responsibility for any
defects or flaws in the original datasets you choose to use.
3. Usage Restrictions for Open-Source Datasets
Your use of the dataset must not infringe on our or any third party's legal
rights and interests (including but not limited to copyrights, patent rights,
trademark rights, and other intellectual property and other rights).
After obtaining the open-source dataset, you should ensure that your use of
the open-source dataset does not exceed the usage rules explicitly stipulated
by the rights holders of the original dataset in the form of a public notice
or agreement, including the range, purpose, and lawful purposes of the use of
the original data. We kindly remind you here that if your use of the
open-source dataset exceeds the predetermined range and purpose of the
original dataset, you may face the risk of infringing on the legal rights and
interests of the rights holders of the original dataset, such as intellectual
property, and may bear corresponding legal responsibilities.
4. Personal Information Protection
Due to technical limitations and the public welfare nature of the open-source
datasets, we cannot guarantee that the open-source datasets do not contain any
personal information, and we do not bear any legal responsibility for any
personal information that may be involved in the open-source datasets.
If the open-source dataset involves personal information, we do not bear any
legal responsibility for any personal information processing activities you
may involve when using the open-source dataset. We kindly remind you here that
you should handle personal information in accordance with the provisions of
the "Personal Information Protection Law" and other relevant laws and
regulations.
To protect the legal rights and interests of the information subject and to
fulfill possible applicable laws and administrative regulations, if you find
content that involves or may involve personal information during the use of
the open-source dataset, you should immediately stop using the part of the
dataset that involves personal information and contact us as indicated in "6.
Complaints and Notices."
5. Information Content Management
We do not bear any legal responsibility for any illegal and bad information
that may be involved in the open-source dataset.
If you find that the open-source dataset involves or may involve any illegal
and bad information during your use, you should immediately stop using the
part of the dataset that involves illegal and bad information and contact us
in a timely manner as indicated in "6. Complaints and Notices."
6. Complaints and Notices
If you believe that the open-source dataset has infringed on your legal rights
and interests, you can contact us at 010-50955974, and we will handle your
claims and complaints in accordance with the law in a timely manner.
To handle your claims and complaints, we may need you to provide contact
information, infringement proof materials, and identity proof materials.
Please note that if you maliciously complain or make false statements, you
will bear all legal responsibilities caused thereby (including but not limited
to reasonable compensation costs).
7. Disclaimer
You understand and agree that due to the nature of the open-source dataset,
the dataset may contain data from different sources and contributors, and the
authenticity, accuracy, and objectivity of the data may vary, and we cannot
make any promises about the availability and reliability of any dataset.
In any case, we do not bear any legal responsibility for any risks such as
personal information infringement, illegal and bad information dissemination,
and intellectual property infringement that may exist in the open-source
dataset.
In any case, we do not bear any legal responsibility for any loss (including
but not limited to direct loss, indirect loss, and loss of potential benefits)
you suffer or is related to the open-source dataset.
8. Others
The open-source dataset is in a constant state of development and change. We
may update, adjust the range of the open-source dataset we provide, or
suspend, pause, or terminate the open-source dataset service due to business
development, third-party cooperation, changes in laws and regulations, and
other reasons.
extra_gated_fields:
Name: text
Affiliation: text
Country: text
I agree to accept the license: checkbox
extra_gated_button_content: Acknowledge license
license: cc-by-nc-4.0
language:
- en
tags:
- biology
- protein
- instruction dataset
- instruction tuning
pretty_name: Open Protein Instructions(OPI)
size_categories:
- 1M<n<10M
task_categories:
- text-generation
---

# Dataset Card for Open Protein Instructions (OPI)
## Dataset Update
The previous version of OPI dataset is based on the **release 2022_01** of UniProtKB/Swiss-Prot protein knowledgebase. At current, OPI is updated to contain the latest **release 2023_05**, which can be accessed via the dataset file [OPI_updated_160k.json](./OPI_DATA/OPI_updated_160k.json).
Reference:
- https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2022_01/knowledgebase/UniProtKB_SwissProt-relstat.html
- https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2023_05/knowledgebase/UniProtKB_SwissProt-relstat.html
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Open Protein Instructions(OPI) is the initial part of Open Biology Instructions(OBI) project, together with the subsequent Open Molecule Instructions(OMI), Open DNA Instructions(ODI), Open RNA Instructions(ORI) and Open Single-cell Instructions (OSCI). OBI is a project which aims to fully leverage the potential ability of Large Language Models(LLMs), especially the scientific LLMs like Galactica, to facilitate research in AI for Life Science community. While OBI is still in an early stage, we hope to provide a starting point for the community to bridge LLMs and biological domain knowledge.
## Dataset Structure
### Data Instances
```
instruction:
What is the EC classification of the input protein sequence based on its biological function?
input:
MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS
LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL
QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV
RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK
FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT
ATERQYELQP
output:
2.7.10.2
```
### Data Splits
The OPI dataset folder structure is as follows:
```
./OPI_DATA/
├── AP
│ ├── Function
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_function_test.jsonl
│ │ │ ├── IDFilterSeq_function_test.jsonl
│ │ │ └── UniProtSeq_function_test.jsonl
│ │ └── train
│ │ ├── function_description_train.json
│ │ └── function_description_train_0.01.json
│ ├── GO
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_go_test.jsonl
│ │ │ ├── IDFilterSeq_go_test.jsonl
│ │ │ └── UniProtSeq_go_test.jsonl
│ │ └── train
│ │ ├── go_terms_train.json
│ │ └── go_terms_train_0.01.json
│ └── Keywords
│ ├── test
│ │ ├── CASPSimilarSeq_keywords_test.jsonl
│ │ ├── IDFilterSeq_keywords_test.jsonl
│ │ └── UniProtSeq_keywords_test.jsonl
│ └── train
│ ├── keywords_train.json
│ └── keywords_train_0.01.json
├── KM
│ ├── gSymbol2Cancer
│ │ ├── test
│ │ │ └── gene_symbol_to_cancer_test.jsonl
│ │ └── train
│ │ └── gene_symbol_to_cancer_train.json
│ ├── gName2Cancer
│ │ ├── test
│ │ │ └── gene_name_to_cancer_test.jsonl
│ │ └── train
│ │ └── gene_name_to_cancer_train.json
│ └── gSymbol2Tissue
│ ├── test
│ │ └── gene_symbol_to_tissue_test.jsonl
│ └── train
│ └── gene_symbol_to_tissue_train.json
└── SU
├── EC_number
│ ├── test
│ │ ├── CLEAN_EC_number_new_test.jsonl
│ │ └── CLEAN_EC_number_price_test.jsonl
│ └── train
│ ├── CLEAN_EC_number_train.json
├── Fold_type-Remote
│ ├── test
│ │ └── Remote_test.jsonl
│ └── train
│ └── Remote_train.json
└── Subcellular_location
├── test
│ ├── location_test.jsonl
└── train
└── location_train.json
```
## Dataset Creation
The OPI dataset is curated on our own by extracting key information from [Swiss-Prot](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue&query=%2A) database. The detailed construction pipeline is depicted in the supplementary material of our manuscript which has been submitted to NeurIPS 2023 Datasets and Benchmarks. The following figure shows the general construction process.

## License
The dataset is licensed under a Creative Commons Attribution Non Commercial 4.0 License. The use of this dataset should also abide by the original [License & Disclaimer](https://www.uniprot.org/help/license) and [Privacy Notice](https://www.uniprot.org/help/privacy) of UniProt.
需确认许可协议以接收此仓库
extra_gated_prompt: 北京人工智能研究院(以下简称“我方”或“BAAI”)通过OPI Hugging Face仓库(https://huggingface.co/datasets/BAAI/OPI)向您提供开源数据集(以下简称“本数据集”)。您可在遵守各原始数据集使用规则的前提下,下载所需数据集并用于学习与研究等合法用途。
在您获取本开源数据集(包括但不限于访问、下载、复制、分发、使用或以任何其他方式处理本数据集)之前,请您仔细阅读并理解本《OPI开源数据集使用声明与免责条款》(以下简称“本声明”)。一旦您获取本开源数据集,无论采用何种获取方式,均视为您已充分知晓并同意本声明的全部内容。
1. 所有权与运营权
您应充分知晓,OPI Hugging Face仓库(包括当前及所有过往版本)的所有权与运营权归属于BAAI。BAAI对本平台/工具及本开源数据集计划拥有最终解释权与决策权。
您知晓并理解,鉴于相关法律法规的更新完善以及我方履行合规义务的需要,我方保留不时更新、维护,甚至暂停或永久终止本平台/工具服务的权利。我方将通过合理方式(如公告或电子邮件)在合理时限内就前述可能发生的情况通知您,您应及时做出相应调整与安排。但我方不对前述任何情况给您造成的任何损失承担责任。
2. 开源数据集的权利声明
为便于您获取并使用本数据集用于学习与研究,我方已对第三方原始数据集进行格式整合、数据清洗、标注、分类、注释及其他相关处理,以形成本平台/工具用户可用的开源数据集。
您知晓并同意,我方不对本开源数据集主张任何知识产权所有权。因此,我方无义务主动认可并保护本开源数据集可能存在的知识产权。但这并不意味着我方放弃对本开源数据集(如存在)的署名权、发表权、修改权及保护作品完整权等人身权利。本原始数据集的潜在知识产权及相应合法权利归属于原始权利持有人。
此外,向您提供经合理整理、处理的开源数据集,并不代表我方认可原始数据集的知识产权及信息内容的真实性、准确性或无可争议性。您应仔细筛选并甄别您选择使用的开源数据集。您知晓并同意,BAAI不对您选择使用的原始数据集存在的任何缺陷或瑕疵承担任何义务或担保责任。
3. 开源数据集的使用限制
您使用本数据集不得侵犯我方或任何第三方的合法权益(包括但不限于著作权、专利权、商标权及其他知识产权与其他权利)。
获取本开源数据集后,您应确保对本开源数据集的使用未超出原始数据集权利持有人以公告或协议形式明确规定的使用规则,包括原始数据的使用范围、用途及合法目的。我方在此提醒您,若您对本开源数据集的使用超出原始数据集预设的范围与用途,您可能面临侵犯原始数据集权利持有人合法权益(如知识产权)的风险,并需承担相应的法律责任。
4. 个人信息保护
鉴于技术限制及本开源数据集的公益属性,我方无法保证本开源数据集不包含任何个人信息,且不对本开源数据集可能涉及的任何个人信息承担法律责任。
若本开源数据集涉及个人信息,我方不对您在使用本开源数据集过程中可能涉及的任何个人信息处理活动承担法律责任。我方在此提醒您,应按照《中华人民共和国个人信息保护法》及其他相关法律法规的规定处理个人信息。
为保护信息主体的合法权益并履行可能适用的法律与行政法规要求,若您在使用本开源数据集过程中发现涉及或可能涉及个人信息的内容,应立即停止使用该涉个人信息的数据集部分,并按照“6. 投诉与通知”中的方式联系我方。
5. 信息内容管理
我方不对本开源数据集可能涉及的任何非法及不良信息承担法律责任。
若您在使用过程中发现本开源数据集涉及或可能涉及任何非法及不良信息,应立即停止使用该涉非法及不良信息的数据集部分,并及时按照“6. 投诉与通知”中的方式联系我方。
6. 投诉与通知
若您认为本开源数据集侵犯了您的合法权益,可通过010-50955974联系我方,我方将依法及时处理您的诉求与投诉。
为处理您的诉求与投诉,我方可能需要您提供联系方式、侵权证明材料及身份证明材料。请注意,若您恶意投诉或作出虚假陈述,您将承担由此产生的全部法律责任(包括但不限于合理的赔偿费用)。
7. 免责声明
您知晓并同意,鉴于本开源数据集的性质,数据集可能包含来自不同来源与贡献者的数据,数据的真实性、准确性与客观性可能存在差异,我方不对任何数据集的可用性与可靠性作出任何承诺。
在任何情况下,我方均不对本开源数据集可能存在的个人信息侵权、非法及不良信息传播、知识产权侵权等任何风险承担法律责任。
在任何情况下,我方均不对您因本开源数据集遭受或与之相关的任何损失(包括但不限于直接损失、间接损失及潜在利益损失)承担法律责任。
8. 其他条款
本开源数据集处于持续开发与更新状态。我方可能因业务发展、第三方合作、法律法规变更等原因,更新、调整所提供的开源数据集范围,或暂停、中止或终止本开源数据集服务。
extra_gated_fields:
姓名:文本输入
所属机构:文本输入
国家/地区:文本输入
我同意接受许可协议:复选框
extra_gated_button_content: 确认许可协议
license: CC BY-NC 4.0(知识共享署名-非商业性使用4.0国际许可协议)
language:
- 英语
tags:
- 生物学
- 蛋白质
- 指令数据集
- 指令微调
pretty_name: 开放蛋白质指令集(OPI, Open Protein Instructions)
size_categories:
- 100万<n<1000万
task_categories:
- 文本生成

# 开放蛋白质指令集(OPI)数据集卡片
## 数据集更新
OPI数据集的过往版本基于UniProtKB/Swiss-Prot蛋白质知识库的**2022_01发布版**构建。当前,OPI已更新至包含最新的**2023_05发布版**,可通过数据集文件[OPI_updated_160k.json](./OPI_DATA/OPI_updated_160k.json)获取。
参考文献:
- https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2022_01/knowledgebase/UniProtKB_SwissProt-relstat.html
- https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2023_05/knowledgebase/UniProtKB_SwissProt-relstat.html
## 数据集描述
- **主页**:
- **仓库**:
- **论文**:
- **排行榜**:
- **联系人**:
### 数据集概述
开放蛋白质指令集(OPI, Open Protein Instructions)是开放生物学指令集(Open Biology Instructions, OBI)项目的初始组成部分,后续还包括开放分子指令集(Open Molecule Instructions, OMI)、开放DNA指令集(Open DNA Instructions, ODI)、开放RNA指令集(Open RNA Instructions, ORI)以及开放单细胞指令集(Open Single-cell Instructions, OSCI)。OBI项目旨在充分挖掘大语言模型(Large Language Model, LLM),尤其是Galactica等科学领域大语言模型的潜力,以助力生命科学领域的AI研究。尽管OBI仍处于早期阶段,我们希望为社区搭建一座桥梁,连接大语言模型与生物领域知识,提供一个良好的起点。
## 数据集结构
### 数据实例
instruction:
请根据输入的蛋白质序列的生物学功能,确定其EC分类编号。
input:
MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS
LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL
QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV
RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK
FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT
ATERQYELQP
output:
2.7.10.2
### 数据划分
OPI数据集的文件夹结构如下:
./OPI_DATA/
├── AP
│ ├── Function
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_function_test.jsonl
│ │ │ ├── IDFilterSeq_function_test.jsonl
│ │ │ └── UniProtSeq_function_test.jsonl
│ │ └── train
│ │ ├── function_description_train.json
│ │ └── function_description_train_0.01.json
│ ├── GO
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_go_test.jsonl
│ │ │ ├── IDFilterSeq_go_test.jsonl
│ │ │ └── UniProtSeq_go_test.jsonl
│ │ └── train
│ │ ├── go_terms_train.json
│ │ └── go_terms_train_0.01.json
│ └── Keywords
│ ├── test
│ │ ├── CASPSimilarSeq_keywords_test.jsonl
│ │ ├── IDFilterSeq_keywords_test.jsonl
│ │ └── UniProtSeq_keywords_test.jsonl
│ └── train
│ ├── keywords_train.json
│ └── keywords_train_0.01.json
├── KM
│ ├── gSymbol2Cancer
│ │ ├── test
│ │ │ └── gene_symbol_to_cancer_test.jsonl
│ │ └── train
│ │ └── gene_symbol_to_cancer_train.json
│ ├── gName2Cancer
│ │ ├── test
│ │ │ └── gene_name_to_cancer_test.jsonl
│ │ └── train
│ │ └── gene_name_to_cancer_train.json
│ └── gSymbol2Tissue
│ ├── test
│ │ └── gene_symbol_to_tissue_test.jsonl
│ └── train
│ └── gene_symbol_to_tissue_train.json
└── SU
├── EC_number
│ ├── test
│ │ ├── CLEAN_EC_number_new_test.jsonl
│ │ └── CLEAN_EC_number_price_test.jsonl
│ └── train
│ ├── CLEAN_EC_number_train.json
├── Fold_type-Remote
│ ├── test
│ │ └── Remote_test.jsonl
│ └── train
│ └── Remote_train.json
└── Subcellular_location
├── test
│ ├── location_test.jsonl
└── train
└── location_train.json
## 数据集构建
OPI数据集由我方自主整理,从[Swiss-Prot](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue&query=%2A)数据库中提取关键信息构建而成。详细的构建流程已提交至NeurIPS 2023数据集与基准赛道的论文补充材料中,下图展示了整体构建流程。

## 许可证
本数据集采用知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)进行授权。使用本数据集时,您还需遵守UniProt的原始[许可证与免责声明](https://www.uniprot.org/help/license)及[隐私声明](https://www.uniprot.org/help/privacy)。