OPI
收藏魔搭社区2026-01-09 更新2024-11-16 收录
下载链接:
https://modelscope.cn/datasets/BAAI/OPI
下载链接
链接失效反馈官方服务:
资源简介:

# Github:
https://github.com/baaihealth/opi
# Paper:
[OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks](https://neurips.cc/virtual/2024/105921) has been accepted by [NeurIPS 2024 Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges](https://neurips.cc/virtual/2024/workshop/84714).
# Dataset Overview
**Dataset size:**
**- Thera are <u>1.64M samples</u>, including <u>training (1,615,661)</u> and <u>testing (26,607)</u> sets, in OPI dataset, covering 9 protein-related tasks.**
We are excited to announce the release of the **Open Protein Instructions (OPI)** dataset, a curated collection of instructions covering 9 tasks for adapting LLMs to protein biology. The dataset is designed to advance LLM-driven research in the field of protein biology. We welcome contributions and enhancements to this dataset from the community.
OPI is the initial part of Open Biology Instructions(OBI) project, together with the subsequent Open Molecule Instructions(OMI), Open DNA Instructions(ODI), Open RNA Instructions(ORI) and Open Single-cell Instructions (OSCI). OBI is a project which aims to fully leverage the potential ability of Large Language Models(LLMs), especially the scientific LLMs like Galactica, to facilitate research in AI for Life Science community. While OBI is still in an early stage, we hope to provide a starting point for the community to bridge LLMs and biological domain knowledge.
## Dataset Update
The previous version of OPI dataset is based on the **release 2022_01** of UniProtKB/Swiss-Prot protein knowledgebase. At current, OPI is updated to contain the latest **release 2023_05**, which can be accessed via the dataset file [OPI_updated_160k.json](./OPI_DATA/OPI_updated_160k.json).
Reference:
- https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2022_01/knowledgebase/UniProtKB_SwissProt-relstat.html
- https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2023_05/knowledgebase/UniProtKB_SwissProt-relstat.html
<!-- ## Dataset Description -->
<!-- - **Homepage:**
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:** -->
## OPI Dataset Construction Pipeline
The OPI dataset is curated on our own by extracting key information from [Swiss-Prot](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue&query=%2A) database. The following figure shows the general construction process.

## OPI Dataset Folder Structure
The OPI dataset is organized into the three subfolders—AP, KM, and SU—by in the [OPI_DATA](https://huggingface.co/datasets/BAAI/OPI/tree/main/OPI_DATA) directory within this repository, where you can find a subset for each specific task as well as the full dataset file: [OPI_full_1.61M_train.json](https://huggingface.co/datasets/BAAI/OPI/blob/main/OPI_DATA/OPI_full_1.61M_train.json).
```
./OPI_DATA/
└── SU
│ ├── EC_number
│ │ ├── test
│ │ │ ├── CLEAN_EC_number_new_test.jsonl
│ │ │ └── CLEAN_EC_number_price_test.jsonl
│ │ └── train
│ │ ├── CLEAN_EC_number_train.json
│ ├── Fold_type
│ │ ├── test
│ │ │ └── fold_type_test.jsonl
│ │ └── train
│ │ └── fold_type_train.json
│ └── Subcellular_localization
│ ├── test
│ │ ├── subcell_loc_test.jsonl
│ └── train
└── subcell_loc_train.json
├── AP
│ └── Keywords
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_keywords_test.jsonl
│ │ │ ├── IDFilterSeq_keywords_test.jsonl
│ │ │ └── UniProtSeq_keywords_test.jsonl
│ │ └── train
│ │ ├── keywords_train.json
│ ├── GO
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_go_terms_test.jsonl
│ │ │ ├── IDFilterSeq_go_terms_test.jsonl
│ │ │ └── UniProtSeq_go_terms_test.jsonl
│ │ └── train
│ │ ├── go_terms_train.json
│ ├── Function
│ ├── test
│ │ ├── CASPSimilarSeq_function_test.jsonl
│ │ ├── IDFilterSeq_function_test.jsonl
│ │ └── UniProtSeq_function_test.jsonl
│ └── train
│ ├── function_train.json
├── KM
└── gSymbol2Tissue
│ ├── test
│ │ └── gene_symbol_to_tissue_test.jsonl
│ └── train
│ └── gene_symbol_to_tissue_train.json
├── gSymbol2Cancer
│ ├── test
│ │ └── gene_symbol_to_cancer_test.jsonl
│ └── train
│ └── gene_symbol_to_cancer_train.json
├── gName2Cancer
├── test
│ └── gene_name_to_cancer_test.jsonl
└── train
└── gene_name_to_cancer_train.json
```
## Dataset Examples
**An example of OPI training data:**
```
instruction:
What is the EC classification of the input protein sequence based on its biological function?
input:
MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS
LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL
QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV
RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK
FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT
ATERQYELQP
output:
2.7.10.2
```
**An example of OPI testing data:**
```
{"id": "seed_task_0", "name": "EC number of price dataset from CLEAN", "instruction":
"Return the EC number of the protein sequence.", "instances": [{"input":
"MAIPPYPDFRSAAFLRQHLRATMAFYDPVATDASGGQFHFFLDDGTVYNTHTRHLVSATRFVVTHAMLYRTTGEARYQVGMRHALEFLRTAFLDPATGGY
AWLIDWQDGRATVQDTTRHCYGMAFVMLAYARAYEAGVPEARVWLAEAFDTAEQHFWQPAAGLYADEASPDWQLTSYRGQNANMHACEAMISAFRATGERR
YIERAEQLAQGICQRQAALSDRTHAPAAEGWVWEHFHADWSVDWDYNRHDRSNIFRPWGYQVGHQTEWAKLLLQLDALLPADWHLPCAQRLFDTAVERGWD
AEHGGLYYGMAPDGSICDDGKYHWVQAESMAAAAVLAVRTGDARYWQWYDRIWAYCWAHFVDHEHGAWFRILHRDNRNTTREKSNAGKVDYHNMGACYDVL
LWALDAPGFSKESRSAALGRP", "output": "5.3.1.7"}], "is_classification": false}
```
## OPEval: Nine evaluation tasks using the OPI dataset
To assess the effectiveness of instruction tuning with the OPI dataset, we developed OPEval, which comprises three categories of evaluation tasks. Each category includes three specific tasks. The table below outlines the task types, names, and the corresponding sizes of the training and testing sets.
<table border="1" style="text-align:center; border-collapse:collapse;">
<tr>
<th style="text-align:center;">Task Type</th>
<th style="text-align:center;">Type Abbr.</th>
<th style="text-align:center;">Task Name</th>
<th style="text-align:center;">Task Abbr.</th>
<th style="text-align:center;">Training set size</th>
<th style="text-align:center;">Testing set size</th>
</tr>
<tr>
<td rowspan="3">Sequence Understanding</td>
<td rowspan="3">SU</td>
<td>EC Number Prediction</td>
<td>EC_number</td>
<td style="text-align:center;">227,362</td>
<td style="text-align:center;">392 (NEW-392), 149 (Price-149)</td>
</tr>
<tr>
<td>Fold Type Prediction</td>
<td>Fold_type</td>
<td style="text-align:center;">12,312</td>
<td style="text-align:center;">718 (Fold), 1254 (Superfamily), 1272 (Family)</td>
</tr>
<tr>
<td>Subcellular Localization Prediction</td>
<td>Subcellular_localization</td>
<td style="text-align:center;">11,230</td>
<td style="text-align:center;">2,772</td>
</tr>
<tr>
<td rowspan="3">Annotation Prediction</td>
<td rowspan="3">AP</td>
<td>Function Keywords Prediction</td>
<td>Keywords</td>
<td style="text-align:center;">451,618</td>
<td style="text-align:center;">184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)</td>
</tr>
<tr>
<td>Gene Ontology(GO) Terms Prediction</td>
<td>GO</td>
<td style="text-align:center;">451,618</td>
<td style="text-align:center;">184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)</td>
</tr>
<tr>
<td>Function Description Prediction</td>
<td>Function</td>
<td style="text-align:center;">451,618</td>
<td style="text-align:center;">184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)</td>
</tr>
<tr>
<td rowspan="3">Knowledge Mining</td>
<td rowspan="3">KM</td>
<td>Tissue Location Prediction from Gene Symbol</td>
<td>gSymbol2Tissue</td>
<td style="text-align:center;">8,723</td>
<td style="text-align:center;">2,181</td>
</tr>
<tr>
<td>Cancer Prediction from Gene Symbol</td>
<td>gSymbol2Cancer</td>
<td style="text-align:center;">590</td>
<td style="text-align:center;">148</td>
</tr>
<tr>
<td>Cancer Prediction from Gene Name</td>
<td>gName2Cancer</td>
<td style="text-align:center;">590</td>
<td style="text-align:center;">148</td>
</tr>
</table>
## License
The dataset is licensed under a Creative Commons Attribution Non Commercial 4.0 License. The use of this dataset should also abide by the original [License & Disclaimer](https://www.uniprot.org/help/license) and [Privacy Notice](https://www.uniprot.org/help/privacy) of UniProt.

# GitHub:
https://github.com/baaihealth/opi
# 论文:
[OPI: 一款用于适配大语言模型(Large Language Model, LLM)至蛋白质相关任务的开放指令数据集](https://neurips.cc/virtual/2024/105921) 已被[NeurIPS 2024 研讨会:科学基础模型:进展、机遇与挑战](https://neurips.cc/virtual/2024/workshop/84714)收录。
# 数据集概览
**数据集规模:**
**- OPI数据集共包含<u>164万个样本</u>,其中训练集(1,615,661条)与测试集(26,607条),覆盖9项蛋白质相关任务。**
我们很高兴地宣布发布**开放蛋白质指令集(Open Protein Instructions, OPI)**,这是一份精选的指令集合,涵盖9项用于适配大语言模型至蛋白质生物学领域的任务。本数据集旨在推动蛋白质生物学领域基于大语言模型的研究进展,我们欢迎社区对本数据集进行贡献与改进。
OPI是**开放生物学指令集(Open Biology Instructions, OBI)**项目的初始组成部分,后续还将包含开放分子指令集(Open Molecule Instructions, OMI)、开放DNA指令集(Open DNA Instructions, ODI)、开放RNA指令集(Open RNA Instructions, ORI)以及开放单细胞指令集(Open Single-cell Instructions, OSCI)。OBI项目旨在充分挖掘大语言模型(Large Language Model, LLM)——尤其是Galactica这类科学大语言模型——的潜力,以助力生命科学社区的AI相关研究。目前OBI仍处于早期阶段,我们希望为社区搭建大语言模型与生物领域知识的桥梁提供一个起点。
## 数据集更新
此前版本的OPI数据集基于UniProtKB/Swiss-Prot蛋白质知识库的**2022_01发布版**,当前OPI已更新至最新的**2023_05发布版**,可通过数据集文件[OPI_updated_160k.json](./OPI_DATA/OPI_updated_160k.json)获取。
参考链接:
- https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2022_01/knowledgebase/UniProtKB_SwissProt-relstat.html
- https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2023_05/knowledgebase/UniProtKB_SwissProt-relstat.html
## OPI数据集构建流程
OPI数据集由我们自主从[Swiss-Prot](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue&query=%2A)数据库提取关键信息并整理得到。下图展示了整体构建流程。

## OPI数据集文件夹结构
OPI数据集在本仓库的[OPI_DATA](https://huggingface.co/datasets/BAAI/OPI/tree/main/OPI_DATA)目录下分为三个子文件夹:AP、KM和SU,每个子文件夹中均可找到对应特定任务的子集以及完整数据集文件:[OPI_full_1.61M_train.json](https://huggingface.co/datasets/BAAI/OPI/blob/main/OPI_DATA/OPI_full_1.61M_train.json)。
./OPI_DATA/
└── SU
│ ├── EC_number
│ │ ├── test
│ │ │ ├── CLEAN_EC_number_new_test.jsonl
│ │ │ └── CLEAN_EC_number_price_test.jsonl
│ │ └── train
│ │ ├── CLEAN_EC_number_train.json
│ ├── Fold_type
│ │ ├── test
│ │ │ └── fold_type_test.jsonl
│ │ └── train
│ │ └── fold_type_train.json
│ └── Subcellular_localization
│ ├── test
│ │ ├── subcell_loc_test.jsonl
│ └── train
└── subcell_loc_train.json
├── AP
│ └── Keywords
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_keywords_test.jsonl
│ │ │ ├── IDFilterSeq_keywords_test.jsonl
│ │ │ └── UniProtSeq_keywords_test.jsonl
│ │ └── train
│ │ ├── keywords_train.json
│ ├── GO
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_go_terms_test.jsonl
│ │ │ ├── IDFilterSeq_go_terms_test.jsonl
│ │ │ └── UniProtSeq_go_terms_test.jsonl
│ │ └── train
│ │ ├── go_terms_train.json
│ ├── Function
│ ├── test
│ │ ├── CASPSimilarSeq_function_test.jsonl
│ │ ├── IDFilterSeq_function_test.jsonl
│ │ └── UniProtSeq_function_test.jsonl
│ └── train
│ ├── function_train.json
├── KM
└── gSymbol2Tissue
│ ├── test
│ │ └── gene_symbol_to_tissue_test.jsonl
│ └── train
│ └── gene_symbol_to_tissue_train.json
├── gSymbol2Cancer
│ ├── test
│ │ └── gene_symbol_to_cancer_test.jsonl
│ └── train
│ └── gene_symbol_to_cancer_train.json
├── gName2Cancer
├── test
│ └── gene_name_to_cancer_test.jsonl
└── train
└── gene_name_to_cancer_train.json
## 数据集示例
**OPI训练集示例:**
instruction:
请根据输入蛋白质序列的生物学功能,给出其EC分类编号。
input:
MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS
LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL
QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV
RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK
FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT
ATERQYELQP
output:
2.7.10.2
**OPI测试集示例:**
{"id": "seed_task_0", "name": "CLEAN数据集的EC编号预测任务", "instruction":
"返回该蛋白质序列的EC编号。", "instances": [{"input":
"MAIPPYPDFRSAAFLRQHLRATMAFYDPVATDASGGQFHFFLDDGTVYNTHTRHLVSATRFVVTHAMLYRTTGEARYQVGMRHALEFLRTAFLDPATGGY
AWLIDWQDGRATVQDTTRHCYGMAFVMLAYARAYEAGVPEARVWLAEAFDTAEQHFWQPAAGLYADEASPDWQLTSYRGQNANMHACEAMISAFRATGERR
YIERAEQLAQGICQRQAALSDRTHAPAAEGWVWEHFHADWSVDWDYNRHDRSNIFRPWGYQVGHQTEWAKLLLQLDALLPADWHLPCAQRLFDTAVERGWD
AEHGGLYYGMAPDGSICDDGKYHWVQAESMAAAAVLAVRTGDARYWQWYDRIWAYCWAHFVDHEHGAWFRILHRDNRNTTREKSNAGKVDYHNMGACYDVL
LWALDAPGFSKESRSAALGRP", "output": "5.3.1.7"}], "is_classification": false}
## OPEval:基于OPI数据集的9项评估任务
为了评估使用OPI数据集进行指令微调的效果,我们开发了OPEval,其包含三类评估任务,每类包含三个具体任务。下表列出了任务类型、类型缩写、任务名称、任务缩写以及对应的训练集和测试集规模。
<table border="1" style="text-align:center; border-collapse:collapse;">
<tr>
<th style="text-align:center;">任务类型</th>
<th style="text-align:center;">类型缩写</th>
<th style="text-align:center;">任务名称</th>
<th style="text-align:center;">任务缩写</th>
<th style="text-align:center;">训练集规模</th>
<th style="text-align:center;">测试集规模</th>
</tr>
<tr>
<td rowspan="3">序列理解</td>
<td rowspan="3">SU</td>
<td>EC编号预测</td>
<td>EC_number</td>
<td style="text-align:center;">227,362</td>
<td style="text-align:center;">392(NEW-392)、149(Price-149)</td>
</tr>
<tr>
<td>折叠类型预测</td>
<td>Fold_type</td>
<td style="text-align:center;">12,312</td>
<td style="text-align:center;">718(Fold)、1254(超家族)、1272(家族)</td>
</tr>
<tr>
<td>亚细胞定位预测</td>
<td>Subcellular_localization</td>
<td style="text-align:center;">11,230</td>
<td style="text-align:center;">2,772</td>
</tr>
<tr>
<td rowspan="3">注释预测</td>
<td rowspan="3">AP</td>
<td>功能关键词预测</td>
<td>Keywords</td>
<td style="text-align:center;">451,618</td>
<td style="text-align:center;">184(CASPSimilarSeq)、1,112(IDFilterSeq)、4562(UniProtSeq)</td>
</tr>
<tr>
<td>基因本体(Gene Ontology, GO)术语预测</td>
<td>GO</td>
<td style="text-align:center;">451,618</td>
<td style="text-align:center;">184(CASPSimilarSeq)、1,112(IDFilterSeq)、4562(UniProtSeq)</td>
</tr>
<tr>
<td>功能描述预测</td>
<td>Function</td>
<td style="text-align:center;">451,618</td>
<td style="text-align:center;">184(CASPSimilarSeq)、1,112(IDFilterSeq)、4562(UniProtSeq)</td>
</tr>
<tr>
<td rowspan="3">知识挖掘</td>
<td rowspan="3">KM</td>
<td>基于基因符号的组织定位预测</td>
<td>gSymbol2Tissue</td>
<td style="text-align:center;">8,723</td>
<td style="text-align:center;">2,181</td>
</tr>
<tr>
<td>基于基因符号的癌症预测</td>
<td>gSymbol2Cancer</td>
<td style="text-align:center;">590</td>
<td style="text-align:center;">148</td>
</tr>
<tr>
<td>基于基因名称的癌症预测</td>
<td>gName2Cancer</td>
<td style="text-align:center;">590</td>
<td style="text-align:center;">148</td>
</tr>
</table>
## 许可协议
本数据集采用知识共享署名非商业性4.0国际许可协议(Creative Commons Attribution Non Commercial 4.0 License)进行授权。使用本数据集还需遵守UniProt原有的[许可与免责声明](https://www.uniprot.org/help/license)及[隐私声明](https://www.uniprot.org/help/privacy)。
提供机构:
maas
创建时间:
2024-09-12



