im-sangwoon/protein-sft-uniprot
收藏Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/im-sangwoon/protein-sft-uniprot
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- text-generation
language:
- en
tags:
- protein
- bioinformatics
- uniprot
- sft
size_categories:
- 1M<n<10M
---
# protein-sft-uniprot
단백질 연구 특화 LLM 학습을 위한 SFT(Supervised Fine-Tuning) 데이터셋입니다.
UniProt 데이터베이스와 단백질 문헌에서 추출한 Q&A 형식의 대화 데이터로 구성되어 있습니다.
## Dataset Summary
| | |
|---|---|
| **Total samples** | 1,551,711 |
| **Unique proteins** | 455,613 |
| **Format** | JSONL (chat messages) |
| **Size** | 462MB |
## Sources
| Source | Samples | Description |
|---|---|---|
| UniProtQA | 1,513,126 | UniProt 데이터베이스에서 구조화된 단백질 정보 추출 |
| Protein2Text-QA | 38,585 | 단백질 관련 문헌 기반 심화 Q&A |
## Question Types
| Type | Samples | Example |
|---|---|---|
| Official names | 455,612 | "What are the official names of Hemoglobin subunit alpha?" |
| Protein family | 407,551 | "What is the protein family that Insulin belongs to?" |
| Function | 381,776 | "What is the function of Acetyl-CoA carboxylase?" |
| Subcellular location | 274,361 | "What are the subcellular locations of TP53?" |
| Sequence | 365 | Amino acid sequence related questions |
| Other (literature-based) | 32,046 | Genotype-phenotype relationships, expression significance, etc. |
## Data Format
각 샘플은 `messages` 형식의 chat 구조입니다:
```json
{
"messages": [
{"role": "user", "content": "What is the function of Acetyl-coenzyme A carboxylase carboxyl transferase subunit beta?"},
{"role": "assistant", "content": "Component of the acetyl coenzyme A carboxylase (ACC) complex..."}
],
"source": "UniProtQA",
"protein_id": "ACCD_HAEIE"
}
```
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("im-sangwoon/protein-sft-uniprot")
```
## Intended Use
- 단백질 연구 특화 LLM fine-tuning
- Bioinformatics Q&A 시스템 구축
- 단백질 지식 기반 챗봇 학습
## Related Model
이 데이터셋으로 학습된 모델: [im-sangwoon/chatprot-qwen2.5-32b-lora](https://huggingface.co/im-sangwoon/chatprot-qwen2.5-32b-lora)
提供机构:
im-sangwoon



