five

im-sangwoon/protein-sft-uniprot

收藏
Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/im-sangwoon/protein-sft-uniprot
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering - text-generation language: - en tags: - protein - bioinformatics - uniprot - sft size_categories: - 1M<n<10M --- # protein-sft-uniprot 단백질 연구 특화 LLM 학습을 위한 SFT(Supervised Fine-Tuning) 데이터셋입니다. UniProt 데이터베이스와 단백질 문헌에서 추출한 Q&A 형식의 대화 데이터로 구성되어 있습니다. ## Dataset Summary | | | |---|---| | **Total samples** | 1,551,711 | | **Unique proteins** | 455,613 | | **Format** | JSONL (chat messages) | | **Size** | 462MB | ## Sources | Source | Samples | Description | |---|---|---| | UniProtQA | 1,513,126 | UniProt 데이터베이스에서 구조화된 단백질 정보 추출 | | Protein2Text-QA | 38,585 | 단백질 관련 문헌 기반 심화 Q&A | ## Question Types | Type | Samples | Example | |---|---|---| | Official names | 455,612 | "What are the official names of Hemoglobin subunit alpha?" | | Protein family | 407,551 | "What is the protein family that Insulin belongs to?" | | Function | 381,776 | "What is the function of Acetyl-CoA carboxylase?" | | Subcellular location | 274,361 | "What are the subcellular locations of TP53?" | | Sequence | 365 | Amino acid sequence related questions | | Other (literature-based) | 32,046 | Genotype-phenotype relationships, expression significance, etc. | ## Data Format 각 샘플은 `messages` 형식의 chat 구조입니다: ```json { "messages": [ {"role": "user", "content": "What is the function of Acetyl-coenzyme A carboxylase carboxyl transferase subunit beta?"}, {"role": "assistant", "content": "Component of the acetyl coenzyme A carboxylase (ACC) complex..."} ], "source": "UniProtQA", "protein_id": "ACCD_HAEIE" } ``` ## Usage ```python from datasets import load_dataset dataset = load_dataset("im-sangwoon/protein-sft-uniprot") ``` ## Intended Use - 단백질 연구 특화 LLM fine-tuning - Bioinformatics Q&A 시스템 구축 - 단백질 지식 기반 챗봇 학습 ## Related Model 이 데이터셋으로 학습된 모델: [im-sangwoon/chatprot-qwen2.5-32b-lora](https://huggingface.co/im-sangwoon/chatprot-qwen2.5-32b-lora)
提供机构:
im-sangwoon
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作