Align-Anything-Instruction-100K-zh
收藏魔搭社区2025-12-19 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/PKU-Alignment/Align-Anything-Instruction-100K-zh
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Align-Anything-Instruction-100K-zh
[[🏠 Homepage](https://github.com/PKU-Alignment/align-anything)]
[[🤗 Instruction-Dataset-100K(en)](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K)]
[[🤗 Instruction-Dataset-100K(zh)](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K-zh)]
[[🤗 Align-Anything Datasets](https://huggingface.co/datasets/PKU-Alignment/align-anything/)]
# Instruction-Dataset-100K(zh)
## Highlights
<div class="col-md-12">
<ul>
<li><b>Data sources:</b>
<a href="https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M" target="_blank">Firefly (47.8%)</a>,
<a href="https://huggingface.co/datasets/BAAI/COIG" target="_blank">COIG (2.9%)</a>,
and our meticulously constructed QA pairs (49.3%).
</li>
<li><b>100K QA pairs (zh):</b> 104,550 meticulously crafted instructions, selected and polished from various Chinese datasets, with QA pairs further enhanced using GPT-4.</li>
<li><b>Note:</b> This dataset has different data sources and polishing methods from <a href="https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K" target="_blank">Align-Anything-Instruction-100K(en)</a>. It is not directly translated from that dataset.</li>
</ul>
</div>
## Data Summary
This dataset is a sibling project of [Align-Anything](https://github.com/PKU-Alignment/align-anything).
We offer a high-quality Chinese instruction-following dataset consisting of 100K question-answer pairs. These entries cover various categories, including summaries, creation, extraction, classification, cosplay, KnowledgeQA, OpenQA, reasoning, brainstorming, and more.
Of the 100K QA pairs in our dataset, 50.7% are from public datasets such as [Firefly](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) and [COIG](https://huggingface.co/datasets/BAAI/COIG). The instructions for the remaining 49.3% QA pairs are crafted by us and annotated by GPT-4 under expert guidance, similar to the [PKU-SafeRLHF dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF).
Each QA pair is post-processed by GPT-4 according to specific guidelines. This comprehensive and detailed pipeline ensures a high-quality instruction-following dataset.
## Dataset Comparison
We train several base models (Llama2-7B, Llama3-8B, Qwen2-7B) using samples from both Align-Anything-Instruction-100-zh (50K samples) and Firefly (50K samples). Then, we evaluate the fine-tuned models on the [Just-Eval](https://huggingface.co/datasets/re-align/just-eval-instruct) benchmark, translating the evaluation prompts into Chinese before assessment. The models are evaluated across five dimensions: helpfulness, clarity, factuality, depth, and engagement. Models trained by our dataset demonstrate excellent performance in all dimensions.
<div align="center">
<img src="performance.png" width="70%"/>
</div>
## Usage
To load our dataset, use the `load_dataset()` function as follows:
```python
from datasets import load_dataset
dataset = load_dataset("PKU-Alignment/Align-Anything-Instruction-100K-zh")
```
# Align-Anything-Instruction-100K-zh 数据集卡片
[[🏠 主页](https://github.com/PKU-Alignment/align-anything)]
[[🤗 10万条指令数据集(英文)](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K)]
[[🤗 10万条指令数据集(中文)](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K-zh)]
[[🤗 Align-Anything 系列数据集](https://huggingface.co/datasets/PKU-Alignment/align-anything/)]
## 数据集亮点
<div class="col-md-12">
<ul>
<li><b>数据来源:</b>
<a href="https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M" target="_blank">Firefly(占比47.8%)</a>,
<a href="https://huggingface.co/datasets/BAAI/COIG" target="_blank">COIG(占比2.9%)</a>,
以及我们精心构建的问答对(占比49.3%)。
</li>
<li><b>10万条中文问答对:</b> 从各类中文数据集筛选打磨得到的104550条精心编制的指令,其中问答对进一步通过GPT-4进行优化完善。</li>
<li><b>说明:</b> 本数据集与<a href="https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K" target="_blank">Align-Anything-Instruction-100K(英文)</a>数据来源与打磨方式均不相同,并非直接由该英文数据集翻译而来。</li>
</ul>
</div>
## 数据概览
本数据集是<a href="https://github.com/PKU-Alignment/align-anything">Align-Anything</a>的姊妹项目。
我们提供了一套高质量的中文指令遵循数据集,包含10万条问答对。这些数据条目覆盖多个类别,包括摘要生成、内容创作、信息抽取、分类任务、角色扮演、知识问答、开放域问答、逻辑推理、头脑风暴等诸多领域。
在本数据集的10万条问答对中,50.7%来自<a href="https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M">Firefly</a>与<a href="https://huggingface.co/datasets/BAAI/COIG">COIG</a>等公开数据集;剩余49.3%的问答对指令由我们自主编制,并在专家指导下通过GPT-4完成标注,其构建方式与<a href="https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF">PKU-SafeRLHF数据集</a>类似。
每条问答对均按照特定规范通过GPT-4进行后处理,这套全面细致的流程保障了本指令遵循数据集的高质量水准。
## 数据集对比分析
我们分别使用Align-Anything-Instruction-100K-zh的样本(5万条)与Firefly的样本(5万条)微调了多款基础模型(Llama2-7B、Llama3-8B、Qwen2-7B)。随后在<a href="https://huggingface.co/datasets/re-align/just-eval-instruct">Just-Eval</a>基准测试集上对微调后的模型进行评估,评估前先将测试提示词翻译为中文。本次评估从实用性、清晰度、事实性、深度与参与度五个维度展开。使用本数据集训练得到的模型在所有维度上均表现优异。
<div align="center">
<img src="performance.png" width="70%"/>
</div>
## 使用方法
如需加载本数据集,请使用`load_dataset()`函数,代码示例如下:
python
from datasets import load_dataset
dataset = load_dataset("PKU-Alignment/Align-Anything-Instruction-100K-zh")
提供机构:
maas
创建时间:
2025-02-07



