five

Align-Anything-Instruction-100K-zh

收藏
魔搭社区2025-12-19 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/PKU-Alignment/Align-Anything-Instruction-100K-zh
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Align-Anything-Instruction-100K-zh [[🏠 Homepage](https://github.com/PKU-Alignment/align-anything)] [[🤗 Instruction-Dataset-100K(en)](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K)] [[🤗 Instruction-Dataset-100K(zh)](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K-zh)] [[🤗 Align-Anything Datasets](https://huggingface.co/datasets/PKU-Alignment/align-anything/)] # Instruction-Dataset-100K(zh) ## Highlights <div class="col-md-12"> <ul> <li><b>Data sources:</b> <a href="https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M" target="_blank">Firefly (47.8%)</a>, <a href="https://huggingface.co/datasets/BAAI/COIG" target="_blank">COIG (2.9%)</a>, and our meticulously constructed QA pairs (49.3%). </li> <li><b>100K QA pairs (zh):</b> 104,550 meticulously crafted instructions, selected and polished from various Chinese datasets, with QA pairs further enhanced using GPT-4.</li> <li><b>Note:</b> This dataset has different data sources and polishing methods from <a href="https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K" target="_blank">Align-Anything-Instruction-100K(en)</a>. It is not directly translated from that dataset.</li> </ul> </div> ## Data Summary This dataset is a sibling project of [Align-Anything](https://github.com/PKU-Alignment/align-anything). We offer a high-quality Chinese instruction-following dataset consisting of 100K question-answer pairs. These entries cover various categories, including summaries, creation, extraction, classification, cosplay, KnowledgeQA, OpenQA, reasoning, brainstorming, and more. Of the 100K QA pairs in our dataset, 50.7% are from public datasets such as [Firefly](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) and [COIG](https://huggingface.co/datasets/BAAI/COIG). The instructions for the remaining 49.3% QA pairs are crafted by us and annotated by GPT-4 under expert guidance, similar to the [PKU-SafeRLHF dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF). Each QA pair is post-processed by GPT-4 according to specific guidelines. This comprehensive and detailed pipeline ensures a high-quality instruction-following dataset. ## Dataset Comparison We train several base models (Llama2-7B, Llama3-8B, Qwen2-7B) using samples from both Align-Anything-Instruction-100-zh (50K samples) and Firefly (50K samples). Then, we evaluate the fine-tuned models on the [Just-Eval](https://huggingface.co/datasets/re-align/just-eval-instruct) benchmark, translating the evaluation prompts into Chinese before assessment. The models are evaluated across five dimensions: helpfulness, clarity, factuality, depth, and engagement. Models trained by our dataset demonstrate excellent performance in all dimensions. <div align="center"> <img src="performance.png" width="70%"/> </div> ## Usage To load our dataset, use the `load_dataset()` function as follows: ```python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/Align-Anything-Instruction-100K-zh") ```

# Align-Anything-Instruction-100K-zh 数据集卡片 [[🏠 主页](https://github.com/PKU-Alignment/align-anything)] [[🤗 10万条指令数据集(英文)](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K)] [[🤗 10万条指令数据集(中文)](https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K-zh)] [[🤗 Align-Anything 系列数据集](https://huggingface.co/datasets/PKU-Alignment/align-anything/)] ## 数据集亮点 <div class="col-md-12"> <ul> <li><b>数据来源:</b> <a href="https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M" target="_blank">Firefly(占比47.8%)</a>, <a href="https://huggingface.co/datasets/BAAI/COIG" target="_blank">COIG(占比2.9%)</a>, 以及我们精心构建的问答对(占比49.3%)。 </li> <li><b>10万条中文问答对:</b> 从各类中文数据集筛选打磨得到的104550条精心编制的指令,其中问答对进一步通过GPT-4进行优化完善。</li> <li><b>说明:</b> 本数据集与<a href="https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K" target="_blank">Align-Anything-Instruction-100K(英文)</a>数据来源与打磨方式均不相同,并非直接由该英文数据集翻译而来。</li> </ul> </div> ## 数据概览 本数据集是<a href="https://github.com/PKU-Alignment/align-anything">Align-Anything</a>的姊妹项目。 我们提供了一套高质量的中文指令遵循数据集,包含10万条问答对。这些数据条目覆盖多个类别,包括摘要生成、内容创作、信息抽取、分类任务、角色扮演、知识问答、开放域问答、逻辑推理、头脑风暴等诸多领域。 在本数据集的10万条问答对中,50.7%来自<a href="https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M">Firefly</a>与<a href="https://huggingface.co/datasets/BAAI/COIG">COIG</a>等公开数据集;剩余49.3%的问答对指令由我们自主编制,并在专家指导下通过GPT-4完成标注,其构建方式与<a href="https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF">PKU-SafeRLHF数据集</a>类似。 每条问答对均按照特定规范通过GPT-4进行后处理,这套全面细致的流程保障了本指令遵循数据集的高质量水准。 ## 数据集对比分析 我们分别使用Align-Anything-Instruction-100K-zh的样本(5万条)与Firefly的样本(5万条)微调了多款基础模型(Llama2-7B、Llama3-8B、Qwen2-7B)。随后在<a href="https://huggingface.co/datasets/re-align/just-eval-instruct">Just-Eval</a>基准测试集上对微调后的模型进行评估,评估前先将测试提示词翻译为中文。本次评估从实用性、清晰度、事实性、深度与参与度五个维度展开。使用本数据集训练得到的模型在所有维度上均表现优异。 <div align="center"> <img src="performance.png" width="70%"/> </div> ## 使用方法 如需加载本数据集,请使用`load_dataset()`函数,代码示例如下: python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/Align-Anything-Instruction-100K-zh")
提供机构:
maas
创建时间:
2025-02-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作