five

Alpaca-CoT

收藏
魔搭社区2026-05-23 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/swift/Alpaca-CoT
下载链接
链接失效反馈
官方服务:
资源简介:
# Instruction-Finetuning Dataset Collection (Alpaca-CoT) This repository will continuously collect various instruction tuning datasets. And we standardize different datasets into the same format, which can be directly loaded by the [code](https://github.com/PhoebusSi/alpaca-CoT) of Alpaca model. We also have conducted empirical study on various instruction-tuning datasets based on the Alpaca model, as shown in [https://github.com/PhoebusSi/alpaca-CoT](https://github.com/PhoebusSi/alpaca-CoT). If you think this dataset collection is helpful to you, please `like` this dataset and `star` our [github project](https://github.com/PhoebusSi/alpaca-CoT)! You are in a warm welcome to provide us with any non-collected instruction-tuning datasets (or their sources). We will uniformly format them, train Alpaca model with these datasets and open source the model checkpoints. # Contribute Welcome to join us and become a contributor to this project! If you want to share some datasets, adjust the data in the following format: ``` example.json [ {"instruction": instruction string, "input": input string, # (may be empty) "output": output string} ] ``` Folder should be like this: ``` Alpaca-CoT | |----example | | | |----example.json | | | ----example_context.json ... ``` Create a new pull request in [Community ](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/discussions) and publish your branch when you are ready. We will merge it as soon as we can. # Data Usage and Resources ## Data Format All data in this folder is formatted into the same templates, where each sample is as follows: ``` [ {"instruction": instruction string, "input": input string, # (may be empty) "output": output string} ] ``` ## alpaca #### alpaca_data.json > This dataset is published by [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca). It contains 52K English instruction-following samples obtained by [Self-Instruction](https://github.com/yizhongw/self-instruct) techniques. #### alpaca_data_cleaned.json > This dataset is obtained [here](https://github.com/tloen/alpaca-lora). It is a revised version of `alpaca_data.json` by stripping of various tokenization artifacts. ## alpacaGPT4 #### alpaca_gpt4_data.json > This dataset is published by [Instruction-Tuning-with-GPT-4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM). It contains 52K English instruction-following samples generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. #### alpaca_gpt4_data_zh.json > This dataset is generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT. <!-- ## belle_cn #### belle_data_cn.json This dataset is published by [BELLE](https://github.com/LianjiaTech/BELLE). It contains 0.5M Chinese instruction-following samples, which is also generated by [Self-Instruction](https://github.com/yizhongw/self-instruct) techniques. #### belle_data1M_cn.json This dataset is published by [BELLE](https://github.com/LianjiaTech/BELLE). It contains 1M Chinese instruction-following samples. The data of `belle_data_cn.json` and `belle_data1M_cn.json` are not duplicated. --> ## Chain-of-Thought #### CoT_data.json > This dataset is obtained by formatting the combination of 9 CoT datasets published by [FLAN](https://github.com/google-research/FLAN). It contains 9 CoT tasks involving 74771 samples. #### CoT_CN_data.json > This dataset is obtained by tranlating `CoT_data.json` into Chinese, using Google Translate(en2cn). #### formatted_cot_data folder > This folder contains the formatted English data for each CoT dataset. #### formatted_cot_data folder > This folder contains the formatted Chinese data for each CoT dataset. ## CodeAlpaca #### code_alpaca.json > This dataset is published by [codealpaca](https://github.com/sahil280114/codealpaca). It contains code generation task involving 20022 samples. ## finance #### finance_en.json > This dataset is collected from [here](https://huggingface.co/datasets/gbharti/finance-alpaca). It contains 68912 financial related instructions in English. ## firefly #### firefly.json > his dataset is collected from [here](https://github.com/yangjianxin1/Firefly). It contains 1649398 chinese instructions in 23 nlp tasks. ## GPT4all #### gpt4all.json > This dataset is collected from [here](https://github.com/nomic-ai/gpt4all). It contains 806199 en instructions in code, storys and dialogs tasks. #### gpt4all_without_p3.json > gpt4all without Bigscience/P3, contains 437605 samples. ## GPTeacher #### GPTeacher.json > This dataset is collected from [here](https://github.com/teknium1/GPTeacher). It contains 29013 en instructions generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer. ## Guanaco #### GuanacoDataset.json > This dataset is collected from [here](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset). It contains 534610 en instructions generated by text-davinci-003 upon 175 tasks from the Alpaca model by providing rewrites of seed tasks in different languages and adding new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition. #### Guanaco_additional_Dataset.json > A new additional larger dataset for different languages. ## HC3 #### HC3_ChatGPT.json/HC3_Human.json > This dataset is collected from [here](https://huggingface.co/datasets/Hello-SimpleAI/HC3). It contains 37175 en/zh instructions generated by ChatGPT and human. #### HC3_ChatGPT_deduplication.json/HC3_Human_deduplication.json > HC3 dataset without deduplication instructions. ## instinwild #### instinwild_en.json & instinwild_cn.json > The two datasets are obtained [here](https://github.com/XueFuzhao/InstructionWild). It contains 52191 English and 51504 Chinese instructions, which are collected from Twitter, where users tend to share their interesting prompts of mostly generation, open QA, and mind-storm types. (Colossal AI used these datasets to train the ColossalChat model.) ## instruct #### instruct.json > The two datasets are obtained [here](https://huggingface.co/datasets/swype/instruct). It contains 888969 English instructions, which are caugmentation performed using the advanced NLP tools provided by AllenAI. ## Natural Instructions #### natural-instructions-1700tasks.zip > This dataset is obtained [here](https://github.com/allenai/natural-instructions). It contains 5040134 instructions, which are collected from diverse nlp tasks ## prosocial dialog #### natural-instructions-1700tasks.zip > This dataset is obtained [here](https://huggingface.co/datasets/allenai/prosocial-dialog). It contains 165681 English instructions, which are produuced by GPT-3 rewrites questions and humans feedback ## xP3 #### natural-instructions-1700tasks.zip > This dataset is obtained [here](https://huggingface.co/datasets/bigscience/xP3). It contains 78883588 instructions, which are collected by prompts & datasets across 46 of languages & 16 NLP tasks ## Chinese-instruction-collection > all datasets of Chinese instruction collection ## combination #### alcapa_plus_belle_data.json > This dataset is the combination of English `alpaca_data.json` and Chinese `belle_data_cn.json`. #### alcapa_plus_cot_data.json > This dataset is the combination of English `alpaca_data.json` and CoT `CoT_data.json`. #### alcapa_plus_belle_cot_data.json > This dataset is the combination of English `alpaca_data.json`, Chinese `belle_data_cn.json` and CoT `CoT_data.json`. ## Citation Please cite the repo if you use the data collection, code, and experimental findings in this repo. ``` @misc{alpaca-cot, author = {Qingyi Si, Zheng Lin }, school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China}, title = {Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/PhoebusSi/alpaca-CoT}}, } ``` Cite the original Stanford Alpaca, BELLE and FLAN papers as well, please.

# 指令微调数据集集合(Alpaca-CoT) 本仓库将持续收录各类指令微调数据集,并将不同数据集统一标准化为相同格式,可直接通过Alpaca模型的[代码](https://github.com/PhoebusSi/alpaca-CoT)加载。 我们还基于Alpaca模型对各类指令微调数据集开展了实证研究,详情见[https://github.com/PhoebusSi/alpaca-CoT](https://github.com/PhoebusSi/alpaca-CoT)。 若您认为本数据集集合对您有所帮助,请为该数据集点赞,并为我们的[GitHub项目](https://github.com/PhoebusSi/alpaca-CoT)点亮Star! 欢迎您向我们提供任何尚未收录的指令微调数据集(或其来源)。我们将对其进行统一格式转换,使用这些数据集训练Alpaca模型,并开源模型权重文件。 # 贡献 欢迎加入我们,成为本项目的贡献者! 若您希望分享数据集,请将数据调整为如下格式: example.json [ {"instruction": instruction string, "input": input string, # (may be empty) "output": output string} ] 文件夹结构应如下所示: Alpaca-CoT | |----example | | | |----example.json | | | ----example_context.json ... 在[社区板块](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/discussions)中新建拉取请求(Pull Request),准备就绪后发布您的分支,我们将尽快合并您的贡献。 # 数据使用与资源 ## 数据格式 本文件夹内的所有数据均采用统一模板格式,每个样本示例如下: [ {"instruction": instruction string, "input": input string, # (may be empty) "output": output string} ] ## alpaca #### alpaca_data.json > 本数据集由[Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)发布,包含通过[Self-Instruction](https://github.com/yizhongw/self-instruct)技术生成的52K条英文指令跟随样本。 #### alpaca_data_cleaned.json > 本数据集取自[此处](https://github.com/tloen/alpaca-lora),是对`alpaca_data.json`的修订版本,移除了各类分词伪影。 ## alpacaGPT4 #### alpaca_gpt4_data.json > 本数据集由[Instruction-Tuning-with-GPT-4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)发布,包含由GPT-4基于Alpaca提示词生成的52K条英文指令跟随样本,用于大语言模型微调。 #### alpaca_gpt4_data_zh.json > 本数据集由GPT-4将Alpaca提示词翻译为中文后生成。 <!-- ## belle_cn #### belle_data_cn.json > 本数据集由[BELLE](https://github.com/LianjiaTech/BELLE)发布,包含0.5M条中文指令跟随样本,同样通过[Self-Instruction](https://github.com/yizhongw/self-instruct)技术生成。 #### belle_data1M_cn.json > 本数据集由[BELLE](https://github.com/LianjiaTech/BELLE)发布,包含1M条中文指令跟随样本。`belle_data_cn.json`与`belle_data1M_cn.json`的数据无重复。 --> ## Chain-of-Thought(思维链,CoT) #### CoT_data.json > 本数据集通过整合[FLAN](https://github.com/google-research/FLAN)发布的9个CoT数据集格式化得到,包含9个CoT任务,总计74771条样本。 #### CoT_CN_data.json > 本数据集通过谷歌翻译(en2cn)将`CoT_data.json`翻译为中文得到。 #### formatted_cot_data 文件夹 > 该文件夹包含每个CoT数据集的格式化英文数据。 #### formatted_cot_data 文件夹 > 该文件夹包含每个CoT数据集的格式化中文数据。 ## CodeAlpaca #### code_alpaca.json > 本数据集由[codealpaca](https://github.com/sahil280114/codealpaca)发布,包含20022条代码生成任务样本。 ## finance #### finance_en.json > 本数据集取自[此处](https://huggingface.co/datasets/gbharti/finance-alpaca),包含68912条英文金融相关指令样本。 ## firefly #### firefly.json > 本数据集取自[此处](https://github.com/yangjianxin1/Firefly),包含23个自然语言处理任务下的1649398条中文指令样本。 ## GPT4all #### gpt4all.json > 本数据集取自[此处](https://github.com/nomic-ai/gpt4all),包含806199条英文指令样本,涵盖代码生成、故事创作与对话任务。 #### gpt4all_without_p3.json > 移除Bigscience/P3的GPT4all数据集,包含437605条样本。 ## GPTeacher #### GPTeacher.json > 本数据集取自[此处](https://github.com/teknium1/GPTeacher),包含29013条由GPT-4生成的英文指令样本,涵盖通用指令、角色扮演指令、代码指令以及Toolformer相关任务。 ## Guanaco #### GuanacoDataset.json > 本数据集取自[此处](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset),包含由text-davinci-003生成的534610条英文指令样本,涵盖175个Alpaca模型相关任务,具体包括对种子任务的多语言重写,以及专为英语语法分析、自然语言理解、跨语言自我认知和显性内容识别设计的新增任务。 #### Guanaco_additional_Dataset.json > 面向多语言的新增大型数据集。 ## HC3 #### HC3_ChatGPT.json/HC3_Human.json > 本数据集取自[此处](https://huggingface.co/datasets/Hello-SimpleAI/HC3),包含37175条由ChatGPT和人类生成的英文/中文指令样本。 #### HC3_ChatGPT_deduplication.json/HC3_Human_deduplication.json > 去除重复指令的HC3数据集。 ## instinwild #### instinwild_en.json & instinwild_cn.json > 这两个数据集取自[此处](https://github.com/XueFuzhao/InstructionWild),分别包含52191条英文指令样本与51504条中文指令样本,数据采集自Twitter,用户在此分享的提示词多为生成式、开放问答与头脑风暴类型(Colossal AI曾使用该数据集训练ColossalChat模型)。 ## instruct #### instruct.json > 本数据集取自[此处](https://huggingface.co/datasets/swype/instruct),包含888969条英文指令样本,通过AllenAI提供的高级自然语言处理工具完成数据增强。 ## Natural Instructions #### natural-instructions-1700tasks.zip > 本数据集取自[此处](https://github.com/allenai/natural-instructions),包含5040134条来自各类自然语言处理任务的指令样本。 ## prosocial dialog #### natural-instructions-1700tasks.zip > 本数据集取自[此处](https://huggingface.co/datasets/allenai/prosocial-dialog),包含165681条英文指令样本,由GPT-3重写问题并结合人类反馈生成。 ## xP3 #### natural-instructions-1700tasks.zip > 本数据集取自[此处](https://huggingface.co/datasets/bigscience/xP3),包含78883588条指令样本,采集自覆盖46种语言与16个自然语言处理任务的提示词与数据集。 ## Chinese-instruction-collection > 所有中文指令集合数据集。 ## combination #### alcapa_plus_belle_data.json > 本数据集为英文数据集`alpaca_data.json`与中文数据集`belle_data_cn.json`的组合。 #### alcapa_plus_cot_data.json > 本数据集为英文数据集`alpaca_data.json`与CoT数据集`CoT_data.json`的组合。 #### alcapa_plus_belle_cot_data.json > 本数据集为英文数据集`alpaca_data.json`、中文数据集`belle_data_cn.json`与CoT数据集`CoT_data.json`的组合。 ## 引用 若您使用本仓库中的数据集、代码或实验结果,请引用本仓库: @misc{alpaca-cot, author = {Qingyi Si, Zheng Lin }, school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China}, title = {Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/PhoebusSi/alpaca-CoT}}, } 同时请引用原始的Stanford Alpaca、BELLE与FLAN相关论文。
提供机构:
maas
创建时间:
2024-06-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作