five

OIG

收藏
魔搭社区2025-12-04 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/laion/OIG
下载链接
链接失效反馈
官方服务:
资源简介:
# This is the Open Instruction Generalist Dataset This is our attempt to create a large instruction dataset of medium quality along with a smaller high quality instruciton dataset (OIG-small-chip2). The data is in the form of jsonl objects, with at least a 'text' field. Some datasets may also include a 'metadata' field. The 'text' field contains a string of the form of one or more of: - \<human\>: instruction\n\<bot\>: response - \<human\>: instruction\n\<bot\>: response .. \<human\>: instruction\n\<bot\>: response The purpose of the larger dataset is to perform continued pre-training, followed by a finetune on the smaller high quality dataset. The purpose of the smaller OIG-small-chip2 dataset is to make it easy to convert a language model pretrained on large amounts of text into an instruction following model using a small amount of additional compute via finetuning or softprompt tuning. Many additional datasets are being prepared by various community members and will be incorporated into this dataset as we are able to verify the quality and formatting of the data. Our goal is to make helpful and non-toxic instruction tuned models available to everyone. OIG is currently at 44M. We will continue to publish ever larger diverse instruction datasets with the goal of creating 1 trillion tokens of diverse instructions - enough to pretrain an LLM from scratch. It is best to download the individual jsonl files directly that you wish to use instead of using HF load_datasets. https://huggingface.co/datasets/laion/OIG/tree/main ## unified_abstract_infill.jsonl (~232000) dbpedia and wikipedia snippets combined with a small portion of https://github.com/google-research/dialog-inpainting ## unified_basic.jsonl (30) ## unified_conv_finqa.jsonl (~9000) https://github.com/czyssrs/ConvFinQA ## unified_cuad.jsonl (~500) https://www.atticusprojectai.org/cuad ## unified_essays.jsonl (~2000) - essays available on the public web ## unified_grade_school_math_instructions.jsonl (~9000) - https://github.com/openai/grade-school-math ## unified_hc3_human.jsonl (~58000) ## unified_image_prompts_instructions.jsonl (~15000) - A very small subset of LAION-400M ## unified_joke_explanations.jsonl (356) - Crawled from public internet. ## unified_mathqa_flanv2_kojma_cot.jsonl (~107000) - https://huggingface.co/datasets/math_qa, ## unified_merged_code_xp3.jsonl (~67000) - https://huggingface.co/datasets/bigscience/xP3 ## unified_multi_news.jsonl (~90000) - https://www.tensorflow.org/datasets/catalog/multi_news ## unified_multi_sum.jsonl (~1700000) ## unified_nq.jsonl (~307000) ## unified_openai_summarize_tldr.jsonl (~233000) - https://github.com/openai/summarize-from-feedback ## unified_oscar_en_sample_dialog.jsonl (~2670000) - https://oscar-project.org/ - https://huggingface.co/datasets/TurkuNLP/register_oscar ## unified_plot_screenplay_books_dialog.jsonl (~8000) - https://github.com/markriedl/WikiPlots extracted from Wikipedia, snippets from the Pile’s https://huggingface.co/datasets/the_pile_books3, and snippets of screenplays available on the public web. ## unified_sqlv1.jsonl (~17000) - public text 2 sql datasets. ## unified_sqlv2.jsonl(~24000) - public text 2 sql datasets. ## unified_squad_v2.jsonl (~19000) - https://rajpurkar.github.io/SQuAD-explorer/ ## unified_squad_v2_more_neg.jsonl (~19000) - https://rajpurkar.github.io/SQuAD-explorer/ ## unified_ul2_plus_oscar_en_sample_dialog.jsonl (~2900000) - https://oscar-project.org/ - https://huggingface.co/datasets/TurkuNLP/register_oscar ## unified_unifiedskg_instructions.jsonl (~223000) - https://github.com/HKUNLP/UnifiedSKG ## unified_unnatural_instructions.jsonl (~238000) - https://github.com/orhonovich/unnatural-instructions ## unified_xp3_sample.jsonl (~188000) - https://huggingface.co/datasets/bigscience/xP3 ## unified_canadian_parliament.jsonl(~301000) - https://openparliament.ca/data-download/ ## unified_poetry_2_song.jsonl (~12000) - https://huggingface.co/datasets/merve/poetry - https://huggingface.co/datasets/matthh/gutenberg-poetry-corpus ## unified_flan.jsonl (~2700000) - https://github.com/google-research/FLAN/tree/main/flan/v2 ## unified_ni.jsonl (~256000) - https://github.com/allenai/natural-instructions ## unified_p3.jsonl (~31000000) - https://huggingface.co/datasets/bigscience/P3 ## unified_soda_dialog.jsonl (~1200000) - https://huggingface.co/datasets/allenai/soda ## unified_rallio_soda_upgraded_2048.jsonl (~210000) - https://huggingface.co/datasets/allenai/soda - a newer version of the unified_soda_dialog dataset, with multiple dialogs on one line - recommend to use either the unified_soda_dailog.jsonl or unified_rallio_soda_upgraded_2048, and not both. ## unified_rallio_safety_and_prosocial.jsonl (~319000) - Generated from public datasets and generated from Wiki similar to the chip2 data - Find a full list in the end of the document - This dataset also includes https://huggingface.co/datasets/allenai/prosocial-dialog and https://huggingface.co/datasets/Anthropic/hh-rlhf ## unified-chip2.jsonl / OIG-small-chip2 (~210000): This dataset was created as part of the LAION OA effort by @rallio67 and other members of the LAION contributors. It is a high quality dataset intended to be mixed into a large pre-train dataset and can be used for a final finetune. Chip2 contains: ### Python Code Examples (~6,000): A set of instruction / response pairs where the User requests the agent to generate a python function. These examples were generated using a large language model and few shot prompting with python code verified to execute. There are also ~3000 examples of manually curated one line python code examples from the Conala publication (see: https://conala-corpus.github.io/) ### Natural Instruction Examples (~124,000): A balanced set of diverse natural and factual questions and answers made using few shot prompted UL2 20B and an instruction tuned GPT-NeoX-20B model (Chip) and then rejection sampled using multiple automatic evaluations to remove low quality outputs and to filter out factually inaccurate answers. Also includes some filtered natural instructions from Anthropic Helpful instructions (see: https://github.com/anthropics/hh-rlhf). ### Generic Harmless Instruction Examples (~6,500): A set of instruction / response pairs sourced from the Anthropic redteam paper github (see: https://github.com/anthropics/hh-rlhf). This dataset includes a lot of data regarding real humans trying to make the Anthropic language models say harmful/toxic/trolling things. For this dataset only examples that were rated lowly on the harmful scale (0,1,2 out of 4, where 4 is the most toxic) were included. Again, only the first lines of dialogue (instruction, first_agent_response) were retained. ### Instruction/Responses with Lists (~14,000): A set of filtered and reformatted instruction / response pairs where the agent response contains a list. Sourced from the Anthropic github (see: https://github.com/anthropics/hh-rlhf). Sourced from wikihow text lists created by b-mc2 (https://huggingface.co/datasets/b-mc2/wikihow_lists). And rejection filtered instruction response pairs generated by Chip20B that contained lists. All lists are formatted in a similar style. ### Follow-up questions (~12,500): Examples of instructions and responses where an appropriate response is to ask for more information from the prompter. These examples were generated from a combination of few shot prompted UL2 20B (to generate natural questions) and a large dialogue prompted language model to generate the responses containing follow-up questions. ### Wikipedia Toxic Adversarial Questions (~12,000): Questions and answers generated from wikipedia articles that discuss potentially sensitive topics (flagged as potentially toxic by an early toxicity detection model). ### Grade School Math GSM8K (~9,000): GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning. (https://github.com/openai/grade-school-math) ### Reasoning Instructions (~4,500): Examples from the Com2Sense and Strategy QA datasets that were reformatted into natural instructions using large language models with few shot prompting and additional quality filtering steps. ### Character and Scene Descriptions (~30,000): Examples of instructions and responses for the generation of character or scene descriptions. Scenes were sourced from video game wikis and reformatted into instruction / response format using large language models or generated by few shot prompting with large language models. ## Support this project Your contributions and feedback support the open source ecosystem, improve the bot and provide datasets for future AI research. To participate you can: Submit Github issues, track issues and help create datasets that need improvement. https://github.com/LAION-AI/Open-Instruction-Generalist Join our Discord to talk with other team members working on this! https://discord.gg/xBPBXfcFHd ## Update: March 20, 2023 - Added the metadata column to all datasets to alleviate issues with HF datasets loader. - Broke some of the p3 dialogs into parts for ease of loading. ## Disclaimer These datasets contain synthetic data and in some cases data that includes humans trying to get the language model to say toxic/offensive/trolling things. If you are concerned about the presence of this type of material in the dataset please make sure you carefully inspect each of the entries and filter appropriately. Our goal is for the model to be as helpful and non-toxic as possible and we are actively evaluating ways to reduce or eliminate undesirable content from the instruction tuning datasets. ## License The OIG dataset that is authored by LAION volunteers is released under an Apache 2.0 license. However, the data also includes content licensed under other permissive licenses such as Wikipedia data which is licensed under CC-BY-SA, or web-crawled data which is used under fair use principles. ## Acknowledgement - We would like to thank all of our amazing LAION volunteers including: @Rallio, @Jue, @Ce Zhang, @Player-1, @Laurel, @danielpatrickhug, @Jjmachan, @Mylo, @Khalid, @Coco.han, @Jordiclive, @Pszemraj, all volunteers from the Open Assistant project who initially created synthetic data, and many others. - We would like to thank Together for their tireless dedication to the open source and AI community and their contribution to many of the datasets. - We would like to thank AI Horde and user @Db0 for their incredible contribution of filtered data that were flagged as unethical. - Please check out our related project: https://github.com/LAION-AI/Open-Assistant for our work in human feedback gathering and RLHF. - Lastly, Ontocord.ai’s founders are grateful to have the opportunity to create a portion of the data augmentation and safety-moderation code for this project.

# 本数据集为开放指令通用数据集(Open Instruction Generalist Dataset, OIG) 本项目旨在构建一个中等质量的大规模指令数据集,以及一个体量更小的高质量指令数据集(OIG-small-chip2)。 本数据集以jsonl对象形式存储,每条数据至少包含一个`text`字段,部分数据集还会附带`metadata`元数据字段。`text`字段的字符串格式可包含以下一种或多种形式: - `<human>`: 指令 `<bot>`: 回复 - `<human>`: 指令 `<bot>`: 回复 … `<human>`: 指令 `<bot>`: 回复 该大规模数据集的使用场景为持续预训练,随后在小型高质量数据集上进行微调。 小型的OIG-small-chip2数据集的设计目标是:通过少量额外计算资源,经微调或软提示调优(softprompt tuning),将在海量文本上预训练的语言模型快速转换为遵循指令的模型。 目前已有多位社区成员正在筹备更多数据集,待完成质量与格式验证后将被纳入本数据集。我们的目标是将实用且无毒的指令微调模型开放给所有用户。 当前OIG数据集规模达44M Token,我们将持续发布体量更大、类型更多样的指令数据集,目标是构建总计1万亿Token的多样化指令数据——足够从零开始预训练一个大语言模型(Large Language Model, LLM)。 建议直接下载所需的单个jsonl文件,而非使用Hugging Face的`load_datasets`接口。相关文件可访问:https://huggingface.co/datasets/laion/OIG/tree/main ## 各数据集详情 ### unified_abstract_infill.jsonl(约232000条) 结合了DBpedia与维基百科片段,以及来自https://github.com/google-research/dialog-inpainting的小部分数据。 ### unified_basic.jsonl(共30条) ### unified_conv_finqa.jsonl(约9000条) 数据来源:https://github.com/czyssrs/ConvFinQA ### unified_cuad.jsonl(约500条) 数据来源:https://www.atticusprojectai.org/cuad ### unified_essays.jsonl(约2000条) - 公开网络上的随笔文本 ### unified_grade_school_math_instructions.jsonl(约9000条) - 数据来源:https://github.com/openai/grade-school-math ### unified_hc3_human.jsonl(约58000条) ### unified_image_prompts_instructions.jsonl(约15000条) - LAION-400M的极小子集 ### unified_joke_explanations.jsonl(共356条) - 从公开互联网爬取获取 ### unified_mathqa_flanv2_kojma_cot.jsonl(约107000条) - 数据来源:https://huggingface.co/datasets/math_qa ### unified_merged_code_xp3.jsonl(约67000条) - 数据来源:https://huggingface.co/datasets/bigscience/xP3 ### unified_multi_news.jsonl(约90000条) - 数据来源:https://www.tensorflow.org/datasets/catalog/multi_news ### unified_multi_sum.jsonl(约1700000条) ### unified_nq.jsonl(约307000条) ### unified_openai_summarize_tldr.jsonl(约233000条) - 数据来源:https://github.com/openai/summarize-from-feedback ### unified_oscar_en_sample_dialog.jsonl(约2670000条) - 数据来源:https://oscar-project.org/ - 数据来源:https://huggingface.co/datasets/TurkuNLP/register_oscar ### unified_plot_screenplay_books_dialog.jsonl(约8000条) - 数据来自https://github.com/markriedl/WikiPlots(从维基百科提取)、The Pile数据集的https://huggingface.co/datasets/the_pile_books3片段,以及公开网络上的剧本片段 ### unified_sqlv1.jsonl(约17000条) - 公开的文本转SQL数据集 ### unified_sqlv2.jsonl(约24000条) - 公开的文本转SQL数据集 ### unified_squad_v2.jsonl(约19000条) - 数据来源:https://rajpurkar.github.io/SQuAD-explorer/ ### unified_squad_v2_more_neg.jsonl(约19000条) - 数据来源:https://rajpurkar.github.io/SQuAD-explorer/ ### unified_ul2_plus_oscar_en_sample_dialog.jsonl(约2900000条) - 数据来源:https://oscar-project.org/ - 数据来源:https://huggingface.co/datasets/TurkuNLP/register_oscar ### unified_unifiedskg_instructions.jsonl(约223000条) - 数据来源:https://github.com/HKUNLP/UnifiedSKG ### unified_unnatural_instructions.jsonl(约238000条) - 数据来源:https://github.com/orhonovich/unnatural-instructions ### unified_xp3_sample.jsonl(约188000条) - 数据来源:https://huggingface.co/datasets/bigscience/xP3 ### unified_canadian_parliament.jsonl(约301000条) - 数据来源:https://openparliament.ca/data-download/ ### unified_poetry_2_song.jsonl(约12000条) - 数据来源:https://huggingface.co/datasets/merve/poetry - 数据来源:https://huggingface.co/datasets/matthh/gutenberg-poetry-corpus ### unified_flan.jsonl(约2700000条) - 数据来源:https://github.com/google-research/FLAN/tree/main/flan/v2 ### unified_ni.jsonl(约256000条) - 数据来源:https://github.com/allenai/natural-instructions ### unified_p3.jsonl(约31000000条) - 数据来源:https://huggingface.co/datasets/bigscience/P3 ### unified_soda_dialog.jsonl(约1200000条) - 数据来源:https://huggingface.co/datasets/allenai/soda ### unified_rallio_soda_upgraded_2048.jsonl(约210000条) - 数据来源:https://huggingface.co/datasets/allenai/soda - 该数据集是unified_soda_dialog.jsonl的更新版本,支持一行内存储多轮对话 - 建议仅选用unified_soda_dialog.jsonl或unified_rallio_soda_upgraded_2048.jsonl中的其一,勿同时使用二者 ### unified_rallio_safety_and_prosocial.jsonl(约319000条) - 数据来自公开数据集,以及类似chip2数据的维基百科生成内容 - 完整数据集列表详见文档末尾 - 本数据集还包含https://huggingface.co/datasets/allenai/prosocial-dialog与https://huggingface.co/datasets/Anthropic/hh-rlhf的数据 ### 对应文件unified-chip2.jsonl的OIG-small-chip2数据集(约210000条) 该数据集是LAION开放AI项目的一部分,由用户@rallio67与LAION的其他贡献者共同创建。它是一个高质量数据集,可混入大规模预训练数据集,也可用于最终微调环节。Chip2数据集包含以下内容: #### Python代码示例(约6000条) 该部分为指令/回复对集合,用户要求模型生成Python函数。这些示例通过大语言模型结合少样本提示生成,且对应的Python代码均经过可执行性验证。此外还包含约3000条来自Conala公开数据集的人工整理单行Python代码示例(详见:https://conala-corpus.github.io/)。 #### 自然指令示例(约124000条) 通过少样本提示UL2 20B模型与指令微调后的GPT-NeoX-20B模型(Chip)生成多样化的均衡自然事实问答对,随后通过多维度自动评估进行拒绝采样,剔除低质量输出与事实不准确的回复。该数据集还包含从Anthropic Helpful指令集中过滤得到的部分数据(详见:https://github.com/anthropics/hh-rlhf)。 #### 通用无害指令示例(约6500条) 该部分指令/回复对来自Anthropic红队测试论文的开源仓库(详见:https://github.com/anthropics/hh-rlhf)。该原始数据集包含大量真实人类尝试诱导Anthropic语言模型生成有害、冒犯性或引战内容的案例。本数据集仅保留了有害性评分较低的样本(评分0、1、2,满分4分,4分代表毒性最强),且仅保留了首轮对话(指令、首轮模型回复)。 #### 带列表的指令/回复对(约14000条) 该部分为经过过滤与重格式化的指令/回复对,其中模型回复包含列表内容。数据来源包括:Anthropic开源仓库(详见:https://github.com/anthropics/hh-rlhf)、由b-mc2整理的wikihow文本列表(https://huggingface.co/datasets/b-mc2/wikihow_lists),以及由Chip20B生成并经拒绝过滤的带列表指令回复对。所有列表均采用统一格式。 #### 跟进式提问示例(约12500条) 该类示例的指令与回复场景为:模型应向提问者询问更多信息以完善回复。这些示例通过结合少样本提示的UL2 20B模型(生成自然问句)与经过大规模对话提示的语言模型(生成包含跟进提问的回复)生成。 #### 维基百科敏感对抗性问题(约12000条) 从讨论敏感话题的维基百科文章中生成的问答对,且这些话题被早期毒性检测模型标记为潜在有害内容。 #### 小学年级数学GSM8K数据集(约9000条) GSM8K数据集包含8500条高质量、语言多样化的小学低年级数学应用题,由人工出题者创作。数据集分为7500条训练样本与1000条测试样本,解题需2至8个步骤,主要通过基础算术运算(+ − × ÷)完成最终求解,具备中等水平的中学生即可独立完成所有题目,可用于多步数学推理任务。(详见:https://github.com/openai/grade-school-math) #### 推理指令示例(约4500条) 该部分示例来自Com2Sense与Strategy QA数据集,通过大语言模型结合少样本提示与额外质量过滤步骤,将原始数据重格式化为自然指令形式。 #### 角色与场景描述示例(约30000条) 用于生成角色或场景描述的指令与回复对。场景数据来自电子游戏维基百科,并通过大语言模型重格式化为指令/回复对,或通过大语言模型少样本提示直接生成。 ## 支持本项目 您的贡献与反馈将助力开源生态建设,优化模型性能,并为未来的人工智能研究提供数据集。您可以通过以下方式参与: 1. 提交GitHub Issue、跟踪问题进度,并协助改进待完善的数据集:https://github.com/LAION-AI/Open-Instruction-Generalist 2. 加入我们的Discord社区,与项目团队成员交流协作:https://discord.gg/xBPBXfcFHd ## 更新记录(2023年3月20日) 1. 为所有数据集添加`metadata`元数据字段,以解决Hugging Face数据集加载器的兼容问题。 2. 将部分P3对话拆分为多个片段,便于加载。 ## 免责声明 本数据集包含合成数据,在部分样本中还包含人类尝试诱导语言模型生成有害、冒犯性或引战内容的案例。若您担忧此类内容出现在数据集中,请务必仔细检查每条数据并进行适当过滤。我们的目标是打造实用且低毒性的模型,目前正积极评估各种方案以减少或消除指令微调数据集中的不良内容。 ## 许可证 由LAION志愿者创作的OIG数据集采用Apache 2.0许可证开源。但本数据集还包含其他基于宽松许可证的内容:例如维基百科数据采用CC-BY-SA许可证,网络爬取数据则基于合理使用原则使用。 ## 致谢 1. 感谢所有出色的LAION志愿者,包括:@Rallio、@Jue、@Ce Zhang、@Player-1、@Laurel、@danielpatrickhug、@Jjmachan、@Mylo、@Khalid、@Coco.han、@Jordiclive、@Pszemraj,以及最初参与合成数据创建的Open Assistant项目的所有志愿者,以及其他众多贡献者。 2. 感谢Together团队对开源与人工智能社区的不懈投入,以及他们为本项目诸多数据集提供的支持。 3. 感谢AI Horde与用户@Db0贡献了经过过滤的、被标记为不伦理的数据集样本。 4. 欢迎访问我们的相关项目:https://github.com/LAION-AI/Open-Assistant,了解我们在人类反馈收集与基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)方面的工作。 5. 最后,Ontocord.ai的创始人感谢有机会为本项目开发部分数据增强与安全审核代码。
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作