five

databricks_dolly_15k

收藏
魔搭社区2025-12-05 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceH4/databricks_dolly_15k
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Dolly_15K # Summary `databricks-dolly-15k` is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT](https://arxiv.org/abs/2203.02155) paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://creativecommons.org/licenses/by-sa/3.0/legalcode). Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation Languages: English Version: 1.0 **Owner: Databricks, Inc.** # Dataset Overview `databricks-dolly-15k` is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category. Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly. For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the `context` field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. `[42]`) which we recommend users remove for downstream applications. # Intended Uses While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories. Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets. # Dataset ## Purpose of Collection As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications. ## Sources - **Human-generated data**: Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories. - **Wikipedia**: For instruction categories that require an annotator to consult a reference text (information extraction, closed QA, summarization) contributors selected passages from Wikipedia for particular subsets of instruction categories. No guidance was given to annotators as to how to select the target passages. ## Annotator Guidelines To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous compliance to an annotation rubric that concretely and reliably operationalizes the specific task. Caveat emptor. The annotation guidelines for each of the categories are as follows: - **Creative Writing**: Write a question or instruction that requires a creative, open-ended written response. The instruction should be reasonable to ask of a person with general world knowledge and should not require searching. In this task, your prompt should give very specific instructions to follow. Constraints, instructions, guidelines, or requirements all work, and the more of them the better. - **Closed QA**: Write a question or instruction that requires factually correct response based on a passage of text from Wikipedia. The question can be complex and can involve human-level reasoning capabilities, but should not require special knowledge. To create a question for this task include both the text of the question as well as the reference text in the form. - **Open QA**: Write a question that can be answered using general world knowledge or at most a single search. This task asks for opinions and facts about the world at large and does not provide any reference text for consultation. - **Summarization**: Give a summary of a paragraph from Wikipedia. Please don't ask questions that will require more than 3-5 minutes to answer. To create a question for this task include both the text of the question as well as the reference text in the form. - **Information Extraction**: These questions involve reading a paragraph from Wikipedia and extracting information from the passage. Everything required to produce an answer (e.g. a list, keywords etc) should be included in the passages. To create a question for this task include both the text of the question as well as the reference text in the form. - **Classification**: These prompts contain lists or examples of entities to be classified, e.g. movie reviews, products, etc. In this task the text or list of entities under consideration is contained in the prompt (e.g. there is no reference text.). You can choose any categories for classification you like, the more diverse the better. - **Brainstorming**: Think up lots of examples in response to a question asking to brainstorm ideas. ## Personal or Sensitive Data This dataset contains public information (e.g., some information from Wikipedia). To our knowledge, there are no private person’s personal identifiers or sensitive information. ## Language American English # Known Limitations - Wikipedia is a crowdsourced corpus and the contents of this dataset may reflect the bias, factual errors and topical focus found in Wikipedia - Some annotators may not be native English speakers - Annotator demographics and subject matter may reflect the makeup of Databricks employees # License/Attribution **Copyright (2023) Databricks, Inc.** This dataset was developed at Databricks (https://www.databricks.com) and its use is subject to the CC BY-SA 3.0 license. Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license: Wikipedia (various pages) - https://www.wikipedia.org/ Copyright © Wikipedia editors and contributors.

# Dolly_15K 数据集卡片 ## 摘要 `databricks-dolly-15k` 是一个开源的指令跟随样本数据集,由数千名Databricks员工基于《InstructGPT》论文中概述的多个行为类别生成,涵盖头脑风暴、分类、封闭域问答(Closed QA)、摘要生成、信息抽取、开放域问答(Open QA)等类别。 本数据集可在《知识共享署名-相同方式共享3.0未移植许可协议》(https://creativecommons.org/licenses/by-sa/3.0/legalcode)的条款下,用于学术或商业等任意用途。 支持任务: - 大语言模型(Large Language Model,LLM)训练 - 合成数据生成 - 数据增强 语言:英语 版本:1.0 **所有者:Databricks公司** ## 数据集概览 `databricks-dolly-15k` 是包含超过15000条样本的语料库,由数千名Databricks员工生成,旨在让大语言模型展现出类似ChatGPT的出色交互能力。Databricks员工受邀创建8种不同指令类别下的提示-响应对,其中包括《InstructGPT》论文中提出的7种类别,以及一种开放式自由形式类别。贡献者被要求,除特定指令类别可使用维基百科内容外,不得使用任何网络其他来源的信息,且明确禁止在生成指令或响应时使用生成式AI。主办方为每个类别提供了示例,以说明适用于该类别的问题和指令类型。 在数据生成过程进行到一半时,贡献者可选择回答其他贡献者提出的问题,他们需要重新表述原始问题,且仅选择自己有合理把握正确回答的问题。 对于部分类别,贡献者需提供从维基百科复制的参考文本。参考文本(在实际数据集中由`context`字段表示)可能包含带方括号的维基百科引用编号(例如`[42]`),我们建议用户在下游应用中移除这些引用编号。 ## 预期用途 尽管本数据集作为人类生成的指令提示语料库,可直接用于大语言模型的指令微调,同时也为《Self-Instruct》论文中提出的合成数据生成方法提供了宝贵的实践机会。例如,可将贡献者生成的提示语作为少样本示例(Few-shot examples)提交给大型开源语言模型,以生成每个《InstructGPT》类别下数百万条指令样本的语料库。 同样,指令和响应都为数据增强提供了良好的基础。可使用释义模型对每个提示或简短响应进行重述,将生成的文本与对应的真实样本关联起来。这种方法可为数据集提供一种正则化形式,使基于这些合成数据集训练的模型能够展现出更稳健的指令跟随能力。 ## 数据集 ### 采集目的 为持续推进开源事业,Databricks开发了据我们所知首个专为使大语言模型展现ChatGPT级出色交互能力而设计的开源人类生成指令语料库。与其他仅限非商业使用的数据集不同,本数据集可用于任意用途,包括学术或商业应用。 ### 数据来源 - **人工生成数据**:Databricks员工受邀创建8种不同指令类别下的提示-响应对。 - **维基百科**:对于需要标注员参考文本的指令类别(信息抽取、封闭域问答、摘要生成),贡献者为特定指令子集从维基百科选取段落。主办方未向标注员提供选取目标段落的具体指导。 ### 标注员指南 要创建一条样本,员工会收到标注任务的简要说明以及每种标注任务典型提示类型的示例。指南设计得简洁明了,以鼓励较高的任务完成率,但可能会牺牲严格遵循具体任务的标注规则的严谨性,特此声明。 各分类的标注指南如下: - **创意写作**:编写需要创意性、开放式书面响应的问题或指令。该指令应是具备一般世界知识的人可以合理回答的,无需额外检索。在此任务中,你的提示应给出非常具体的遵循要求,约束、说明、指南或要求均可,且越多越佳。 - **封闭域问答(Closed QA)**:编写需要基于维基百科文本段落给出事实性正确响应的问题或指令。问题可以很复杂,涉及人类水平的推理能力,但无需特殊知识。要为此任务创建问题,请在表单中同时包含问题文本和参考文本。 - **开放域问答(Open QA)**:编写可通过一般世界知识或至多一次搜索即可回答的问题。该任务要求回答关于世界的观点和事实,不提供任何参考文本。 - **摘要生成**:对维基百科的一段文字进行摘要。请不要提出需要超过3-5分钟才能回答的问题。要为此任务创建问题,请在表单中同时包含问题文本和参考文本。 - **信息抽取**:此类问题涉及阅读维基百科的一段文字并从中提取信息。生成答案所需的所有信息(例如列表、关键词等)都应包含在段落中。要为此任务创建问题,请在表单中同时包含问题文本和参考文本。 - **分类**:此类提示包含需要分类的实体列表或示例,例如电影评论、产品等。在此任务中,待分类的文本或实体列表包含在提示中(无需参考文本)。你可以选择任意分类类别,类别越多样化越好。 - **头脑风暴**:针对要求头脑风暴创意的问题,想出大量示例。 ### 个人或敏感数据 本数据集包含公开信息(例如维基百科的部分内容)。据我们所知,数据集中不包含任何私人身份信息或敏感数据。 ### 语言 美式英语 ## 已知局限性 - 维基百科是众包语料库,本数据集的内容可能反映维基百科存在的偏见、事实错误和主题聚焦倾向。 - 部分标注员并非以英语为母语的人士。 - 标注员的人口统计特征和主题选择可能反映了Databricks员工的构成情况。 ## 许可与归因 **版权所有(2023)Databricks公司** 本数据集由Databricks(https://www.databricks.com)开发,使用需遵守CC BY-SA 3.0许可协议。 数据集中部分类别素材包含以下来源的内容,这些内容已获得CC BY-SA 3.0许可协议授权: 维基百科(各页面)- https://www.wikipedia.org/ 版权归维基百科编辑者和贡献者所有。
提供机构:
maas
创建时间:
2025-02-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作