databricks-dolly-15k

Name: databricks-dolly-15k
Creator: maas
Published: 2026-01-08 15:32:50
License: 暂无描述

魔搭社区2026-01-08 更新2024-06-22 收录

下载链接：

https://modelscope.cn/datasets/thomas/databricks-dolly-15k

下载链接

链接失效反馈

官方服务：

资源简介：

# Summary `databricks-dolly-15k` is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT](https://arxiv.org/abs/2203.02155) paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://creativecommons.org/licenses/by-sa/3.0/legalcode). Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation Languages: English Version: 1.0 **Owner: Databricks, Inc.** # Dataset Overview `databricks-dolly-15k` is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category. Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly. For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the `context` field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. `[42]`) which we recommend users remove for downstream applications. # Intended Uses While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories. Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets. # Dataset ## Purpose of Collection As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications. ## Sources - **Human-generated data**: Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories. - **Wikipedia**: For instruction categories that require an annotator to consult a reference text (information extraction, closed QA, summarization) contributors selected passages from Wikipedia for particular subsets of instruction categories. No guidance was given to annotators as to how to select the target passages. ## Annotator Guidelines To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous compliance to an annotation rubric that concretely and reliably operationalizes the specific task. Caveat emptor. The annotation guidelines for each of the categories are as follows: - **Creative Writing**: Write a question or instruction that requires a creative, open-ended written response. The instruction should be reasonable to ask of a person with general world knowledge and should not require searching. In this task, your prompt should give very specific instructions to follow. Constraints, instructions, guidelines, or requirements all work, and the more of them the better. - **Closed QA**: Write a question or instruction that requires factually correct response based on a passage of text from Wikipedia. The question can be complex and can involve human-level reasoning capabilities, but should not require special knowledge. To create a question for this task include both the text of the question as well as the reference text in the form. - **Open QA**: Write a question that can be answered using general world knowledge or at most a single search. This task asks for opinions and facts about the world at large and does not provide any reference text for consultation. - **Summarization**: Give a summary of a paragraph from Wikipedia. Please don't ask questions that will require more than 3-5 minutes to answer. To create a question for this task include both the text of the question as well as the reference text in the form. - **Information Extraction**: These questions involve reading a paragraph from Wikipedia and extracting information from the passage. Everything required to produce an answer (e.g. a list, keywords etc) should be included in the passages. To create a question for this task include both the text of the question as well as the reference text in the form. - **Classification**: These prompts contain lists or examples of entities to be classified, e.g. movie reviews, products, etc. In this task the text or list of entities under consideration is contained in the prompt (e.g. there is no reference text.). You can choose any categories for classification you like, the more diverse the better. - **Brainstorming**: Think up lots of examples in response to a question asking to brainstorm ideas. ## Personal or Sensitive Data This dataset contains public information (e.g., some information from Wikipedia). To our knowledge, there are no private person’s personal identifiers or sensitive information. ## Language American English # Known Limitations - Wikipedia is a crowdsourced corpus and the contents of this dataset may reflect the bias, factual errors and topical focus found in Wikipedia - Some annotators may not be native English speakers - Annotator demographics and subject matter may reflect the makeup of Databricks employees # Citation ``` @online{DatabricksBlog2023DollyV2, author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin}, title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM}, year = {2023}, url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}, urldate = {2023-06-30} } ``` # License/Attribution **Copyright (2023) Databricks, Inc.** This dataset was developed at Databricks (https://www.databricks.com) and its use is subject to the CC BY-SA 3.0 license. Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license: Wikipedia (various pages) - https://www.wikipedia.org/ Copyright © Wikipedia editors and contributors.

# 数据集摘要 `databricks-dolly-15k` 是一个开源的指令遵循数据集，由数千名Databricks员工基于InstructGPT论文中划定的多个行为类别生成，涵盖头脑风暴、分类、封闭域问答（Closed QA）、生成、信息抽取、开放域问答（Open QA）以及摘要等任务。该数据集可根据[知识共享署名-相同方式共享3.0未移植协议（Creative Commons Attribution-ShareAlike 3.0 Unported License）](https://creativecommons.org/licenses/by-sa/3.0/legalcode)的条款，用于学术或商业等任意用途。支持任务： - 大语言模型（Large Language Model, LLM）训练 - 合成数据生成 - 数据增强语言：英语版本：1.0 **所有者：Databricks, Inc.** # 数据集概览 `databricks-dolly-15k` 是一个包含超15000条记录的语料库，由数千名Databricks员工生成，旨在让大语言模型（Large Language Model, LLM）具备ChatGPT般的神奇交互能力。 Databricks员工受邀创建8种不同指令类别的提示（prompt）-回复（response）配对，其中包括InstructGPT论文中提及的7种类别，以及一类开放式自由形式类别。主办方要求参与者仅可使用维基百科（针对部分特定指令类别）的信息，不得使用其他任何网络来源内容，同时明确禁止参与者在生成指令或回复时使用生成式AI。主办方还提供了各类别的示例，以引导参与者生成符合各分类要求的问题与指令。在数据生成流程进行到一半时，参与者被允许回答其他参与者提出的问题，但需先改写原问题，且仅可选择自身有能力正确作答的题目。针对部分类别，参与者需提供取自维基百科的参考文本。实际数据集中以`context`字段标注的参考文本可能包含带方括号的维基百科引用编号（例如`[42]`），我们建议下游应用使用者将其移除。 # 预期用途尽管该数据集可直接用于大语言模型的指令微调，但作为由人类生成的指令提示语料库，它还为基于《Self-Instruct》论文中所述方法生成合成数据提供了宝贵契机。例如，可将参与者生成的提示语作为少样本（Few-shot）示例输入至开源大语言模型，以在每个InstructGPT类别下生成百万级别的指令示例语料库。同样，指令与回复本身也为数据增强提供了广阔空间。可使用释义模型对每条提示语或简短回复进行改写，将生成的文本与对应的真实样本关联。此类方法可对数据集实现正则化处理，使基于该合成数据集训练得到的模型能够更稳健地遵循指令。 # 数据集 ## 收集目的作为Databricks对开源社区持续承诺的一部分，我们开发了目前已知首个专为让大语言模型具备ChatGPT般交互能力而打造的、由人类生成的开源指令语料库。与其他仅限非商业使用的数据集不同，该数据集可用于包括学术与商业应用在内的任意用途，且支持修改与扩展。 ## 数据来源 - **人工生成数据**：Databricks员工受邀创建8种不同指令类别的提示-回复配对。 - **维基百科**：针对需要参考文本的指令类别（如信息抽取、封闭域问答、摘要任务），参与者从维基百科中选取段落用于部分特定指令类别。主办方未对参与者如何选取目标段落提供任何指导。 ## 标注指南为生成数据记录，员工会收到各标注任务的简要说明，以及各类别典型提示语的示例。标注指南设计得较为简洁，以提升任务完成率，但这可能导致参与者无法严格遵循精准且可复现的任务标注规范，特此声明：购者自慎（Caveat emptor）。各分类的标注指南如下： - **创意写作**：编写需要以创造性、开放式书面回复作答的问题或指令。该指令应仅需具备通用世界知识的普通人即可回答，无需额外检索。在此任务中，提示语需给出明确的遵循要求，约束条件、操作说明、指南或规则均可，且越详细越好。 - **封闭域问答（Closed QA）**：编写需要基于维基百科段落给出事实正确回复的问题或指令。问题可较为复杂，需具备人类水平的推理能力，但无需特殊专业知识。生成此类任务的问题时，需在表单中同时包含问题文本与参考文本。 - **开放域问答（Open QA）**：编写可通过通用世界知识或至多一次检索即可作答的问题。该任务要求回答关于世界的观点与事实，无需提供参考文本。 - **摘要生成**：对维基百科中的一段文本生成摘要。请勿编写需要3-5分钟以上才能作答的问题。生成此类任务的问题时，需在表单中同时包含问题文本与参考文本。 - **信息抽取**：此类问题要求阅读维基百科中的一段文本并从中抽取信息。生成答案所需的全部内容（例如列表、关键词等）均应包含在参考文本中。生成此类任务的问题时，需在表单中同时包含问题文本与参考文本。 - **分类任务**：此类提示语包含待分类的实体列表或示例，例如影评、商品等。在此任务中，待处理的文本或实体列表均包含在提示语中（即无需参考文本）。参与者可自行选择任意分类类别，类别多样性越高越好。 - **头脑风暴**：针对要求发散思维的问题，生成大量相关示例。 ## 个人与敏感数据该数据集仅包含公开信息（例如部分维基百科内容）。据我们所知，数据中不包含任何个人身份标识或敏感信息。 ## 语言美式英语 # 已知局限性 - 维基百科是众包构建的语料库，本数据集的内容可能反映维基百科中存在的偏见、事实错误与主题偏向 - 部分标注者并非以英语为母语的使用者 - 标注者的人口统计特征与选题可能贴合Databricks员工的构成情况 # 引用 @online{DatabricksBlog2023DollyV2, author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin}, title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM}, year = {2023}, url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}, urldate = {2023-06-30} } # 许可与归因 **版权所有（2023）Databricks, Inc.** 本数据集由Databricks（https://www.databricks.com）开发，使用需遵循CC BY-SA 3.0协议。数据集中部分类别素材包含以下来源的内容，这些内容均采用CC BY-SA 3.0协议授权：维基百科（各页面） - https://www.wikipedia.org/ 版权归维基百科编辑与贡献者所有。

提供机构：

maas

创建时间：

2024-06-05

搜集汇总

数据集介绍

背景与挑战

背景概述

databricks-dolly-15k是一个开源指令遵循数据集，包含超过15,000条由数千名Databricks员工人工生成的记录，覆盖头脑风暴、分类、问答、摘要等多个类别，旨在训练大型语言模型以实现类ChatGPT的交互能力。该数据集基于CC BY-SA 3.0许可证，允许学术和商业使用，数据来源包括人工生成和Wikipedia参考，语言为英语，但可能受Wikipedia偏见和员工背景影响。

以上内容由遇见数据集搜集并总结生成