five

databricks/databricks-dolly-15k|自然语言处理数据集|文本生成数据集

收藏
hugging_face2023-06-30 更新2024-03-04 收录
自然语言处理
文本生成
下载链接:
https://hf-mirror.com/datasets/databricks/databricks-dolly-15k
下载链接
链接失效反馈
资源简介:
--- license: cc-by-sa-3.0 task_categories: - question-answering - summarization language: - en size_categories: - 10K<n<100K --- # Summary `databricks-dolly-15k` is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT](https://arxiv.org/abs/2203.02155) paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://creativecommons.org/licenses/by-sa/3.0/legalcode). Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation Languages: English Version: 1.0 **Owner: Databricks, Inc.** # Dataset Overview `databricks-dolly-15k` is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category. Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly. For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the `context` field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. `[42]`) which we recommend users remove for downstream applications. # Intended Uses While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories. Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets. # Dataset ## Purpose of Collection As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications. ## Sources - **Human-generated data**: Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories. - **Wikipedia**: For instruction categories that require an annotator to consult a reference text (information extraction, closed QA, summarization) contributors selected passages from Wikipedia for particular subsets of instruction categories. No guidance was given to annotators as to how to select the target passages. ## Annotator Guidelines To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous compliance to an annotation rubric that concretely and reliably operationalizes the specific task. Caveat emptor. The annotation guidelines for each of the categories are as follows: - **Creative Writing**: Write a question or instruction that requires a creative, open-ended written response. The instruction should be reasonable to ask of a person with general world knowledge and should not require searching. In this task, your prompt should give very specific instructions to follow. Constraints, instructions, guidelines, or requirements all work, and the more of them the better. - **Closed QA**: Write a question or instruction that requires factually correct response based on a passage of text from Wikipedia. The question can be complex and can involve human-level reasoning capabilities, but should not require special knowledge. To create a question for this task include both the text of the question as well as the reference text in the form. - **Open QA**: Write a question that can be answered using general world knowledge or at most a single search. This task asks for opinions and facts about the world at large and does not provide any reference text for consultation. - **Summarization**: Give a summary of a paragraph from Wikipedia. Please don't ask questions that will require more than 3-5 minutes to answer. To create a question for this task include both the text of the question as well as the reference text in the form. - **Information Extraction**: These questions involve reading a paragraph from Wikipedia and extracting information from the passage. Everything required to produce an answer (e.g. a list, keywords etc) should be included in the passages. To create a question for this task include both the text of the question as well as the reference text in the form. - **Classification**: These prompts contain lists or examples of entities to be classified, e.g. movie reviews, products, etc. In this task the text or list of entities under consideration is contained in the prompt (e.g. there is no reference text.). You can choose any categories for classification you like, the more diverse the better. - **Brainstorming**: Think up lots of examples in response to a question asking to brainstorm ideas. ## Personal or Sensitive Data This dataset contains public information (e.g., some information from Wikipedia). To our knowledge, there are no private person’s personal identifiers or sensitive information. ## Language American English # Known Limitations - Wikipedia is a crowdsourced corpus and the contents of this dataset may reflect the bias, factual errors and topical focus found in Wikipedia - Some annotators may not be native English speakers - Annotator demographics and subject matter may reflect the makeup of Databricks employees # Citation ``` @online{DatabricksBlog2023DollyV2, author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin}, title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM}, year = {2023}, url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}, urldate = {2023-06-30} } ``` # License/Attribution **Copyright (2023) Databricks, Inc.** This dataset was developed at Databricks (https://www.databricks.com) and its use is subject to the CC BY-SA 3.0 license. Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license: Wikipedia (various pages) - https://www.wikipedia.org/ Copyright © Wikipedia editors and contributors.
提供机构:
databricks
原始信息汇总

数据集概述

名称: databricks-dolly-15k

描述: 该数据集包含超过15,000条记录,由数千名Databricks员工生成,旨在使大型语言模型展现出ChatGPT的交互特性。数据集涵盖了八个不同的指令类别,包括创意写作、封闭式问答、开放式问答、总结、信息提取、分类和头脑风暴等。

语言: 英语

规模: 10K<n<100K

许可: Creative Commons Attribution-ShareAlike 3.0 Unported License

数据集用途

  • 训练LLMs: 用于微调大型语言模型。
  • 合成数据生成: 利用人类生成的指令提示进行数据生成。
  • 数据增强: 通过重述每个提示或简短响应,提供数据集的正则化。

数据集收集目的

  • 作为开放源代码的一部分,Databricks开发了首个开放源代码、人类生成的指令语料库,旨在使大型语言模型展现出ChatGPT的交互特性。

数据来源

  • 人类生成数据: Databricks员工创建的提示/响应对。
  • Wikipedia: 特定指令类别中,从Wikipedia选取的段落。

标注指南

  • 创意写作: 要求创造性、开放式的书面响应。
  • 封闭式问答: 基于Wikipedia文本的事实正确响应。
  • 开放式问答: 使用通用世界知识或单一搜索即可回答的问题。
  • 总结: 对Wikipedia段落的总结。
  • 信息提取: 从Wikipedia段落中提取信息。
  • 分类: 包含实体列表或示例的分类任务。
  • 头脑风暴: 针对问题提出大量想法。

语言

  • 美式英语

已知限制

  • 数据集内容可能反映Wikipedia的偏见、事实错误和主题焦点。
  • 部分标注者可能不是英语母语者。
  • 标注者的背景和主题可能反映Databricks员工的构成。
AI搜集汇总
数据集介绍
main_image_url
构建方式
databricks-dolly-15k数据集是由Databricks公司的员工生成的,包含超过15000条指令遵循记录的语料库。这些记录是在八个不同的指令类别中创建的,包括InstructGPT论文中概述的七个类别以及一个开放式自由形式类别。参与者被指示避免使用来自网络的任何信息,除维基百科外,并且在制定指令或响应时明确指示不要使用生成式AI。在数据生成过程的中途,参与者还被提供了回答其他参与者提出问题的选项,以此方式丰富数据集的内容。
特点
该数据集的特点在于其开放性和多样性,可用于学术或商业目的。它不仅包含了人类生成的指令和响应对,还提供了从维基百科摘录的参考文本,有助于模型在信息提取、封闭式问答、总结等任务上的训练。此外,数据集中的注释者指南旨在鼓励创造性思维和开放性指令,尽管这可能以对任务严格规范的遵守为代价。
使用方法
使用databricks-dolly-15k数据集时,用户可以根据数据集中的指令类别进行相应的模型训练,如大型语言模型的指令微调、合成数据生成和数据增强。用户可以利用贡献者生成的提示作为少量样本,以生成数百万个各种InstructGPT类别的指令语料库。同时,指令和响应均可用于数据增强,例如使用释义模型重新表达每个提示或简短响应,以提供对数据集的正则化,从而允许从这些合成数据集中派生的模型表现出更加稳健的指令遵循行为。
背景与挑战
背景概述
在人工智能领域,语言模型的发展日新月异。`databricks-dolly-15k`数据集,由Databricks公司于2023年推出,汇聚了15000余条由员工创作的指令遵循记录,旨在推动大型语言模型展现出类似ChatGPT的神奇交互性。该数据集涵盖了包括创意写作、分类、封闭式问答、生成、信息提取、开放式问答和总结在内的多种行为类别,是首个开源的、专门为大型语言模型设计的、人类生成的指令语料库,对学术和商业应用均开放使用,具有重要的研究价值和广泛的应用前景。
当前挑战
尽管`databricks-dolly-15k`数据集在促进大型语言模型训练方面具有显著作用,但其构建过程中亦面临诸多挑战。首先,数据集的来源几乎完全依赖于Databricks员工,这可能导致数据来源的同质化。其次,数据集中部分内容来源于维基百科,可能携带维基百科的偏见和事实错误。此外,数据集中可能存在非英语母语标注者的语言误差,以及标注者人群可能不能完全代表广泛的社会多样性,这些都是未来研究和应用中需要注意的问题。
常用场景
经典使用场景
在自然语言处理领域,`databricks-dolly-15k` 数据集的显著应用便是作为大型语言模型指令微调的典范。其包含的指令遵循记录,为训练模型在诸如问答、总结等任务上的指令跟随能力提供了丰富的素材,使得模型能够模拟ChatGPT般的神奇交互性。
实际应用
在实际应用中,`databricks-dolly-15k` 数据集可用于合成数据生成,为各种任务提供丰富的指令样例。此外,它也可用于数据增强,通过指令和响应的改写,为模型训练提供更加多样化的数据,增强模型的泛化能力。
衍生相关工作
基于`databricks-dolly-15k` 数据集,研究者可以开展一系列相关工作,如开发新的指令微调框架、探索更高效的数据增强策略,以及构建能够更好地理解和执行复杂人类指令的语言模型。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国交通事故深度调查(CIDAS)数据集

交通事故深度调查数据通过采用科学系统方法现场调查中国道路上实际发生交通事故相关的道路环境、道路交通行为、车辆损坏、人员损伤信息,以探究碰撞事故中车损和人伤机理。目前已积累深度调查事故10000余例,单个案例信息包含人、车 、路和环境多维信息组成的3000多个字段。该数据集可作为深入分析中国道路交通事故工况特征,探索事故预防和损伤防护措施的关键数据源,为制定汽车安全法规和标准、完善汽车测评试验规程、

北方大数据交易中心 收录

RAVDESS

情感语音和歌曲 (RAVDESS) 的Ryerson视听数据库包含7,356个文件 (总大小: 24.8 GB)。该数据库包含24位专业演员 (12位女性,12位男性),以中性的北美口音发声两个词汇匹配的陈述。言语包括平静、快乐、悲伤、愤怒、恐惧、惊讶和厌恶的表情,歌曲则包含平静、快乐、悲伤、愤怒和恐惧的情绪。每个表达都是在两个情绪强度水平 (正常,强烈) 下产生的,另外还有一个中性表达。所有条件都有三种模态格式: 纯音频 (16位,48kHz .wav),音频-视频 (720p H.264,AAC 48kHz,.mp4) 和仅视频 (无声音)。注意,Actor_18没有歌曲文件。

OpenDataLab 收录

长江干流实时水位观测数据集(2024年)

该数据集为长江干流主要水文站实时水位观测数据集,包含了汉口、户口、九江、宜昌等16个水文站点的逐小时或逐日水位观测数据。 该数据集包含3个excel表格文件,长江干流站点.xls,逐日水位.xlsx,逐小时水位.xlsx。

国家地球系统科学数据中心 收录

Plant-Diseases

Dataset for Plant Diseases containg variours Plant Disease

kaggle 收录

中国区域地面气象要素驱动数据集 v2.0(1951-2020)

中国区域地面气象要素驱动数据集(China Meteorological Forcing Data,以下简称 CMFD)是为支撑中国区域陆面、水文、生态等领域研究而研发的一套高精度、高分辨率、长时间序列数据产品。本页面发布的 CMFD 2.0 包含了近地面气温、气压、比湿、全风速、向下短波辐射通量、向下长波辐射通量、降水率等气象要素,时间分辨率为 3 小时,水平空间分辨率为 0.1°,时间长度为 70 年(1951~2020 年),覆盖了 70°E~140°E,15°N~55°N 空间范围内的陆地区域。CMFD 2.0 融合了欧洲中期天气预报中心 ERA5 再分析数据与气象台站观测数据,并在辐射、降水数据产品中集成了采用人工智能技术制作的 ISCCP-ITP-CNN 和 TPHiPr 数据产品,其数据精度较 CMFD 的上一代产品有显著提升。 CMFD 历经十余年的发展,其间发布了多个重要版本。2019 年发布的 CMFD 1.6 是完全采用传统数据融合技术制作的最后一个 CMFD 版本,而本次发布的 CMFD 2.0 则是 CMFD 转向人工智能技术制作的首个版本。此版本与 1.6 版具有相同的时空分辨率和基础变量集,但在其它诸多方面存在大幅改进。除集成了采用人工智能技术制作的辐射和降水数据外,在制作 CMFD 2.0 的过程中,研发团队尽可能采用单一来源的再分析数据作为输入并引入气象台站迁址信息,显著缓解了 CMFD 1.6 中因多源数据拼接和气象台站迁址而产生的虚假气候突变。同时,CMFD 2.0 数据的时间长度从 CMFD 1.6 的 40 年大幅扩展到了 70 年,并将继续向后延伸。CMFD 2.0 的网格空间范围虽然与 CMFD 1.6 相同,但其有效数据扩展到了中国之外,能够更好地支持跨境区域研究。为方便用户使用,CMFD 2.0 还在基础变量集之外提供了若干衍生变量,包括近地面相对湿度、雨雪分离降水产品等。此外,CMFD 2.0 摒弃了 CMFD 1.6 中通过 scale_factor 和 add_offset 参数将实型数据化为整型数据的压缩技术,转而直接将实型数据压缩存储于 NetCDF4 格式文件中,从而消除了用户使用数据时进行解压换算的困扰。 本数据集原定版本号为 1.7,但鉴于本数据集从输入数据到研制技术都较上一代数据产品有了大幅的改变,故将其版本号重新定义为 2.0。CMFD 2.0 的数据内容与此前宣传的 CMFD 1.7 基本一致,仅对 1983 年 7 月以后的向下短/长波辐射通量数据进行了更新,以修正其长期趋势存在的问题。2021 年至 2024 年的 CMFD 数据正在制作中,计划于 2025 年上半年发布,从而使 CMFD 2.0 延伸至 2024 年底。

国家青藏高原科学数据中心 收录