robinhad/databricks-dolly-15k-uk

Name: robinhad/databricks-dolly-15k-uk
Creator: robinhad
Published: 2023-06-13 05:43:34
License: 暂无描述

Hugging Face2023-06-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/robinhad/databricks-dolly-15k-uk

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 task_categories: - question-answering - summarization language: - uk size_categories: - 10K<n<100K --- # Summary `databricks-dolly-15k-uk` is an open source dataset based on [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) instruction-following dataset, but machine translated using [facebook/m2m100_1.2B](https://huggingface.co/facebook/m2m100_1.2B) model. Tasks covered include brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. Expect this dataset to not be grammatically correct and having obvious pitfalls of machine translation. <details> <summary>Original Summary</summary> # Summary `databricks-dolly-15k` is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT](https://arxiv.org/abs/2203.02155) paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://creativecommons.org/licenses/by-sa/3.0/legalcode). Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation Languages: Ukrainian Version: 1.0 **Owner: Databricks, Inc.** # Dataset Overview `databricks-dolly-15k` is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category. Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly. For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the `context` field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. `[42]`) which we recommend users remove for downstream applications. # Intended Uses While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories. Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets. # Dataset ## Purpose of Collection As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications. ## Sources - **Human-generated data**: Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories. - **Wikipedia**: For instruction categories that require an annotator to consult a reference text (information extraction, closed QA, summarization) contributors selected passages from Wikipedia for particular subsets of instruction categories. No guidance was given to annotators as to how to select the target passages. ## Annotator Guidelines To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous compliance to an annotation rubric that concretely and reliably operationalizes the specific task. Caveat emptor. The annotation guidelines for each of the categories are as follows: - **Creative Writing**: Write a question or instruction that requires a creative, open-ended written response. The instruction should be reasonable to ask of a person with general world knowledge and should not require searching. In this task, your prompt should give very specific instructions to follow. Constraints, instructions, guidelines, or requirements all work, and the more of them the better. - **Closed QA**: Write a question or instruction that requires factually correct response based on a passage of text from Wikipedia. The question can be complex and can involve human-level reasoning capabilities, but should not require special knowledge. To create a question for this task include both the text of the question as well as the reference text in the form. - **Open QA**: Write a question that can be answered using general world knowledge or at most a single search. This task asks for opinions and facts about the world at large and does not provide any reference text for consultation. - **Summarization**: Give a summary of a paragraph from Wikipedia. Please don't ask questions that will require more than 3-5 minutes to answer. To create a question for this task include both the text of the question as well as the reference text in the form. - **Information Extraction**: These questions involve reading a paragraph from Wikipedia and extracting information from the passage. Everything required to produce an answer (e.g. a list, keywords etc) should be included in the passages. To create a question for this task include both the text of the question as well as the reference text in the form. - **Classification**: These prompts contain lists or examples of entities to be classified, e.g. movie reviews, products, etc. In this task the text or list of entities under consideration is contained in the prompt (e.g. there is no reference text.). You can choose any categories for classification you like, the more diverse the better. - **Brainstorming**: Think up lots of examples in response to a question asking to brainstorm ideas. ## Personal or Sensitive Data This dataset contains public information (e.g., some information from Wikipedia). To our knowledge, there are no private person’s personal identifiers or sensitive information. ## Language American English # Known Limitations - Wikipedia is a crowdsourced corpus and the contents of this dataset may reflect the bias, factual errors and topical focus found in Wikipedia - Some annotators may not be native English speakers - Annotator demographics and subject matter may reflect the makeup of Databricks employees # License/Attribution **Copyright (2023) Databricks, Inc.** This dataset was developed at Databricks (https://www.databricks.com) and its use is subject to the CC BY-SA 3.0 license. Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license: Wikipedia (various pages) - https://www.wikipedia.org/ Copyright © Wikipedia editors and contributors. </details>

提供机构：

robinhad

原始信息汇总

数据集概述

数据集名称

databricks-dolly-15k-uk

数据集来源

基于databricks/databricks-dolly-15k指令遵循数据集，通过facebook/m2m100_1.2B模型进行机器翻译。

语言

乌克兰语（uk）

数据集大小

10,000 < n < 100,000 条记录

任务类别

问题回答
摘要生成

覆盖任务

头脑风暴
分类
封闭式问答
生成
信息提取
开放式问答
摘要生成

预期问题

数据集可能存在语法错误和机器翻译的明显缺陷。

原始数据集概述

databricks-dolly-15k是一个开放源代码数据集，包含数千名Databricks员工生成的指令遵循记录，涵盖了InstructGPT论文中概述的几种行为类别，包括头脑风暴、分类、封闭式问答、生成、信息提取、开放式问答和摘要生成。

支持任务

训练大型语言模型（LLMs）
合成数据生成
数据增强

数据集用途

该数据集旨在作为第一个开放源代码、人类生成的指令语料库，专门设计用于使大型语言模型能够展示ChatGPT的神奇交互性。数据集可用于任何目的，包括学术或商业应用。

数据收集来源

人类生成数据：Databricks员工被邀请在八个不同的指令类别中创建提示/响应对。
维基百科：对于需要注释者查阅参考文本的指令类别（信息提取、封闭式问答、摘要生成），注释者从维基百科中选择特定子集的指令类别的段落。

注释者指南

注释者指南简洁设计，以鼓励高任务完成率，可能以严格遵守注释评分标准为代价。指南包括创意写作、封闭式问答、开放式问答、摘要生成、信息提取、分类和头脑风暴等任务的具体说明。

语言

美国英语

已知限制

维基百科是一个众包语料库，数据集内容可能反映维基百科中的偏见、事实错误和主题焦点。
一些注释者可能不是英语母语者。
注释者的背景和主题可能反映Databricks员工的构成。

许可证/归属

数据集受Creative Commons Attribution-ShareAlike 3.0 Unported License约束。数据集中的某些类别材料包括根据CC BY-SA 3.0许可证授权的以下来源的材料：

维基百科（各种页面）- https://www.wikipedia.org/

5,000+

优质数据集

54 个

任务类型

进入经典数据集