five

togethercomputer/RedPajama-Data-Instruct

收藏
Hugging Face2023-06-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/togethercomputer/RedPajama-Data-Instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- # Dataset Summary RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both [P3 (BigScience)](https://huggingface.co/datasets/bigscience/P3) and [Natural Instruction (AI2)](https://github.com/allenai/natural-instructions), and conduct aggressive decontamination against [HELM]((https://crfm.stanford.edu/helm/latest/?group=core_scenarios)), in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove the entire task if the returned instance and the validation example correspond to the same task (In this step, we keep the task in the case that the returned instance happens to use the same Wikipedia article as the validation example, but asks different questions); (2) We then remove all instances that have any 10-Gram overlap with any HELM validation example. In total, we filtered out 137 tasks and 5.2M instances (out of 1069 tasks and 93.3M instances). # QuickStart The materialized version of P3 includes three main fields. The inputs field contains task instructions and data inputs, while the targets field denotes the labels. The third field, meta, provides meta information. ```python data = load_dataset('togethercomputer/RedPajama-Instruct-Data', data_files='data/P3_decontaminated.jsonl.zst', split='train') ``` For NI, the definition field refers to the task instructions, while inputs represent the input data. The targets field pertains to the labels, and meta provides relevant meta information. ```python data = load_dataset('togethercomputer/RedPajama-Instruct-Data', data_files='data/NI_decontaminated.jsonl.zst', split='train') ``` # Source Data RedPajama-Instruct-Data is sourced from two prominent datasets: - [Public Pool of Prompts](https://huggingface.co/datasets/bigscience/P3): A large dataset featuring various creative tasks obtained from crowdsourcing efforts. - [Natural-Instructions](https://github.com/allenai/natural-instructions): An instruction-tuning dataset comprising a diverse set of tasks in natural languages. # Languages Primarily English. # Licensing Information This dataset is released under the licsence of Apache 2.0.
提供机构:
togethercomputer
原始信息汇总

数据集概述

数据集名称

RedPajama-Instruct-Data

数据来源

  • P3 (BigScience): 来自Public Pool of Prompts,包含多种创意任务的众包数据集。
  • Natural Instructions (AI2): 来自Natural-Instructions,一个包含多种自然语言任务的指令调整数据集。

数据处理

数据集通过两步进行积极去污染处理:

  1. 使用HELM验证示例作为查询进行语义搜索,从Instruct数据集中获取最相似的100个实例,并检查是否有任何返回的实例与验证示例重叠(使用10-Gram)。如果返回的实例与验证示例对应于同一任务,则移除整个任务。
  2. 移除所有与任何HELM验证示例有10-Gram重叠的实例。 总共过滤掉了137个任务和5.2M实例(原始数据为1069个任务和93.3M实例)。

数据结构

  • P3: 包含三个主要字段:inputs(任务指令和数据输入)、targets(标签)和meta(元信息)。
  • NI: 包含四个字段:definition(任务指令)、inputs(输入数据)、targets(标签)和meta(相关元信息)。

语言

主要为英语。

许可信息

该数据集根据Apache 2.0许可证发布。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作