togethercomputer/RedPajama-Data-Instruct
收藏Hugging Face2023-06-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/togethercomputer/RedPajama-Data-Instruct
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# Dataset Summary
RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both [P3 (BigScience)](https://huggingface.co/datasets/bigscience/P3) and [Natural Instruction (AI2)](https://github.com/allenai/natural-instructions),
and conduct aggressive decontamination against [HELM]((https://crfm.stanford.edu/helm/latest/?group=core_scenarios)),
in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example.
We remove the entire task if the returned instance and the validation example correspond to the same task
(In this step, we keep the task in the case that the returned instance happens to use the same Wikipedia article as the validation example, but asks different questions);
(2) We then remove all instances that have any 10-Gram overlap with any HELM validation example.
In total, we filtered out 137 tasks and 5.2M instances (out of 1069 tasks and 93.3M instances).
# QuickStart
The materialized version of P3 includes three main fields. The inputs field contains task instructions and data inputs, while the targets field denotes the labels. The third field, meta, provides meta information.
```python
data = load_dataset('togethercomputer/RedPajama-Instruct-Data', data_files='data/P3_decontaminated.jsonl.zst', split='train')
```
For NI, the definition field refers to the task instructions, while inputs represent the input data. The targets field pertains to the labels, and meta provides relevant meta information.
```python
data = load_dataset('togethercomputer/RedPajama-Instruct-Data', data_files='data/NI_decontaminated.jsonl.zst', split='train')
```
# Source Data
RedPajama-Instruct-Data is sourced from two prominent datasets:
- [Public Pool of Prompts](https://huggingface.co/datasets/bigscience/P3): A large dataset featuring various creative tasks obtained from crowdsourcing efforts.
- [Natural-Instructions](https://github.com/allenai/natural-instructions): An instruction-tuning dataset comprising a diverse set of tasks in natural languages.
# Languages
Primarily English.
# Licensing Information
This dataset is released under the licsence of Apache 2.0.
提供机构:
togethercomputer
原始信息汇总
数据集概述
数据集名称
RedPajama-Instruct-Data
数据来源
- P3 (BigScience): 来自Public Pool of Prompts,包含多种创意任务的众包数据集。
- Natural Instructions (AI2): 来自Natural-Instructions,一个包含多种自然语言任务的指令调整数据集。
数据处理
数据集通过两步进行积极去污染处理:
- 使用HELM验证示例作为查询进行语义搜索,从Instruct数据集中获取最相似的100个实例,并检查是否有任何返回的实例与验证示例重叠(使用10-Gram)。如果返回的实例与验证示例对应于同一任务,则移除整个任务。
- 移除所有与任何HELM验证示例有10-Gram重叠的实例。 总共过滤掉了137个任务和5.2M实例(原始数据为1069个任务和93.3M实例)。
数据结构
- P3: 包含三个主要字段:
inputs(任务指令和数据输入)、targets(标签)和meta(元信息)。 - NI: 包含四个字段:
definition(任务指令)、inputs(输入数据)、targets(标签)和meta(相关元信息)。
语言
主要为英语。
许可信息
该数据集根据Apache 2.0许可证发布。



