Open-Orca/FLAN
收藏Hugging Face2023-08-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Open-Orca/FLAN
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
datasets:
- Open-Orca/OpenOrca
size_categories:
- 1B<n<10B
---
<p><h1>🍮 The WHOLE FLAN Collection! 🍮</h1></p>

# Overview
This repository includes the full dataset from the [FLAN Collection](https://ai.googleblog.com/2023/02/the-flan-collection-advancing-open.html), totalling ~300GB as parquets.
Generated using the official seqio templating from the [Google FLAN Collection GitHub repo](https://github.com/google-research/FLAN/tree/main/flan/v2).
The data is subject to all the same licensing of the component datasets.
To keep up with our continued work on OpenOrca and other exciting research, find our Discord here:
https://AlignmentLab.ai
# Motivation
This work was done as part of the requirements for the OpenOrca project.
There was not a large enough subset of FLAN Collection generated publicly to subsample from to complete the work.
So, we opted to process the entire collection ourselves.
Generating this requires an understanding of seqio and a Linux server with 512GB of CPU ram, as well as fast drives and custom limits for many parameters beyond what is default on Linux server distributions (e.g., requiring up to 45,000 threads running at once).
It takes downloading over 400GB of datasets, working around tfds bugs, and then processing the datasets over the course of several days.
We provide this repo as a resource to other ML researchers, as it saves these time consuming and laborious steps to getting the data into a more accessible format for further consumption.
# Data
## Organization
* JSON files at top level are used for subsampling in OpenOrca
* Parquets in subdirectories contain the entire FLAN collection in Dask-sharded folders by submix fractions
## Zero-Shot vs Few-Shot and Options vs No-Options
The core sub-collections of FLAN are `CoT`, `Dialog`, `NIv2`, `T0`, and `flan2021`.
Within those sub-collections are four "remixes" of the data that are templated differently:
* `Zero-Shot` and `Few-Shot`
* `Zero-Shot` provides a prompt, question, or challenge without any exemplaries prior
* `Few-Shot` provides exemplaries first
* `Options` and `No-Options`
* `Options` provides a question or challenge with multiple-choice (e.g. A/B/C/D) answer options provided to select from
* `No-Options` requires a free-form answer
For every sub-collection, only some of the "remixes" may officially be provided. All available have been generated in full without any redaction or sub-sampling.
An example: `t0_fsopt_data` folder contains the sub-collection `T0`'s Few-Shot (FS), Options (OPT) remix set.
Notably, this is the largest "remix" and the one that necessitates 512GB CPU ram to generate. The raw json output is nearly 200GB.
## Parquet Sizes
Each sub-collection's individual remixes are provided as [Parquet](https://huggingface.co/docs/datasets/loading#parquet) files which have been sharded by [Dask](https://huggingface.co/docs/datasets/main/en/filesystems#dask) into ~160MB chunks (starting from 256MB blocks of the source jsonl files).
The folder structure along with size sums is provided below.
```
$ du -h --max-depth=1 ./
9.1G ./niv2_fsopt_data
2.4G ./niv2_zsopt_data
59G ./flan_fsopt_data
984M ./dialog_zsopt_data
11G ./flan_zsopt_data
8.6G ./dialog_fsopt_data
16G ./t0_zsnoopt_data
149M ./cot_fsopt_data
20M ./cot_zsopt_data
17G ./t0_zsopt_data
11G ./flan_zsnoopt_data
101G ./t0_fsopt_data
25G ./flan_fsnoopt_data
39G ./t0_fsnoopt_data
296G ./
```
# Citations
```bibtex
@misc{goodson2023huggyflan
title={Fine FLAN: Seqio to Parquet So You Don't Have To},
author={Bleys Goodson},
year={2023},
publisher = {HuggingFace},
journal = {HuggingFace repository},
howpublished = {\url{https://https://huggingface.co/datasets/Open-Orca/FLAN},
}
```
```bibtex
@misc{longpre2023flan,
title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning},
author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts},
year={2023},
eprint={2301.13688},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
```
```bibtex
@misc{wei2022finetuned,
title={Finetuned Language Models Are Zero-Shot Learners},
author={Jason Wei and Maarten Bosma and Vincent Y. Zhao and Kelvin Guu and Adams Wei Yu and Brian Lester and Nan Du and Andrew M. Dai and Quoc V. Le},
year={2022},
eprint={2109.01652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
```bibtex
@misc{sanh2022multitask,
title={Multitask Prompted Training Enables Zero-Shot Task Generalization},
author={Victor Sanh and Albert Webson and Colin Raffel and Stephen H. Bach and Lintang Sutawika and Zaid Alyafeai and Antoine Chaffin and Arnaud Stiegler and Teven Le Scao and Arun Raja and Manan Dey and M Saiful Bari and Canwen Xu and Urmish Thakker and Shanya Sharma Sharma and Eliza Szczechla and Taewoon Kim and Gunjan Chhablani and Nihal Nayak and Debajyoti Datta and Jonathan Chang and Mike Tian-Jian Jiang and Han Wang and Matteo Manica and Sheng Shen and Zheng Xin Yong and Harshit Pandey and Rachel Bawden and Thomas Wang and Trishala Neeraj and Jos Rozen and Abheesht Sharma and Andrea Santilli and Thibault Fevry and Jason Alan Fries and Ryan Teehan and Tali Bers and Stella Biderman and Leo Gao and Thomas Wolf and Alexander M. Rush},
year={2022},
eprint={2110.08207},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
```bibtex
@misc{wang2022supernaturalinstructions,
title={Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks},
author={Yizhong Wang and Swaroop Mishra and Pegah Alipoormolabashi and Yeganeh Kordi and Amirreza Mirzaei and Anjana Arunkumar and Arjun Ashok and Arut Selvan Dhanasekaran and Atharva Naik and David Stap and Eshaan Pathak and Giannis Karamanolakis and Haizhi Gary Lai and Ishan Purohit and Ishani Mondal and Jacob Anderson and Kirby Kuznia and Krima Doshi and Maitreya Patel and Kuntal Kumar Pal and Mehrad Moradshahi and Mihir Parmar and Mirali Purohit and Neeraj Varshney and Phani Rohitha Kaza and Pulkit Verma and Ravsehaj Singh Puri and Rushang Karia and Shailaja Keyur Sampat and Savan Doshi and Siddhartha Mishra and Sujan Reddy and Sumanta Patro and Tanay Dixit and Xudong Shen and Chitta Baral and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi and Daniel Khashabi},
year={2022},
eprint={2204.07705},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
Open-Orca
原始信息汇总
数据集概述
数据集名称
- 名称: FLAN Collection
- 来源: Open-Orca/FLAN
数据集属性
- 许可证: CC-BY-4.0
- 语言: 英语
- 库名称: transformers
- 管道标签: text-generation
- 大小分类: 1B<n<10B
数据集内容
- 总大小: ~300GB
- 格式: Parquet
- 生成方式: 使用Google FLAN Collection GitHub repo中的官方seqio模板生成
数据组织
- JSON文件: 用于OpenOrca中的子采样
- Parquet文件: 包含完整的FLAN集合,按子混合比例分片在子目录中
数据子集
- 核心子集: CoT, Dialog, NIv2, T0, flan2021
- 数据重混: Zero-Shot, Few-Shot, Options, No-Options
- Zero-Shot: 提供提示、问题或挑战,无先前示例
- Few-Shot: 先提供示例
- Options: 提供多选题(如A/B/C/D)选项
- No-Options: 要求自由形式回答
数据大小示例
- t0_fsopt_data: T0子集的Few-Shot, Options重混,约200GB原始json输出
- Parquet文件大小: 约160MB,从256MB的源jsonl文件分片而来
引用信息
- Goodson2023huggyflan: 描述了从Seqio到Parquet的转换过程
- Longpre2023flan: 描述了FLAN Collection的设计和方法
- Wei2022finetuned: 描述了微调语言模型作为零样本学习者的研究
- Sanh2022multitask: 描述了多任务提示训练实现零样本任务泛化的研究
- Wang2022supernaturalinstructions: 描述了超过1600个NLP任务的声明性指令泛化研究



