SKT27182/Preprocessed_OpenOrca
收藏数据集卡片
数据集描述
数据集概述
该数据集包含用于文本分类和对话系统的数据。数据集主要语言为英语,遵循MIT许可证。
语言
数据集的主要语言是英语。
数据集结构
数据字段
数据集包含以下字段:
id:唯一编号标识符,包含niv、t0、cot或flan之一,表示问题来源的FLAN Collection子集。system_prompt:向GPT-3.5或GPT-4 API展示的系统提示。question:由FLAN Collection提供的问题条目。response:对问题的响应,来自GPT-3.5或GPT-4的查询。
数据分割
数据集分为以下几个部分:
train:包含2872771个样本,大小为3671168412.416216字节。test:包含359097个样本,大小为458896850.2513517字节。validation:包含359096个样本,大小为458895572.3324322字节。
数据集大小
- 下载大小:2553683923字节
- 数据集大小:4588960835.0字节
配置
数据集配置为默认配置,包含以下数据文件:
train:路径为data/train-*test:路径为data/test-*validation:路径为data/validation-*
数据来源
初始数据收集和规范化
数据集从HuggingFace的Open-Orca/OpenOrca收集。
附加信息
数据集策展人
该数据集来自Open-Orca/OpenOrca,并对其提示进行了修改,使其总体长度小于512,以便大多数最大输入长度为512的模型能够处理。
引用
bibtex @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/Open-Orca/OpenOrca}, }
bibtex @misc{mukherjee2023orca, title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4}, author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah}, year={2023}, eprint={2306.02707}, archivePrefix={arXiv}, primaryClass={cs.CL} }
bibtex @misc{longpre2023flan, title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning}, author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts}, year={2023}, eprint={2301.13688}, archivePrefix={arXiv}, primaryClass={cs.AI} }
bibtex @software{touvron2023llama, title={LLaMA: Open and Efficient Foundation Language Models}, author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{e}e and Rozi{`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, journal={arXiv preprint arXiv:2302.13971}, year={2023} }



