five

kyujinpy/OpenOrca-ko-v2

收藏
Hugging Face2023-10-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/kyujinpy/OpenOrca-ko-v2
下载链接
链接失效反馈
官方服务:
资源简介:
OpenOrca数据集是一个增强的FLAN Collection数据集合,目前包含约100万条GPT-4完成数据和约320万条GPT-3.5完成数据。该数据集按照Orca论文中的分布进行表格化处理,主要用于自然语言处理领域的训练和评估。数据集的创建旨在提供增强的文本数据,以支持语言理解和模型训练。

The OpenOrca dataset is an enhanced collection of the FLAN Collection. Currently, it contains approximately 1 million GPT-4 completion samples and about 3.2 million GPT-3.5 completion samples. This dataset is tabulated according to the distribution described in the Orca paper, and is primarily used for training and evaluation in the field of natural language processing. The creation of this dataset aims to provide enhanced textual data to support language understanding and model training.
提供机构:
kyujinpy
原始信息汇总

数据集概述

数据集信息

特征

  • id: 数据类型为字符串(string)。
  • input: 数据类型为字符串(string)。
  • output: 数据类型为字符串(string)。
  • instruction: 数据类型为字符串(string)。

数据分割

  • train: 包含41,592,589字节的数据,共有19,468个样本。

数据大小

  • 下载大小: 21,611,641字节。
  • 数据集大小: 41,592,589字节。

数据集结构

数据字段

  1. id: 唯一编号标识符,包含niv, t0, cot, flan之一,表示来源的FLAN Collection子集。
  2. system_prompt: 提供给GPT-3.5或GPT-4 API的系统提示。
  3. question: 来自FLAN Collection的问题条目。
  4. response: 对问题的响应,来自GPT-3.5或GPT-4的查询。

数据集创建

数据来源

  • 数据生成遵循Orca论文中概述的分布,使用HuggingFace上托管的预生成FLAN Collection数据集。

数据集使用

使用案例

  • 适用于语言理解、自然语言处理、机器学习模型训练和模型性能评估。

使用注意事项

  • 由于数据集仍在进行中,建议定期检查更新和改进。
  • 使用时应遵循Orca论文中概述的指南和建议。

引用

bibtex @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/Open-Orca/OpenOrca}, }

bibtex @misc{mukherjee2023orca, title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4}, author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah}, year={2023}, eprint={2306.02707}, archivePrefix={arXiv}, primaryClass={cs.CL} }

bibtex @misc{longpre2023flan, title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning}, author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts}, year={2023}, eprint={2301.13688}, archivePrefix={arXiv}, primaryClass={cs.AI} }

bibtex @misc{touvron2023llama, title={Llama 2: Open Foundation and Fine-Tuned Chat Models}, author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom}, year={2023}, eprint= arXiv 2307.09288 } @software{touvron2023llama, title={LLaMA: Open and Efficient Foundation Language Models}, author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{e}e and Rozi{`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, journal={arXiv preprint arXiv:2302.13971}, year={2023} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作