five

OpenOrca

收藏
魔搭社区2026-05-21 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/OpenOrca
下载链接
链接失效反馈
官方服务:
资源简介:
## Table of Contents - [Dataset Summary](#dataset-summary) - [Dataset Attribution](#dataset-attribution) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Dataset Use](#dataset-use) - [Use Cases](#use-cases) - [Usage Caveats](#usage-caveats) - [Getting Started](#getting-started) <p><h1>🐋 The OpenOrca Dataset! 🐋</h1></p> ![OpenOrca Logo](https://huggingface.co/datasets/Open-Orca/OpenOrca/resolve/main/OpenOrcaLogo.png "OpenOrca Logo") <a name="dataset-announcement"></a> We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707). It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers! # Official Models ## 示例代码 ```python from modelscope import MsDataset from modelscope.utils.constant import DownloadMode ds = MsDataset.load('AI-ModelScope/OpenOrca',subset_name='default', split='train', download_mode=DownloadMode.FORCE_REDOWNLOAD) print(next(iter(ds))) ``` ## Mistral-7B-OpenOrca Our [latest model](https://huggingface.co/spaces/Open-Orca/Mistral-7B-OpenOrca), the first 7B to score better overall than all previous models below 30B. 98% of Llama2-70b-chat's performance, in a completely open 7B! ## OpenOrca-Platypus2-13B Our [third model](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B), the first 13B model to score higher than LLaMA1-65B on the HuggingFace Leaderboard! Released in partnership with Platypus. ## LlongOrca 7B & 13B * Our [first 7B release](https://huggingface.co/Open-Orca/LlongOrca-7B-16k), trained on top of LLongMA2 to achieve 16,000 tokens context. #1 long context 7B model at release time, with >99% of the overall #1 model's performance. * [LlongOrca-13B-16k](https://huggingface.co/Open-Orca/LlongOrca-13B-16k), trained on top of LLongMA2. #1 long context 13B model at release time, with >97% of the overall #1 model's performance. ## OpenOrcaxOpenChat-Preview2-13B Our [second model](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B), highlighting that we've surpassed the performance reported in the Orca paper. Was #1 at release time, now surpassed by our own OpenOrca-Platypus2-13B. Released in partnership with OpenChat. ## OpenOrca-Preview1-13B [OpenOrca-Preview1-13B](https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B) This model was trained in less than a day, for <$200, with <10% of our data. At release, it beat the current state of the art models on BigBench-Hard and AGIEval. Achieves ~60% of the improvements reported in the Orca paper. <a name="dataset-summary"></a> # Dataset Summary The OpenOrca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688). Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. The data is primarily used for training and evaluation in the field of natural language processing. <a name="dataset-attribution"></a> # Dataset Attribution We would like to give special recognition to the following contributors for their significant efforts and dedication: Teknium WingLian/Caseus Eric Hartford NanoBit Pankaj Winddude Rohan http://AlignmentLab.ai: Autometa Entropi AtlasUnified NeverendingToast NanoBit WingLian/Caseus Also of course, as always, TheBloke, for being the backbone of the whole community. Many thanks to NanoBit and Caseus, makers of [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), for lending us their expertise on the platform that developed and trained manticore, minotaur, and many others! We are welcoming sponsors or collaborators to help us build these models to the scale they deserve. Please reach out via our socials: http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx Want to visualize our full dataset? Check out our [Nomic Atlas Map](https://atlas.nomic.ai/map/c1b88b47-2d9b-47e0-9002-b80766792582/2560fd25-52fe-42f1-a58f-ff5eccc890d2). [<img src="https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B/resolve/main/OpenOrca%20Nomic%20Atlas.png" alt="Atlas Nomic Dataset Map" width="400" height="400" />](https://atlas.nomic.ai/map/c1b88b47-2d9b-47e0-9002-b80766792582/2560fd25-52fe-42f1-a58f-ff5eccc890d2) <a name="supported-tasks-and-leaderboards"></a> # Supported Tasks and Leaderboards This dataset supports a range of tasks including language modeling, text generation, and text augmentation. It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing. Further information on leaderboards will be updated as they become available. <a name="languages"></a> # Languages The language of the data is primarily English. <a name="dataset-structure"></a> # Dataset Structure <a name="data-instances"></a> ## Data Instances A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5. The response is then entered into the response field. <a name="data-fields"></a> ## Data Fields The fields are: 1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from. 2) 'system_prompt', representing the System Prompt presented to the GPT-3.5 or GPT-4 API for the datapoint 3) 'question', representing a question entry as provided by the FLAN Collection 4) 'response', a response to that question received from a query to either GPT-3.5 or GPT-4. <a name="data-splits"></a> ## Data Splits The data is unsplit. <a name="dataset-creation"></a> # Dataset Creation <a name="curation-rationale"></a> ## Curation Rationale The dataset was created to provide a source of augmented text data for researchers and developers. The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4. This "reasoning trace" augmentation has demonstrated exceptional results, allowing a LLaMA-13B model trained with this data to rival or beat GPT-3.5 on broad sets of hard reasoning tasks which all models below 100B parameters had previously performed dramatically worse on. <a name="source-data"></a> ## Source Data The data is generated using techniques in alignment with the distributions outlined in the Orca paper, except as noted below: 1) There is not enough CoT data in the FLAN Collection to generate 150K zero-shot entries, as the paper purports to use. We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available. 2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original). These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source. However, these are a subset of the full FLAN Collection data, and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively. Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work. <a name="dataset-use"></a> # Dataset Use <a name="use-cases"></a> ## Use Cases The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation. <a name="usage-caveats"></a> ## Usage Caveats Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements. Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper. <a name="getting-started"></a> ## Getting Started This dataset is organized such that it can be naively loaded via Hugging Face datasets library. We recommend using streaming due to the large size of the files. Regular updates and data generation progress can be monitored through the OpenOrca repository on Hugging Face. # Citation ```bibtex @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://https://huggingface.co/Open-Orca/OpenOrca}}, } ``` ```bibtex @misc{mukherjee2023orca, title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4}, author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah}, year={2023}, eprint={2306.02707}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ```bibtex @misc{longpre2023flan, title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning}, author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts}, year={2023}, eprint={2301.13688}, archivePrefix={arXiv}, primaryClass={cs.AI} } ``` ```bibtex @misc{touvron2023llama, title={Llama 2: Open Foundation and Fine-Tuned Chat Models}, author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom}, year={2023}, eprint= arXiv 2307.09288 } @software{touvron2023llama, title={LLaMA: Open and Efficient Foundation Language Models}, author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, journal={arXiv preprint arXiv:2302.13971}, year={2023} } ```

## 目录 - [数据集概览](#dataset-summary) - [数据集归属](#dataset-attribution) - [支持任务与基准排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [遴选依据](#curation-rationale) - [源数据](#source-data) - [数据集使用](#dataset-use) - [使用场景](#use-cases) - [使用注意事项](#usage-caveats) - [快速入门](#getting-started) <p><h1>🐋 OpenOrca 数据集!🐋</h1></p> ![OpenOrca 标志](https://huggingface.co/datasets/Open-Orca/OpenOrca/resolve/main/OpenOrcaLogo.png "OpenOrca 标志") <a name="dataset-announcement"></a> 我们很高兴地宣布OpenOrca数据集正式发布!该数据集包含丰富的增强型FLAN合集(FLAN Collection)数据,尽可能贴合Orca论文[Orca论文](https://arxiv.org/abs/2306.02707)中所述的数据分布。它已助力生成性能优异的模型检查点(model checkpoint),是所有自然语言处理(NLP)研究者与开发者的宝贵资源! # 官方模型 ## 示例代码 python from modelscope import MsDataset from modelscope.utils.constant import DownloadMode ds = MsDataset.load('AI-ModelScope/OpenOrca',subset_name='default', split='train', download_mode=DownloadMode.FORCE_REDOWNLOAD) print(next(iter(ds))) ## Mistral-7B-OpenOrca 我们的[最新模型](https://huggingface.co/spaces/Open-Orca/Mistral-7B-OpenOrca),是首款综合性能优于所有此前发布的30B参数以下模型的7B参数模型。这款完全开源的7B参数模型实现了Llama2-70B-chat 98%的性能表现。 ## OpenOrca-Platypus2-13B 我们的[第三款模型](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B),是首款在HuggingFace基准排行榜上得分超越LLaMA1-65B的13B参数模型,与Platypus合作发布。 ## LlongOrca 7B & 13B * 我们的[首款7B参数模型](https://huggingface.co/Open-Orca/LlongOrca-7B-16k)基于LLongMA2训练,支持16,000个Token(Token)的上下文长度,发布时为同类7B参数长上下文模型中的性能榜首,整体性能达到当时最优模型的99%以上。 * [LlongOrca-13B-16k](https://huggingface.co/Open-Orca/LlongOrca-13B-16k)同样基于LLongMA2训练,发布时为同类13B参数长上下文模型中的性能榜首,整体性能达到当时最优模型的97%以上。 ## OpenOrcaxOpenChat-Preview2-13B 我们的[第二款模型](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B),证明我们的性能已超越Orca论文中报道的水平。该模型发布时位居性能榜首,后续被我们自研的OpenOrca-Platypus2-13B超越,与OpenChat合作发布。 ## OpenOrca-Preview1-13B [OpenOrca-Preview1-13B](https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B) 这款模型的训练耗时不足一天,成本低于200美元,仅使用了我们数据集的10%。发布时,它在BigBench-Hard与AGIEval基准上击败了当前的最优模型,实现了Orca论文中报道的性能提升约60%。 <a name="dataset-summary"></a> # 数据集概览 OpenOrca数据集是增强型FLAN合集(FLAN Collection)的集合,当前包含约100万条GPT-4生成结果与约320万条GPT-3.5生成结果。该数据集按照Orca论文中提出的数据分布进行了结构化整理,目前仅完成了完整预期数据集的一部分,我们仍在持续生成数据以扩展其规模。该数据主要用于自然语言处理领域的模型训练与评估。 <a name="dataset-attribution"></a> # 数据集归属 我们谨向以下作出卓越贡献与辛勤付出的贡献者致以特别感谢: Teknium WingLian/Caseus Eric Hartford NanoBit Pankaj Winddude Rohan http://AlignmentLab.ai团队: Autometa Entropi AtlasUnified NeverendingToast NanoBit WingLian/Caseus 当然,还要特别感谢TheBloke,他是整个社区的中坚力量。 特别感谢NanoBit与Caseus——Axolotl工具[Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)的开发者,他们为我们开发、训练manticore、minotaur等模型的平台提供了专业技术支持! 我们正寻求赞助商或合作伙伴,以助力我们将这些模型推至应有的规模。请通过以下社交渠道联系我们: http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx 想要可视化完整数据集?请访问我们的[Nomic Atlas地图](https://atlas.nomic.ai/map/c1b88b47-2d9b-47e0-9002-b80766792582/2560fd25-52fe-42f1-a58f-ff5eccc890d2)。 <a href="https://atlas.nomic.ai/map/c1b88b47-2d9b-47e0-9002-b80766792582/2560fd25-52fe-42f1-a58f-ff5eccc890d2"><img src="https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B/resolve/main/OpenOrca%20Nomic%20Atlas.png" alt="Atlas Nomic 数据集地图" width="400" height="400" /></a> <a name="supported-tasks-and-leaderboards"></a> # 支持任务与基准排行榜 该数据集支持语言建模、文本生成、文本增强等多种任务。它已助力生成多款高性能模型检查点,在我们的单元测试中展现出优异的性能。有关基准排行榜的更多信息将在可用时更新。 <a name="languages"></a> # 语言 该数据集的主要语言为英语。 <a name="dataset-structure"></a> # 数据集结构 <a name="data-instances"></a> ## 数据实例 本数据集中的每个实例均来自FLAN合集,我们通过将其中的问题提交至GPT-4或GPT-3.5 API进行增强,并将生成的回复存入响应字段。 <a name="data-fields"></a> ## 数据字段 数据字段包括: 1) "id":唯一的数字标识符,字段中包含'niv'、't0'、'cot'或'flan'之一,用于标识该'question'字段所属的FLAN合集子混合数据集来源。 2) "system_prompt":向GPT-3.5或GPT-4 API提交该数据点时使用的系统提示词 3) "question":FLAN合集中提供的问题条目 4) "response":通过调用GPT-3.5或GPT-4 API获得的该问题的响应结果。 <a name="data-splits"></a> ## 数据划分 该数据集未进行划分。 <a name="dataset-creation"></a> # 数据集构建 <a name="curation-rationale"></a> ## 遴选依据 创建该数据集的目的是为研究者与开发者提供增强型文本数据资源。该数据点主要用于增强核心FLAN合集数据,借助GPT-3.5与GPT-4的详细分步推理能力生成“推理轨迹”。这种“推理轨迹”增强方式已展现出卓越效果:使用该数据训练的LLaMA-13B模型,可在各类复杂推理任务上媲美甚至超越GPT-3.5,而此前所有参数低于100B的模型在这些任务上的性能均大幅落后。 <a name="source-data"></a> ## 源数据 本数据集采用与Orca论文中所述分布一致的方法生成,但存在以下例外: 1) FLAN合集中的思维链(CoT)数据量不足,无法生成论文所述的15万条零样本样本。我们推测该部分数据要么未公开,要么存在表述偏差,因此我们仅使用了现有的约7.5万条数据。 2) 我们使用了HuggingFace上由conceptofmind托管的预生成FLAN合集数据集,例如[conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original)。谷歌研究团队的官方FLAN合集仓库[官方FLAN合集仓库](https://github.com/google-research/FLAN/tree/main/flan/v2)将这些数据集列为首选数据源。但这些数据集仅是完整FLAN合集的子集,分别比flan2021与t0子混合数据集所需的数据量少约125万条与20万条。 综上,我们最终获得的数据点比Orca原论文中的少约150万条。完成完整数据集的构建仍是一项持续进行的工作。 <a name="dataset-use"></a> # 数据集使用 <a name="use-cases"></a> ## 使用场景 该数据集可用于语言理解、自然语言处理、机器学习模型训练以及模型性能评估等相关任务。 <a name="usage-caveats"></a> ## 使用注意事项 由于本数据集仍处于开发阶段,建议定期检查更新与优化版本。此外,使用该数据集时应遵循Orca论文中所述的指南与建议。 <a name="getting-started"></a> ## 快速入门 该数据集的组织方式使其可直接通过Hugging Face数据集库加载。由于数据集文件体积较大,我们建议使用流式加载方式。您可通过Hugging Face上的OpenOrca仓库跟踪数据集的定期更新与生成进度。 # 引用 bibtex @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/Open-Orca/OpenOrca}}, } bibtex @misc{mukherjee2023orca, title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4}, author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah}, year={2023}, eprint={2306.02707}, archivePrefix={arXiv}, primaryClass={cs.CL} } bibtex @misc{longpre2023flan, title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning}, author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts}, year={2023}, eprint={2301.13688}, archivePrefix={arXiv}, primaryClass={cs.AI} } bibtex @misc{touvron2023llama, title={Llama 2: Open Foundation and Fine-Tuned Chat Models}, author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom}, year={2023}, eprint= arXiv 2307.09288 } @software{touvron2023llama, title={LLaMA: Open and Efficient Foundation Language Models}, author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timothée and Rozière, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, journal={arXiv preprint arXiv:2302.13971}, year={2023} }
提供机构:
maas
创建时间:
2023-12-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作