five

ChuGyouk/OpenOrca_Solar_filtered

收藏
Hugging Face2024-03-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ChuGyouk/OpenOrca_Solar_filtered
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: system_prompt dtype: string - name: question dtype: string - name: response dtype: string splits: - name: train num_bytes: 7053129247 num_examples: 4120862 download_size: 3997215736 dataset_size: 7053129247 configs: - config_name: default data_files: - split: train path: data/train-* license: mit language: - en size_categories: - 1M<n<10M --- # TODO To be consistent, we need to change column name into ['instruction', 'input', 'output'], which is same as [alpaca-gpt4](https://huggingface.co/datasets/c-s-ale/alpaca-gpt4-data). # Dataset Summary This is a filtered version of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) dataset based on Solar 10.7B paper. In this version, of the 4.2M OpenOrca data, 113k data is removed. In more conservative version [here](https://huggingface.co/datasets/ChuGyouk/OpenOrca_Solar_filtered_conservative), of the 4.2M OpenOrca data, 117k data is removed. ## Step 1 [FLAN data link broken](https://github.com/google-research/FLAN/issues/102) Based on DataProvenanceInitiative/flan2021_submix_original, DataProvenanceInitiative/cot_submix_original, DataProvenanceInitiative/t0_submix_original, DataProvenanceInitiative/niv2_submix_original. Here, data was extracted based on the same list of task names used in Table 8 of the Solar paper. This data, especially the 'inputs' column will be the standard for what data should be deleted in OpenOrca. (Step 2) 1. flan - Total data: 5362361, filtered data: 326213, task name: 'ai2_arcARCChallenge:1.0.0','ai2_arcARCEasy:1.0.0','hellaswag:1.1.0','drop:2.0.0','winogrande:1.1.0' 2. cot - Total data: 183848, filtered data: 18266, task name: 'cot_gsm8k', 'cot_gsm8k_ii' 3. t0 - Total data: 1650308, filtered data: 0, task name: 4. niv - Total data: 10066896, filtered data: 28573, task name: 'task228_arc_answer_generation_easy','task229_arc_answer_generation_hard','task1389_hellaswag_completion' ### Caution Based on official FLAN implementation [github](https://github.com/google-research/FLAN/blob/main/flan/v2/flan_collection_info.csv), drop, winogrande dataset is included in niv. So, filtered data can be increased (so that we can be more conservative) up to **129673** if we further include following task names: "task026_drop_question_generation", "task027_drop_answer_type_generation", "task028_drop_answer_generation", "task029_winogrande_full_object", "task030_winogrande_full_person", "task031_winogrande_question_generation_object", "task032_winogrande_question_generation_person", "task033_winogrande_answer_generation", "task034_winogrande_question_modification_object", "task035_winogrande_question_modification_person", "task1391_winogrande_easy_answer_generation", ## Step 2 Based on the Step 1's result, we filter OpenOrca dataset. Basically, if the 'inputs' column of the step 1 result and the 'question' column of the OpenOrca dataset are the same, the corresponding data is removed. Of the 4233923 OpenOrca data, 1. There are 1649259 flan data. After filtering (based on 326213 data in step 1), there are 1551217 data left. == 98042 deleted. 2. There are 141695 cot data. After filtering (based on 18266 data in step 1), there are 127540 data left. == 14155 deleted. 3. There are 2149573 t0 data. No filter. 4. There are 293396 niv data. After filtering (based on 28573 data in step 1), there are 292532 data left. == 864 deleted. ### Caution 4+. To be more conservative, if we filter out more niv data based on 129673 data in step 1), you get 288698. == 4698 deleted. This version is on [here](https://huggingface.co/datasets/ChuGyouk/OpenOrca_Solar_filtered_conservative). # Citation ```bibtex @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://https://huggingface.co/Open-Orca/OpenOrca}}, } ``` ```bibtex @misc{kim2023solar, title={SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling}, author={Dahyun Kim and Chanjun Park and Sanghoon Kim and Wonsung Lee and Wonho Song and Yunsu Kim and Hyeonwoo Kim and Yungi Kim and Hyeonju Lee and Jihoo Kim and Changbae Ahn and Seonghoon Yang and Sukyung Lee and Hyunbyung Park and Gyoungjin Gim and Mikyoung Cha and Hwalsuk Lee and Sunghun Kim}, year={2023}, eprint={2312.15166}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
ChuGyouk
原始信息汇总

数据集概述

数据集信息

  • 特征:
    • id: 字符串类型
    • system_prompt: 字符串类型
    • question: 字符串类型
    • response: 字符串类型
  • 分割:
    • train: 包含4,120,862个样本,总大小为7,053,129,247字节
  • 下载大小: 3,997,215,736字节
  • 数据集大小: 7,053,129,247字节
  • 配置:
    • default: 数据文件路径为data/train-*
  • 许可证: MIT
  • 语言: 英语
  • 大小类别: 1M<n<10M

数据集版本

数据过滤步骤

  1. 步骤1:

    • 基于DataProvenanceInitiative的多个子集进行数据提取,特别是根据Solar论文中的任务名称列表。
    • 具体过滤数据如下:
      • flan: 总数据5,362,361,过滤后数据326,213
      • cot: 总数据183,848,过滤后数据18,266
      • t0: 总数据1,650,308,过滤后数据0
      • niv: 总数据10,066,896,过滤后数据28,573
  2. 步骤2:

    • 基于步骤1的结果,对OpenOrca数据集进行过滤。
    • 如果OpenOrca数据集的question列与步骤1结果的inputs列相同,则移除相应数据。
    • 具体过滤结果如下:
      • flan: 总数据1,649,259,过滤后数据1,551,217,移除98,042
      • cot: 总数据141,695,过滤后数据127,540,移除14,155
      • t0: 总数据2,149,573,无过滤
      • niv: 总数据293,396,过滤后数据292,532,移除864
    • 更保守的过滤结果:
      • niv: 过滤后数据288,698,移除4,698

引用

bibtex @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/Open-Orca/OpenOrca}}, }

bibtex @misc{kim2023solar, title={SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling}, author={Dahyun Kim and Chanjun Park and Sanghoon Kim and Wonsung Lee and Wonho Song and Yunsu Kim and Hyeonju Kim and Yungi Kim and Hyeonju Lee and Jihoo Kim and Changbae Ahn and Seonghoon Yang and Sukyung Lee and Hyunbyung Park and Gyoungjin Gim and Mikyoung Cha and Hwalsuk Lee and Sunghun Kim}, year={2023}, eprint={2312.15166}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作