five

shahules786/orca-best

收藏
Hugging Face2023-08-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shahules786/orca-best
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: cluster struct: - name: samples list: - name: input dtype: string - name: output dtype: string - name: source dtype: string - name: instruction dtype: string - name: num_samples dtype: int64 splits: - name: train num_bytes: 900092818 num_examples: 328906 download_size: 462629849 dataset_size: 900092818 --- ## Best of Orca This is a filtered version of Orca GPT4 1M instructions. From repeated experiments and analysis, I came to the conclusion that original dataset contains a lot of low-quality instructions which contributes to only poor generalization. The solution I came up with is to filter the dataset and remove the unwanted samples. I applied two levels of filters 1. Removed instructions with less than 100 tokens in response. 2. Data deduplication grouped by instruction type using GTE embedding and cosine similarity (threshold>0.95) After these two steps, the number of samples was reduced to 1/3rd of the original count. For selecting a sample from each cluster, I tried different methods including random selection from a cluster. We used this dataset to train multiple Open-Assistant models to confirm my hypothesis that data quality matter more than quantity. This dataset was used in some of our models best models including https://huggingface.co/OpenAssistant/llama2-70b-oasst-sft-v10 ⭐️ All models perform much better than models trained on full ORCA samples. ## Credits * This wouldn't be possible without the amazing work of Eric in recreating the ORCA dataset. Check it out: https://huggingface.co/datasets/ehartford/dolphin * This dataset was created in association with the Open-Assistant team @jordanclive and @andreaskoepf ## Citations ``` @misc{Orca-best, title = {Orca-best: A filtered version of orca gpt4 dataset.}, author = {Shahul Es}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://huggingface.co/datasets/shahules786/orca-best/}, } ```
提供机构:
shahules786
原始信息汇总

数据集概述

数据集信息

  • 特征:

    • cluster:
      • samples:
        • input: 数据类型为 string
        • output: 数据类型为 string
    • source: 数据类型为 string
    • instruction: 数据类型为 string
    • num_samples: 数据类型为 int64
  • 拆分:

    • train:
      • 字节数: 900092818
      • 样本数: 328906
  • 下载大小: 462629849 字节

  • 数据集大小: 900092818 字节

数据集处理

  • 通过以下两个步骤过滤数据集:
    1. 移除响应中少于100个令牌的指令。
    2. 使用GTE嵌入和余弦相似度(阈值>0.95)按指令类型进行数据去重。
  • 经过上述步骤后,样本数量减少到原始数量的1/3。

数据集用途

  • 用于训练多个Open-Assistant模型,以验证数据质量比数量更重要的假设。
  • 该数据集被用于一些最佳模型,包括 https://huggingface.co/OpenAssistant/llama2-70b-oasst-sft-v10

引用

@misc{Orca-best, title = {Orca-best: A filtered version of orca gpt4 dataset.}, author = {Shahul Es}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://huggingface.co/datasets/shahules786/orca-best/}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作