shahules786/orca-best

Name: shahules786/orca-best
Creator: shahules786
Published: 2023-08-25 14:48:40
License: 暂无描述

Hugging Face2023-08-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/shahules786/orca-best

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: cluster struct: - name: samples list: - name: input dtype: string - name: output dtype: string - name: source dtype: string - name: instruction dtype: string - name: num_samples dtype: int64 splits: - name: train num_bytes: 900092818 num_examples: 328906 download_size: 462629849 dataset_size: 900092818 --- ## Best of Orca This is a filtered version of Orca GPT4 1M instructions. From repeated experiments and analysis, I came to the conclusion that original dataset contains a lot of low-quality instructions which contributes to only poor generalization. The solution I came up with is to filter the dataset and remove the unwanted samples. I applied two levels of filters 1. Removed instructions with less than 100 tokens in response. 2. Data deduplication grouped by instruction type using GTE embedding and cosine similarity (threshold>0.95) After these two steps, the number of samples was reduced to 1/3rd of the original count. For selecting a sample from each cluster, I tried different methods including random selection from a cluster. We used this dataset to train multiple Open-Assistant models to confirm my hypothesis that data quality matter more than quantity. This dataset was used in some of our models best models including https://huggingface.co/OpenAssistant/llama2-70b-oasst-sft-v10 ⭐️ All models perform much better than models trained on full ORCA samples. ## Credits * This wouldn't be possible without the amazing work of Eric in recreating the ORCA dataset. Check it out: https://huggingface.co/datasets/ehartford/dolphin * This dataset was created in association with the Open-Assistant team @jordanclive and @andreaskoepf ## Citations ``` @misc{Orca-best, title = {Orca-best: A filtered version of orca gpt4 dataset.}, author = {Shahul Es}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://huggingface.co/datasets/shahules786/orca-best/}, } ```

提供机构：

shahules786

原始信息汇总

数据集概述

数据集信息

特征:
- cluster:
  - samples:
    - input: 数据类型为 string
    - output: 数据类型为 string
- source: 数据类型为 string
- instruction: 数据类型为 string
- num_samples: 数据类型为 int64
拆分:
- train:
  - 字节数: 900092818
  - 样本数: 328906
下载大小: 462629849 字节
数据集大小: 900092818 字节

数据集处理

通过以下两个步骤过滤数据集：
1. 移除响应中少于100个令牌的指令。
2. 使用GTE嵌入和余弦相似度（阈值>0.95）按指令类型进行数据去重。
经过上述步骤后，样本数量减少到原始数量的1/3。

数据集用途

用于训练多个Open-Assistant模型，以验证数据质量比数量更重要的假设。
该数据集被用于一些最佳模型，包括 https://huggingface.co/OpenAssistant/llama2-70b-oasst-sft-v10。

引用

@misc{Orca-best, title = {Orca-best: A filtered version of orca gpt4 dataset.}, author = {Shahul Es}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://huggingface.co/datasets/shahules786/orca-best/}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集