shahules786/orca-best
收藏Hugging Face2023-08-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shahules786/orca-best
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: cluster
struct:
- name: samples
list:
- name: input
dtype: string
- name: output
dtype: string
- name: source
dtype: string
- name: instruction
dtype: string
- name: num_samples
dtype: int64
splits:
- name: train
num_bytes: 900092818
num_examples: 328906
download_size: 462629849
dataset_size: 900092818
---
## Best of Orca
This is a filtered version of Orca GPT4 1M instructions. From repeated experiments and analysis, I came to the conclusion that original dataset
contains a lot of low-quality instructions which contributes to only poor generalization.
The solution I came up with is to filter the dataset and remove the unwanted samples. I applied two levels of filters
1. Removed instructions with less than 100 tokens in response.
2. Data deduplication grouped by instruction type using GTE embedding and cosine similarity (threshold>0.95)
After these two steps, the number of samples was reduced to 1/3rd of the original count.
For selecting a sample from each cluster, I tried different methods including random selection from a cluster.
We used this dataset to train multiple Open-Assistant models to confirm my hypothesis that data quality matter more than quantity.
This dataset was used in some of our models best models including https://huggingface.co/OpenAssistant/llama2-70b-oasst-sft-v10
⭐️ All models perform much better than models trained on full ORCA samples.
## Credits
* This wouldn't be possible without the amazing work of Eric in recreating the ORCA dataset. Check it out:
https://huggingface.co/datasets/ehartford/dolphin
* This dataset was created in association with the Open-Assistant team @jordanclive and @andreaskoepf
## Citations
```
@misc{Orca-best,
title = {Orca-best: A filtered version of orca gpt4 dataset.},
author = {Shahul Es},
year = {2023},
publisher = {HuggingFace},
journal = {HuggingFace repository},
howpublished = {\url{https://huggingface.co/datasets/shahules786/orca-best/},
}
```
提供机构:
shahules786
原始信息汇总
数据集概述
数据集信息
-
特征:
cluster:samples:input: 数据类型为stringoutput: 数据类型为string
source: 数据类型为stringinstruction: 数据类型为stringnum_samples: 数据类型为int64
-
拆分:
train:- 字节数: 900092818
- 样本数: 328906
-
下载大小: 462629849 字节
-
数据集大小: 900092818 字节
数据集处理
- 通过以下两个步骤过滤数据集:
- 移除响应中少于100个令牌的指令。
- 使用GTE嵌入和余弦相似度(阈值>0.95)按指令类型进行数据去重。
- 经过上述步骤后,样本数量减少到原始数量的1/3。
数据集用途
- 用于训练多个Open-Assistant模型,以验证数据质量比数量更重要的假设。
- 该数据集被用于一些最佳模型,包括
https://huggingface.co/OpenAssistant/llama2-70b-oasst-sft-v10。
引用
@misc{Orca-best, title = {Orca-best: A filtered version of orca gpt4 dataset.}, author = {Shahul Es}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://huggingface.co/datasets/shahules786/orca-best/}, }



