StudentLLM/Open-Wyvern-74k

Name: StudentLLM/Open-Wyvern-74k
Creator: StudentLLM
Published: 2023-09-06 00:24:42
License: 暂无描述

Hugging Face2023-09-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/StudentLLM/Open-Wyvern-74k

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-classification - question-answering - summarization - conversational - text-generation language: - en size_categories: - 10K<n<100K --- <p align="center"><img src="https://cdn-uploads.huggingface.co/production/uploads/63e087b6a98d931aa90c1b9c/jm4fCY9DMGDxDRyhIeDZh.jpeg"></p> # The Wyvern 🐉 Dataset Let's introduce the **Wyvern 🐉** dataset, the new combination of datasets([Open-Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca), [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus), [airoboros](https://huggingface.co/datasets/jondurbin/airoboros-2.1), [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k))! We have integrated high-quality datasets following the claim that quality is more matter than quantity. In addition, we have deduplicated the duplication of datasets to improve the dataset's quality because each dataset has some data contaminations. Please see below for more details about the dataset! # Dataset Details **Wyvern 🐉** dataset is mixture of several datasets(Open-Orca, Open-Platypus, airoboros, Dolly) as mentioned above. The specific configuration of the dataset is as follows. (Open-Orca GPT-4 answered dataset was sampled using stratified sampling) - **Open-Platypus(100%) + airoboros(100%) + Open-Orca(GPT-4)(5%)(stratified sampled) + Dolly-15k(100%)** |Dataset Name|Sampled Size(ratio)|Deduped Size|License Type| |---|---|---|---| |[Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus)|24.9k(100%)|16.8k|None| |[airoboros](https://huggingface.co/datasets/jondurbin/airoboros-2.1)|36.3k(100%)|11k|apache-2.0| |[Open-Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca)|999.9k → 49.7k(5%)|35.6k|MIT| |[Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)|15k(100%)|11k|cc-by-sa-3.0| After the deduplication process, the size of the combination dataset is changed from 125k to 74k! (125k → 74k) # Data Deduplication We referred to Open-Platypus's [data similarity check code](https://github.com/arielnlee/Platypus/blob/main/data_pipeline/data_similarity.ipynb) to deduplicate the duplicated data. The specific code for deduplication will be uploaded soon! # Citations ``` @article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} } ``` ``` @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://https://huggingface.co/Open-Orca/OpenOrca}, } ``` ``` @online{DatabricksBlog2023DollyV2, author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin}, title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM}, year = {2023}, url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}, urldate = {2023-06-30} } ```

提供机构：

StudentLLM

原始信息汇总

数据集概述

数据集名称

Wyvern 🐉 数据集

任务类别

文本分类
问答
摘要
对话
文本生成

语言

英语

数据集大小

10K<n<100K

数据集组成

Wyvern 🐉 数据集 是由多个数据集混合而成，具体包括：

Open-Platypus(100%) + airoboros(100%) + Open-Orca(GPT-4)(5%)(分层抽样) + Dolly-15k(100%)

数据集详情

数据集名称	抽样大小(比例)	去重后大小	许可证类型
Open-Platypus	24.9k(100%)	16.8k	无
airoboros	36.3k(100%)	11k	apache-2.0
Open-Orca	999.9k → 49.7k(5%)	35.6k	MIT
Dolly-15k	15k(100%)	11k	cc-by-sa-3.0

数据去重

数据集经过去重处理后，总大小从125k减少到74k。

引用

@article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} }

@misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/Open-Orca/OpenOrca}, }

@online{DatabricksBlog2023DollyV2, author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin}, title = {Free Dolly: Introducing the Worlds First Truly Open Instruction-Tuned LLM}, year = {2023}, url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}, urldate = {2023-06-30} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集