five

StudentLLM/Open-Wyvern-74k

收藏
Hugging Face2023-09-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/StudentLLM/Open-Wyvern-74k
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-classification - question-answering - summarization - conversational - text-generation language: - en size_categories: - 10K<n<100K --- <p align="center"><img src="https://cdn-uploads.huggingface.co/production/uploads/63e087b6a98d931aa90c1b9c/jm4fCY9DMGDxDRyhIeDZh.jpeg"></p> # The Wyvern 🐉 Dataset Let's introduce the **Wyvern 🐉** dataset, the new combination of datasets([Open-Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca), [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus), [airoboros](https://huggingface.co/datasets/jondurbin/airoboros-2.1), [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k))! We have integrated high-quality datasets following the claim that quality is more matter than quantity. In addition, we have deduplicated the duplication of datasets to improve the dataset's quality because each dataset has some data contaminations. Please see below for more details about the dataset! # Dataset Details **Wyvern 🐉** dataset is mixture of several datasets(Open-Orca, Open-Platypus, airoboros, Dolly) as mentioned above. The specific configuration of the dataset is as follows. (Open-Orca GPT-4 answered dataset was sampled using stratified sampling) - **Open-Platypus(100%) + airoboros(100%) + Open-Orca(GPT-4)(5%)(stratified sampled) + Dolly-15k(100%)** |Dataset Name|Sampled Size(ratio)|Deduped Size|License Type| |---|---|---|---| |[Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus)|24.9k(100%)|16.8k|None| |[airoboros](https://huggingface.co/datasets/jondurbin/airoboros-2.1)|36.3k(100%)|11k|apache-2.0| |[Open-Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca)|999.9k → 49.7k(5%)|35.6k|MIT| |[Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)|15k(100%)|11k|cc-by-sa-3.0| After the deduplication process, the size of the combination dataset is changed from 125k to 74k! (125k → 74k) # Data Deduplication We referred to Open-Platypus's [data similarity check code](https://github.com/arielnlee/Platypus/blob/main/data_pipeline/data_similarity.ipynb) to deduplicate the duplicated data. The specific code for deduplication will be uploaded soon! # Citations ``` @article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} } ``` ``` @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://https://huggingface.co/Open-Orca/OpenOrca}, } ``` ``` @online{DatabricksBlog2023DollyV2, author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin}, title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM}, year = {2023}, url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}, urldate = {2023-06-30} } ```
提供机构:
StudentLLM
原始信息汇总

数据集概述

数据集名称

Wyvern 🐉 数据集

任务类别

  • 文本分类
  • 问答
  • 摘要
  • 对话
  • 文本生成

语言

  • 英语

数据集大小

  • 10K<n<100K

数据集组成

Wyvern 🐉 数据集 是由多个数据集混合而成,具体包括:

  • Open-Platypus(100%) + airoboros(100%) + Open-Orca(GPT-4)(5%)(分层抽样) + Dolly-15k(100%)

数据集详情

数据集名称 抽样大小(比例) 去重后大小 许可证类型
Open-Platypus 24.9k(100%) 16.8k
airoboros 36.3k(100%) 11k apache-2.0
Open-Orca 999.9k → 49.7k(5%) 35.6k MIT
Dolly-15k 15k(100%) 11k cc-by-sa-3.0

数据去重

数据集经过去重处理后,总大小从125k减少到74k。

引用

@article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} }

@misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/Open-Orca/OpenOrca}, }

@online{DatabricksBlog2023DollyV2, author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin}, title = {Free Dolly: Introducing the Worlds First Truly Open Instruction-Tuned LLM}, year = {2023}, url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}, urldate = {2023-06-30} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作