five

shahules786/orca-chat

收藏
Hugging Face2023-07-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shahules786/orca-chat
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- ## ORCA-Chat A high-quality explanation-style chat dataset. ORCA dataset is cool, but it cannot directly be used to finetune chat models with above 4k context length because it has trivial samples with tokens above 4k. It also has a large number of redundant instructions which degrades its quality and increases the compute time when finetuning models using it. Enter ORCA-Chat! This is a cleaned, pruned, and clustered version of orca to form a conversation-style dataset. The the process involves removing samples with very high similarity and also grouping instructions to form conversation. ![](https://github.com/explodinggradients/ragas/assets/25312635/fcea532d-e0a6-4030-a14b-42d65df86a10) ## What next? I will release 16/32k versions for this soon! ## Credits * This wouldn't be possible without the amazing work of Eric in recreating the ORCA dataset. Check it out: https://huggingface.co/datasets/ehartford/dolphin * This dataset was created in association with the Open-Assistant team @jordanclive and @andreaskoepf ## Citations ``` @misc{Orca-Chat, title = {Orca-chat: A high-quality explanation-style chat dataset.}, author = {Shahul Es}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://huggingface.co/datasets/shahules786/orca-chat/}, } ```
提供机构:
shahules786
原始信息汇总

ORCA-Chat 数据集概述

数据集描述

  • 名称: ORCA-Chat
  • 类型: 高质量解释风格聊天数据集
  • 特点: 经过清理、修剪和聚类处理,形成对话风格的数据集。移除了高度相似的样本,并整合指令以构建对话。

数据集使用

  • 限制: 原ORCA数据集因包含超过4k令牌的琐碎样本和大量冗余指令,不适合直接用于微调超过4k上下文长度的聊天模型。
  • 改进: ORCA-Chat通过移除这些问题样本和指令,提高了数据集的质量,并减少了模型微调时的计算时间。

许可证

  • 许可证: Apache-2.0

未来计划

  • 更新: 计划发布16/32k版本的数据集。

贡献者

  • 主要贡献者: Eric,其工作对ORCA数据集的重建至关重要。
  • 合作团队: Open-Assistant团队,成员包括@jordanclive和@andreaskoepf。

引用信息

@misc{Orca-Chat, title = {Orca-chat: A high-quality explanation-style chat dataset.}, author = {Shahul Es}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://huggingface.co/datasets/shahules786/orca-chat/}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作