five

UltraChat-Matic

收藏
魔搭社区2025-05-26 更新2024-06-22 收录
下载链接:
https://modelscope.cn/datasets/thomas/UltraChat-Matic
下载链接
链接失效反馈
官方服务:
资源简介:
# ChatMatic ## with Over 80,000 multi-turn examples. UltraChat-Matic Dataset is built with mix of 4 other dataset and which carefully chosing best one from each one of them with using `GPT-4` and contains System messages Dialogs and conv_depth more than 5 with higher sequence lengths Used datasets are: 1. "oasst2" 2. "ise-uiuc/Magicoder-Evol-Instruct-110K" 3. "vicgalle/alpaca-gpt4" 4. "LDJnr/Capybara" ### From Capybara * Most tokens contained in this dataset are newly synthesized and did not exist prior online. * This leverages the Amplify-Instruct method(paper coming soon) to grow thousands of high-quality single-turn seeds into advanced and in-depth multi-turn conversations. * Average context length per conversation is over 1,000 tokens and 3 turns or more per example (most instruction/chat datasets on HF for fine-tuning are only 1 turn) * Each conversation is optimized to amplify the natural raw knowledge capabilities of the model, as well as delving deep into obscure and advanced topics. * Aggresively filtered to remove any and all possible examples of overt moralizing/alignment, and common undesirable behaviours such as "as an AI language model" and "September 2021" and "I don't have personal beliefs" * ### More than 60000 Datas generated or selected by GPT4

# ChatMatic ## 超8万条多轮对话示例。 UltraChat-Matic 数据集由4个现有数据集混合构建而成,我们借助GPT-4(GPT-4)从每个数据源中精心筛选最优样本,该数据集包含系统消息对话、轮次深度超过5的对话,且序列长度更长。 所用数据集如下: 1. "oasst2" 2. "ise-uiuc/Magicoder-Evol-Instruct-110K" 3. "vicgalle/alpaca-gpt4" 4. "LDJnr/Capybara" ### 源自Capybara数据集 * 该数据集内绝大多数Token(Token)均为全新合成,此前未在互联网上出现过。 * 本数据集采用Amplify-Instruct方法(论文即将发表),将数千条高质量单轮种子样本扩展为兼具进阶性与深度的多轮对话。 * 单条对话的平均上下文长度超过1000个Token,且每个样本包含至少3轮对话——当前Hugging Face(HF)平台上多数用于微调的指令/对话数据集仅支持单轮对话。 * 每条对话均经过优化,以强化模型原生的知识理解能力,同时深入探讨冷门前沿议题。 * 经过严格过滤,移除了所有可能存在的显性说教/对齐样本,以及“作为AI语言模型”“2021年9月”“我没有个人观点”等常见不良表述。 ### 由GPT-4生成或筛选的样本超6万条
提供机构:
maas
创建时间:
2024-06-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作