UltraChat-Matic

Name: UltraChat-Matic
Creator: maas
Published: 2025-05-26 00:57:32
License: 暂无描述

魔搭社区2025-05-26 更新2024-06-22 收录

下载链接：

https://modelscope.cn/datasets/thomas/UltraChat-Matic

下载链接

链接失效反馈

官方服务：

资源简介：

# ChatMatic ## with Over 80,000 multi-turn examples. UltraChat-Matic Dataset is built with mix of 4 other dataset and which carefully chosing best one from each one of them with using `GPT-4` and contains System messages Dialogs and conv_depth more than 5 with higher sequence lengths Used datasets are: 1. "oasst2" 2. "ise-uiuc/Magicoder-Evol-Instruct-110K" 3. "vicgalle/alpaca-gpt4" 4. "LDJnr/Capybara" ### From Capybara * Most tokens contained in this dataset are newly synthesized and did not exist prior online. * This leverages the Amplify-Instruct method(paper coming soon) to grow thousands of high-quality single-turn seeds into advanced and in-depth multi-turn conversations. * Average context length per conversation is over 1,000 tokens and 3 turns or more per example (most instruction/chat datasets on HF for fine-tuning are only 1 turn) * Each conversation is optimized to amplify the natural raw knowledge capabilities of the model, as well as delving deep into obscure and advanced topics. * Aggresively filtered to remove any and all possible examples of overt moralizing/alignment, and common undesirable behaviours such as "as an AI language model" and "September 2021" and "I don't have personal beliefs" * ### More than 60000 Datas generated or selected by GPT4

# ChatMatic ## 超8万条多轮对话示例。 UltraChat-Matic 数据集由4个现有数据集混合构建而成，我们借助GPT-4（GPT-4）从每个数据源中精心筛选最优样本，该数据集包含系统消息对话、轮次深度超过5的对话，且序列长度更长。所用数据集如下： 1. "oasst2" 2. "ise-uiuc/Magicoder-Evol-Instruct-110K" 3. "vicgalle/alpaca-gpt4" 4. "LDJnr/Capybara" ### 源自Capybara数据集 * 该数据集内绝大多数Token（Token）均为全新合成，此前未在互联网上出现过。 * 本数据集采用Amplify-Instruct方法（论文即将发表），将数千条高质量单轮种子样本扩展为兼具进阶性与深度的多轮对话。 * 单条对话的平均上下文长度超过1000个Token，且每个样本包含至少3轮对话——当前Hugging Face（HF）平台上多数用于微调的指令/对话数据集仅支持单轮对话。 * 每条对话均经过优化，以强化模型原生的知识理解能力，同时深入探讨冷门前沿议题。 * 经过严格过滤，移除了所有可能存在的显性说教/对齐样本，以及“作为AI语言模型”“2021年9月”“我没有个人观点”等常见不良表述。 ### 由GPT-4生成或筛选的样本超6万条

提供机构：

maas

创建时间：

2024-06-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集