UltraChat-Matic
收藏魔搭社区2025-05-26 更新2024-06-22 收录
下载链接:
https://modelscope.cn/datasets/thomas/UltraChat-Matic
下载链接
链接失效反馈官方服务:
资源简介:
# ChatMatic
## with Over 80,000 multi-turn examples.
UltraChat-Matic Dataset is built with mix of 4 other dataset and which carefully chosing best one from each one of them with using `GPT-4`
and contains
System messages Dialogs and conv_depth more than 5 with higher sequence lengths
Used datasets are:
1. "oasst2"
2. "ise-uiuc/Magicoder-Evol-Instruct-110K"
3. "vicgalle/alpaca-gpt4"
4. "LDJnr/Capybara"
### From Capybara
* Most tokens contained in this dataset are newly synthesized and did not exist prior online.
* This leverages the Amplify-Instruct method(paper coming soon) to grow thousands of high-quality single-turn seeds into advanced and in-depth multi-turn conversations.
* Average context length per conversation is over 1,000 tokens and 3 turns or more per example (most instruction/chat datasets on HF for fine-tuning are only 1 turn)
* Each conversation is optimized to amplify the natural raw knowledge capabilities of the model, as well as delving deep into obscure and advanced topics.
* Aggresively filtered to remove any and all possible examples of overt moralizing/alignment, and common undesirable behaviours such as "as an AI language model" and "September 2021" and "I don't have personal beliefs"
* ### More than 60000 Datas generated or selected by GPT4
# ChatMatic
## 超8万条多轮对话示例。
UltraChat-Matic 数据集由4个现有数据集混合构建而成,我们借助GPT-4(GPT-4)从每个数据源中精心筛选最优样本,该数据集包含系统消息对话、轮次深度超过5的对话,且序列长度更长。
所用数据集如下:
1. "oasst2"
2. "ise-uiuc/Magicoder-Evol-Instruct-110K"
3. "vicgalle/alpaca-gpt4"
4. "LDJnr/Capybara"
### 源自Capybara数据集
* 该数据集内绝大多数Token(Token)均为全新合成,此前未在互联网上出现过。
* 本数据集采用Amplify-Instruct方法(论文即将发表),将数千条高质量单轮种子样本扩展为兼具进阶性与深度的多轮对话。
* 单条对话的平均上下文长度超过1000个Token,且每个样本包含至少3轮对话——当前Hugging Face(HF)平台上多数用于微调的指令/对话数据集仅支持单轮对话。
* 每条对话均经过优化,以强化模型原生的知识理解能力,同时深入探讨冷门前沿议题。
* 经过严格过滤,移除了所有可能存在的显性说教/对齐样本,以及“作为AI语言模型”“2021年9月”“我没有个人观点”等常见不良表述。
### 由GPT-4生成或筛选的样本超6万条
提供机构:
maas
创建时间:
2024-06-06



