erfanzar/UltraChat-Matic
收藏Hugging Face2024-01-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/erfanzar/UltraChat-Matic
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: system
dtype: string
- name: user
sequence: string
- name: assistant
sequence: string
- name: dialogs
sequence: string
- name: conv_depth
dtype: int64
splits:
- name: train
num_bytes: 447216231
num_examples: 109765
download_size: 242424003
dataset_size: 447216231
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- en
- es
- ru
- de
- pl
- th
- vi
- sv
- bn
- da
- he
- it
- fa
- sk
- id
- nb
- el
- nl
- hu
- eu
- zh
- eo
- ja
- ca
- cs
- bg
- fi
- pt
- tr
- ro
- ar
- uk
- gl
- fr
- ko
tags:
- code
- biology
- medical
size_categories:
- 1M<n<10M
task_categories:
- text-generation
- text-classification
- conversational
---
# ChatMatic
## with Over 80,000 multi-turn examples.
UltraChat-Matic Dataset is built with mix of 4 other dataset and which carefully chosing best one from each one of them with using `GPT-4`
and contains
System messages Dialogs and conv_depth more than 5 with higher sequence lengths
Used datasets are:
1. "oasst2"
2. "ise-uiuc/Magicoder-Evol-Instruct-110K"
3. "vicgalle/alpaca-gpt4"
4. "LDJnr/Capybara"
### From Capybara
* Most tokens contained in this dataset are newly synthesized and did not exist prior online.
* This leverages the Amplify-Instruct method(paper coming soon) to grow thousands of high-quality single-turn seeds into advanced and in-depth multi-turn conversations.
* Average context length per conversation is over 1,000 tokens and 3 turns or more per example (most instruction/chat datasets on HF for fine-tuning are only 1 turn)
* Each conversation is optimized to amplify the natural raw knowledge capabilities of the model, as well as delving deep into obscure and advanced topics.
* Aggresively filtered to remove any and all possible examples of overt moralizing/alignment, and common undesirable behaviours such as "as an AI language model" and "September 2021" and "I don't have personal beliefs"
* ### More than 60000 Datas generated or selected by GPT4
提供机构:
erfanzar
原始信息汇总
数据集概述
数据集信息
- 特征:
system: 字符串类型user: 字符串序列assistant: 字符串序列dialogs: 字符串序列conv_depth: 64位整数类型
- 分割:
train: 包含447,216,231字节,109,765个样本
- 下载大小: 242,424,003字节
- 数据集大小: 447,216,231字节
- 配置:
default: 数据文件路径为data/train-*
- 语言: 包含多种语言,如英语、西班牙语、俄语等
- 标签: 代码、生物学、医学
- 大小类别: 1M<n<10M
- 任务类别: 文本生成、文本分类、对话
数据集详情
- 构建方式: 结合了4个其他数据集,使用
GPT-4精心挑选最佳数据 - 包含内容: 系统消息、对话和超过5的对话深度,序列长度较长
- 使用的数据集:
- "oasst2"
- "ise-uiuc/Magicoder-Evol-Instruct-110K"
- "vicgalle/alpaca-gpt4"
- "LDJnr/Capybara"
- Capybara数据集特点:
- 大部分令牌是新合成的,之前未在线存在
- 利用Amplify-Instruct方法(即将发表的论文)将数千个高质量单轮种子扩展为高级深入的多轮对话
- 平均每个对话的上下文长度超过1,000个令牌,每个示例包含3轮或更多
- 每个对话都优化以放大模型的自然原始知识能力,深入探讨晦涩和高级主题
- 积极过滤,去除任何可能的道德化/对齐示例和常见的不良行为
- 数据生成: 超过60,000个数据由GPT-4生成或选择



