erfanzar/UltraChat-Matic

Name: erfanzar/UltraChat-Matic
Creator: erfanzar
Published: 2024-01-06 10:23:35
License: 暂无描述

Hugging Face2024-01-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/erfanzar/UltraChat-Matic

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: system dtype: string - name: user sequence: string - name: assistant sequence: string - name: dialogs sequence: string - name: conv_depth dtype: int64 splits: - name: train num_bytes: 447216231 num_examples: 109765 download_size: 242424003 dataset_size: 447216231 configs: - config_name: default data_files: - split: train path: data/train-* language: - en - es - ru - de - pl - th - vi - sv - bn - da - he - it - fa - sk - id - nb - el - nl - hu - eu - zh - eo - ja - ca - cs - bg - fi - pt - tr - ro - ar - uk - gl - fr - ko tags: - code - biology - medical size_categories: - 1M<n<10M task_categories: - text-generation - text-classification - conversational --- # ChatMatic ## with Over 80,000 multi-turn examples. UltraChat-Matic Dataset is built with mix of 4 other dataset and which carefully chosing best one from each one of them with using `GPT-4` and contains System messages Dialogs and conv_depth more than 5 with higher sequence lengths Used datasets are: 1. "oasst2" 2. "ise-uiuc/Magicoder-Evol-Instruct-110K" 3. "vicgalle/alpaca-gpt4" 4. "LDJnr/Capybara" ### From Capybara * Most tokens contained in this dataset are newly synthesized and did not exist prior online. * This leverages the Amplify-Instruct method(paper coming soon) to grow thousands of high-quality single-turn seeds into advanced and in-depth multi-turn conversations. * Average context length per conversation is over 1,000 tokens and 3 turns or more per example (most instruction/chat datasets on HF for fine-tuning are only 1 turn) * Each conversation is optimized to amplify the natural raw knowledge capabilities of the model, as well as delving deep into obscure and advanced topics. * Aggresively filtered to remove any and all possible examples of overt moralizing/alignment, and common undesirable behaviours such as "as an AI language model" and "September 2021" and "I don't have personal beliefs" * ### More than 60000 Datas generated or selected by GPT4

提供机构：

erfanzar

原始信息汇总

数据集概述

数据集信息

特征:
- system: 字符串类型
- user: 字符串序列
- assistant: 字符串序列
- dialogs: 字符串序列
- conv_depth: 64位整数类型
分割:
- train: 包含447,216,231字节，109,765个样本
下载大小: 242,424,003字节
数据集大小: 447,216,231字节
配置:
- default: 数据文件路径为data/train-*
语言: 包含多种语言，如英语、西班牙语、俄语等
标签: 代码、生物学、医学
大小类别: 1M<n<10M
任务类别: 文本生成、文本分类、对话

数据集详情

构建方式: 结合了4个其他数据集，使用GPT-4精心挑选最佳数据
包含内容: 系统消息、对话和超过5的对话深度，序列长度较长
使用的数据集:
1. "oasst2"
2. "ise-uiuc/Magicoder-Evol-Instruct-110K"
3. "vicgalle/alpaca-gpt4"
4. "LDJnr/Capybara"
Capybara数据集特点:
- 大部分令牌是新合成的，之前未在线存在
- 利用Amplify-Instruct方法（即将发表的论文）将数千个高质量单轮种子扩展为高级深入的多轮对话
- 平均每个对话的上下文长度超过1,000个令牌，每个示例包含3轮或更多
- 每个对话都优化以放大模型的自然原始知识能力，深入探讨晦涩和高级主题
- 积极过滤，去除任何可能的道德化/对齐示例和常见的不良行为
数据生成: 超过60,000个数据由GPT-4生成或选择

5,000+

优质数据集

54 个

任务类型

进入经典数据集