five

erfanzar/UltraChat-Matic

收藏
Hugging Face2024-01-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/erfanzar/UltraChat-Matic
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: system dtype: string - name: user sequence: string - name: assistant sequence: string - name: dialogs sequence: string - name: conv_depth dtype: int64 splits: - name: train num_bytes: 447216231 num_examples: 109765 download_size: 242424003 dataset_size: 447216231 configs: - config_name: default data_files: - split: train path: data/train-* language: - en - es - ru - de - pl - th - vi - sv - bn - da - he - it - fa - sk - id - nb - el - nl - hu - eu - zh - eo - ja - ca - cs - bg - fi - pt - tr - ro - ar - uk - gl - fr - ko tags: - code - biology - medical size_categories: - 1M<n<10M task_categories: - text-generation - text-classification - conversational --- # ChatMatic ## with Over 80,000 multi-turn examples. UltraChat-Matic Dataset is built with mix of 4 other dataset and which carefully chosing best one from each one of them with using `GPT-4` and contains System messages Dialogs and conv_depth more than 5 with higher sequence lengths Used datasets are: 1. "oasst2" 2. "ise-uiuc/Magicoder-Evol-Instruct-110K" 3. "vicgalle/alpaca-gpt4" 4. "LDJnr/Capybara" ### From Capybara * Most tokens contained in this dataset are newly synthesized and did not exist prior online. * This leverages the Amplify-Instruct method(paper coming soon) to grow thousands of high-quality single-turn seeds into advanced and in-depth multi-turn conversations. * Average context length per conversation is over 1,000 tokens and 3 turns or more per example (most instruction/chat datasets on HF for fine-tuning are only 1 turn) * Each conversation is optimized to amplify the natural raw knowledge capabilities of the model, as well as delving deep into obscure and advanced topics. * Aggresively filtered to remove any and all possible examples of overt moralizing/alignment, and common undesirable behaviours such as "as an AI language model" and "September 2021" and "I don't have personal beliefs" * ### More than 60000 Datas generated or selected by GPT4
提供机构:
erfanzar
原始信息汇总

数据集概述

数据集信息

  • 特征:
    • system: 字符串类型
    • user: 字符串序列
    • assistant: 字符串序列
    • dialogs: 字符串序列
    • conv_depth: 64位整数类型
  • 分割:
    • train: 包含447,216,231字节,109,765个样本
  • 下载大小: 242,424,003字节
  • 数据集大小: 447,216,231字节
  • 配置:
    • default: 数据文件路径为data/train-*
  • 语言: 包含多种语言,如英语、西班牙语、俄语等
  • 标签: 代码、生物学、医学
  • 大小类别: 1M<n<10M
  • 任务类别: 文本生成、文本分类、对话

数据集详情

  • 构建方式: 结合了4个其他数据集,使用GPT-4精心挑选最佳数据
  • 包含内容: 系统消息、对话和超过5的对话深度,序列长度较长
  • 使用的数据集:
    1. "oasst2"
    2. "ise-uiuc/Magicoder-Evol-Instruct-110K"
    3. "vicgalle/alpaca-gpt4"
    4. "LDJnr/Capybara"
  • Capybara数据集特点:
    • 大部分令牌是新合成的,之前未在线存在
    • 利用Amplify-Instruct方法(即将发表的论文)将数千个高质量单轮种子扩展为高级深入的多轮对话
    • 平均每个对话的上下文长度超过1,000个令牌,每个示例包含3轮或更多
    • 每个对话都优化以放大模型的自然原始知识能力,深入探讨晦涩和高级主题
    • 积极过滤,去除任何可能的道德化/对齐示例和常见的不良行为
  • 数据生成: 超过60,000个数据由GPT-4生成或选择
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作