Magpie-Align/Magpie-Qwen2-Pro-200K-English

Name: Magpie-Align/Magpie-Qwen2-Pro-200K-English
Creator: Magpie-Align
Published: 2024-07-03 04:39:27
License: 暂无描述

Hugging Face2024-07-03 更新2024-07-06 收录

下载链接：

https://hf-mirror.com/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-English

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是通过Magpie方法使用Qwen/Qwen2-72B-Instruct模型生成的，包含了指令和响应的对话数据。数据集的特征包括输入长度、输出长度、任务类别、输入质量、输入难度、最小邻居距离、安全性、奖励和语言等。过滤设置包括输入质量、指令奖励、语言以及去除重复和不完整的指令。数据集的不同版本包括1M原始对话、300K高质量对话、200K高质量中文对话和200K高质量英文对话。

This dataset is generated by Qwen/Qwen2-72B-Instruct using the Magpie method, containing 200,000 training data entries. The dataset features include UUID, model name, generation input configurations, instructions, responses, conversations, task categories, difficulty, intent, knowledge, and more fields. The dataset split includes a training set, approximately 1GB in size. The description of the dataset details the meaning and data type of each field, as well as the generation method and filtering conditions of the dataset.

提供机构：

Magpie-Align

原始信息汇总

数据集概述

数据集信息

特征:
- uuid: 字符串
- model: 字符串
- gen_input_configs: 结构体
  - temperature: 浮点数
  - top_p: 浮点数
  - input_generator: 字符串
  - seed: 空值
  - extract_input: 字符串
- instruction: 字符串
- response: 字符串
- conversations: 列表
  - from: 字符串
  - value: 字符串
- task_category: 字符串
- other_task_category: 序列字符串
- task_category_generator: 字符串
- difficulty: 字符串
- intent: 字符串
- knowledge: 字符串
- difficulty_generator: 字符串
- input_quality: 字符串
- quality_explanation: 字符串
- quality_generator: 字符串
- llama_guard_2: 字符串
- reward_model: 字符串
- instruct_reward: 浮点数
- min_neighbor_distance: 浮点数
- repeat_count: 整数
- min_similar_uuid: 字符串
- instruction_length: 整数
- response_length: 整数
- language: 字符串
分割:
- train:
  - num_bytes: 1007184254.428362
  - num_examples: 200000
下载大小: 599475522
数据集大小: 1007184254.428362

配置

默认配置:
- data_files:
  - split: train
  - path: data/train-*

可用标签

Input Length: 指令中的字符总数
Output Length: 响应中的字符总数
Task Category: 指令的具体类别
Input Quality: 指令的清晰度、特异性和连贯性，评级为 very poor, poor, average, good, excellent
Input Difficulty: 完成指令所需的知识水平，评级为 very easy, easy, medium, hard, very hard
Minimum Neighbor Distance: 数据集中最近的邻居的嵌入距离，用于过滤重复或相似的实例
Safety: 由 meta-llama/Meta-Llama-Guard-2-8B 标记的安全标签
Reward: 奖励模型给出的特定指令-响应对的输出
Language: 指令的语言

过滤设置

Input Quality: >= good
Instruction Reward: >=-10
Language: English
移除重复和不完整的指令（例如，以 ":" 结尾的指令）
选择响应最长的 200K 数据

5,000+

优质数据集

54 个

任务类型

进入经典数据集