qywu/ruozhiba_en

Name: qywu/ruozhiba_en
Creator: qywu
Published: 2024-04-12 19:34:39
License: 暂无描述

Hugging Face2024-04-12 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/qywu/ruozhiba_en

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: source dtype: string - name: instruction dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: followup_question dtype: string - name: model dtype: string splits: - name: train_sft num_bytes: 954797 num_examples: 238 download_size: 548182 dataset_size: 954797 configs: - config_name: default data_files: - split: train_sft path: data/train_sft-* size_categories: - n<1K --- # Ruozhiba English Data Based on the findings from [COIG-CQIA](https://arxiv.org/html/2403.18058v1), Ruozhiba data is a high-quality instruction tuning dataset that can greatly improve supervised fine-tuning models' performance. We translated the 240 instructions in Ruozhiba from Chinese to English. We filtered out and modified some instructions are language/cultural related. Some Chinese instructions are kept to maintain their original meaning. Finally, we re-generate the response using `gpt-4-turbo` and add one additional turn to improve robustness. ## MT-Bench We use GPT-4-0125-preview as Judge. On MT-Bench, [ruozhiba_en](https://huggingface.co/datasets/qywu/ruozhiba_en) data has achieved comparable performance compared to [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. | Model | Total | Coding | Extraction | Humanities | Math | Reasoning | Roleplay | STEM | Writing | |--------------------------------------------|-------|--------|------------|------------|------|-----------|----------|------|---------| | alignment-handbook/zephyr-7b-sft-full | 5.6 | 3.95 | 6.75 | 7.5 | 3.1 | 4.05 | 6.15 | 6.1 | 7.2 | | zephyr-7b-sft-ruozhiba | 5.88 | 3.75 | 6.45 | 8.11 | 2.7 | 4.2 | 7.4 | 7.4 | 7.15 |

提供机构：

qywu

原始信息汇总

数据集概述

数据集名称

Ruoziba English Data

数据集描述

Ruoziba English Data是一个高质量的指令调优数据集，用于提升监督微调模型的性能。该数据集包含240条从中文翻译至英文的指令，并对语言/文化相关的指令进行了过滤和修改。部分中文指令被保留以保持其原始含义，并通过gpt-4-turbo重新生成响应，增加了一个额外的回合以增强鲁棒性。

数据集特征

source: 字符串类型
instruction: 字符串类型
messages: 列表类型，包含以下子特征：
- content: 字符串类型
- role: 字符串类型
followup_question: 字符串类型
model: 字符串类型

数据集分割

train_sft: 包含238个示例，数据大小为954797字节

数据集大小

下载大小: 548182字节
数据集大小: 954797字节

数据集配置

config_name: default
data_files:
- split: train_sft
- path: data/train_sft-*

数据集大小分类

n<1K

5,000+

优质数据集

54 个

任务类型

进入经典数据集