qywu/ruozhiba_en
收藏Hugging Face2024-04-12 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/qywu/ruozhiba_en
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: source
dtype: string
- name: instruction
dtype: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: followup_question
dtype: string
- name: model
dtype: string
splits:
- name: train_sft
num_bytes: 954797
num_examples: 238
download_size: 548182
dataset_size: 954797
configs:
- config_name: default
data_files:
- split: train_sft
path: data/train_sft-*
size_categories:
- n<1K
---
# Ruozhiba English Data
Based on the findings from [COIG-CQIA](https://arxiv.org/html/2403.18058v1), Ruozhiba data is a high-quality instruction tuning dataset that can greatly improve supervised fine-tuning models' performance.
We translated the 240 instructions in Ruozhiba from Chinese to English.
We filtered out and modified some instructions are language/cultural related.
Some Chinese instructions are kept to maintain their original meaning.
Finally, we re-generate the response using `gpt-4-turbo` and add one additional turn to improve robustness.
## MT-Bench
We use GPT-4-0125-preview as Judge. On MT-Bench, [ruozhiba_en](https://huggingface.co/datasets/qywu/ruozhiba_en) data has achieved comparable performance compared to [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
| Model | Total | Coding | Extraction | Humanities | Math | Reasoning | Roleplay | STEM | Writing |
|--------------------------------------------|-------|--------|------------|------------|------|-----------|----------|------|---------|
| alignment-handbook/zephyr-7b-sft-full | 5.6 | 3.95 | 6.75 | 7.5 | 3.1 | 4.05 | 6.15 | 6.1 | 7.2 |
| zephyr-7b-sft-ruozhiba | 5.88 | 3.75 | 6.45 | 8.11 | 2.7 | 4.2 | 7.4 | 7.4 | 7.15 |
提供机构:
qywu
原始信息汇总
数据集概述
数据集名称
Ruoziba English Data
数据集描述
Ruoziba English Data是一个高质量的指令调优数据集,用于提升监督微调模型的性能。该数据集包含240条从中文翻译至英文的指令,并对语言/文化相关的指令进行了过滤和修改。部分中文指令被保留以保持其原始含义,并通过gpt-4-turbo重新生成响应,增加了一个额外的回合以增强鲁棒性。
数据集特征
- source: 字符串类型
- instruction: 字符串类型
- messages: 列表类型,包含以下子特征:
- content: 字符串类型
- role: 字符串类型
- followup_question: 字符串类型
- model: 字符串类型
数据集分割
- train_sft: 包含238个示例,数据大小为954797字节
数据集大小
- 下载大小: 548182字节
- 数据集大小: 954797字节
数据集配置
- config_name: default
- data_files:
- split: train_sft
- path: data/train_sft-*
数据集大小分类
- n<1K



