five

qywu/ruozhiba_en

收藏
Hugging Face2024-04-12 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/qywu/ruozhiba_en
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: source dtype: string - name: instruction dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: followup_question dtype: string - name: model dtype: string splits: - name: train_sft num_bytes: 954797 num_examples: 238 download_size: 548182 dataset_size: 954797 configs: - config_name: default data_files: - split: train_sft path: data/train_sft-* size_categories: - n<1K --- # Ruozhiba English Data Based on the findings from [COIG-CQIA](https://arxiv.org/html/2403.18058v1), Ruozhiba data is a high-quality instruction tuning dataset that can greatly improve supervised fine-tuning models' performance. We translated the 240 instructions in Ruozhiba from Chinese to English. We filtered out and modified some instructions are language/cultural related. Some Chinese instructions are kept to maintain their original meaning. Finally, we re-generate the response using `gpt-4-turbo` and add one additional turn to improve robustness. ## MT-Bench We use GPT-4-0125-preview as Judge. On MT-Bench, [ruozhiba_en](https://huggingface.co/datasets/qywu/ruozhiba_en) data has achieved comparable performance compared to [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. | Model | Total | Coding | Extraction | Humanities | Math | Reasoning | Roleplay | STEM | Writing | |--------------------------------------------|-------|--------|------------|------------|------|-----------|----------|------|---------| | alignment-handbook/zephyr-7b-sft-full | 5.6 | 3.95 | 6.75 | 7.5 | 3.1 | 4.05 | 6.15 | 6.1 | 7.2 | | zephyr-7b-sft-ruozhiba | 5.88 | 3.75 | 6.45 | 8.11 | 2.7 | 4.2 | 7.4 | 7.4 | 7.15 |
提供机构:
qywu
原始信息汇总

数据集概述

数据集名称

Ruoziba English Data

数据集描述

Ruoziba English Data是一个高质量的指令调优数据集,用于提升监督微调模型的性能。该数据集包含240条从中文翻译至英文的指令,并对语言/文化相关的指令进行了过滤和修改。部分中文指令被保留以保持其原始含义,并通过gpt-4-turbo重新生成响应,增加了一个额外的回合以增强鲁棒性。

数据集特征

  • source: 字符串类型
  • instruction: 字符串类型
  • messages: 列表类型,包含以下子特征:
    • content: 字符串类型
    • role: 字符串类型
  • followup_question: 字符串类型
  • model: 字符串类型

数据集分割

  • train_sft: 包含238个示例,数据大小为954797字节

数据集大小

  • 下载大小: 548182字节
  • 数据集大小: 954797字节

数据集配置

  • config_name: default
  • data_files:
    • split: train_sft
    • path: data/train_sft-*

数据集大小分类

  • n<1K
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作