argilla/databricks-dolly-15k-curated-multilingual
收藏Hugging Face2023-06-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/argilla/databricks-dolly-15k-curated-multilingual
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: context
dtype: string
- name: response
dtype: string
- name: category
dtype: string
- name: instruction_original_en
dtype: string
- name: context_original_en
dtype: string
- name: response_original_en
dtype: string
- name: id
dtype: int64
splits:
- name: de
num_bytes: 25985140
num_examples: 15015
- name: en
num_bytes: 24125109
num_examples: 15015
- name: es
num_bytes: 25902709
num_examples: 15015
- name: fr
num_bytes: 26704314
num_examples: 15015
download_size: 65586669
dataset_size: 102717272
license: cc-by-sa-3.0
task_categories:
- text-generation
- text2text-generation
language:
- es
- de
- fr
tags:
- machine-translated
- instruction-following
pretty_name: Databrick Dolly Instructions Multilingual
size_categories:
- 10K<n<100K
---
# Dataset Card for "databricks-dolly-15k-curated-multilingual"
A curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original `en` dataset. See below.
**STATUS**:
Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary about the number of edited fields.

## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage: https://huggingface.co/datasets/argilla/databricks-dolly-15k-multilingual/**
- **Repository: https://huggingface.co/datasets/argilla/databricks-dolly-15k-multilingual/**
- **Paper:**
- **Leaderboard:**
- **Point of Contact: contact@argilla.io, https://github.com/argilla-io/argilla**
### Dataset Summary
This dataset collection is a curated and machine-translated version of the `databricks-dolly-15k` [dataset](https://github.com/databrickslabs/dolly/tree/master/data) originally created by Databricks, Inc. in 2023.
The goal is to give practitioners a starting point for training open-source instruction-following models with better-quality English data and translated data beyond English. However, as the translation quality will not be perfect, we highly recommend dedicating time to curate and fix translation issues. Below we explain how to load the datasets into [Argilla for data curation and fixing](https://github.com/argilla-io/argilla). Additionally, we'll be improving the datasets made available here, with the help of different communities.
Currently, the original English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary of the number of edited fields.
The main issues (likely many issues still remaining) are the following:
1. Some labelers misunderstood the usage of the `context` field. This `context` field is used as part of the prompt for instruction-tuning and in other works it's called `input` (e.g., Alpaca). Likely, the name context, has led to some labelers using it to provide the full context of where they have extracted the response. This is problematic for some types of tasks (summarization, closed-qa or information-extraction) because sometimes the context is shorter than or unrelated to summaries, or the information cannot be extracted from the context (closed-qa, information-extraction).
2. Some labelers misunderstood the way to give instructions for summarization or closed-qa, for example, they ask: Who is Thomas Jefferson? then provide a very long context and a response equally long.
We programmatically identified records with these potential issues and ran a campaign to fix it and as a result more than 400 records have been adapted. See below for statistics:

As a result of this curation process the content of the fields has been reduced, counted in number of tokens, especially for the responses:

If you want to browse and curate your dataset with Argilla, you can:
1. [Duplicate this Space](https://huggingface.co/spaces/argilla/dolly-multilingual-curation/settings?duplicate=true). IMPORTANT: The Space's Visibility need to be Public, but you can setup your own password and API KEYS [following this guide](https://docs.argilla.io/en/latest/getting_started/installation/deployments/huggingface-spaces.html#setting-up-secret-environment-variables).
2. Setup two secrets: `HF_TOKEN` and `LANG` for indicating the language split
3. Login with `admin`/`12345678` and start browsing and labelling.
4. Start labeling. Every 5 min the validations will be stored on a Hub dataset in your personal HF space.
5. Please get in touch to contribute fixes and improvements to the source datasets.
There's one split per language:
```python
from datasets import load_dataset
# loads all splits
load_dataset("argilla/databricks-dolly-15k-curate-multilingual")
# loads Spanish splits
load_dataset("argilla/databricks-dolly-15k-curated-multilingual", split="es")
```
### Supported Tasks and Leaderboards
As described in the README of the original dataset, this dataset can be used for:
* Training LLMs
* Synthetic Data Generation
* Data Augmentation
### Languages
Currently: `es`, `fr`, `de`, `en`
Join Argilla [Slack community](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g) if you want to help us include other languages.
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
There's one split per language:
```python
from datasets import load_dataset
# loads all splits
load_dataset("argilla/databricks-dolly-15k-multilingual")
# loads Spanish splits
load_dataset("argilla/databricks-dolly-15k-multilingual", split="es")
```
## Dataset Creation
These datasets have been translated using the DeepL API from the original English dataset between the 13th and 14th of April
### Curation Logbook
* 28/04/23: Removed references from Wikipedia copy pastes for 8113 rows. Applied to context and response fields with the following regex: `r'\[[\w]+\]'`
### Source Data
#### Initial Data Collection and Normalization
Refer to the [original dataset](https://github.com/databrickslabs/dolly/tree/master/data) for more information.
#### Who are the source language producers?
[More Information Needed]
### Annotations
Annotations are planned but not performed yet.
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://creativecommons.org/licenses/by-sa/3.0/legalcode).
**Original dataset Owner: Databricks, Inc.**
### Citation Information
[More Information Needed]
提供机构:
argilla
原始信息汇总
数据集卡片 "databricks-dolly-15k-curated-multilingual"
数据集描述
数据集摘要
这是一个经过筛选和多语言翻译的 databricks-dolly-15k 数据集版本。它包括原始 en 数据集的程序化和手动修正版本。
支持的任务和排行榜
该数据集可用于以下任务:
- 训练大型语言模型(LLMs)
- 合成数据生成
- 数据增强
语言
当前支持的语言:es, fr, de, en
数据集结构
数据实例
[更多信息待补充]
数据字段
instruction: 字符串类型context: 字符串类型response: 字符串类型category: 字符串类型instruction_original_en: 字符串类型context_original_en: 字符串类型response_original_en: 字符串类型id: 整数类型
数据分割
每个语言有一个分割:
python from datasets import load_dataset
加载所有分割
load_dataset("argilla/databricks-dolly-15k-curated-multilingual")
加载西班牙语分割
load_dataset("argilla/databricks-dolly-15k-curated-multilingual", split="es")
数据集创建
这些数据集是从原始英语数据集使用 DeepL API 在 4 月 13 日至 14 日之间翻译的。
筛选日志
- 28/04/23: 移除了 8113 行中的 Wikipedia 复制粘贴引用。应用于上下文和响应字段,使用以下正则表达式:
r[[w]+]
源数据
初始数据收集和规范化
更多信息请参考 原始数据集。
源语言生产者
[更多信息待补充]
注释
注释计划但尚未执行。
注释过程
[更多信息待补充]
注释者
[更多信息待补充]
个人和敏感信息
[更多信息待补充]
使用数据集的注意事项
数据集的社会影响
[更多信息待补充]
偏见的讨论
[更多信息待补充]
其他已知限制
[更多信息待补充]
附加信息
数据集策展人
[更多信息待补充]
许可信息
该数据集可用于任何目的,无论是学术还是商业,根据 Creative Commons Attribution-ShareAlike 3.0 Unported License 的条款。
原始数据集所有者:Databricks, Inc.
引用信息
[更多信息待补充]
搜集汇总
数据集介绍

构建方式
在指令微调数据集的构建领域,argilla/databricks-dolly-15k-curated-multilingual 的构建体现了严谨的跨语言数据工程理念。该数据集以原始的 Databricks Dolly 15k 英文指令集为蓝本,首先通过 DeepL API 进行了程序化的多语言翻译,生成了西班牙语、法语和德语版本。更为关键的是,构建过程深度融合了自动化处理与人工协同标注,利用 Argilla 平台对原始英文数据进行了系统性审查与修正,特别针对指令、上下文与回答字段中存在的语义不一致与格式错位问题,手动编辑并优化了约400条记录,显著提升了数据的逻辑一致性与任务适配性。
特点
该数据集的核心特征在于其经过精心策划的多语言指令遵循架构。其数据条目结构清晰,包含指令、上下文、回答及类别等关键字段,并保留了原始英文版本以供对照。数据集覆盖英语、西班牙语、法语和德语四种语言,每种语言均提供独立的拆分,各包含超过一万五千个实例,确保了跨语言研究的广度与深度。尤为突出的是,通过人工与自动结合的后期修正,有效缓解了原始数据中因标注者理解偏差导致的上下文误用、回答冗余等问题,使得数据质量更为可靠,更适合用于训练高质量的指令遵循模型。
使用方法
在自然语言生成模型的研究与应用中,该数据集为多语言指令微调提供了便捷的实践入口。研究者可通过 Hugging Face `datasets` 库直接加载,使用 `load_dataset("argilla/databricks-dolly-15k-curated-multilingual")` 指令获取全部语言数据,或通过指定 `split` 参数(如 `"es"`)来加载特定语言子集。数据可直接用于训练大型语言模型、生成合成数据或进行数据增强。为进一步优化数据质量,用户可借助 Argilla 平台复制相关空间,设置环境密钥后对翻译结果进行浏览与二次标注,从而实现持续的数据迭代与精炼。
背景与挑战
背景概述
在大型语言模型指令微调技术蓬勃发展的背景下,多语言指令数据集成为推动模型泛化能力的关键资源。Databricks公司于2023年发布了Dolly 15k数据集,旨在为开源社区提供高质量的指令遵循数据。随后,Argilla团队对其进行了深度优化与多语言扩展,推出了‘databricks-dolly-15k-curated-multilingual’数据集。该数据集不仅通过程序化与人工协同的方式修正了原始英文数据中的标注偏差,还利用机器翻译技术生成了德语、西班牙语和法语版本,为核心研究问题——即如何构建高质量、跨语言的指令微调数据——提供了重要实践范例,显著提升了多语言环境下模型对齐人类指令的潜力。
当前挑战
该数据集致力于解决指令遵循模型在多语言场景下的训练数据稀缺与质量不均的挑战。具体而言,其构建过程面临双重困难:首先,在领域问题层面,指令数据需确保‘指令-上下文-响应’三元组在语义上严格对齐,尤其需克服如摘要生成、封闭式问答等任务中,标注者可能误解‘上下文’字段用途,导致上下文与响应内容脱节或信息冗余的固有难题。其次,在构建过程中,多语言扩展依赖机器翻译,译文质量难以完美,存在语义失真或文化语境适配不足的风险,这要求后续持续投入大量人工进行精细化校验与修正,以保障数据集的可靠性与实用性。
常用场景
经典使用场景
在自然语言处理领域,指令遵循数据集的构建对于推动大语言模型的发展至关重要。argilla/databricks-dolly-15k-curated-multilingual数据集以其多语言特性与经过人工校正的高质量指令-响应对,成为训练开源指令遵循模型的经典资源。该数据集通过提供结构化的指令、上下文及标准回应,为模型在理解人类意图并生成连贯文本方面提供了精准的监督信号,尤其适用于微调阶段以提升模型的多任务执行能力。
实际应用
在实际应用层面,该数据集为开发多语言对话系统、智能客服以及内容生成工具提供了直接的数据基础。企业可利用其训练定制化的语言模型,以处理不同语言用户的复杂查询,例如在金融、教育或医疗领域生成专业报告或解答问题。其多语言覆盖特性尤其有助于构建具备全球服务能力的AI应用,降低技术部署的语言壁垒,推动人工智能技术的普惠化发展。
衍生相关工作
围绕该数据集衍生的经典工作主要集中在多语言指令微调模型的构建与评估框架上。研究者们以其为基础,开发了诸如多语言Alpaca风格的适配模型,并探索了跨语言知识迁移的有效性。这些工作不仅验证了高质量多语言指令数据对模型性能的提升作用,还进一步催生了针对数据清洗、翻译质量优化以及偏差检测的一系列方法论研究,丰富了多语言NLP的技术生态。
以上内容由遇见数据集搜集并总结生成



