Quora Chat Dutch
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/records/8052391
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains 54,444 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch. They are translations of Baize's machine-generated answers to the Quora dataset.
Data Instances
{
'id': 10,
'input': 'Het gesprek tussen de mens en de AI-assistent.\n'
'[|Human|] Ik vroeg me af wanneer de Moto G3 de Android 7.0-update '
'zal krijgen?\n'
'[|AI|] De Moto G3 (Moto G 3rd gen 2015) komt in aanmerking voor de '
'Android 7.0 (Nougat) update.\n'
'[|Human|] Geweldig! Weet je wanneer de update beschikbaar zal '
'zijn?\n'
'[|AI|] De update werd uitgebracht in december 2016 voor de Moto G4 '
'en G4 Plus. Er is echter nog geen woord over wanneer de Moto G3 de '
'update zal ontvangen.\n'
'[|Human|] Is er nieuws of het binnenkort de update zal ontvangen?\n'
'[|AI|] Het spijt me, ik heb momenteel geen informatie daarover. '
'Het is het beste om uit te kijken naar updates van Motorola.\n'
'[|Human|] ',
'topic': 'Wanneer krijgt Moto G3 (Moto G 3e gen 2015) de Android 7.0 '
'(Nougat) update?'
}
Data Fields
id: the ID of the item. The following IDs are not included because they could not be translated: [1352, 14599, 25985, 27167, 28254, 33731, 41550, 44364, 47293, 49060, 51383, 53740]
input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent.\n and has at least one occurrence of both [|AI|] and [|Human|].
topic: the topic description
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.
The prompt template to translate the input is:
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as `[|AI|]`) and the human (`[|Human|]`) talking in turns and responding to each other;
2. do not translate the speaker identifiers `[|AI|]` and `[|Human|]` but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.\n\n"""
The prompt to translate the topic is:
TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct," \
" summarizing manner. Translate accurately and formally. Do not provide any explanation" \
" about the translation and do not include the original title.\n\n"
The system message was:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
Note that 12 items (0.02%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1352, 14599, 25985, 27167, 28254, 33731, 41550, 44364, 47293, 49060, 51383, 53740].
The translation quality has not been verified. Use at your own risk!
Licensing Information
Licensing info for quora is listed as unknown. Baize is listed as GPLv3. So this dataset is also listed as such. A specific restriction is in place: as per OpenAI's terms of use, this dataset cannot be used to build a commercial system that competes with OpenAI's services.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub which is the canonical repository for this dataset. See the link for more info.
本数据集包含54444轮荷兰语的AI助手 (AI Assistant)与(虚构的)“人类”(由模型生成)的对话。这些对话均为将Baize针对Quora数据集生成的机器回答翻译为荷兰语的结果。
### 数据实例
json
{
'id': 10,
'input': 'Het gesprek tussen de mens en de AI-assistent.
[|Human|] Ik vroeg me af wanneer de Moto G3 de Android 7.0-update zal krijgen?
[|AI|] De Moto G3 (Moto G 3rd gen 2015) komt in aanmerking voor de Android 7.0 (Nougat) update.
[|Human|] Geweldig! Weet je wanneer de update beschikbaar zal zijn?
[|AI|] De update werd uitgebracht in december 2016 voor de Moto G4 en G4 Plus. Er is echter nog geen woord over wanneer de Moto G3 de update zal ontvangen.
[|Human|] Is er nieuws of het binnenkort de update zal ontvangen?
[|AI|] Het spijt me, ik heb momenteel geen informatie daarover. Het is het beste om uit te kijken naar updates van Motorola.
[|Human|] ',
'topic': 'Wanneer krijgt Moto G3 (Moto G 3e gen 2015) de Android 7.0 (Nougat) update?'
}
### 数据字段
- id:数据项的唯一标识。以下ID因无法完成翻译而未被纳入本数据集:[1352, 14599, 25985, 27167, 28254, 33731, 41550, 44364, 47293, 49060, 51383, 53740]
- input:AI与“人类”的机器生成对话。始终以`Het gesprek tussen de mens en de AI-assistent.
`开头,且至少包含一次`[|AI|]`和`[|Human|]`发言标记。
- topic:对话主题的描述性文本
### 数据集构建
所有对话与主题的翻译均通过OpenAI的gpt-3.5-turbo API完成,参数设置为`max_tokens=1024`、`temperature=0`。
#### 对话翻译提示模板
python
CONVERSATION_TRANSLATION_PROMPT = """你需要将一段AI助手与人类的对话从{src_lang}翻译为{tgt_lang}。
请遵循以下要求:
1. 保留对话格式:对话由依次轮流发言、互相回应的AI(标记为`[|AI|]`)和人类(`[|Human|]`)组成;
2. 不得翻译发言者标识符`[|AI|]`与`[|Human|]`,需将其原样复制到译文的对应位置;
3. 确保翻译准确,保证对话内容的正确性;
4. 译文需流畅自然,无语法错误,使用标准{tgt_lang}且无地域偏见;
5. 人类发言需采用非正式但规范的语言风格;
6. 避免各类偏见(如性别偏见、语法偏见、社会偏见);
7. 若人类要求修正语法或拼写错误,则需在{tgt_lang}中生成类似的错误文本,并同时为AI生成修正后的输出版本;
8. 若人类要求将文本从一种语言翻译为另一种语言,则仅将人类的问题翻译为{tgt_lang},保留AI提供的、符合人类请求的目标语言译文;
9. 不得翻译代码片段,需原样复制。若代码片段中包含英文示例、变量名或定义,请保留其英文形式。
请按照上述要求翻译以下对话,无需提供解释或添加任何额外内容。
"""
#### 主题翻译提示模板
python
TOPIC_TRANSLATION_PROMPT = "请将一段对话的标题从{src_lang}简洁、概括地翻译为{tgt_lang}。翻译需准确且正式,无需提供翻译相关解释,也不得保留原文标题。
"
> 系统提示词为:
> 你是一名乐于助人的助手,需按照给定要求将文本从英语翻译为荷兰语。
### 注意事项
注:共有12条数据(占比0.02%)未能成功完成翻译,缺失了AI标识符`[|AI|]`和/或人类标识符`[|Human|]`。缺失标识符的数据项ID为[1352, 14599, 25985, 27167, 28254, 33731, 41550, 44364, 47293, 49060, 51383, 53740]。
本数据集的翻译质量未经过验证,使用前请自行承担风险!
### 许可信息
Quora数据集的许可信息未知。Baize数据集采用GPLv3许可,因此本数据集亦采用该许可。同时存在一项特殊限制:根据OpenAI的使用条款,本数据集不得用于构建与OpenAI服务形成商业竞争的系统。
本文本部分或全部由OpenAI的大规模语言生成模型GPT-3(gpt-3.5-turbo)生成。在生成初稿后,作者已对文本进行审阅、编辑与修改,并对出版物的最终内容承担全部责任。
若使用本数据集,你还需遵守“共享”与“使用”相关政策。
正如其服务条款第2c.iii条明确规定的:"[你不得]使用本服务的输出开发与OpenAI形成竞争的模型"。这意味着你不得使用本数据集构建旨在与OpenAI形成商业竞争的模型。据我所知,该条款为本许可的附加特殊限制。
本数据集同时可在Hugging Face Hub上获取,该平台为本数据集的官方托管仓库,更多详情请访问对应链接。
创建时间:
2023-06-20



