BramVanroy/stackoverflow-chat-dutch

Name: BramVanroy/stackoverflow-chat-dutch
Creator: BramVanroy
Published: 2024-01-22 10:44:13
License: 暂无描述

Hugging Face2024-01-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/BramVanroy/stackoverflow-chat-dutch

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含56,964个荷兰语的对话，涉及编程领域的问答，这些对话是由AI助手与（模拟的）人类生成的，并且是基于Baize项目中的机器生成答案翻译而来。数据集的创建使用了OpenAI的API进行翻译，并且有82个条目未能成功翻译。数据集的使用需要注意潜在的偏见和翻译质量问题，且必须遵守OpenAI的使用政策。

提供机构：

BramVanroy

原始信息汇总

数据集卡片 for Stack Overflow Chat Dutch

数据集描述

数据集摘要

该数据集包含56,964个荷兰语的AI助手与（假）“人类”（生成）之间的对话，特别针对编程领域（Stack Overflow）。这些对话是Baize对Stack Overflow数据集的机器生成答案的翻译。

语言

荷兰语

数据集结构

数据实例

python {id: 9, input: Het gesprek tussen de mens en de AI-assistent.

       [|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen 
       van kolommen.

       [|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp 
       bij nodig?

       [|Human|] Ik wil de output van een UNION-query ordenen op een 
       kolom die niet aanwezig is in de SELECT-instructie. Is dat 
       mogelijk?

       [|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale 
       positie van de kolom in de ORDER BY-clausule gebruiken in plaats 
       van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde 
       kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken.

       [|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de 
       ORDER BY-clausule?

       [|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De 
       alias moet worden gedefinieerd in de SELECT-instructie. Als je 
       bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen 
       met ORDER BY col3.

       [|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias 
       gebruiken voor kolommen uit beide tabellen in de UNION?

       [|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen 
       uit beide tabellen in de UNION. Elke tabel in de UNION moet een 
       unieke alias hebben voor de kolommen.

       [|Human|] ,

topic: Spark SQL UNION - ORDER BY kolom niet in SELECT},

数据字段

id: 项目的ID。以下82个ID未包含，因为它们无法翻译：[1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]
input: AI与“人类”之间的机器生成对话。总是以Het gesprek tussen de mens en de AI-assistent. 开头，并且至少有一次[|AI|]和[|Human|]的出现。
topic: 主题描述

数据集创建

翻译和主题都是使用OpenAI的API gpt-3.5-turbo进行翻译的，参数为max_tokens=1024, temperature=0。

翻译输入的提示模板如下（其中src_lang为英语，tgt_lang为荷兰语）：

python CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to:

maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other;
do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places;
ensure accurate translation and keep the correctness of the conversation;
make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
translate the humans text using informal, but standard, language;
make sure to avoid biases (such as gender bias, grammatical bias, social bias);
if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
if the human asks to translate text from one to another language, then you only translate the humans question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

翻译主题的提示如下：

python TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct," " summarizing manner. Translate accurately and formally. Do not provide any explanation" " about the translation and do not include the original title.

系统消息为：

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

注意，有82个项目（0.1%）未成功翻译。翻译缺失了AI标识符[|AI|]和/或人类标识符[|Human|]。缺失项目的ID为[1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]。

源数据

初始数据收集和规范化

初始数据由Baize生成，该数据集依赖于Stack Overflow Questions。

源语言生产者

Quora用户和OpenAI的GPT模型（通过Baize）。

使用数据集的注意事项

注意，该新数据集中的翻译以及原始Baize数据集中的机器生成对话均未经过人工验证。

偏见讨论

与任何机器生成文本一样，用户应注意该数据集中可能包含的潜在偏见。尽管提示中特别包含确保避免偏见（如性别偏见、语法偏见、社会偏见），但这种命令的影响是未知的。很可能数据集中仍然存在偏见，因此请谨慎使用。

其他已知限制

翻译质量未经验证。请自行承担风险使用！

许可信息

Stack Overflow Questions的许可信息为Apache 2.0。如果您使用当前数据集，您还应遵守原始许可。

这段文本是使用GPT-3（gpt-3.5-turbo），OpenAI的大规模语言生成模型生成的。在生成草稿语言后，作者进行了审查、编辑和修订，以符合自己的喜好，并最终对该出版物的内容负责。

如果您使用此数据集，您还必须遵守共享和使用政策。

正如他们在使用条款中明确指出的，特别是2c.iii，“[您不得]使用服务的输出来开发与OpenAI竞争的模型”。这意味着您不能使用此数据集来构建旨在与OpenAI商业竞争的模型。据我所知，这是一个特定的限制，应作为当前许可的附加条款。

贡献

感谢Baize提供的初始机器生成对话，基于Stack Overflow Questions。

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，高质量的多语言对话数据对于训练和评估大型语言模型至关重要。Stack Overflow Chat Dutch数据集的构建过程体现了跨语言资源转化的精密设计。该数据集源自Baize项目基于Stack Overflow Questions生成的英文对话，通过OpenAI的gpt-3.5-turbo模型进行机器翻译转化为荷兰语。翻译过程采用严格的提示模板，要求保持原始对话格式、准确翻译技术内容、保留代码片段原貌，并避免各类语言偏见。尽管翻译过程自动化程度较高，但仍有82条数据因格式问题被排除，确保了最终数据集的规范性。

特点

作为荷兰语编程对话领域的专项资源，该数据集展现出鲜明的领域特异性与语言纯粹性。数据集包含56,964条人工与AI助手之间的模拟对话，全部围绕Stack Overflow编程问题展开，形成了高度聚焦的技术对话语料。每条数据均包含完整的对话轮次、明确的说话人标识以及概括性的主题描述，结构清晰统一。其独特价值在于将英语技术对话高质量转化为荷兰语，填补了低资源语言在编程问答任务中的数据空白，为跨语言模型训练提供了重要支撑。

使用方法

该数据集主要服务于荷兰语大型语言模型的监督微调任务，特别是在编程问答与文本生成领域具有直接应用价值。研究人员可将数据集加载至标准机器学习框架，利用其训练集与测试集划分进行模型训练与评估。使用时应充分认识其机器生成与机器翻译的双重特性，建议结合人工验证以控制潜在偏见与错误传播。鉴于数据集的Apache 2.0许可及OpenAI使用限制，需严格遵守相关商业使用条款，避免用于开发与OpenAI竞争的产品。

背景与挑战

背景概述

在自然语言处理领域，针对低资源语言的高质量对话数据集长期匮乏，制约了相关语言模型的研发进展。BramVanroy/stackoverflow-chat-dutch数据集由研究者Bram Vanroy于2023年创建，旨在填补荷兰语编程对话数据的空白。该数据集基于Baize项目生成的英文编程对话，通过OpenAI API系统性地翻译为荷兰语，共包含56,964轮人机对话，聚焦于Stack Overflow社区的编程问题解答场景。其核心研究问题在于为荷兰语大型语言模型提供监督微调所需的指令遵循数据，推动低资源语言在代码生成与问答任务上的性能提升，相关成果已发表于预印本平台arXiv，为多语言NLP社区贡献了关键资源。

当前挑战

该数据集致力于解决荷兰语编程对话生成与问答任务的领域挑战，其核心在于克服低资源语言中技术术语准确翻译与对话逻辑一致性的双重难题。构建过程中的挑战尤为显著：首先，机器翻译需在保留原始编程代码片段的同时，确保荷兰语技术表述的精确性与自然性，82条数据因翻译丢失对话标识符而被迫剔除；其次，尽管提示词明确要求避免性别、语法等社会偏见，但机器生成内容仍可能隐含难以察觉的偏差，且翻译质量未经人工验证，存在潜在错误传播风险。此外，数据衍生自多层自动生成流程（Stack Overflow→Baize→翻译），原始噪声的累积效应可能影响下游模型训练的稳定性。

常用场景

经典使用场景

在自然语言处理领域，针对低资源语言的对话系统开发常面临数据稀缺的挑战。BramVanroy/stackoverflow-chat-dutch数据集通过将英文编程问答对话机器翻译为荷兰语，为荷兰语大型语言模型的指令微调提供了关键资源。该数据集以Stack Overflow编程问题为背景，模拟了人类与AI助手之间的多轮技术对话，其经典应用场景在于训练和评估面向荷兰语用户的代码辅助聊天机器人，尤其在跨语言迁移学习中，它能够有效提升模型对荷兰语技术术语和对话结构的理解能力。

衍生相关工作

该数据集衍生的经典工作主要围绕荷兰语语言模型的优化与评估展开。例如，在Bram Vanroy的相关研究中，该数据集被用于构建荷兰语指令微调基准，以评测模型如ChatGPT对低资源语言的适应能力。同时，它也为后续研究如荷兰语对话模型的偏见分析、机器翻译在技术领域的适用性检验，以及多语言代码生成模型的性能对比提供了数据基础。这些工作共同推动了针对特定语言和垂直领域的对话人工智能技术的精细化发展。

数据集最近研究