BramVanroy/stackoverflow-chat-dutch
收藏数据集卡片 for Stack Overflow Chat Dutch
数据集描述
数据集摘要
该数据集包含56,964个荷兰语的AI助手与(假)“人类”(生成)之间的对话,特别针对编程领域(Stack Overflow)。这些对话是Baize对Stack Overflow数据集的机器生成答案的翻译。
语言
- 荷兰语
数据集结构
数据实例
python {id: 9, input: Het gesprek tussen de mens en de AI-assistent.
[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen
van kolommen.
[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp
bij nodig?
[|Human|] Ik wil de output van een UNION-query ordenen op een
kolom die niet aanwezig is in de SELECT-instructie. Is dat
mogelijk?
[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale
positie van de kolom in de ORDER BY-clausule gebruiken in plaats
van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde
kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken.
[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de
ORDER BY-clausule?
[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De
alias moet worden gedefinieerd in de SELECT-instructie. Als je
bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen
met ORDER BY col3.
[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias
gebruiken voor kolommen uit beide tabellen in de UNION?
[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen
uit beide tabellen in de UNION. Elke tabel in de UNION moet een
unieke alias hebben voor de kolommen.
[|Human|] ,
topic: Spark SQL UNION - ORDER BY kolom niet in SELECT},
数据字段
- id: 项目的ID。以下82个ID未包含,因为它们无法翻译:
[1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992] - input: AI与“人类”之间的机器生成对话。总是以
Het gesprek tussen de mens en de AI-assistent.开头,并且至少有一次[|AI|]和[|Human|]的出现。 - topic: 主题描述
数据集创建
翻译和主题都是使用OpenAI的API gpt-3.5-turbo进行翻译的,参数为max_tokens=1024, temperature=0。
翻译输入的提示模板如下(其中src_lang为英语,tgt_lang为荷兰语):
python CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
Here are the requirements that you should adhere to:
- maintain the format: the conversation consists of the AI (marked as
[|AI|]) and the human ([|Human|]) talking in turns and responding to each other; - do not translate the speaker identifiers
[|AI|]and[|Human|]but always copy them into the translation in appropriate places; - ensure accurate translation and keep the correctness of the conversation;
- make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
- translate the humans text using informal, but standard, language;
- make sure to avoid biases (such as gender bias, grammatical bias, social bias);
- if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
- if the human asks to translate text from one to another language, then you only translate the humans question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
- do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
翻译主题的提示如下:
python TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct," " summarizing manner. Translate accurately and formally. Do not provide any explanation" " about the translation and do not include the original title.
"
系统消息为:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
注意,有82个项目(0.1%)未成功翻译。翻译缺失了AI标识符[|AI|]和/或人类标识符[|Human|]。缺失项目的ID为[1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]。
源数据
初始数据收集和规范化
初始数据由Baize生成,该数据集依赖于Stack Overflow Questions。
源语言生产者
Quora用户和OpenAI的GPT模型(通过Baize)。
使用数据集的注意事项
注意,该新数据集中的翻译以及原始Baize数据集中的机器生成对话均未经过人工验证。
偏见讨论
与任何机器生成文本一样,用户应注意该数据集中可能包含的潜在偏见。尽管提示中特别包含确保避免偏见(如性别偏见、语法偏见、社会偏见),但这种命令的影响是未知的。很可能数据集中仍然存在偏见,因此请谨慎使用。
其他已知限制
翻译质量未经验证。请自行承担风险使用!
许可信息
Stack Overflow Questions的许可信息为Apache 2.0。如果您使用当前数据集,您还应遵守原始许可。
这段文本是使用GPT-3(gpt-3.5-turbo),OpenAI的大规模语言生成模型生成的。在生成草稿语言后,作者进行了审查、编辑和修订,以符合自己的喜好,并最终对该出版物的内容负责。
正如他们在使用条款中明确指出的,特别是2c.iii,“[您不得]使用服务的输出来开发与OpenAI竞争的模型”。这意味着您不能使用此数据集来构建旨在与OpenAI商业竞争的模型。据我所知,这是一个特定的限制,应作为当前许可的附加条款。
贡献
感谢Baize提供的初始机器生成对话,基于Stack Overflow Questions。




