BramVanroy/dolly-15k-dutch
收藏数据集卡片 for Dolly 15k Dutch
数据集描述
数据集摘要
该数据集包含14,934条指令、上下文和响应,涵盖多种自然语言类别,如分类、封闭式问答、生成等。原始数据集由@databricks创建,通过其员工众包数据创建。当前数据集是通过ChatGPT(gpt-3.5-turbo)翻译的。
语言
- 荷兰语
数据集结构
数据实例
python { "id": 14963, "instruction": "Wat zijn de duurste steden ter wereld?", "context": "", "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.", "category": "brainstorming" }
数据字段
- id: 项目的ID。以下77个ID未包含,因为它们无法翻译(或太长):
[1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 14966] - instruction: 指令(问题)
- context: 额外的上下文,AI可以用来回答问题
- response: AI的预期响应
- category: 此类问题的类别(参见Dolly了解更多信息)
数据集创建
翻译和主题均使用OpenAI的API进行,参数为max_tokens=1024, temperature=0。
翻译输入的提示模板如下(其中src_lang为英语,tgt_lang为荷兰语):
python CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a tasks instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.
Here are the requirements that you should adhere to:
- maintain the format: the task consists of a task instruction (marked
instruction:), optional context to the task (markedcontext:) and response for the task marked withresponse:; - do not translate the identifiers
instruction:,context:, andresponse:but instead copy them to your output; - make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
- translate the instruction and context text using informal, but standard, language;
- make sure to avoid biases (such as gender bias, grammatical bias, social bias);
- if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang};
- if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is);
- do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
系统消息为:
You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.
注意,有77个项目(0.5%)未成功翻译。这可能意味着提示太长(max_tokens=1024)或生成的翻译无法解析为instruction、context和response字段。缺失的ID为[1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 14966]。
源数据
初始数据收集和规范化
初始数据由databricks收集。有关此数据集的更多信息,请参阅其仓库。
使用数据的注意事项
注意,该新数据集中的翻译尚未经过人工验证!请自行承担使用风险,包括质量和偏见风险。
偏见讨论
与任何机器生成的文本一样,用户应注意该数据集中可能包含的潜在偏见。尽管提示中特别提到“确保避免偏见(如性别偏见、语法偏见、社会偏见)”,但此类命令的影响未知。很可能数据集中仍存在偏见,因此请谨慎使用。
其他已知限制
翻译质量尚未经过验证。请自行承担使用风险!
许可信息
该仓库遵循原始databricks许可证,即CC BY-SA 3.0,但请参阅以下特定限制。
该文本(部分或全部)使用GPT-3(gpt-3.5-turbo),OpenAI的大规模语言生成模型生成。生成草稿语言后,作者进行了审查、编辑和修订,以符合自己的喜好,并最终对出版物内容负责。
根据其服务条款,特别是2c.iii,“[您不得]使用服务输出开发与OpenAI竞争的模型”。这意味着您不能使用此数据集构建旨在与OpenAI商业竞争的模型。据我所知,这是一个应作为当前许可证附加条款的特定限制。
引用信息
如果您使用此数据集,请引用:
Vanroy, B. (2023). Dolly 15k Dutch [Data set]. Hugging Face. https://doi.org/10.57967/hf/0785
bibtex @misc {https://doi.org/10.57967/hf/0785, author = {Vanroy, Bram }, title = { {D}olly 15k {D}utch }, year = 2023, url = { https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch }, doi = { 10.57967/hf/0785 }, publisher = { Hugging Face } }
贡献
感谢databricks提供初始的高质量数据集。



