five

BramVanroy/dolly-15k-dutch

收藏
Hugging Face2024-01-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BramVanroy/dolly-15k-dutch
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - nl license: cc-by-nc-sa-3.0 size_categories: - 10K<n<100K task_categories: - question-answering - text-generation pretty_name: Dolly 15k Dutch tags: - dolly - instruct - instruction dataset_info: features: - name: prompt dtype: string - name: prompt_id dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train_sft num_bytes: 12428974 num_examples: 12911 - name: test_sft num_bytes: 1338966 num_examples: 1428 download_size: 8279222 dataset_size: 13767940 configs: - config_name: default data_files: - split: train_sft path: data/train_sft-* - split: test_sft path: data/test_sft-* --- # Dataset Card for Dolly 15k Dutch ## Dataset Description - **Homepage:** N/A - **Repository:** N/A - **Paper:** N/A - **Leaderboard:** N/A - **Point of Contact:** Bram Vanroy ### Dataset Summary This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English [original dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k) was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (`gpt-3.5-turbo`). ☕ [**Want to help me out?**](https://www.buymeacoffee.com/bramvanroy) Translating the data with the OpenAI API, and prompt testing, cost me 💸$19.38💸. If you like this dataset, please consider [buying me a coffee](https://www.buymeacoffee.com/bramvanroy) to offset a portion of this cost, I appreciate it a lot! ☕ If you use this dataset or refer to it, please use the following citation: Vanroy, B. (2023). *Language Resources for Dutch Large Language Modelling*. [https://arxiv.org/abs/2312.12852](https://arxiv.org/abs/2312.12852) ```bibtext @article{vanroy2023language, title={Language Resources for {Dutch} Large Language Modelling}, author={Vanroy, Bram}, journal={arXiv preprint arXiv:2312.12852}, year={2023} } ``` ### Languages - Dutch ## Dataset Structure ### Data Instances ```python { "id": 14963, "instruction": "Wat zijn de duurste steden ter wereld?", "context": "", "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.", "category": "brainstorming" } ``` ### Data Fields - **id**: the ID of the item. The following 77 IDs are not included because they could not be translated (or were too long): `[1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]` - **instruction**: the instruction (question) - **context**: additional context that the AI can use to answer the question - **response**: the AI's expected response - **category**: the category of this type of question (see [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k#annotator-guidelines) for more info) ## Dataset Creation Both the translations and the topics were translated with OpenAI's API for `gpt-3.5-turbo`. `max_tokens=1024, temperature=0` as parameters. The prompt template to translate the input is (where `src_lang` was English and `tgt_lang` Dutch): ```python CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}. Here are the requirements that you should adhere to: 1. maintain the format: the task consists of a task instruction (marked `instruction: `), optional context to the task (marked `context: `) and response for the task marked with `response: `; 2. do not translate the identifiers `instruction: `, `context: `, and `response: ` but instead copy them to your output; 3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 4. translate the instruction and context text using informal, but standard, language; 5. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang}; 7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is); 8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English. Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.\n\n""" ``` The system message was: ``` You are a helpful assistant that translates English to Dutch according to the requirements that are given to you. ``` Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (`max_tokens=1024`) or that the generated translation could not be parsed into `instruction`, `context` and `response` fields. The missing IDs are `[1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]`. ### Source Data #### Initial Data Collection and Normalization Initial data collection by [databricks](https://huggingface.co/datasets/databricks/databricks-dolly-15k). See their repository for more information about this dataset. ## Considerations for Using the Data Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases. ### Discussion of Biases As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes `make sure to avoid biases (such as gender bias, grammatical bias, social bias)`, of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution. ### Other Known Limitations The translation quality has not been verified. Use at your own risk! ### Licensing Information This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction. This text was generated (either in part or in full) with GPT-3 (`gpt-3.5-turbo`), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication. If you use this dataset, you must also follow the [Sharing](https://openai.com/policies/sharing-publication-policy) and [Usage](https://openai.com/policies/usage-policies) policies. As clearly stated in their [Terms of Use](https://openai.com/policies/terms-of-use), specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. [As far as I am aware](https://law.stackexchange.com/questions/93308/licensing-material-generated-with-chatgpt), that is a specific restriction that should serve as an addendum to the current license. ### Citation Information If you use this data set, please cite : Vanroy, B. (2023). Dolly 15k Dutch [Data set]. Hugging Face. https://doi.org/10.57967/hf/0785 ```bibtex @misc {https://doi.org/10.57967/hf/0785, author = {Vanroy, Bram }, title = { {D}olly 15k {D}utch }, year = 2023, url = { https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch }, doi = { 10.57967/hf/0785 }, publisher = { Hugging Face } } ``` ### Contributions Thanks to [databricks](https://huggingface.co/datasets/databricks/databricks-dolly-15k) for the initial, high-quality dataset.
提供机构:
BramVanroy
原始信息汇总

数据集卡片 for Dolly 15k Dutch

数据集描述

数据集摘要

该数据集包含14,934条指令、上下文和响应,涵盖多种自然语言类别,如分类、封闭式问答、生成等。原始数据集由@databricks创建,通过其员工众包数据创建。当前数据集是通过ChatGPT(gpt-3.5-turbo)翻译的。

语言

  • 荷兰语

数据集结构

数据实例

python { "id": 14963, "instruction": "Wat zijn de duurste steden ter wereld?", "context": "", "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.", "category": "brainstorming" }

数据字段

  • id: 项目的ID。以下77个ID未包含,因为它们无法翻译(或太长):[1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 14966]
  • instruction: 指令(问题)
  • context: 额外的上下文,AI可以用来回答问题
  • response: AI的预期响应
  • category: 此类问题的类别(参见Dolly了解更多信息)

数据集创建

翻译和主题均使用OpenAI的API进行,参数为max_tokens=1024, temperature=0

翻译输入的提示模板如下(其中src_lang为英语,tgt_lang为荷兰语):

python CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a tasks instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.

Here are the requirements that you should adhere to:

  1. maintain the format: the task consists of a task instruction (marked instruction: ), optional context to the task (marked context: ) and response for the task marked with response: ;
  2. do not translate the identifiers instruction: , context: , and response: but instead copy them to your output;
  3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
  4. translate the instruction and context text using informal, but standard, language;
  5. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
  6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang};
  7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is);
  8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

系统消息为:

You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.

注意,有77个项目(0.5%)未成功翻译。这可能意味着提示太长(max_tokens=1024)或生成的翻译无法解析为instructioncontextresponse字段。缺失的ID为[1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 14966]

源数据

初始数据收集和规范化

初始数据由databricks收集。有关此数据集的更多信息,请参阅其仓库。

使用数据的注意事项

注意,该新数据集中的翻译尚未经过人工验证!请自行承担使用风险,包括质量和偏见风险。

偏见讨论

与任何机器生成的文本一样,用户应注意该数据集中可能包含的潜在偏见。尽管提示中特别提到“确保避免偏见(如性别偏见、语法偏见、社会偏见)”,但此类命令的影响未知。很可能数据集中仍存在偏见,因此请谨慎使用。

其他已知限制

翻译质量尚未经过验证。请自行承担使用风险!

许可信息

该仓库遵循原始databricks许可证,即CC BY-SA 3.0,但请参阅以下特定限制。

该文本(部分或全部)使用GPT-3(gpt-3.5-turbo),OpenAI的大规模语言生成模型生成。生成草稿语言后,作者进行了审查、编辑和修订,以符合自己的喜好,并最终对出版物内容负责。

如果您使用此数据集,还必须遵守共享使用政策。

根据其服务条款,特别是2c.iii,“[您不得]使用服务输出开发与OpenAI竞争的模型”。这意味着您不能使用此数据集构建旨在与OpenAI商业竞争的模型。据我所知,这是一个应作为当前许可证附加条款的特定限制。

引用信息

如果您使用此数据集,请引用:

Vanroy, B. (2023). Dolly 15k Dutch [Data set]. Hugging Face. https://doi.org/10.57967/hf/0785

bibtex @misc {https://doi.org/10.57967/hf/0785, author = {Vanroy, Bram }, title = { {D}olly 15k {D}utch }, year = 2023, url = { https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch }, doi = { 10.57967/hf/0785 }, publisher = { Hugging Face } }

贡献

感谢databricks提供初始的高质量数据集。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作