BramVanroy/chatgpt-dutch-simplification

Name: BramVanroy/chatgpt-dutch-simplification
Creator: BramVanroy
Published: 2023-06-19 13:39:34
License: 暂无描述

Hugging Face2023-06-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/BramVanroy/chatgpt-dutch-simplification

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 task_categories: - text2text-generation task_ids: - text-simplification language: - nl multilinguality: - monolingual size_categories: - 1K<n<10K dataset_info: features: - name: source dtype: string - name: target dtype: string splits: - name: train num_examples: 1013 - name: validation num_examples: 126 - name: test num_examples: 128 dataset_size: 1267 train-eval-index: - config: default task: text2text-generation task_id: text-simplification splits: train_split: train eval_split: validation test_split: test metrics: - type: sari name: Test SARI - type: rouge name: Test ROUGE pretty_name: ChatGPT Dutch Simplification --- # Dataset Card for ChatGPT Dutch Simplification ## Dataset Description - **Point of Contact:** [Bram Vanroy](https://twitter.com/BramVanroy) ### Dataset Summary Created in light of a master thesis by Charlotte Van de Velde as part of the Master of Science in Artificial Intelligence at KU Leuven. Charlotte is supervised by Vincent Vandeghinste and Bram Vanroy. The dataset contains Dutch source sentences and aligned simplified sentences, generated with ChatGPT. All splits combined, the dataset consists of 1267 entries. Charlotte used gpt-3.5-turbo with the following prompt: > Schrijf een moeilijke zin, en daarna een simpele versie ervan. De simpele versie moet makkelijker zijn om te lezen en te begrijpen. Schrijf "Moeilijke zin: " aan het begin van de moeilijke zin, en "Simpele versie: " aan het begin van de simpele versie. Parameters: - temperature=0.9 - max tokens=1000 - top p=1 - frequency penalty=0.1 - presence penalty=0 Bram Vanroy was not involved in the data collection but only generated the data splits and provides the dataset as-is on this online platform. Splits were generated with [the following script](https://github.com/BramVanroy/mai-simplification-nl-2023#1-split-the-data). ### Supported Tasks and Leaderboards Intended for text2text generation, specifically text simplification. ### Languages Dutch ## Dataset Structure ### Data Instances ```python { "source": "Het fenomeen van acquisitie van taalkennis vindt plaats door middel van het opdoen van ervaringen met de taal in diverse contexten.", "target": "Je leert een taal door de taal te gebruiken in verschillende situaties." } ``` ### Data Fields - source: the "more difficult" Dutch sentence - target: the simplified Dutch sentence ### Data Splits - train: 1013 - validation: 126 - test: 128 ## Disclaimer about data usage This text was generated (either in part or in full) with GPT-3 (`gpt-3.5-turbo`), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication. If you use this dataset, you must also follow the [Sharing](https://openai.com/policies/sharing-publication-policy) and [Usage](https://openai.com/policies/usage-policies) policies. As clearly stated in their [Terms of Use](https://openai.com/policies/terms-of-use), specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. [As far as I am aware](https://law.stackexchange.com/questions/93308/licensing-material-generated-with-chatgpt), that is a specific restriction that should serve as an addendum to the current license.

提供机构：

BramVanroy

原始信息汇总

数据集概述

数据集名称

ChatGPT Dutch Simplification

数据集描述

本数据集由Charlotte Van de Velde在KU Leuven的硕士论文研究中创建，包含荷兰语源句及其对应的简化句，共1267条记录。数据通过ChatGPT生成，用于文本简化任务。

数据集特征

source: 荷兰语的“较难”句子，数据类型为字符串。
target: 对应的简化荷兰语句子，数据类型为字符串。

数据集结构

训练集: 1013条记录
验证集: 126条记录
测试集: 128条记录

数据集大小

总计1267条记录

语言

荷兰语

任务类型

任务类别: 文本到文本生成
任务ID: 文本简化

评估指标

SARI
ROUGE

许可协议

CC-BY-NC-SA-4.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集