BramVanroy/chatgpt-dutch-simplification
收藏Hugging Face2023-06-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BramVanroy/chatgpt-dutch-simplification
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- text2text-generation
task_ids:
- text-simplification
language:
- nl
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
dataset_info:
features:
- name: source
dtype: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1013
- name: validation
num_examples: 126
- name: test
num_examples: 128
dataset_size: 1267
train-eval-index:
- config: default
task: text2text-generation
task_id: text-simplification
splits:
train_split: train
eval_split: validation
test_split: test
metrics:
- type: sari
name: Test SARI
- type: rouge
name: Test ROUGE
pretty_name: ChatGPT Dutch Simplification
---
# Dataset Card for ChatGPT Dutch Simplification
## Dataset Description
- **Point of Contact:** [Bram Vanroy](https://twitter.com/BramVanroy)
### Dataset Summary
Created in light of a master thesis by Charlotte Van de Velde as part of the Master of Science in Artificial Intelligence at KU Leuven.
Charlotte is supervised by Vincent Vandeghinste and Bram Vanroy.
The dataset contains Dutch source sentences and aligned simplified sentences, generated with ChatGPT. All splits combined, the dataset
consists of 1267 entries.
Charlotte used gpt-3.5-turbo with the following prompt:
> Schrijf een moeilijke zin, en daarna een simpele versie ervan. De simpele versie moet makkelijker zijn om te lezen en te begrijpen. Schrijf "Moeilijke zin: " aan het begin van de moeilijke zin, en "Simpele versie: " aan het begin van de simpele versie.
Parameters:
- temperature=0.9
- max tokens=1000
- top p=1
- frequency penalty=0.1
- presence penalty=0
Bram Vanroy was not involved in the data collection but only generated the data splits and provides the dataset as-is on this online platform. Splits
were generated with [the following script](https://github.com/BramVanroy/mai-simplification-nl-2023#1-split-the-data).
### Supported Tasks and Leaderboards
Intended for text2text generation, specifically text simplification.
### Languages
Dutch
## Dataset Structure
### Data Instances
```python
{
"source": "Het fenomeen van acquisitie van taalkennis vindt plaats door middel van het opdoen van ervaringen met de taal in diverse contexten.",
"target": "Je leert een taal door de taal te gebruiken in verschillende situaties."
}
```
### Data Fields
- source: the "more difficult" Dutch sentence
- target: the simplified Dutch sentence
### Data Splits
- train: 1013
- validation: 126
- test: 128
## Disclaimer about data usage
This text was generated (either in part or in full) with GPT-3 (`gpt-3.5-turbo`), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the [Sharing](https://openai.com/policies/sharing-publication-policy) and [Usage](https://openai.com/policies/usage-policies) policies.
As clearly stated in their [Terms of Use](https://openai.com/policies/terms-of-use), specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. [As far as I am aware](https://law.stackexchange.com/questions/93308/licensing-material-generated-with-chatgpt), that is a specific restriction that should serve as an addendum to the current license.
提供机构:
BramVanroy
原始信息汇总
数据集概述
数据集名称
ChatGPT Dutch Simplification
数据集描述
本数据集由Charlotte Van de Velde在KU Leuven的硕士论文研究中创建,包含荷兰语源句及其对应的简化句,共1267条记录。数据通过ChatGPT生成,用于文本简化任务。
数据集特征
- source: 荷兰语的“较难”句子,数据类型为字符串。
- target: 对应的简化荷兰语句子,数据类型为字符串。
数据集结构
- 训练集: 1013条记录
- 验证集: 126条记录
- 测试集: 128条记录
数据集大小
总计1267条记录
语言
荷兰语
任务类型
- 任务类别: 文本到文本生成
- 任务ID: 文本简化
评估指标
- SARI
- ROUGE
许可协议
CC-BY-NC-SA-4.0



