benjleite/FairytaleQA-translated-ptBR
收藏Hugging Face2024-06-11 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/benjleite/FairytaleQA-translated-ptBR
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- text-generation
language:
- pt
tags:
- question-answering
- question-generation
- education
- children education
size_categories:
- 10K<n<100K
---
# Dataset Card for FairytaleQA-translated-ptBR
## Dataset Description
- **Homepage:**
- **Repository:** https://github.com/bernardoleite/fairytaleqa-translated
- **Paper:** https://arxiv.org/abs/2406.04233v1
- **Leaderboard:** https://paperswithcode.com/sota/question-generation-on-fairytaleqa
- **Point of Contact:** Bernardo Leite (benjleite.com)
### Dataset Summary
This repository contains the **Brazilian Portuguese (pt-BR)** machine-translated version of the original English FairytaleQA dataset (https://huggingface.co/datasets/WorkInTheDark/FairytaleQA). FairytaleQA is an open-source dataset designed to enhance comprehension of narratives, aimed at students from kindergarten to eighth grade. The dataset is meticulously annotated by education experts following an evidence-based theoretical framework. It comprises 10,580 explicit and implicit questions derived from 278 child-friendly stories, covering seven types of narrative elements or relations.
This translation was performed using DeepL as part of our research: **FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages**.
You can load the dataset via:
```
import datasets
data = datasets.load_dataset('benjleite/FairytaleQA-translated-ptBR')
```
### Supported Tasks and Leaderboards
Question-Answering, Question-Generation, Question-Answer Pair Generation
### Languages
Brazilian Portuguese (pt-BR)
### Example
An example of "train" looks as follows:
```
{
'story_name': 'the-toad-woman-story',
'story_section': 'Certa vez, uma jovem que vivia sozinha na floresta...'
'question': 'Quem a mulher viu deslizando pela floresta?',
'answer': 'Um homem jovem e bonito.',
'local-or-sum': 'local',
'attribute': 'character',
'ex-or-im': 'explicit',
'ex-or-im2': '',
}
```
### Dataset Structure
- `story_name`*: a string of the story name to which the story section content belongs.
- `story_section`: a string of the story section(s) content related to the experts' labeled QA-pair. Used as the input for both Question Generation and Question Answering tasks.
- `question`: a string of the question content. Used as the input for Question Answering task and as the output for Question Generation task.
- `answer`: a string of the answer content for all splits. Used as the input for Question Generation task and as the output for Question Answering task.
- `local_or_sum`*: a string of either local or summary, indicating whether the QA is related to one story section or multiple sections.
- `attribute`*: a string of one of character, causal relationship, action, setting, feeling, prediction, or outcome resolution. Classification of the QA by education experts annotators via 7 narrative elements on an established framework.
- `ex_or_im1`*: a string of either explicit or implicit, indicating whether the answers can be directly found in the story content or cannot be directly from the story content.
- `ex_or_im2`*: similar to 'ex-or-im1', but annotated by another annotator (only available for test/val splits).
(*) Field has not been translated. Use it at your own convenince.
### Data Splits
<!-- info: Describe and name the splits in the dataset if there are more than one. -->
<!-- scope: periscope -->
The split sizes are as follows:
| | Train | Validation | Test |
| ----- | ----- | ----- | ----- |
| # Books | 232 | 23 | 23 |
| # QA-Pairs | 8548 | 1025 |1007 |
## Additional Information
### Licensing Information
This dataset version is released under the [Apache-2.0 License](http://www.apache.org/licenses/LICENSE-2.0) (as the original dataset).
### Citation Information
Our paper (preprint - accepted for publication at ECTEL 2024):
```
@article{leite_fairytaleqa_translated_2024,
title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages},
author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso},
year={2024},
eprint={2406.04233},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Original FairytaleQA paper:
```
@inproceedings{xu-etal-2022-fantastic,
title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension",
author = "Xu, Ying and
Wang, Dakuo and
Yu, Mo and
Ritchie, Daniel and
Yao, Bingsheng and
Wu, Tongshuang and
Zhang, Zheng and
Li, Toby and
Bradford, Nora and
Sun, Branda and
Hoang, Tran and
Sang, Yisi and
Hou, Yufang and
Ma, Xiaojuan and
Yang, Diyi and
Peng, Nanyun and
Yu, Zhou and
Warschauer, Mark",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.34",
doi = "10.18653/v1/2022.acl-long.34",
pages = "447--460",
abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.",
}
```
### Contact
Bernardo Leite (bernardo.leite@fe.up.pt)
提供机构:
benjleite
原始信息汇总
数据集概述
数据集描述
数据集摘要
- 名称: FairytaleQA-translated-ptBR
- 语言: 巴西葡萄牙语 (pt-BR)
- 来源: 由原始英语版本的FairytaleQA数据集机器翻译而来。
- 目标: 增强叙事理解,主要面向幼儿园至八年级学生。
- 内容: 包含10,580个显式和隐式问题,源自278个适合儿童的故事,涵盖七种叙事元素或关系。
- 翻译: 使用DeepL进行翻译,作为研究项目“FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages”的一部分。
支持的任务和排行榜
- 任务: 问答、问题生成、问答对生成
语言
- 语言: 巴西葡萄牙语 (pt-BR)
示例
json { story_name: the-toad-woman-story, story_section: Certa vez, uma jovem que vivia sozinha na floresta..., question: Quem a mulher viu deslizando pela floresta?, answer: Um homem jovem e bonito., local-or-sum: local, attribute: character, ex-or-im: explicit, ex-or-im2: , }
数据集结构
story_name: 故事名称story_section: 故事章节内容question: 问题内容answer: 答案内容local_or_sum: 指示QA是否与一个故事章节或多个章节相关attribute: 由教育专家标注的七种叙事元素之一ex_or_im1: 指示答案是否可以直接在故事内容中找到ex_or_im2: 与ex_or_im1类似,但由另一标注者标注(仅在测试/验证集可用)
数据分割
| Train | Validation | Test | |
|---|---|---|---|
| # Books | 232 | 23 | 23 |
| # QA-Pairs | 8548 | 1025 | 1007 |
附加信息
许可信息
- 许可: Apache-2.0 License
引用信息
plaintext @article{leite_fairytaleqa_translated_2024, title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages}, author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso}, year={2024}, eprint={2406.04233}, archivePrefix={arXiv}, primaryClass={cs.CL} }
联系
- 联系人: Bernardo Leite (bernardo.leite@fe.up.pt)



