benjleite/FairytaleQA-translated-ptBR

Name: benjleite/FairytaleQA-translated-ptBR
Creator: benjleite
Published: 2024-06-11 17:55:50
License: 暂无描述

Hugging Face2024-06-11 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/benjleite/FairytaleQA-translated-ptBR

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering - text-generation language: - pt tags: - question-answering - question-generation - education - children education size_categories: - 10K<n<100K --- # Dataset Card for FairytaleQA-translated-ptBR ## Dataset Description - **Homepage:** - **Repository:** https://github.com/bernardoleite/fairytaleqa-translated - **Paper:** https://arxiv.org/abs/2406.04233v1 - **Leaderboard:** https://paperswithcode.com/sota/question-generation-on-fairytaleqa - **Point of Contact:** Bernardo Leite (benjleite.com) ### Dataset Summary This repository contains the **Brazilian Portuguese (pt-BR)** machine-translated version of the original English FairytaleQA dataset (https://huggingface.co/datasets/WorkInTheDark/FairytaleQA). FairytaleQA is an open-source dataset designed to enhance comprehension of narratives, aimed at students from kindergarten to eighth grade. The dataset is meticulously annotated by education experts following an evidence-based theoretical framework. It comprises 10,580 explicit and implicit questions derived from 278 child-friendly stories, covering seven types of narrative elements or relations. This translation was performed using DeepL as part of our research: **FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages**. You can load the dataset via: ``` import datasets data = datasets.load_dataset('benjleite/FairytaleQA-translated-ptBR') ``` ### Supported Tasks and Leaderboards Question-Answering, Question-Generation, Question-Answer Pair Generation ### Languages Brazilian Portuguese (pt-BR) ### Example An example of "train" looks as follows: ``` { 'story_name': 'the-toad-woman-story', 'story_section': 'Certa vez, uma jovem que vivia sozinha na floresta...' 'question': 'Quem a mulher viu deslizando pela floresta?', 'answer': 'Um homem jovem e bonito.', 'local-or-sum': 'local', 'attribute': 'character', 'ex-or-im': 'explicit', 'ex-or-im2': '', } ``` ### Dataset Structure - `story_name`*: a string of the story name to which the story section content belongs. - `story_section`: a string of the story section(s) content related to the experts' labeled QA-pair. Used as the input for both Question Generation and Question Answering tasks. - `question`: a string of the question content. Used as the input for Question Answering task and as the output for Question Generation task. - `answer`: a string of the answer content for all splits. Used as the input for Question Generation task and as the output for Question Answering task. - `local_or_sum`*: a string of either local or summary, indicating whether the QA is related to one story section or multiple sections. - `attribute`*: a string of one of character, causal relationship, action, setting, feeling, prediction, or outcome resolution. Classification of the QA by education experts annotators via 7 narrative elements on an established framework. - `ex_or_im1`*: a string of either explicit or implicit, indicating whether the answers can be directly found in the story content or cannot be directly from the story content. - `ex_or_im2`*: similar to 'ex-or-im1', but annotated by another annotator (only available for test/val splits). (*) Field has not been translated. Use it at your own convenince. ### Data Splits   The split sizes are as follows: | | Train | Validation | Test | | ----- | ----- | ----- | ----- | | # Books | 232 | 23 | 23 | | # QA-Pairs | 8548 | 1025 |1007 | ## Additional Information ### Licensing Information This dataset version is released under the [Apache-2.0 License](http://www.apache.org/licenses/LICENSE-2.0) (as the original dataset). ### Citation Information Our paper (preprint - accepted for publication at ECTEL 2024): ``` @article{leite_fairytaleqa_translated_2024, title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages}, author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso}, year={2024}, eprint={2406.04233}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` Original FairytaleQA paper: ``` @inproceedings{xu-etal-2022-fantastic, title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension", author = "Xu, Ying and Wang, Dakuo and Yu, Mo and Ritchie, Daniel and Yao, Bingsheng and Wu, Tongshuang and Zhang, Zheng and Li, Toby and Bradford, Nora and Sun, Branda and Hoang, Tran and Sang, Yisi and Hou, Yufang and Ma, Xiaojuan and Yang, Diyi and Peng, Nanyun and Yu, Zhou and Warschauer, Mark", editor = "Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.34", doi = "10.18653/v1/2022.acl-long.34", pages = "447--460", abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.", } ``` ### Contact Bernardo Leite (bernardo.leite@fe.up.pt)

提供机构：

benjleite

原始信息汇总

数据集概述

数据集描述

数据集摘要

名称: FairytaleQA-translated-ptBR
语言: 巴西葡萄牙语 (pt-BR)
来源: 由原始英语版本的FairytaleQA数据集机器翻译而来。
目标: 增强叙事理解，主要面向幼儿园至八年级学生。
内容: 包含10,580个显式和隐式问题，源自278个适合儿童的故事，涵盖七种叙事元素或关系。
翻译: 使用DeepL进行翻译，作为研究项目“FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages”的一部分。

支持的任务和排行榜

任务: 问答、问题生成、问答对生成

语言

语言: 巴西葡萄牙语 (pt-BR)

示例

json { story_name: the-toad-woman-story, story_section: Certa vez, uma jovem que vivia sozinha na floresta..., question: Quem a mulher viu deslizando pela floresta?, answer: Um homem jovem e bonito., local-or-sum: local, attribute: character, ex-or-im: explicit, ex-or-im2: , }

数据集结构

story_name: 故事名称
story_section: 故事章节内容
question: 问题内容
answer: 答案内容
local_or_sum: 指示QA是否与一个故事章节或多个章节相关
attribute: 由教育专家标注的七种叙事元素之一
ex_or_im1: 指示答案是否可以直接在故事内容中找到
ex_or_im2: 与ex_or_im1类似，但由另一标注者标注（仅在测试/验证集可用）

数据分割

	Train	Validation	Test
# Books	232	23	23
# QA-Pairs	8548	1025	1007

附加信息

许可信息

许可: Apache-2.0 License

引用信息

plaintext @article{leite_fairytaleqa_translated_2024, title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages}, author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso}, year={2024}, eprint={2406.04233}, archivePrefix={arXiv}, primaryClass={cs.CL} }

联系

联系人: Bernardo Leite (bernardo.leite@fe.up.pt)

5,000+

优质数据集

54 个

任务类型

进入经典数据集