five

SkelterLabsInc/JaQuAD

收藏
Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SkelterLabsInc/JaQuAD
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced - found language: - ja license: - cc-by-sa-3.0 multilinguality: - monolingual paperswithcode_id: null pretty_name: "JaQuAD: Japanese Question Answering Dataset" size_categories: - 10K<n<100K source_datasets: - original task_categories: - question-answering task_ids: - extractive-qa --- # Dataset Card for JaQuAD ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splitting](#data-splitting) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Acknowledgements](#acknowledgements) ## Dataset Description - **Repository:** https://github.com/SkelterLabsInc/JaQuAD - **Paper:** [JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension]() - **Point of Contact:** [jaquad@skelterlabs.com](jaquad@skelterlabs.com) - **Size of dataset files:** 24.6 MB - **Size of the generated dataset:** 48.6 MB - **Total amount of disk used:** 73.2 MB ### Dataset Summary Japanese Question Answering Dataset (JaQuAD), released in 2022, is a human-annotated dataset created for Japanese Machine Reading Comprehension. JaQuAD is developed to provide a SQuAD-like QA dataset in Japanese. JaQuAD contains 39,696 question-answer pairs. Questions and answers are manually curated by human annotators. Contexts are collected from Japanese Wikipedia articles. Fine-tuning [BERT-Japanese](https://huggingface.co/cl-tohoku/bert-base-japanese) on JaQuAD achieves 78.92% for an F1 score and 63.38% for an exact match. ### Supported Tasks - `extractive-qa`: This dataset is intended to be used for `extractive-qa`. ### Languages Japanese (`ja`) ## Dataset Structure ### Data Instances - **Size of dataset files:** 24.6 MB - **Size of the generated dataset:** 48.6 MB - **Total amount of disk used:** 73.2 MB An example of 'validation': ```python { "id": "de-001-00-000", "title": "イタセンパラ", "context": "イタセンパラ(板鮮腹、Acheilognathuslongipinnis)は、コイ科のタナゴ亜科タナゴ属に分類される淡水>魚の一種。\n別名はビワタナゴ(琵琶鱮、琵琶鰱)。", "question": "ビワタナゴの正式名称は何?", "question_type": "Multiple sentence reasoning", "answers": { "text": "イタセンパラ", "answer_start": 0, "answer_type": "Object", }, }, ``` ### Data Fields - `id`: a `string` feature. - `title`: a `string` feature. - `context`: a `string` feature. - `question`: a `string` feature. - `question_type`: a `string` feature. - `answers`: a dictionary feature containing: - `text`: a `string` feature. - `answer_start`: a `int32` feature. - `answer_type`: a `string` feature. ### Data Splitting JaQuAD consists of three sets, `train`, `validation`, and `test`. They were created from disjoint sets of Wikipedia articles. The `test` set is not publicly released yet. The following table shows statistics for each set. Set | Number of Articles | Number of Contexts | Number of Questions --------------|--------------------|--------------------|-------------------- Train | 691 | 9713 | 31748 Validation | 101 | 1431 | 3939 Test | 109 | 1479 | 4009 ## Dataset Creation ### Curation Rationale The JaQuAD dataset was created by [Skelter Labs](https://skelterlabs.com/) to provide a SQuAD-like QA dataset in Japanese. Questions are original and based on Japanese Wikipedia articles. ### Source Data The articles used for the contexts are from [Japanese Wikipedia](https://ja.wikipedia.org/). 88.7% of articles are from the curated list of Japanese high-quality Wikipedia articles, e.g., [featured articles](https://ja.wikipedia.org/wiki/Wikipedia:%E8%89%AF%E8%B3%AA%E3%81%AA%E8%A8%98%E4%BA%8B) and [good articles](https://ja.wikipedia.org/wiki/Wikipedia:%E7%A7%80%E9%80%B8%E3%81%AA%E8%A8%98%E4%BA%8B). ### Annotations Wikipedia articles were scrapped and divided into one more multiple paragraphs as contexts. Annotations (questions and answer spans) are written by fluent Japanese speakers, including natives and non-natives. Annotators were given a context and asked to generate non-trivial questions about information in the context. ### Personal and Sensitive Information No personal or sensitive information is included in this dataset. Dataset annotators has been manually verified it. ## Considerations for Using the Data Users should consider that the articles are sampled from Wikipedia articles but not representative of all Wikipedia articles. ### Social Impact of Dataset The social biases of this dataset have not yet been investigated. ### Discussion of Biases The social biases of this dataset have not yet been investigated. Articles and questions have been selected for quality and diversity. ### Other Known Limitations The JaQuAD dataset has limitations as follows: - Most of them are short answers. - Assume that a question is answerable using the corresponding context. This dataset is incomplete yet. If you find any errors in JaQuAD, please contact us. ## Additional Information ### Dataset Curators Skelter Labs: [https://skelterlabs.com/](https://skelterlabs.com/) ### Licensing Information The JaQuAD dataset is licensed under the [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license. ### Citation Information ```bibtex @misc{so2022jaquad, title={{JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension}}, author={ByungHoon So and Kyuhong Byun and Kyungwon Kang and Seongjin Cho}, year={2022}, eprint={2202.01764}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Acknowledgements This work was supported by [TPU Research Cloud (TRC) program](https://sites.research.google/trc/). For training models, we used cloud TPUs provided by TRC. We also thank annotators who generated JaQuAD.
提供机构:
SkelterLabsInc
原始信息汇总

数据集概述

数据集名称

  • 名称: JaQuAD: Japanese Question Answering Dataset
  • 别名: 日本語質問応答データセット

数据集基本信息

  • 语言: 日语 (ja)
  • 许可证: CC BY-SA 3.0
  • 多语言性: 单语种
  • 数据集大小: 10K<n<100K
  • 源数据: 原始数据
  • 任务类别: 问答
  • 任务ID: 抽取式问答 (extractive-qa)

数据集内容

  • 数据集摘要: JaQuAD是一个人工标注的日语机器阅读理解数据集,包含39,696个问题-答案对。问题和答案由人工标注者手动创建,上下文来自日语维基百科文章。
  • 支持的任务: 主要用于抽取式问答任务。
  • 数据结构:
    • 数据实例: 包含ID、标题、上下文、问题、问题类型和答案等字段。
    • 数据字段: 包括字符串类型的ID、标题、上下文、问题和问题类型,以及答案字段中的字符串、整数和字符串类型。
    • 数据分割: 数据集分为训练集、验证集和测试集,分别来自不同的维基百科文章。

数据集创建

  • 创建理由: 由Skelter Labs创建,旨在提供一个类似于SQuAD的日语QA数据集。
  • 源数据: 上下文来自日语维基百科文章,其中88.7%来自高质量的维基百科文章。
  • 标注: 由流利的日语使用者进行标注,包括母语和非母语者。
  • 个人信息和敏感信息: 数据集中不包含个人信息或敏感信息。

使用数据集的考虑

  • 社会影响: 数据集的社会偏见尚未被调查。
  • 偏见讨论: 数据集的偏见尚未被调查,文章和问题已根据质量和多样性进行选择。
  • 其他已知限制: 数据集主要包含简短答案,且假设每个问题都可以在相应的上下文中找到答案。

附加信息

  • 数据集创建者: Skelter Labs
  • 许可证信息: 数据集根据CC BY-SA 3.0许可证发布。
  • 引用信息: 引用时请使用提供的BibTeX格式。
  • 致谢: 感谢TPU Research Cloud (TRC)项目的支持及参与标注的标注者。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作