AmazonScience/tydi-as2

Name: AmazonScience/tydi-as2
Creator: AmazonScience
Published: 2023-07-24 17:33:28
License: 暂无描述

Hugging Face2023-07-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/AmazonScience/tydi-as2

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language: - bn - en - fi - id - ja - ko - ru - sw language_creators: - found license_details: https://huggingface.co/datasets/AmazonScience/tydi-as2/blob/main/LICENSE.md multilinguality: - multilingual - translation pretty_name: tydi-as2 size_categories: - 10M<n<100M source_datasets: - extended|tydiqa tags: - as2 - answer sentence selection - text retrieval - question answering task_categories: - question-answering - text-retrieval task_ids: - open-domain-qa license: cdla-permissive-2.0 --- # TyDi-AS2 ## Table of Contents - [Dataset Card Creation Guide](#dataset-card-creation-guide) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [TyDi-AS2](#tydi-as2) - [Xtr-TyDi-AS2](#xtr-tydi-as2) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Amazon Science](https://www.amazon.science/publications/cross-lingual-knowledge-distillation-for-answer-sentence-selection-in-low-resource-languages) - **Paper:** [Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages](https://aclanthology.org/2023.findings-acl.885/) - **Point of Contact:** [Yoshitomo Matsubara](yomtsub@amazon.com) ### Dataset Summary ***TyDi-AS2*** and ***Xtr-TyDi-AS2*** are multilingual Answer Sentence Selection (AS2) datasets comprising 8 diverse languages, proposed in our paper accepted at ACL 2023 (Findings): [**Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages**](https://aclanthology.org/2023.findings-acl.885/). Both the datasets were created from [TyDi-QA](https://ai.google.com/research/tydiqa), a multilingual question-answering dataset. TyDi-AS2 was created by converting the QA instances in TyDi-QA to AS2 instances (see [Dataset Creation](#dataset-creation) for details). Xtr-TyDi-AS2 was created by translating the non-English TyDi-AS2 instances to English and vise versa. For translations, we used [Amazon Translate](https://aws.amazon.com/translate/). ### Languages #### TyDi-AS2 (original) - `bn`: Bengali - `en`: English - `fi`: Finnish - `id`: Indonesian - `ja`: Japanese - `ko`: Korean - `ru`: Russian - `sw`: Swahili File location: [`jsonl/original/`](https://huggingface.co/datasets/AmazonScience/tydi-as2/tree/main/jsonl/original/) For non-English sets, we also have English-translated samples used for the cross-lingual knowledge distillation (CLKD) experiments in our paper. File location: [`jsonl/x-to-en/`](https://huggingface.co/datasets/AmazonScience/tydi-as2/tree/main/jsonl/x-to-en/) #### Xtr-TyDi-AS2 (translationese) Xtr-TyDi-AS2 (X-translated TyDi-AS2) dataset consists of non-English AS2 instances translated from the English set of TyDi-AS2. - `bn`: Bengali - `fi`: Finnish - `id`: Indonesian - `ja`: Japanese - `ko`: Korean - `ru`: Russian - `sw`: Swahili File location: [`jsonl/en-to-x/`](https://huggingface.co/datasets/AmazonScience/tydi-as2/tree/main/jsonl/en-to-x/) ## Dataset Structure ### Data Instances This is an example instance from the English training split of TyDi-AS2 dataset. ``` { "Question": "When was the Argentine Basketball Federation formed?", "Title": "History of the Argentina national basketball team", "Sentence": "The Argentina national basketball team represents Argentina in basketball international competitions, and is controlled by the Argentine Basketball Federation.", "Label": 0 } ``` For English-translated TyDi-AS2 dataset and Xtr-TyDi-AS2 dataset, the translated instances in JSONL files are listed in the same order of the original (native) instances in the original TyDi-AS2 dataset. For example, the 2nd instance in [`jsonl/x-to-en/en_from_bn-train.jsonl`](jsonl/x-to-en/en_from_bn-train.jsonl) (English-translated from Bengali) corresponds to the 2nd instance in [`jsonl/original/bn-train.jsonl`](jsonl/original/bn-train.jsonl) (Bengali). Similarly, the 2nd instance in [`jsonl/en-to-x/bn_from_en-train.jsonl`](jsonl/en-to-x/bn_from_en-train.jsonl) (Bengali-translated from English) corresponds to the 2nd instance in [`jsonl/original/en-train.jsonl`](jsonl/original/en-train.jsonl) (English). ### Data Fields Each instance (a QA pair) consists of the following fields: - `Question`: Question to be answered (str) - `Title`: Document title (str) - `Sentence`: Answer sentence in the document (str) - `Label`: Label that indicates the answer sentence correctly answers the question (int, 1: correct, 0: incorrect) ### Data Splits | | | **#Questions** | | | | **#Sentences** | | |---------------------|----------:|---------------:|---------:|---|----------:|---------------:|---------:| | | **train** | **dev** | **test** | | **train** | **dev** | **test** | | **Bengali (bn)** | 7,978 | 2,056 | 316 | | 1,376,432 | 351,186 | 37,465 | | **English (en)** | 6,730 | 1,686 | 918 | | 1,643,702 | 420,899 | 249,513 | | **Finnish (fi)** | 10,859 | 2,731 | 1,870 | | 1,567,695 | 408,205 | 298,093 | | **Indonesian (id)** | 9,310 | 2,339 | 1,355 | | 960,270 | 236,076 | 97,057 | | **Japanese (ja)** | 11,848 | 2,981 | 1,504 | | 3,183,037 | 822,654 | 444,106 | | **Korean (ko)** | 7,354 | 1,943 | 1,389 | | 1,558,191 | 392,361 | 199,043 | | **Russian (ru)** | 9,187 | 2,294 | 1,395 | | 3,190,650 | 820,668 | 367,595 | | **Swahili (sw)** | 8,350 | 2,850 | 1,896 | | 1,048,303 | 269,894 | 74,775 | See [our paper](#citation-information) for more details about the statistics of the datasets. ## Dataset Creation ### Source Data The source of TyDi-AS2 dataset is [TyDi QA](https://ai.google.com/research/tydiqa), which is a question answering dataset. ### Annotations #### Annotation process TyDi QA is a QA dataset spanning questions from 11 typologically diverse languages. Each instance comprises a human-generated question, a single Wikipedia document as context, and one or more spans from the document containing the answer. To convert each instance into AS2 instances, we split the context document into sentences and heuristically identify the correct asnwer sentences using the annotated answer spans. To split documents, we use multiple different sentence tokenizers for the diverse languages and omit languages for which we could not find a suitable sentence tokenizer: 1. [bltk](https://github.com/saimoncse19/bltk) for Bengali 2. [blingfire](https://github.com/microsoft/BlingFire) for Swahili, Indonesian, and Korean 3. [pysdb](https://github.com/nipunsadvilkar/pySBD) for English and Russian 4. [nltk](https://www.nltk.org/) for Finnish 5. [Konoha](https://github.com/himkt/konoha) for Japanese #### Who are the annotators? [Shivanshu Gupta](https://huggingface.co/shivanshu) converted TyDi QA to TyDi-AS2. [Yoshitomo Matsubara](https://huggingface.co/yoshitomo-matsubara) translated non-English samples to English and vice versa for Xtr-TyDi-AS2 dataset Since sentence tokenization and identifying answer sentences can introduce errors, we conducted a manual validation of the AS2 datasets. For each language, we randomly selected 50 instances and verified the accuracy of the answer sentences through manual inspection. Our findings revealed that the answer sentences were accurate in 98% of the cases. ## Additional Information ### Dataset Curators Shivanshu Gupta (@shivanshu) ### Licensing Information [CDLA-Permissive-2.0](LICENSE.md) ### Citation Information ```bibtex @inproceedings{gupta2023cross-lingual, title={{Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages}}, author={Gupta, Shivanshu and Matsubara, Yoshitomo and Chadha, Ankit and Moschitti, Alessandro}, booktitle={Findings of the Association for Computational Linguistics: ACL 2023}, pages={14078--14092}, year={2023} } ``` ### Contributions - [Shivanshu Gupta](https://huggingface.co/shivanshu) - [Yoshitomo Matsubara](https://huggingface.co/yoshitomo-matsubara) - Ankit Chadha - Alessandro Moschitti

提供机构：

AmazonScience

原始信息汇总

数据集概述

数据集名称

TyDi-AS2
Xtr-TyDi-AS2

数据集类型

多语言答案句子选择（AS2）数据集

语言

TyDi-AS2 包含以下语言：
- bn: Bengali
- en: English
- fi: Finnish
- id: Indonesian
- ja: Japanese
- ko: Korean
- ru: Russian
- sw: Swahili
Xtr-TyDi-AS2 包含以下语言：
- bn: Bengali
- fi: Finnish
- id: Indonesian
- ja: Japanese
- ko: Korean
- ru: Russian
- sw: Swahili

数据集大小

10M<n<100M

数据集来源

扩展自 TyDiQA

任务类别

问答
文本检索

许可证

CDLA-Permissive-2.0

数据集结构

数据实例：每个实例包含以下字段：
- Question: 问题（字符串）
- Title: 文档标题（字符串）
- Sentence: 文档中的答案句子（字符串）
- Label: 标签，指示答案句子是否正确回答了问题（整数，1: 正确，0: 不正确）
数据分割：数据集分为训练、开发和测试集，具体数据量见原文表格。

数据集创建

源数据：源自 TyDi QA
注释过程：使用多种语言的句子分割器，手动验证答案句子的准确性。
注释者：
- Shivanshu Gupta 负责将 TyDi QA 转换为 TyDi-AS2。
- Yoshitomo Matsubara 负责翻译非英语样本。

附加信息

数据集管理员：Shivanshu Gupta
许可证信息：CDLA-Permissive-2.0
引用信息：见原文引用格式。

5,000+

优质数据集

54 个

任务类型

进入经典数据集