parinzee/claq-qa-thai-dataset

Name: parinzee/claq-qa-thai-dataset
Creator: parinzee
Published: 2024-01-06 03:31:33
License: 暂无描述

Hugging Face2024-01-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/parinzee/claq-qa-thai-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: question dtype: string - name: context dtype: string - name: answers dtype: string - name: source dtype: string - name: th_aug dtype: string - name: th_fasttext_aug dtype: string - name: th_llm_gec_aug dtype: string - name: th_llm_paraphrase_aug dtype: string - name: th_ltw2v_aug dtype: string - name: th_qcpg_0.2_aug dtype: string - name: th_qcpg_0.2_llm_gec_aug dtype: string - name: th_qcpg_0.5_aug dtype: string - name: th_qcpg_0.5_llm_gec_aug dtype: string - name: th_qcpg_0.8_aug dtype: string - name: th_qcpg_0.8_llm_gec_aug dtype: string - name: th_thai2fit_aug dtype: string - name: th_thai2trans_aug dtype: string - name: th_wordnet_aug dtype: string - name: en_aug dtype: string - name: en_llm_gec_aug dtype: string - name: en_llm_paraphrase_aug dtype: string - name: en_qcpg_0.2_aug dtype: string - name: en_qcpg_0.2_llm_gec_aug dtype: string - name: en_qcpg_0.5_aug dtype: string - name: en_qcpg_0.5_llm_gec_aug dtype: string - name: en_qcpg_0.8_aug dtype: string - name: en_qcpg_0.8_llm_gec_aug dtype: string - name: dis_aug dtype: float64 - name: dis_fasttext_aug dtype: float64 - name: dis_llm_gec_aug dtype: float64 - name: dis_llm_paraphrase_aug dtype: float64 - name: dis_ltw2v_aug dtype: float64 - name: dis_qcpg_0.2_aug dtype: float64 - name: dis_qcpg_0.2_llm_gec_aug dtype: float64 - name: dis_qcpg_0.5_aug dtype: float64 - name: dis_qcpg_0.5_llm_gec_aug dtype: float64 - name: dis_qcpg_0.8_aug dtype: float64 - name: dis_qcpg_0.8_llm_gec_aug dtype: float64 - name: dis_thai2fit_aug dtype: float64 - name: dis_thai2trans_aug dtype: float64 - name: dis_wordnet_aug dtype: float64 splits: - name: train num_bytes: 117313078 num_examples: 16980 download_size: 35147642 dataset_size: 117313078 --- # Dataset Card for "Cross-Lingual Data Augmentation For Thai QA" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Acknowledgements](#acknowledgements) - [Authors](#authors) - [Additional Information](#additional-information) ## Dataset Description ### Abstract This dataset accompanies the paper titled "Cross-Lingual Data Augmentation For Thai Question Answering" by Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, and Peerat Limkonchotiwat, to be presented at GenBench in EMNLP 2023. The paper introduces an innovative framework for data augmentation with quality control measures, aimed at enhancing the robustness of Thai QA models. This dataset is designed to improve model performance in low-resource language settings like Thai, by increasing linguistic diversity through monolingual and cross-lingual data augmentation techniques. ### Links - ACL Link: [PDF](https://aclanthology.org/2023.genbench-1.16/) - ResearchGate Link: [PDF](https://www.researchgate.net/publication/374977605_Cross-Lingual_Data_Augmentation_For_Thai_Question-Answering#fullTextFileContent) ## Dataset Structure ### Dataset Info The dataset, available at [Hugging Face Datasets](https://huggingface.co/datasets/parinzee/claq-qa-thai-dataset), is structured with the following features: - `id`: string - `question`: string - `context`: string - `answers`: string - `source`: string - Augmentation columns for Thai (e.g., `th_aug`, `th_fasttext_aug`, `th_llm_gec_aug`, etc.) - Augmentation columns for English (e.g., `en_aug`, `en_llm_gec_aug`, `en_llm_paraphrase_aug`, etc.) - Semantic distance columns for various augmentations (e.g., `dis_aug`, `dis_fasttext_aug`, `dis_llm_gec_aug`, etc.) ### Splits (No Designated Train/Test Splits) - Train: - Number of rows: **16,980** - Number of augmentation sets: **10** - Total Number of Examples = 16,980 * 11 = **186,780** - Size: 117,313,078 bytes ### Download Size - 35,147,642 bytes ### Total Dataset Size - 117,313,078 bytes ## Acknowledgements ![](https://raw.githubusercontent.com/ai-builders/.github/main/profile/logo-image.png) ## Authors - Parinthapat Pengpun - Can Udomcharoenchaikit - Weerayut Buaphet - Peerat Limkonchotiwat ## Additional Information - The dataset is intended for research purposes, especially in the field of machine learning and natural language processing. - This work is a significant contribution to enhancing the capabilities of QA models in Thai, a low-resource language, by addressing the challenges of limited and varied quality training data.

提供机构：

parinzee

原始信息汇总

数据集卡片 "Cross-Lingual Data Augmentation For Thai QA"

数据集描述

摘要

该数据集伴随论文《Cross-Lingual Data Augmentation For Thai Question Answering》，由Parinthapat Pengpun、Can Udomcharoenchaikit、Weerayut Buaphet和Peerat Limkonchotiwat撰写，将在EMNLP 2023的GenBench会议上发表。论文介绍了一种创新的带有质量控制措施的数据增强框架，旨在提高泰国QA模型的鲁棒性。该数据集旨在通过单语和跨语数据增强技术提高语言多样性，从而改善低资源语言环境（如泰语）中的模型性能。

数据集结构

数据集信息

数据集在Hugging Face Datasets上可用，具有以下特征：

id: 字符串
question: 字符串
context: 字符串
answers: 字符串
source: 字符串
泰语增强列（例如，th_aug, th_fasttext_aug, th_llm_gec_aug等）
英语增强列（例如，en_aug, en_llm_gec_aug, en_llm_paraphrase_aug等）
各种增强的语义距离列（例如，dis_aug, dis_fasttext_aug, dis_llm_gec_aug等）

分割（无指定的训练/测试分割）

训练集:
- 行数: 16,980
- 增强集数: 10
- 总示例数 = 16,980 * 11 = 186,780
- 大小: 117,313,078 字节

下载大小

35,147,642 字节

总数据集大小

117,313,078 字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集