parinzee/claq-qa-thai-dataset
收藏Hugging Face2024-01-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/parinzee/claq-qa-thai-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: question
dtype: string
- name: context
dtype: string
- name: answers
dtype: string
- name: source
dtype: string
- name: th_aug
dtype: string
- name: th_fasttext_aug
dtype: string
- name: th_llm_gec_aug
dtype: string
- name: th_llm_paraphrase_aug
dtype: string
- name: th_ltw2v_aug
dtype: string
- name: th_qcpg_0.2_aug
dtype: string
- name: th_qcpg_0.2_llm_gec_aug
dtype: string
- name: th_qcpg_0.5_aug
dtype: string
- name: th_qcpg_0.5_llm_gec_aug
dtype: string
- name: th_qcpg_0.8_aug
dtype: string
- name: th_qcpg_0.8_llm_gec_aug
dtype: string
- name: th_thai2fit_aug
dtype: string
- name: th_thai2trans_aug
dtype: string
- name: th_wordnet_aug
dtype: string
- name: en_aug
dtype: string
- name: en_llm_gec_aug
dtype: string
- name: en_llm_paraphrase_aug
dtype: string
- name: en_qcpg_0.2_aug
dtype: string
- name: en_qcpg_0.2_llm_gec_aug
dtype: string
- name: en_qcpg_0.5_aug
dtype: string
- name: en_qcpg_0.5_llm_gec_aug
dtype: string
- name: en_qcpg_0.8_aug
dtype: string
- name: en_qcpg_0.8_llm_gec_aug
dtype: string
- name: dis_aug
dtype: float64
- name: dis_fasttext_aug
dtype: float64
- name: dis_llm_gec_aug
dtype: float64
- name: dis_llm_paraphrase_aug
dtype: float64
- name: dis_ltw2v_aug
dtype: float64
- name: dis_qcpg_0.2_aug
dtype: float64
- name: dis_qcpg_0.2_llm_gec_aug
dtype: float64
- name: dis_qcpg_0.5_aug
dtype: float64
- name: dis_qcpg_0.5_llm_gec_aug
dtype: float64
- name: dis_qcpg_0.8_aug
dtype: float64
- name: dis_qcpg_0.8_llm_gec_aug
dtype: float64
- name: dis_thai2fit_aug
dtype: float64
- name: dis_thai2trans_aug
dtype: float64
- name: dis_wordnet_aug
dtype: float64
splits:
- name: train
num_bytes: 117313078
num_examples: 16980
download_size: 35147642
dataset_size: 117313078
---
# Dataset Card for "Cross-Lingual Data Augmentation For Thai QA"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Structure](#dataset-structure)
- [Acknowledgements](#acknowledgements)
- [Authors](#authors)
- [Additional Information](#additional-information)
## Dataset Description
### Abstract
This dataset accompanies the paper titled "Cross-Lingual Data Augmentation For Thai Question Answering" by Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, and Peerat Limkonchotiwat, to be presented at GenBench in EMNLP 2023. The paper introduces an innovative framework for data augmentation with quality control measures, aimed at enhancing the robustness of Thai QA models. This dataset is designed to improve model performance in low-resource language settings like Thai, by increasing linguistic diversity through monolingual and cross-lingual data augmentation techniques.
### Links
- ACL Link: [PDF](https://aclanthology.org/2023.genbench-1.16/)
- ResearchGate Link: [PDF](https://www.researchgate.net/publication/374977605_Cross-Lingual_Data_Augmentation_For_Thai_Question-Answering#fullTextFileContent)
## Dataset Structure
### Dataset Info
The dataset, available at [Hugging Face Datasets](https://huggingface.co/datasets/parinzee/claq-qa-thai-dataset), is structured with the following features:
- `id`: string
- `question`: string
- `context`: string
- `answers`: string
- `source`: string
- Augmentation columns for Thai (e.g., `th_aug`, `th_fasttext_aug`, `th_llm_gec_aug`, etc.)
- Augmentation columns for English (e.g., `en_aug`, `en_llm_gec_aug`, `en_llm_paraphrase_aug`, etc.)
- Semantic distance columns for various augmentations (e.g., `dis_aug`, `dis_fasttext_aug`, `dis_llm_gec_aug`, etc.)
### Splits (No Designated Train/Test Splits)
- Train:
- Number of rows: **16,980**
- Number of augmentation sets: **10**
- Total Number of Examples = 16,980 * 11 = **186,780**
- Size: 117,313,078 bytes
### Download Size
- 35,147,642 bytes
### Total Dataset Size
- 117,313,078 bytes
## Acknowledgements

## Authors
- Parinthapat Pengpun
- Can Udomcharoenchaikit
- Weerayut Buaphet
- Peerat Limkonchotiwat
## Additional Information
- The dataset is intended for research purposes, especially in the field of machine learning and natural language processing.
- This work is a significant contribution to enhancing the capabilities of QA models in Thai, a low-resource language, by addressing the challenges of limited and varied quality training data.
提供机构:
parinzee
原始信息汇总
数据集卡片 "Cross-Lingual Data Augmentation For Thai QA"
数据集描述
摘要
该数据集伴随论文《Cross-Lingual Data Augmentation For Thai Question Answering》,由Parinthapat Pengpun、Can Udomcharoenchaikit、Weerayut Buaphet和Peerat Limkonchotiwat撰写,将在EMNLP 2023的GenBench会议上发表。论文介绍了一种创新的带有质量控制措施的数据增强框架,旨在提高泰国QA模型的鲁棒性。该数据集旨在通过单语和跨语数据增强技术提高语言多样性,从而改善低资源语言环境(如泰语)中的模型性能。
数据集结构
数据集信息
数据集在Hugging Face Datasets上可用,具有以下特征:
id: 字符串question: 字符串context: 字符串answers: 字符串source: 字符串- 泰语增强列(例如,
th_aug,th_fasttext_aug,th_llm_gec_aug等) - 英语增强列(例如,
en_aug,en_llm_gec_aug,en_llm_paraphrase_aug等) - 各种增强的语义距离列(例如,
dis_aug,dis_fasttext_aug,dis_llm_gec_aug等)
分割(无指定的训练/测试分割)
- 训练集:
- 行数: 16,980
- 增强集数: 10
- 总示例数 = 16,980 * 11 = 186,780
- 大小: 117,313,078 字节
下载大小
- 35,147,642 字节
总数据集大小
- 117,313,078 字节



