chompk/tydiqa-goldp-th
收藏Hugging Face2023-11-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/chompk/tydiqa-goldp-th
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: TyDiQA-GoldP-Th
language:
- th
task_categories:
- question-answering
task_ids:
- extractive-qa
configs:
- config_name: default
data_files:
- split: train
path: tydiqa.goldp.th.train.json
- split: dev
path: tydiqa.goldp.th.dev.json
---
# TyDiQA-GoldP-Th
This dataset contains a removed Thai TyDiQA dataset obtained from [Khalidalt's TyDiQA Dataset](https://huggingface.co/datasets/khalidalt/tydiqa-goldp).
This dataset version does the following additional preprocessing to the dataset
1. Convert byte-level index into character-level index
2. Fix any mismatch text between answer span and actual text
3. Re-split train/development set such that there's no leakage in context passage
4. Deduplicate questions from the same context passage
## Dataset Format
The dataset is formatted to make it compatible to [XTREME benchmark](https://github.com/google-research/xtreme) format. The data is formatted as the following pattern:
```json
{
"version": "TyDiQA-GoldP-1.1-for-SQuAD-1.1",
"data": [
{
"paragrahs": [{
"context": [PASSAGE CONTEXT HERE],
"qas": [{
"answers": [{
"answer_start": [CONTEXT START CHAR INDEX OF ANSWER],
"text": [TEXT SPAN FROM CONTEXT],
}],
"question": [QUESTION],
"id": [ID]
}]
}],
},
...
]
}
```
## Author
Chompakorn Chaksangchaichot
The dataset TyDiQA-GoldP-Th is a Thai version extracted from Khalidalts TyDiQA Dataset. This version includes additional preprocessing such as converting byte-level indices to character-level indices, fixing mismatches between answer spans and actual text, re-splitting the train/development set to prevent context leakage, and deduplicating questions from the same context passage. The dataset is formatted to be compatible with the XTREME benchmark, including version information, paragraph context, questions, answers, and their start indices and text in the context.
提供机构:
chompk
原始信息汇总
TyDiQA-GoldP-Th
概述
TyDiQA-GoldP-Th 是一个泰语的 TyDiQA 数据集,由 Khalidalt 的 TyDiQA Dataset 修改而来。该数据集进行了以下预处理:
- 将字节级索引转换为字符级索引。
- 修正答案跨度和实际文本之间的不匹配。
- 重新划分训练集和开发集,确保上下文段落之间没有泄漏。
- 从同一上下文段落中去除重复的问题。
数据格式
数据集格式与 XTREME 基准兼容,具体格式如下: json { "version": "TyDiQA-GoldP-1.1-for-SQuAD-1.1", "data": [ { "paragrahs": [{ "context": [PASSAGE CONTEXT HERE], "qas": [{ "answers": [{ "answer_start": [CONTEXT START CHAR INDEX OF ANSWER], "text": [TEXT SPAN FROM CONTEXT], }], "question": [QUESTION], "id": [ID] }] }], }, ... ] }
作者
Chompakorn Chaksangchaichot



