Gholamreza/pquad
收藏Hugging Face2023-02-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Gholamreza/pquad
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: PQuAD
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
language:
- fa
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- question-answering
task_ids:
- open-domain-qa
- extractive-qa
paperswithcode_id: squad
train-eval-index:
- config: pquad
task: question-answering
task_id: extractive_question_answering
splits:
train_split: train
eval_split: validation
col_mapping:
question: question
context: context
answers:
text: text
answer_start: answer_start
metrics:
- type: pquad
name: PQuAD
dataset_info:
features:
- name: id
dtype: int32
- name: title
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answers
sequence:
- name: text
dtype: string
- name: answer_start
dtype: int32
config_name: pquad
splits:
- name: train
num_bytes: ...
num_examples: 63994
- name: validation
num_bytes: ...
num_examples: 7976
- name: test
num_bytes: ...
num_examples: 8002
download_size: ...
dataset_size: ...
---
# Dataset Card for "pquad"
## PQuAD Description
**THIS IS A NON-OFFICIAL VERSION OF THE DATASET UPLOADED TO HUGGINGFACE BY [Gholamreza Dar](https://huggingface.co/Gholamreza)**
*The original repository for the dataset is https://github.com/AUT-NLP/PQuAD*
PQuAD is a crowd- sourced reading comprehension dataset on Persian Language. It includes 80,000
questions along with their answers, with 25% of the questions being unanswerable. As a reading
comprehension dataset, it requires a system to read a passage and then answer the given questions
from the passage. PQuAD's questions are based on Persian Wikipedia articles and cover a wide
variety of subjects. Articles used for question generation are quality checked and include few
number of non-Persian words.
## Dataset Splits
The dataset is divided into three categories including train, validation, and test sets and the
statistics of these sets are as follows:
```
+----------------------------+-------+------------+------+-------+
| | Train | Validation | Test | Total |
+----------------------------+-------+------------+------+-------+
| Total Questions | 63994 | 7976 | 8002 | 79972 |
| Unanswerable Questions | 15721 | 1981 | 1914 | 19616 |
| Mean # of paragraph tokens | 125 | 121 | 124 | 125 |
| Mean # of question tokens | 10 | 11 | 11 | 10 |
| Mean # of answer tokens | 5 | 6 | 5 | 5 |
+----------------------------+-------+------------+------+-------+
```
Workers were encouraged to use paraphrased sentences in their questions and avoid choosing the
answers comprising non-Persian words. Another group of crowdworkers validated the questions and
answers in the test and validation set to ensure their quality. They also provided additional
answers to the questions in test and validation sets if possible. This helps to consider all
possible types of answers and have a better evaluation of models.
PQuAD is stored in the JSON format and consists of passages where each passage is linked to a
set of questions. Answer(s) of the questions is specified with answer's span (start and end
point of answer in paragraph). Also, the unanswerable questions are marked as unanswerable.
## Results
The estimated human performance on the test set is 88.3% for F1 and 80.3% for EM. We have
evaluated PQuAD using two pre-trained transformer-based language models, namely ParsBERT
(Farahani et al., 2021) and XLM-RoBERTa (Conneau et al., 2020), as well as BiDAF (Levy et
al., 2017) which is an attention-based model proposed for MRC.
```
+-------------+------+------+-----------+-----------+-------------+
| Model | EM | F1 | HasAns_EM | HasAns_F1 | NoAns_EM/F1 |
+-------------+------+------+-----------+-----------+-------------+
| BNA | 54.4 | 71.4 | 43.9 | 66.4 | 87.6 |
| ParsBERT | 68.1 | 82.0 | 61.5 | 79.8 | 89.0 |
| XLM-RoBERTa | 74.8 | 87.6 | 69.1 | 86.0 | 92.7 |
| Human | 80.3 | 88.3 | 74.9 | 85.6 | 96.8 |
+-------------+------+------+-----------+-----------+-------------+
```
## LICENSE
PQuAD is developed by Mabna Intelligent Computing at Amirkabir Science and Technology Park with
collaboration of the NLP lab of the Amirkabir University of Technology and is supported by the
Vice Presidency for Scientific and Technology. By releasing this dataset, we aim to ease research
on Persian reading comprehension and the development of Persian question answering systems.
This work is licensed under a
[Creative Commons Attribution-ShareAlike 4.0 International License][cc-by-sa].
[![CC BY-SA 4.0][cc-by-sa-image]][cc-by-sa]
[cc-by-sa]: http://creativecommons.org/licenses/by-sa/4.0/
[cc-by-sa-image]: https://licensebuttons.net/l/by-sa/4.0/88x31.png
[cc-by-sa-shield]: https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg
# Dataset Card for "pquad"
提供机构:
Gholamreza
原始信息汇总
数据集概述
基本信息
- 名称: PQuAD
- 语言: 波斯语 (fa)
- 许可证: CC-BY-SA-4.0
- 多语言性: 单语种
- 大小: 10K<n<100K
- 来源: 原始数据集
- 任务类别: 问答
- 任务ID:
- open-domain-qa
- extractive-qa
数据集结构
- 特征:
- id: int32
- title: string
- context: string
- question: string
- answers:
- text: string
- answer_start: int32
- 配置名称: pquad
- 分割:
- train: 63994个样本
- validation: 7976个样本
- test: 8002个样本
数据集内容
- 描述: PQuAD是一个关于波斯语的阅读理解数据集,包含80,000个问题及其答案,其中25%的问题无法回答。数据集基于波斯语维基百科文章,涵盖广泛的主题。
- 分割统计:
- 总问题数: 79972
- 无法回答的问题数: 19616
- 平均段落令牌数: 125
- 平均问题令牌数: 10
- 平均答案令牌数: 5
评估
- 人类性能:
- F1: 88.3%
- EM: 80.3%
- 模型评估:
- ParsBERT: EM 68.1%, F1 82.0%
- XLM-RoBERTa: EM 74.8%, F1 87.6%
- BiDAF: EM 54.4%, F1 71.4%
许可证
- 类型: Creative Commons Attribution-ShareAlike 4.0 International License



