hsseinmz/arcd
收藏数据集概述
名称: ARCD (Arabic Reading Comprehension Dataset)
语言: 阿拉伯语 (ar-SA)
许可证: MIT
多语言性: 单语种
大小: 1K<n<10K
源数据: 原始数据
任务类别: 问答 (extractive-qa)
数据集结构
数据实例
- 字段:
id: 字符串title: 字符串context: 字符串question: 字符串answers: 字典,包含text: 字符串answer_start: 整数
数据分割
| 名称 | 训练 | 验证 |
|---|---|---|
| plain_text | 693 | 702 |
数据集创建
注释者
- 注释创建者: 众包
- 语言创建者: 众包
许可证信息
- 许可证: MIT
引用信息
@inproceedings{mozannar-etal-2019-neural, title = "Neural {A}rabic Question Answering", author = "Mozannar, Hussein and Maamary, Elie and El Hajal, Karl and Hajj, Hazem", booktitle = "Proceedings of the Fourth Arabic Natural Language Processing Workshop", month = aug, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W19-4612", doi = "10.18653/v1/W19-4612", pages = "108--118", abstract = "This paper tackles the problem of open domain factual Arabic question answering (QA) using Wikipedia as our knowledge source. This constrains the answer of any question to be a span of text in Wikipedia. Open domain QA for Arabic entails three challenges: annotated QA datasets in Arabic, large scale efficient information retrieval and machine reading comprehension. To deal with the lack of Arabic QA datasets we present the Arabic Reading Comprehension Dataset (ARCD) composed of 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD). Our system for open domain question answering in Arabic (SOQAL) is based on two components: (1) a document retriever using a hierarchical TF-IDF approach and (2) a neural reading comprehension model using the pre-trained bi-directional transformer BERT. Our experiments on ARCD indicate the effectiveness of our approach with our BERT-based reader achieving a 61.3 F1 score, and our open domain system SOQAL achieving a 27.6 F1 score.", }



