clarin-pl/PUGG_MRC
收藏Hugging Face2024-08-12 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/clarin-pl/PUGG_MRC
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators: []
language:
- pl
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- question-answering
task_ids:
- extractive-qa
pretty_name: 'PUGG: MRC dataset for Polish'
tags:
- wikipedia
configs:
- config_name: default
data_files:
- split: train
path: train.jsonl
- split: test
path: test.jsonl
---
# PUGG: KBQA, MRC, IR Dataset for Polish
## Description
This repository contains the PUGG dataset designed for three NLP tasks in the Polish language:
- KBQA (Knowledge Base Question Answering)
- MRC (Machine Reading Comprehension)
- IR (Information Retrieval)
## Paper
For more detailed information, please refer to our research paper titled:
**"Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction"**
Authored by:
* Albert Sawczyn
* Katsiaryna Viarenich
* Konrad Wojtasik
* Aleksandra Domogała
* Marcin Oleksy
* Maciej Piasecki
* Tomasz Kajdanowicz
**The paper was accepted for ACL 2024 (findings).**
## Repositories
The dataset is available in the following repositories:
* [General](https://huggingface.co/datasets/clarin-pl/PUGG) - contains all tasks (KBQA, MRC, IR*)
For more straightforward usage, the tasks are also available in separate repositories:
* [KBQA](https://huggingface.co/datasets/clarin-pl/PUGG_KBQA)
* [MRC](https://huggingface.co/datasets/clarin-pl/PUGG_MRC) **(this repository)**
* [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR)
The knowledge graph for KBQA task is available in the following repository:
* [Knowledge Graph](https://huggingface.co/datasets/clarin-pl/PUGG_KG)
Note: If you want to utilize the IR task in the BEIR format (`qrels` in `.tsv` format), please
download the [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR) repository.
## Links
* Code:
* [Github](https://github.com/CLARIN-PL/PUGG)
* Paper:
* ACL - TBA
* [Arxiv](https://arxiv.org/abs/2408.02337)
## Citation
```bibtex
@misc{sawczyn2024developingpuggpolishmodern,
title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction},
author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz},
year={2024},
eprint={2408.02337},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2408.02337},
}
```
## Contact
albert.sawczyn@pwr.edu.pl
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("clarin-pl/PUGG_MRC")
print(dataset)
```
annotations_creators: 注释生成方式:专家生成
language_creators: 语言生成者:无
language: 语言:波兰语
license: 许可证:知识共享署名-相同方式共享4.0(CC BY-SA 4.0)
multilinguality: 多语言属性:单语种
size_categories: 样本规模:1000 < 样本数 < 10000
source_datasets: 源数据集:原始数据集
task_categories: 任务类别:问答
task_ids: 任务子类型:抽取式问答
pretty_name: 数据集名称:PUGG:面向波兰语的机器阅读理解数据集
tags: 标签:维基百科
configs:
- 配置名称:default
数据文件:
- 拆分方式:训练集,路径:train.jsonl
- 拆分方式:测试集,路径:test.jsonl
---
# PUGG:面向波兰语的KBQA、MRC、IR数据集
## 数据集描述
本仓库包含专为波兰语自然语言处理三项任务设计的PUGG数据集:
- 知识库问答(Knowledge Base Question Answering,KBQA)
- 机器阅读理解(Machine Reading Comprehension,MRC)
- 信息检索(Information Retrieval,IR)
## 相关论文
如需获取详细信息,请参阅我们的研究论文:
**《面向波兰语的PUGG:构建KBQA、MRC与IR数据集的现代方法》**
作者:
* Albert Sawczyn
* Katsiaryna Viarenich
* Konrad Wojtasik
* Aleksandra Domogała
* Marcin Oleksy
* Maciej Piasecki
* Tomasz Kajdanowicz
本论文已被国际计算语言学协会年会(Association for Computational Linguistics)2024(ACL 2024)发现版块收录。
## 代码仓库
本数据集可通过以下仓库获取:
* [通用仓库](https://huggingface.co/datasets/clarin-pl/PUGG) - 包含全部三项任务(KBQA、MRC、IR*)
为便于单独使用,各任务也分别提供了独立仓库:
* [KBQA仓库](https://huggingface.co/datasets/clarin-pl/PUGG_KBQA)
* [MRC仓库](https://huggingface.co/datasets/clarin-pl/PUGG_MRC) **(本仓库)**
* [IR仓库](https://huggingface.co/datasets/clarin-pl/PUGG_IR)
KBQA任务所需的知识图谱可通过以下仓库获取:
* [知识图谱仓库](https://huggingface.co/datasets/clarin-pl/PUGG_KG)
注意:若需使用BEIR格式的IR任务数据(即`.tsv`格式的查询相关性文件`qrels`),请下载[IR仓库](https://huggingface.co/datasets/clarin-pl/PUGG_IR)。
## 相关链接
* 代码:
* [GitHub](https://github.com/CLARIN-PL/PUGG)
* 论文:
* ACL会议论文 - 待公布
* [ArXiv预印本](https://arxiv.org/abs/2408.02337)
## 引用格式
bibtex
@misc{sawczyn2024developingpuggpolishmodern,
title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction},
author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz},
year={2024},
eprint={2408.02337},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2408.02337},
}
## 联系方式
albert.sawczyn@pwr.edu.pl
## 使用方法
python
from datasets import load_dataset
dataset = load_dataset("clarin-pl/PUGG_MRC")
print(dataset)
提供机构:
clarin-pl



