clarin-pl/PUGG

Name: clarin-pl/PUGG
Creator: clarin-pl
Published: 2024-08-12 07:53:43
License: 暂无描述

Hugging Face2024-08-12 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/clarin-pl/PUGG

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: [] language: - pl license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K - 10K<n<100K source_datasets: - original task_categories: - question-answering - text-retrieval task_ids: - extractive-qa - document-retrieval pretty_name: 'PUGG: KBQA, MRC, IR dataset for Polish' tags: - knowledge graph - KBQA - wikipedia - wikidata configs: - config_name: kbqa_all data_files: - split: train path: kbqa/*/train.jsonl - split: test path: kbqa/*/test.jsonl - config_name: kbqa_natural data_files: - split: train path: kbqa/natural/train.jsonl - split: test path: kbqa/natural/test.jsonl - config_name: kbqa_template-based data_files: - split: train path: kbqa/template-based/train.jsonl - split: test path: kbqa/template-based/test.jsonl - config_name: mrc data_files: - split: train path: mrc/train.jsonl - split: test path: mrc/test.jsonl - config_name: ir_corpus data_files: - split: test path: ir/corpus.jsonl - config_name: ir_queries data_files: - split: test path: ir/queries.jsonl - config_name: ir_qrels data_files: - split: test path: ir/qrels/test.jsonl --- # PUGG: KBQA, MRC, IR Dataset for Polish ## Description This repository contains the PUGG dataset designed for three NLP tasks in the Polish language: - KBQA (Knowledge Base Question Answering) - MRC (Machine Reading Comprehension) - IR (Information Retrieval) ## Paper For more detailed information, please refer to our research paper titled: **"Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction"** Authored by: * Albert Sawczyn * Katsiaryna Viarenich * Konrad Wojtasik * Aleksandra Domogała * Marcin Oleksy * Maciej Piasecki * Tomasz Kajdanowicz **The paper was accepted for ACL 2024 (findings).** ## Repositories The dataset is available in the following repositories: * [General](https://huggingface.co/datasets/clarin-pl/PUGG) **(this repository)** - contains all tasks (KBQA, MRC, IR*) For more straightforward usage, the tasks are also available in separate repositories: * [KBQA](https://huggingface.co/datasets/clarin-pl/PUGG_KBQA) * [MRC](https://huggingface.co/datasets/clarin-pl/PUGG_MRC) * [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR) The knowledge graph for KBQA task is available in the following repository: * [Knowledge Graph](https://huggingface.co/datasets/clarin-pl/PUGG_KG) Note: If you want to utilize the IR task in the BEIR format (`qrels` in `.tsv` format), please download the [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR) repository. ## Links * Code: * [Github](https://github.com/CLARIN-PL/PUGG) * Paper: * ACL - TBA * [Arxiv](https://arxiv.org/abs/2408.02337) ## Citation ```bibtex @misc{sawczyn2024developingpuggpolishmodern, title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction}, author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz}, year={2024}, eprint={2408.02337}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2408.02337}, } ``` ## Contact albert.sawczyn@pwr.edu.pl ## Usage ```python from datasets import load_dataset # loading KBQA (all) dataset = load_dataset("clarin-pl/PUGG", "kbqa_all") print(dataset) # loading KBQA (natural) dataset = load_dataset("clarin-pl/PUGG", "kbqa_natural") print(dataset) # loading KBQA (template-based) dataset = load_dataset("clarin-pl/PUGG", "kbqa_template-based") print(dataset) # loading MRC dataset = load_dataset("clarin-pl/PUGG", "mrc") print(dataset) # loading IR ## corpus dataset = load_dataset("clarin-pl/PUGG", "ir_corpus") print(dataset) ## queries dataset = load_dataset("clarin-pl/PUGG", "ir_queries") print(dataset) ## qrels dataset = load_dataset("clarin-pl/PUGG", "ir_qrels") print(dataset) ```

提供机构：

clarin-pl

原始信息汇总

PUGG: KBQA, MRC, IR 数据集概述

数据集描述

本数据集名为PUGG，是根据研究论文《Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction》创建的，该论文由以下作者撰写：

Albert Sawczyn
Katsiaryna Viarenich
Konrad Wojtasik
Aleksandra Domogała
Marcin Oleksy
Maciej Piasecki
Tomasz Kajdanowicz

该论文已被ACL 2024（Findings）接受。

数据集用途

PUGG数据集旨在支持波兰语的KBQA（知识库问答）、MRC（机器阅读理解）和IR（信息检索）研究。

5,000+

优质数据集

54 个

任务类型

进入经典数据集