five

clarin-pl/PUGG

收藏
Hugging Face2024-08-12 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/clarin-pl/PUGG
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: [] language: - pl license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K - 10K<n<100K source_datasets: - original task_categories: - question-answering - text-retrieval task_ids: - extractive-qa - document-retrieval pretty_name: 'PUGG: KBQA, MRC, IR dataset for Polish' tags: - knowledge graph - KBQA - wikipedia - wikidata configs: - config_name: kbqa_all data_files: - split: train path: kbqa/*/train.jsonl - split: test path: kbqa/*/test.jsonl - config_name: kbqa_natural data_files: - split: train path: kbqa/natural/train.jsonl - split: test path: kbqa/natural/test.jsonl - config_name: kbqa_template-based data_files: - split: train path: kbqa/template-based/train.jsonl - split: test path: kbqa/template-based/test.jsonl - config_name: mrc data_files: - split: train path: mrc/train.jsonl - split: test path: mrc/test.jsonl - config_name: ir_corpus data_files: - split: test path: ir/corpus.jsonl - config_name: ir_queries data_files: - split: test path: ir/queries.jsonl - config_name: ir_qrels data_files: - split: test path: ir/qrels/test.jsonl --- # PUGG: KBQA, MRC, IR Dataset for Polish ## Description This repository contains the PUGG dataset designed for three NLP tasks in the Polish language: - KBQA (Knowledge Base Question Answering) - MRC (Machine Reading Comprehension) - IR (Information Retrieval) ## Paper For more detailed information, please refer to our research paper titled: **"Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction"** Authored by: * Albert Sawczyn * Katsiaryna Viarenich * Konrad Wojtasik * Aleksandra Domogała * Marcin Oleksy * Maciej Piasecki * Tomasz Kajdanowicz **The paper was accepted for ACL 2024 (findings).** ## Repositories The dataset is available in the following repositories: * [General](https://huggingface.co/datasets/clarin-pl/PUGG) **(this repository)** - contains all tasks (KBQA, MRC, IR*) For more straightforward usage, the tasks are also available in separate repositories: * [KBQA](https://huggingface.co/datasets/clarin-pl/PUGG_KBQA) * [MRC](https://huggingface.co/datasets/clarin-pl/PUGG_MRC) * [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR) The knowledge graph for KBQA task is available in the following repository: * [Knowledge Graph](https://huggingface.co/datasets/clarin-pl/PUGG_KG) Note: If you want to utilize the IR task in the BEIR format (`qrels` in `.tsv` format), please download the [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR) repository. ## Links * Code: * [Github](https://github.com/CLARIN-PL/PUGG) * Paper: * ACL - TBA * [Arxiv](https://arxiv.org/abs/2408.02337) ## Citation ```bibtex @misc{sawczyn2024developingpuggpolishmodern, title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction}, author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz}, year={2024}, eprint={2408.02337}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2408.02337}, } ``` ## Contact albert.sawczyn@pwr.edu.pl ## Usage ```python from datasets import load_dataset # loading KBQA (all) dataset = load_dataset("clarin-pl/PUGG", "kbqa_all") print(dataset) # loading KBQA (natural) dataset = load_dataset("clarin-pl/PUGG", "kbqa_natural") print(dataset) # loading KBQA (template-based) dataset = load_dataset("clarin-pl/PUGG", "kbqa_template-based") print(dataset) # loading MRC dataset = load_dataset("clarin-pl/PUGG", "mrc") print(dataset) # loading IR ## corpus dataset = load_dataset("clarin-pl/PUGG", "ir_corpus") print(dataset) ## queries dataset = load_dataset("clarin-pl/PUGG", "ir_queries") print(dataset) ## qrels dataset = load_dataset("clarin-pl/PUGG", "ir_qrels") print(dataset) ```
提供机构:
clarin-pl
原始信息汇总

PUGG: KBQA, MRC, IR 数据集概述

数据集描述

本数据集名为PUGG,是根据研究论文《Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction》创建的,该论文由以下作者撰写:

  • Albert Sawczyn
  • Katsiaryna Viarenich
  • Konrad Wojtasik
  • Aleksandra Domogała
  • Marcin Oleksy
  • Maciej Piasecki
  • Tomasz Kajdanowicz

该论文已被ACL 2024(Findings)接受。

数据集用途

PUGG数据集旨在支持波兰语的KBQA(知识库问答)、MRC(机器阅读理解)和IR(信息检索)研究。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作