clarin-pl/PUGG_MRC

Name: clarin-pl/PUGG_MRC
Creator: clarin-pl
Published: 2024-08-12 07:51:48
License: 暂无描述

Hugging Face2024-08-12 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/clarin-pl/PUGG_MRC

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: [] language: - pl license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - question-answering task_ids: - extractive-qa pretty_name: 'PUGG: MRC dataset for Polish' tags: - wikipedia configs: - config_name: default data_files: - split: train path: train.jsonl - split: test path: test.jsonl --- # PUGG: KBQA, MRC, IR Dataset for Polish ## Description This repository contains the PUGG dataset designed for three NLP tasks in the Polish language: - KBQA (Knowledge Base Question Answering) - MRC (Machine Reading Comprehension) - IR (Information Retrieval) ## Paper For more detailed information, please refer to our research paper titled: **"Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction"** Authored by: * Albert Sawczyn * Katsiaryna Viarenich * Konrad Wojtasik * Aleksandra Domogała * Marcin Oleksy * Maciej Piasecki * Tomasz Kajdanowicz **The paper was accepted for ACL 2024 (findings).** ## Repositories The dataset is available in the following repositories: * [General](https://huggingface.co/datasets/clarin-pl/PUGG) - contains all tasks (KBQA, MRC, IR*) For more straightforward usage, the tasks are also available in separate repositories: * [KBQA](https://huggingface.co/datasets/clarin-pl/PUGG_KBQA) * [MRC](https://huggingface.co/datasets/clarin-pl/PUGG_MRC) **(this repository)** * [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR) The knowledge graph for KBQA task is available in the following repository: * [Knowledge Graph](https://huggingface.co/datasets/clarin-pl/PUGG_KG) Note: If you want to utilize the IR task in the BEIR format (`qrels` in `.tsv` format), please download the [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR) repository. ## Links * Code: * [Github](https://github.com/CLARIN-PL/PUGG) * Paper: * ACL - TBA * [Arxiv](https://arxiv.org/abs/2408.02337) ## Citation ```bibtex @misc{sawczyn2024developingpuggpolishmodern, title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction}, author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz}, year={2024}, eprint={2408.02337}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2408.02337}, } ``` ## Contact albert.sawczyn@pwr.edu.pl ## Usage ```python from datasets import load_dataset dataset = load_dataset("clarin-pl/PUGG_MRC") print(dataset) ```

annotations_creators: 注释生成方式：专家生成 language_creators: 语言生成者：无 language: 语言：波兰语 license: 许可证：知识共享署名-相同方式共享4.0（CC BY-SA 4.0） multilinguality: 多语言属性：单语种 size_categories: 样本规模：1000 < 样本数 < 10000 source_datasets: 源数据集：原始数据集 task_categories: 任务类别：问答 task_ids: 任务子类型：抽取式问答 pretty_name: 数据集名称：PUGG：面向波兰语的机器阅读理解数据集 tags: 标签：维基百科 configs: - 配置名称：default 数据文件: - 拆分方式：训练集，路径：train.jsonl - 拆分方式：测试集，路径：test.jsonl --- # PUGG：面向波兰语的KBQA、MRC、IR数据集 ## 数据集描述本仓库包含专为波兰语自然语言处理三项任务设计的PUGG数据集： - 知识库问答（Knowledge Base Question Answering，KBQA） - 机器阅读理解（Machine Reading Comprehension，MRC） - 信息检索（Information Retrieval，IR） ## 相关论文如需获取详细信息，请参阅我们的研究论文： **《面向波兰语的PUGG：构建KBQA、MRC与IR数据集的现代方法》** 作者： * Albert Sawczyn * Katsiaryna Viarenich * Konrad Wojtasik * Aleksandra Domogała * Marcin Oleksy * Maciej Piasecki * Tomasz Kajdanowicz 本论文已被国际计算语言学协会年会（Association for Computational Linguistics）2024（ACL 2024）发现版块收录。 ## 代码仓库本数据集可通过以下仓库获取： * [通用仓库](https://huggingface.co/datasets/clarin-pl/PUGG) - 包含全部三项任务（KBQA、MRC、IR*）为便于单独使用，各任务也分别提供了独立仓库： * [KBQA仓库](https://huggingface.co/datasets/clarin-pl/PUGG_KBQA) * [MRC仓库](https://huggingface.co/datasets/clarin-pl/PUGG_MRC) **（本仓库）** * [IR仓库](https://huggingface.co/datasets/clarin-pl/PUGG_IR) KBQA任务所需的知识图谱可通过以下仓库获取： * [知识图谱仓库](https://huggingface.co/datasets/clarin-pl/PUGG_KG) 注意：若需使用BEIR格式的IR任务数据（即`.tsv`格式的查询相关性文件`qrels`），请下载[IR仓库](https://huggingface.co/datasets/clarin-pl/PUGG_IR)。 ## 相关链接 * 代码： * [GitHub](https://github.com/CLARIN-PL/PUGG) * 论文： * ACL会议论文 - 待公布 * [ArXiv预印本](https://arxiv.org/abs/2408.02337) ## 引用格式 bibtex @misc{sawczyn2024developingpuggpolishmodern, title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction}, author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz}, year={2024}, eprint={2408.02337}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2408.02337}, } ## 联系方式 albert.sawczyn@pwr.edu.pl ## 使用方法 python from datasets import load_dataset dataset = load_dataset("clarin-pl/PUGG_MRC") print(dataset)

提供机构：

clarin-pl

5,000+

优质数据集

54 个

任务类型

进入经典数据集