five

clarin-pl/PUGG_MRC

收藏
Hugging Face2024-08-12 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/clarin-pl/PUGG_MRC
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: [] language: - pl license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - question-answering task_ids: - extractive-qa pretty_name: 'PUGG: MRC dataset for Polish' tags: - wikipedia configs: - config_name: default data_files: - split: train path: train.jsonl - split: test path: test.jsonl --- # PUGG: KBQA, MRC, IR Dataset for Polish ## Description This repository contains the PUGG dataset designed for three NLP tasks in the Polish language: - KBQA (Knowledge Base Question Answering) - MRC (Machine Reading Comprehension) - IR (Information Retrieval) ## Paper For more detailed information, please refer to our research paper titled: **"Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction"** Authored by: * Albert Sawczyn * Katsiaryna Viarenich * Konrad Wojtasik * Aleksandra Domogała * Marcin Oleksy * Maciej Piasecki * Tomasz Kajdanowicz **The paper was accepted for ACL 2024 (findings).** ## Repositories The dataset is available in the following repositories: * [General](https://huggingface.co/datasets/clarin-pl/PUGG) - contains all tasks (KBQA, MRC, IR*) For more straightforward usage, the tasks are also available in separate repositories: * [KBQA](https://huggingface.co/datasets/clarin-pl/PUGG_KBQA) * [MRC](https://huggingface.co/datasets/clarin-pl/PUGG_MRC) **(this repository)** * [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR) The knowledge graph for KBQA task is available in the following repository: * [Knowledge Graph](https://huggingface.co/datasets/clarin-pl/PUGG_KG) Note: If you want to utilize the IR task in the BEIR format (`qrels` in `.tsv` format), please download the [IR](https://huggingface.co/datasets/clarin-pl/PUGG_IR) repository. ## Links * Code: * [Github](https://github.com/CLARIN-PL/PUGG) * Paper: * ACL - TBA * [Arxiv](https://arxiv.org/abs/2408.02337) ## Citation ```bibtex @misc{sawczyn2024developingpuggpolishmodern, title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction}, author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz}, year={2024}, eprint={2408.02337}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2408.02337}, } ``` ## Contact albert.sawczyn@pwr.edu.pl ## Usage ```python from datasets import load_dataset dataset = load_dataset("clarin-pl/PUGG_MRC") print(dataset) ```

annotations_creators: 注释生成方式:专家生成 language_creators: 语言生成者:无 language: 语言:波兰语 license: 许可证:知识共享署名-相同方式共享4.0(CC BY-SA 4.0) multilinguality: 多语言属性:单语种 size_categories: 样本规模:1000 < 样本数 < 10000 source_datasets: 源数据集:原始数据集 task_categories: 任务类别:问答 task_ids: 任务子类型:抽取式问答 pretty_name: 数据集名称:PUGG:面向波兰语的机器阅读理解数据集 tags: 标签:维基百科 configs: - 配置名称:default 数据文件: - 拆分方式:训练集,路径:train.jsonl - 拆分方式:测试集,路径:test.jsonl --- # PUGG:面向波兰语的KBQA、MRC、IR数据集 ## 数据集描述 本仓库包含专为波兰语自然语言处理三项任务设计的PUGG数据集: - 知识库问答(Knowledge Base Question Answering,KBQA) - 机器阅读理解(Machine Reading Comprehension,MRC) - 信息检索(Information Retrieval,IR) ## 相关论文 如需获取详细信息,请参阅我们的研究论文: **《面向波兰语的PUGG:构建KBQA、MRC与IR数据集的现代方法》** 作者: * Albert Sawczyn * Katsiaryna Viarenich * Konrad Wojtasik * Aleksandra Domogała * Marcin Oleksy * Maciej Piasecki * Tomasz Kajdanowicz 本论文已被国际计算语言学协会年会(Association for Computational Linguistics)2024(ACL 2024)发现版块收录。 ## 代码仓库 本数据集可通过以下仓库获取: * [通用仓库](https://huggingface.co/datasets/clarin-pl/PUGG) - 包含全部三项任务(KBQA、MRC、IR*) 为便于单独使用,各任务也分别提供了独立仓库: * [KBQA仓库](https://huggingface.co/datasets/clarin-pl/PUGG_KBQA) * [MRC仓库](https://huggingface.co/datasets/clarin-pl/PUGG_MRC) **(本仓库)** * [IR仓库](https://huggingface.co/datasets/clarin-pl/PUGG_IR) KBQA任务所需的知识图谱可通过以下仓库获取: * [知识图谱仓库](https://huggingface.co/datasets/clarin-pl/PUGG_KG) 注意:若需使用BEIR格式的IR任务数据(即`.tsv`格式的查询相关性文件`qrels`),请下载[IR仓库](https://huggingface.co/datasets/clarin-pl/PUGG_IR)。 ## 相关链接 * 代码: * [GitHub](https://github.com/CLARIN-PL/PUGG) * 论文: * ACL会议论文 - 待公布 * [ArXiv预印本](https://arxiv.org/abs/2408.02337) ## 引用格式 bibtex @misc{sawczyn2024developingpuggpolishmodern, title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction}, author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz}, year={2024}, eprint={2408.02337}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2408.02337}, } ## 联系方式 albert.sawczyn@pwr.edu.pl ## 使用方法 python from datasets import load_dataset dataset = load_dataset("clarin-pl/PUGG_MRC") print(dataset)
提供机构:
clarin-pl
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作