DiscoResearch/germanrag

Name: DiscoResearch/germanrag
Creator: DiscoResearch
Published: 2024-02-04 17:50:10
License: 暂无描述

Hugging Face2024-02-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/DiscoResearch/germanrag

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: GermanRAG configs: - config_name: default data_files: - split: train path: germanrag.jsonl license: cc-by-4.0 language: - de multilinguality: - monolingual source_datasets: - deepset/germandpr task_categories: - question-answering - text-retrieval - conversational task_ids: - open-domain-qa - document-retrieval - document-question-answering tags: - RAG - retrieval-augmented-generation size_categories: - 1K<n<10K --- # GermanRAG 🇩🇪📜🦜 This dataset is derived from the [GermanDPR dataset](https://huggingface.co/datasets/deepset/germandpr) and enhances it by providing fully formulated answers instead of answer spans. It can be used to finetune for retrieval augmented generation tasks (RAG) in German. We deduplicated the original contexts resulting in 2243 unique contexts and repeated the hard negatives of half of them, such that the last third of the total dataset contains only not answerable examples. In contrast to the original dataset the number of contexts per QA pair varies to mimic retrieval results in real world scenarios, resulting in a distribution of positive and hard negative contexts as follows: | # positive contexts | # hard negative contexts | # examples |---|---|--- | 1 | 0 | 562 | 1 | 1 | 562 | 1 | 2 | 561 | 1 | 3 | 558 | 0 | 1 | 375 | 0 | 2 | 373 | 0 | 3 | 371 The passages in the `contexts` list are shuffled and `positive_ctx_idx` marks the index of the positive context. `-1` indicates examples without positive context, which are paired with `"Mit den gegebenen Informationen ist diese Frage nicht zu beantworten."` as answer. Code used to create this dataset can be found [here](https://github.com/rasdani/germanrag). ## Known issues In rare cases hard negatives still provide sufficient information to answer the question. For the last third, we therefore paired hard negatives with random questions, sampled without replacement. ## Acknowledgements Full credit for the original dataset goes to the [authors](https://arxiv.org/abs/2104.12741) of [GermanDPR](https://www.deepset.ai/germanquad). The original dataset is licensed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) and this derived work therfore inherits the same license. Citation for the original dataset: ``` @misc{möller2021germanquad, title={GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval}, author={Timo Möller and Julian Risch and Malte Pietsch}, year={2021}, eprint={2104.12741}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` The derived dataset was created for [DiscoResearch](https://huggingface.co/DiscoResearch) by [Daniel Auras](https://huggingface.co/rasdani) with support from [JP Harries](https://huggingface.co/jphme) and [Björn Pluster](https://huggingface.co/bjoernp).

数据集展示名: GermanRAG 配置: - 配置名: default 数据文件: - 划分集: train 路径: germanrag.jsonl 许可证: cc-by-4.0 语言: - de 多语言属性: - 单语源数据集: - deepset/germandpr 任务类别: - 问答 - 文本检索 - 对话任务任务子类型: - 开放域问答 - 文档检索 - 文档问答标签: - RAG - 检索增强生成（Retrieval-Augmented Generation）样本规模类别: - 1000 < 样本量 < 10000 # GermanRAG 🇩🇪📜🦜 本数据集源自[GermanDPR数据集（deepset/germandpr）](https://huggingface.co/datasets/deepset/germandpr)，相较于原数据集，其优化之处在于提供完整的自然语言回答而非仅答案片段，可用于德语环境下检索增强生成（Retrieval-Augmented Generation，RAG）任务的微调。我们对原始上下文进行去重处理，得到2243个唯一上下文，并对其中一半的难负样本进行重复处理，使得总数据集的最后三分之一仅包含无法回答的样本。与原始数据集不同，每个问答对对应的上下文数量有所变化，以模拟真实场景中的检索结果，正负例与难负例上下文的分布如下： | 正例上下文数量 | 难负例上下文数量 | 样本数 |---|---|--- | 1 | 0 | 562 | 1 | 1 | 562 | 1 | 2 | 561 | 1 | 3 | 558 | 0 | 1 | 375 | 0 | 2 | 373 | 0 | 3 | 371 `contexts`列表中的段落已随机打乱，`positive_ctx_idx`用于标记正例上下文的索引。若该值为`-1`，则表示该样本无正例上下文，此时对应的回答为"根据给定信息，无法回答该问题。"。本数据集的构建代码可参见[此处](https://github.com/rasdani/germanrag)。 ## 已知问题在极少数情况下，难负样本仍可能包含足够信息以回答对应问题。因此对于数据集最后三分之一的样本，我们将难负样本与无放回抽样得到的随机问题进行配对。 ## 致谢本数据集的原始版本的全部贡献归于[GermanDPR](https://www.deepset.ai/germanquad)的[作者](https://arxiv.org/abs/2104.12741)。原始数据集采用[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)许可证，本衍生数据集因此沿用相同的许可协议。原始数据集的引用格式如下： @misc{möller2021germanquad, title={GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval}, author={Timo Möller and Julian Risch and Malte Pietsch}, year={2021}, eprint={2104.12741}, archivePrefix={arXiv}, primaryClass={cs.CL} } 本衍生数据集由[Daniel Auras](https://huggingface.co/rasdani)为[DiscoResearch](https://huggingface.co/DiscoResearch)创建，得到了[JP Harries](https://huggingface.co/jphme)与[Björn Pluster](https://huggingface.co/bjoernp)的支持。

提供机构：

DiscoResearch

原始信息汇总

数据集概述

数据集名称

GermanRAG

配置

默认配置
- 数据文件路径：germanrag.jsonl

许可

CC-BY-4.0

语言

德语 (de)

多语言性

单语

数据来源

deepset/germandpr

任务类别

问答
文本检索
对话

任务ID

开放领域问答
文档检索
文档问答

大小类别

1K<n<10K

数据集特点

从GermanDPR数据集衍生，提供完整答案而非答案片段。
用于微调德语中的检索增强生成任务。
包含2243个唯一上下文，其中一部分重复了硬负例。
数据集的最后三分之一包含无法回答的例子。
上下文数量根据实际场景变化，模拟真实世界的检索结果。

数据集结构

上下文列表中的段落被打乱。
positive_ctx_idx标记正确实例的索引。
-1表示没有正确实例的例子，答案为"Mit den gegebenen Informationen ist diese Frage nicht zu beantworten."。

5,000+

优质数据集

54 个

任务类型

进入经典数据集