amitbcp/nomir

Name: amitbcp/nomir
Creator: amitbcp
Published: 2024-05-15 05:43:21
License: 暂无描述

Hugging Face2024-05-15 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/amitbcp/nomir

下载链接

链接失效反馈

官方服务：

资源简介：

NoMIRACL是一个人工标注的数据集，旨在评估大语言模型（LLMs）在增强检索生成（RAG）中的鲁棒性，涵盖18种不同的语言。数据集包括‘非相关’和‘相关’两个子集：‘非相关’子集包含所有被手动判断为非相关或嘈杂的查询，而‘相关’子集包含至少有一个被判断为相关文档的查询。LLM的鲁棒性通过两个关键指标来衡量：幻觉率和错误率。所有主题均由每种语言的母语者生成，并标注了主题与给定文档列表之间的相关性。无相关文档的查询用于创建‘非相关’子集，而至少有一个相关文档的查询（即MIRACL开发和测试中的查询）用于创建‘相关’子集。

NoMIRACL is a manually annotated dataset developed to evaluate the robustness of Large Language Models (LLMs) in Retrieval-Augmented Generation (RAG) across 18 distinct languages. The dataset comprises two subsets: the "Irrelevant" subset and the "Relevant" subset. The "Irrelevant" subset encompasses all queries manually categorized as irrelevant or noisy, whereas the "Relevant" subset includes queries for which at least one relevant document has been annotated. Two critical metrics are employed to assess LLM robustness: the hallucination rate and the error rate. All topics are generated by native speakers of the corresponding language, and the relevance between each topic and the provided list of documents is manually annotated. Queries without any relevant documents are utilized to construct the "Irrelevant" subset, while queries with at least one relevant document (i.e., the queries from the MIRACL development and test splits) are used to build the "Relevant" subset.

提供机构：

amitbcp

原始信息汇总

数据集概述

数据集名称

名称: NoMIRACL

语言多样性

语言: 包含18种语言，包括阿拉伯语(ar)、孟加拉语(bn)、英语(en)、西班牙语(es)、波斯语(fa)、芬兰语(fi)、法语(fr)、印地语(hi)、印度尼西亚语(id)、日语(ja)、韩语(ko)、俄语(ru)、斯瓦希里语(sw)、泰卢固语(te)、泰语(th)、中文(zh)、葡萄牙语(pt)
多语言性: 多语言

数据集大小

大小: 10K<n<100K

数据集来源

来源: 源自MIRACL/miracl数据集

任务类型

任务: 文本分类

许可证

许可证: Apache-2.0

数据集结构

文件格式:
- 文档: .jsonl.gz格式，每行包含docid, title, text
- 主题: .tsv格式，每行包含qid和query
- qrels: 标准TREC格式，每行包含qid, Q0, docid, relevance

数据集子集

子集: 包含non-relevant和relevant两个子集
- non-relevant: 所有文档被手动判定为不相关或噪声
- relevant: 至少有一个文档被判定为相关

数据集使用

使用方法: 通过HuggingFace datasets库加载数据集，支持18种语言和两个子集的选择

数据集统计信息

统计信息: 详细统计信息请参考相关出版物

引用信息

引用格式:

@article{thakur2023nomiracl, title={NoMIRACL: Knowing When You Dont Know for Robust Multilingual Retrieval-Augmented Generation}, author={Nandan Thakur and Luiz Bonifacio and Xinyu Zhang and Odunayo Ogundepo and Ehsan Kamalloo and David Alfonso-Hermelo and Xiaoguang Li and Qun Liu and Boxing Chen and Mehdi Rezagholizadeh and Jimmy Lin}, journal={ArXiv}, year={2023}, volume={abs/2312.11361} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集