Multi-EuP

Name: Multi-EuP
Creator: 墨尔本大学计算与信息系统学院
Published: 2023-11-03 20:29:11
License: 暂无描述

arXiv2023-11-03 更新2024-06-21 收录

下载链接：

https://github.com/jrnlp/Multi-EuP

下载链接

链接失效反馈

官方服务：

资源简介：

Multi-EuP是一个多语言基准数据集，包含从欧洲议会收集的22,000份多语言文档，涵盖24种语言。该数据集旨在研究多语言信息检索（IR）环境中的公平性，分析排名环境中的语言和人口统计偏差。数据集具有真实的多语言语料库，所有24种语言的主题都被翻译，并提供跨语言相关性判断。此外，数据集还提供与文档相关的丰富人口统计信息，便于研究人口统计偏差。该数据集适用于单语和多语IR的基准测试，并可用于探索由分词策略选择引起的语言偏差。

Multi-EuP is a multilingual benchmark dataset containing 22,000 multilingual documents collected from the European Parliament, covering 24 languages. This dataset aims to investigate fairness in multilingual information retrieval (IR) settings, and analyze language and demographic biases within ranking environments. It comprises a genuine multilingual corpus, with topics for all 24 languages translated, and cross-lingual relevance judgments provided. Furthermore, the dataset offers rich demographic information associated with the documents, facilitating research on demographic bias. This dataset is suitable for benchmarking both monolingual and multilingual IR, and can be used to explore language biases induced by tokenization strategy selection.

提供机构：

墨尔本大学计算与信息系统学院

创建时间：

2023-11-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集