Multi-EuP
收藏arXiv2023-11-03 更新2024-06-21 收录
下载链接:
https://github.com/jrnlp/Multi-EuP
下载链接
链接失效反馈官方服务:
资源简介:
Multi-EuP是一个多语言基准数据集,包含从欧洲议会收集的22,000份多语言文档,涵盖24种语言。该数据集旨在研究多语言信息检索(IR)环境中的公平性,分析排名环境中的语言和人口统计偏差。数据集具有真实的多语言语料库,所有24种语言的主题都被翻译,并提供跨语言相关性判断。此外,数据集还提供与文档相关的丰富人口统计信息,便于研究人口统计偏差。该数据集适用于单语和多语IR的基准测试,并可用于探索由分词策略选择引起的语言偏差。
Multi-EuP is a multilingual benchmark dataset containing 22,000 multilingual documents collected from the European Parliament, covering 24 languages. This dataset aims to investigate fairness in multilingual information retrieval (IR) settings, and analyze language and demographic biases within ranking environments. It comprises a genuine multilingual corpus, with topics for all 24 languages translated, and cross-lingual relevance judgments provided. Furthermore, the dataset offers rich demographic information associated with the documents, facilitating research on demographic bias. This dataset is suitable for benchmarking both monolingual and multilingual IR, and can be used to explore language biases induced by tokenization strategy selection.
提供机构:
墨尔本大学计算与信息系统学院
创建时间:
2023-11-03



