Data for paper: "Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval"

Name: Data for paper: "Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval"
Creator: Mannheim University Library
Published: 2024-06-21 20:13:42
License: 暂无描述

DataCite Commons2024-06-21 更新2024-07-13 收录

下载链接：

https://madata.bib.uni-mannheim.de/361

下载链接

链接失效反馈

官方服务：

资源简介：

Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders `off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.

提供机构：

Mannheim University Library

创建时间：

2021-01-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集