Multilingual Culture-Independent Word Analogy Datasets

Name: Multilingual Culture-Independent Word Analogy Datasets
Creator: 卢布尔雅那大学计算机与信息科学学院
Published: 2020-03-27 23:32:16
License: 暂无描述

arXiv2020-03-27 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/1911.10038v2

下载链接

链接失效反馈

官方服务：

资源简介：

本研究介绍了名为‘Multilingual Culture-Independent Word Analogy Datasets’的数据集，由卢布尔雅那大学计算机与信息科学学院等机构创建。该数据集包含9种语言的词汇类比任务数据，旨在评估不同语言文本嵌入的质量。数据集通过避免特定文化或国家的例子，设计为文化中立，包含15个类别，其中5个语义类别和10个语法/形态类别。数据集的创建过程涉及从斯洛文尼亚语翻译到其他语言，并进行了初步的fastText嵌入评估。该数据集适用于单语和跨语文本嵌入的评估，有助于解决多语言环境下词汇关系表示的问题。

This study introduces the dataset named 'Multilingual Culture-Independent Word Analogy Datasets', which was created by the Faculty of Computer and Information Science of the University of Ljubljana and other institutions. This dataset contains word analogy task data across 9 languages, aiming to evaluate the quality of text embeddings for different languages. Designed to be culture-neutral by avoiding examples specific to any single culture or country, it includes 15 categories in total, namely 5 semantic categories and 10 grammatical/morphological categories. The dataset was developed through translation from Slovenian into other languages, with preliminary fastText embedding evaluations performed. This dataset is suitable for evaluating both monolingual and cross-lingual text embeddings, and contributes to solving the problem of lexical relation representation in multilingual scenarios.

提供机构：

卢布尔雅那大学计算机与信息科学学院

创建时间：

2019-11-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集