IndicSentEval

Name: IndicSentEval
Creator: ILMT initiative
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://tinyurl.com/IndicSentEval

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个新颖的多语言基准数据集，包含大约47,000个句子，旨在评估多语言Transformer模型在六种印度语系语言中对于各种语言属性的编码能力和鲁棒性。该数据集为八个探查任务提供了标注数据，并采用沙克提标准格式（SSF）进行编排，这种格式包含了丰富的语言注释信息。规模上，该数据集大约涵盖了47,000个句子，其任务是探查多语言Transformer模型中的语言属性。

This novel multilingual benchmark dataset contains approximately 47,000 sentences, designed to evaluate the encoding capabilities and robustness of multilingual Transformer models across various linguistic properties in six Indo-Aryan languages. The dataset provides annotated data for eight probing tasks, and is structured in the Shakti Standard Format (SSF) which contains rich linguistic annotation information. In terms of scale, this dataset covers roughly 47,000 sentences, with the core task of probing linguistic properties within multilingual Transformer models.

提供机构：

ILMT initiative

5,000+

优质数据集

54 个

任务类型

进入经典数据集