RUSSE (Russian Words in Context (based on RUSSE))
收藏OpenDataLab2026-05-31 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/RUSSE
下载链接
链接失效反馈官方服务:
资源简介:
WiC:上下文中的词数据集 评估上下文相关词嵌入的可靠基准。
根据其上下文,一个模棱两可的词可以指代多个可能不相关的含义。主流的静态词嵌入,如 Word2vec 和 GloVe,无法反映这种动态语义性质。上下文化词嵌入是通过计算可以根据上下文适应的词的动态表示来解决这一限制的尝试。
俄语 SuperGLUE 任务借用 Russe 项目的原始数据,Word Sense Induction and Disambiguation 共享任务(2018)
任务类型
阅读理解。二进制分类:真/假
例子
{
“idx”:8,
“单词”:“дорожка”,
"sentence1" : "Бурые ковровые дорожки заглушали шаги",
"sentence2" : "Приятели решили выпить на дорожку в местном баре",
“开始1”:15,
“结束1”:23,
“开始2”:26,
“end2”:34,
“标签”:假,
“gold_sense1”:1,
“gold_sense2”:2
}
我们是如何收集数据的?
所有文本示例均来自 Russe 原始数据集,该数据集已由 ACL SIGSLAV 的俄罗斯语义评估收集。在 Yandex.Toloka 上进行了人工评估。
在版本 2 中,我们手动收集了相同格式的测试集。
WiC: Word-in-Context Dataset, a reliable benchmark for evaluating context-aware word embeddings.
An ambiguous word can refer to multiple potentially unrelated meanings based on its context. Mainstream static word embeddings such as Word2vec and GloVe fail to capture this dynamic semantic property. Contextualized word embeddings aim to address this limitation by generating dynamic word representations that adapt to their respective contextual cues.
The Russian SuperGLUE task borrows raw data from the Russe project, specifically the 2018 Word Sense Induction and Disambiguation Shared Task.
Task Type: Reading Comprehension. Binary Classification: True/False.
Example:
{
"idx": 8,
"word": "дорожка",
"sentence1": "Бурые ковровые дорожки заглушали шаги",
"sentence2": "Приятели решили выпить на дорожку в местном баре",
"start1": 15,
"end1": 23,
"start2": 26,
"end2": 34,
"label": false,
"gold_sense1": 1,
"gold_sense2": 2
}
How did we collect the data? All text examples are sourced from the original Russe dataset, which was compiled for Russian semantic evaluation campaigns organized by ACL SIGSLAV. Human annotation was conducted via Yandex.Toloka. In Version 2, we manually curated a test set in the same format.
提供机构:
OpenDataLab
创建时间:
2022-06-28
搜集汇总
数据集介绍

背景与挑战
背景概述
RUSSE是一个俄语上下文词义理解数据集,用于评估上下文相关词嵌入,通过二进制分类任务(真/假)进行阅读理解。该数据集基于RUSSE项目的原始数据构建,并经过人工评估收集,包括手动整理的测试集。
以上内容由遇见数据集搜集并总结生成



