Multilingual Idiomaticity Detection and Sentence Embedding dataset

Name: Multilingual Idiomaticity Detection and Sentence Embedding dataset
Creator: H. Tayyar Madabushi et al.
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/H-TayyarMadabushi/SemEval_2022_Task2-idiomaticity

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集由一个由12名评委组成的团队从自然语言来源中收集，包含了三个连续句子的序列，其中中间句子包含了多词表达，这些表达可能是惯用语的字面意义或实际意义。此外，数据集还包括潜在的惯用表达（PIE），这意味着中间句子可能包含习语或非习语。在测试数据中，我们对惯用语和非惯用语的实例数量进行了随机欠采样，以保持平衡。该数据集的任务是生成包含惯用语或字面表达的故事情境的上下文相关延续。

This dataset was collected from natural language sources by a team of 12 annotators. It consists of sequences of three consecutive sentences, where the middle sentence contains multi-word expressions that may carry either their literal or idiomatic meanings. Additionally, the dataset includes Potential Idiomatic Expressions (PIE), meaning that the middle sentence may contain either idiomatic or non-idiomatic phrases. For the test split, we performed random undersampling on instances of idiomatic and non-idiomatic expressions to maintain a balanced dataset. The task associated with this dataset is to generate contextually relevant continuations for narrative contexts that contain either idiomatic or literal expressions.

提供机构：

H. Tayyar Madabushi et al.

5,000+

优质数据集

54 个

任务类型

进入经典数据集