SemEval-2023 Task 9 Dataset
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/isegura/hulat_intimacy
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了用于亲密程度分析的多语种推文,分为训练集、验证集和测试集。在处理数据时,我们清理了文本,移除了提及和URL,并将最大标记长度限制在50个。此外,我们确保了不同语言和亲密程度分数在各个数据集分割中分布的一致性。具体规模上,训练集包含6,643条文本,验证集有940条,测试集则有1,908条。这项任务的目的是对多语种推文进行亲密程度分析。
This dataset comprises multilingual tweets for intimacy analysis, partitioned into training, validation, and test subsets. During data preprocessing, we cleaned the textual content, removed user mentions and URLs, and capped the maximum token length at 50. Furthermore, we ensured consistent distributions of different languages and intimacy scores across each dataset split. In terms of specific scale, the training set contains 6,643 texts, the validation set has 940 entries, and the test set has 1,908 entries. The core objective of this task is to perform intimacy analysis on multilingual tweets.
提供机构:
HULAT Team



