TH-WordSim-353, TH-SimLex-999, TH-SemEval-500
收藏arXiv2019-04-09 更新2024-06-21 收录
下载链接:
https://github.com/gwohlgen/thai_word_similarity
下载链接
链接失效反馈官方服务:
资源简介:
本研究针对泰语创建了三个词相似度数据集,分别是TH-WordSim-353、TH-SimLex-999和TH-SemEval-500,总计包含1852个词对。这些数据集通过翻译和重新评分英文原版数据集WordSim-353、SimLex-999和SemEval-2017-Task-2而得。数据集涵盖不同难度、领域覆盖和相似性概念(相关性与相似性),旨在为泰语词嵌入模型提供全面的评估。创建过程中,通过专家翻译和本地泰语使用者的评分确保数据质量。这些数据集适用于泰语自然语言处理领域,特别是词嵌入模型的评估和改进,以解决泰语处理中的特定问题。
This study developed three word similarity datasets for Thai, namely TH-WordSim-353, TH-SimLex-999, and TH-SemEval-500, which collectively contain 1,852 word pairs. These datasets are generated by translating and re-scoring their original English counterparts: WordSim-353, SimLex-999, and SemEval-2017-Task-2. Covering diverse difficulty levels, domain coverage, and two core similarity concepts (relevance and semantic similarity), the datasets are designed to serve as a comprehensive evaluation benchmark for Thai word embedding models. To ensure data quality, expert translation and native Thai speaker scoring were implemented throughout the creation process. These datasets are applicable to the field of Thai natural language processing, particularly for evaluating and improving word embedding models to address specific challenges in Thai language processing.
提供机构:
King Mongkut's Institute of Technology Ladkrabang (KMITL)
创建时间:
2019-04-09



