基于计算的语义相似性数据集

Name: 基于计算的语义相似性数据集
Creator: 巴斯克政府
Published: 2023-04-20 16:23:21
License: 暂无描述

arXiv2023-04-20 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2304.09616v2

下载链接

链接失效反馈

官方服务：

资源简介：

本研究介绍了基于计算的语义相似性数据集，该数据集利用两种广泛认可的自然语言处理资源：文本语料库和知识库，为巴斯克语和西班牙语提供语义相似性的量化。数据集的创建包括三个步骤：计算每个名词的四个关键心理语言学特征；根据这四个变量配对名词；为每对名词分配三种类型的词相似度测量，这些测量基于文本、Wordnet和混合嵌入。数据集包括巴斯克语和欧洲西班牙语的名词对信息，并计划扩展到更多语言。该数据集旨在通过提供丰富的语义相似性量化，帮助构建心理语言学实验，控制影响词汇处理的重要变量，如具体性、频率、语义和语音邻域密度，从而拓宽数据集的可用性和结果的解释。

This study introduces a computational-based semantic similarity dataset that leverages two widely recognized natural language processing resources—text corpora and knowledge bases—to provide quantified semantic similarity assessments for Basque and Spanish. The dataset's development involves three core stages: first, calculating four critical psycholinguistic features for each noun; second, pairing nouns based on these four variables; and third, assigning three types of word similarity metrics to each noun pair, which are grounded in text corpora, WordNet, and hybrid embeddings respectively. The dataset contains noun pair information for Basque and European Spanish, with plans to expand to more languages in the future. This dataset aims to facilitate the construction of psycholinguistic experiments by offering rich quantified semantic similarity data, while allowing control over key variables that influence lexical processing, such as concreteness, frequency, semantic and phonological neighborhood density, thereby improving the dataset's usability and the interpretability of experimental outcomes.

提供机构：

巴斯克政府

创建时间：

2023-04-19