Computationally grounded semantic similarity datasets for Basque and Spanish

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/6xr2rp8gvh

下载链接

链接失效反馈

官方服务：

资源简介：

The current word similarity datasets address a gap in psycholinguistic research by providing a comprehensive set of noun pairs with quantifications of semantic similarity. The datasets are based on two well-known Natural Language Processing resources: text corpora and knowledge bases. It aims to facilitate research in lexical processing by incorporating variables that play a significant role in semantic analysis. The dataset included in this repository provides noun pairs' information in Basque and European Spanish. It offers a rich collection of noun pairs with associated psycholinguistic features and word similarity measurements. Researchers can leverage this dataset to explore semantic similarity and lexical processing across languages. In the dataset, each noun is associated with four linguistic features: concreteness (CNC), word frequency (FRQ), semantic neighborhood density (SND), and phonemic neighborhood density (PND). These features include their corresponding high- and low-valued clusters. The matching of noun pairs is entirely computational and follows these steps: for each noun, we search for every noun that matches the clusters across all four features. For each matching noun pair, we compute three types of word similarity: text embeddings similarity (SIM_TXT), WordNet-based embeddings similarity (SIM_WN) and hybrid embeddings similarity (SIM_HYB) The dataset includes the following columns in each line: - noun1 - noun2 - SIM_TXT - SIM_WN - SIM_HYB - CNC value of noun1. - Cluster identifier for the CNC of noun1. - CNC value of noun2. - Cluster identifier for the CNC of noun2. - FRQ of noun1. - Cluster identifier for the FRQ of noun1. - FRQ of noun2. - Cluster identifier for the FRQ of noun2. - PND of noun1. - Cluster identifier for the PND of noun1. - PND of noun2. - Cluster identifier for the PND of noun2. - SND of noun1. - Cluster identifier for the SND of noun1. - SND of noun2. - Cluster identifier for the SND of noun2. The feature dictionaries utilized for creating the word similarity datasets are included in the repository. Each line of the dictionary is comprised by the noun, its corresponding feature cluster, the normalized value of the feature measurement, the raw value of the feature measurement.

创建时间：

2024-10-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集