Computationally grounded semantic similarity datasets for Basque and Spanish
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/6xr2rp8gvh
下载链接
链接失效反馈官方服务:
资源简介:
The current word similarity datasets address a gap in psycholinguistic research by providing a comprehensive
set of noun pairs with quantifications of semantic similarity. The datasets are based on two well-known Natural Language Processing
resources: text corpora and knowledge bases. It aims to facilitate research in lexical processing by incorporating variables that play
a significant role in semantic analysis.
The dataset included in this repository provides noun pairs' information in Basque and European Spanish. It offers a rich collection
of noun pairs with associated psycholinguistic features and word similarity measurements. Researchers can leverage this dataset to
explore semantic similarity and lexical processing across languages.
In the dataset, each noun is associated with four linguistic features: concreteness (CNC), word frequency (FRQ), semantic neighborhood density (SND), and phonemic neighborhood density (PND). These features include their corresponding high- and low-valued clusters. The matching of noun pairs is entirely computational and follows these steps: for each noun, we search for every noun that matches the clusters across all four features. For each matching noun pair, we compute three types of word similarity: text embeddings similarity (SIM_TXT), WordNet-based embeddings similarity (SIM_WN) and hybrid embeddings similarity (SIM_HYB)
The dataset includes the following columns in each line:
- noun1
- noun2
- SIM_TXT
- SIM_WN
- SIM_HYB
- CNC value of noun1.
- Cluster identifier for the CNC of noun1.
- CNC value of noun2.
- Cluster identifier for the CNC of noun2.
- FRQ of noun1.
- Cluster identifier for the FRQ of noun1.
- FRQ of noun2.
- Cluster identifier for the FRQ of noun2.
- PND of noun1.
- Cluster identifier for the PND of noun1.
- PND of noun2.
- Cluster identifier for the PND of noun2.
- SND of noun1.
- Cluster identifier for the SND of noun1.
- SND of noun2.
- Cluster identifier for the SND of noun2.
The feature dictionaries utilized for creating the word similarity datasets are included in the repository. Each line of the dictionary is comprised by the noun, its corresponding feature cluster, the normalized value of the feature measurement, the raw value of the feature measurement.
创建时间:
2024-10-14



