Compositionality of Nominal Compounds

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/8296688

下载链接

链接失效反馈

官方服务：

资源简介：

Originally published at https://pageperso.lis-lab.fr/carlos.ramisch/?page=downloads/compounds => comp-datasets-release-v2.tar.gz This package contains numerical judgements by native speakers on the compositionality of 190 nominal compound in English (EN), 180 nominal compounds in French (FR), and 180 nominal compounds in Brazilian Portuguese (PT). The English data is split into two parts. The original 90 English compounds were annotated to complement the 90 compounds in the Reddy dataset (see below). The "extra" 100 English compounds were annotated to perform generalisation experiments in the Computational Linguistics paper (Section 6.3). Judgements were obtained using Amazon Mechanical Turk (EN and FR) and a web interface for volunteers (PT). Every compound has 3 scores: compositionality of head word, compositionality of modifier word and compositionality of the whole. Scores range from 1 (fully idiomatic) to 5 (fully compositonal) and are averaged over several annotators (around 10 to 20 depending on the language). All compounds also have synonyms and similar expressions given by annotators. The datasets are described in detail and used in the experiments of papers below. Please cite one of them if you use this material in your research. Unsupervised Compositionality Prediction of Nominal Compounds [bib] How Naked is the Naked Truth? A Multilingual Lexicon of Nominal Compound Compositionality [bib] Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time [bib] Filtering and Measuring the Intrinsic Quality of Human Compositionality Judgments [bib] Our methodology is inspired from Reddy, McCarthy and Manandhar (2011). We include their set of 90 compounds and judgments in our dataset for the analyses and experiments on English in our papers above. However, we do not include their dataset here, though. Please also download their data and cite their paper to obtain a fully comparable English dataset to the one used in our experiments. => lexsub-nc.tar.gz This package is an extension of the original compositionality datasets and includes more detailed annotation for Portuguese lexical substitution candidates in the original dataset. It contains the same 180 nominal compounds in Portuguese as the compositionality dataset. It additionally contains frequency and PMI from a large Brazilian Portuguese corpos (around 1.2 billion words), as well as lexical substitutes annotated according to the following categories: Invalid: the substitution candidate is not fit for substitution, either for being too specific for a given context or for simply not being valid for the target MWE. Syn-SW: the substitution candidate is a single-word matching synonym in relation to the target MWE. NearSyn-SW: the substitution candidate is a single-word quasi-synonym in relation to the target MWE. Syn-MWE: the substitution candidate is a multiword matching synonym in relation to the target MWE. NearSyn-MWE: the substitution candidate is a multiword quasi-synonym in relation to the target MWE. Paraphrase: the substitution candidate is a paraphrasis of the target MWE. Definition: the substitution candidate is a definition of the target MWE. Head Modifier The lexical substitutes were provided by volunteer native speaker annotators, who were requested to provide suggestions of substitution candidates for the compounds in context. The suggestions from all annotators were pooled together and sorted according to their frequency. This pool was then manually categorized by a linguist, who attributed categories to each different substitution candidate.

创建时间：

2023-08-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集