five

Dutch Adverb-Adjective Phrase Dataset for Compositionality Tests (nld-adv-adj)

收藏
DataCite Commons2024-08-08 更新2025-04-15 收录
下载链接:
https://fdat.uni-tuebingen.de/records/k84a2-rpj39
下载链接
链接失效反馈
官方服务:
资源简介:
If you want to use this dataset for research purposes, please refer to the following sources:                  - Gertjan Van Noord, Gosse Bouma, Frank Van  Eynde,  Daniël  De  Kok,  Jelmer  Van  der Linde, Ineke Schuurman, Erik Tjong Kim Sang,                  and Vincent Vandeghinste. 2013. Large Scale Syntactic Annotation of Written Dutch: Lassy. In Essential Speech and Language Technology for Dutch, pages 147–164. Springer.                  - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics. The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license. This dataset contains 4,540 Dutch adverb-adjective phrases (3,183 train, 907 test, 450 dev) extracted from the Lassy Large treebank (Van Noord et al., 2013), which consists of written texts (Wikipedia, newspapers) and texts of the medical domain. The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases head-dependent pairs were extracted that fulfilled the following requirements:                  - the head is an attributive or predicative adjective and governs dependent with the adverb relation                  - the dependent immediately precedes the head The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb. The train/test/dev files have the following format, the single parts are separated by tab. adverb adjective adv-adj_phrase (e.g. zeer moeizaam zeer_adv_adj_moeizaam) For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition. Word embeddings for all adverbs, adjectives and phrases are stored in the binary word2vec format in lassy-adv-adj.bin, wich can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)). The word embeddings were trained on the lemmatized Lassy Large treebank with the word2vec package. Representations for the adjectives, adverbs and phrases were trained jointly, for the phrase representations the adverb and the adjective were concatenated into a single unit using the separator _adv_adj_. The embeddings were constructed using the skip-gram model with negative sampling (Mikolov et al., 2013). The embedding size is 200, context size is a symmetric window of 10, 25 negative samples were used and a sample probability of 0.0001. Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The total vocabulary size is 290,704.
提供机构:
University of Tübingen
创建时间:
2024-08-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作