German Nominal Compounds Dataset for Compositionality Tests (deu-nn)
收藏DataCite Commons2024-08-08 更新2025-04-15 收录
下载链接:
https://fdat.uni-tuebingen.de/records/54vmb-80e89
下载链接
链接失效反馈官方服务:
资源简介:
If you want to use this dataset for research purposes, please refer to the following sources:
- Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426. [Download paper: http://www.aclweb.org/anthology/R11-1058]
- Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP).
- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.
The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.
The nominal compounds in this dataset were extracted from the list of 54,759 German compounds provided by the lexical database GermaNet, version 9.0, available at http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml.
As specified on the GermaNet page, "The list of compound data is free for academic research as defined in GermaNet's academic research licence agreement (http://www.sfs.uni-tuebingen.de/lsd/licenses.shtml).
For any other intended purposes, please contact the GermaNet team. Henrich and Hinrichs (2011) describe the automatic compound splitting that is performed before the manual post-correction.
The initial compound list was filtered to contain only those compounds and constituents that had a minimum frequency of 50 in the TüBa-D/DP treebank, resulting in a list of 32,246 compounds, which were split into the train, test and dev splits (with 22,591, 6,442 and 3,213 compounds respectively). The train/test/dev files have the following format, the single parts separated by space: modifier head compound (e.g. Apfel Baum Apfelbaum).
For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.
The word embeddings for all constituents and compounds in this dataset are stored in the binary word2vec format in the file twe-lemmas.bin.
This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).
The embeddings for the constituents and compounds were trained jointly on the lemmatized version of the TüBa-D/DP treebank, using the word2vec package (Mikolov et al. 2013).
The treebank consists of articles from the newspaper taz, the German Wikipedia dump from January 20, 2018 and the German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012) and has a size of 64.9M sentences and 1.3B tokens. The word embeddings were trained using the skipgram model with negative sampling with an embedding dimension of 200, symmetric window of 10, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The minimum frequency cut-off was set to 50 for all words. The total vocabulary size amounts 403,030.
提供机构:
University of Tübingen
创建时间:
2024-08-07



