German Compounds Dataset for Compositionality Tests
收藏DataCite Commons2024-07-13 更新2025-04-15 收录
下载链接:
https://fdat.uni-tuebingen.de/records/tyza5-9kj67
下载链接
链接失效反馈官方服务:
资源简介:
The compounds in this dataset were extracted from the list of 54759 German compounds in GermaNet version 9.0, available at
http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml.
The following paper describes the automatic compound splitting that is performed before the manual post-correction. If you want to use the split compounds in the context of scientific or research work, please refer to the paper:
Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426. [Download paper: http://www.aclweb.org/anthology/R11-1058]
The initial compound list was filtered to only those compounds that had a minimum frequency of 500 in the DECOW14AX corpus (https://webcorpora.org/), resulting in a list of 34497 compounds, which were split into the train, test and dev splits (with 24147, 6901 and 3449 compounds respectively).
The dataset contains of a dictionary file, cmh_dict.txt, containing 41732 unique words. 8580 of them are modifiers and/or heads of a compound. The modifiers and heads appear first in the dictionary, and the compounds appear last.
The train/test/dev files have the following format:
index_modifier index_head index_compound
where index_modifier, index_head and index_compound are the 1-based indices of the modifier, head and compound in the dictionary file (cmh_dict.txt)
提供机构:
University of Tübingen
创建时间:
2024-07-13



