five

German Compounds Dataset for Compositionality Tests

收藏
DataCite Commons2024-07-13 更新2025-04-15 收录
下载链接:
https://fdat.uni-tuebingen.de/records/tyza5-9kj67
下载链接
链接失效反馈
官方服务:
资源简介:
The compounds in this dataset were extracted from the list of 54759 German compounds in GermaNet version 9.0, available at http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml.  The following paper describes the automatic compound splitting that is performed before the manual post-correction. If you want to use the split compounds in the context of scientific or research work, please refer to the paper: Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426. [Download paper: http://www.aclweb.org/anthology/R11-1058] The initial compound list was filtered to only those compounds that had a minimum frequency of 500 in the DECOW14AX corpus (https://webcorpora.org/), resulting in a list of 34497 compounds, which were split into the train, test and dev splits (with 24147, 6901 and 3449 compounds respectively). The dataset contains of a dictionary file, cmh_dict.txt, containing 41732 unique words. 8580 of them are modifiers and/or heads of a compound. The modifiers and heads appear first in the dictionary, and the compounds appear last. The train/test/dev files have the following format: index_modifier index_head index_compound where index_modifier, index_head and index_compound are the 1-based indices of the modifier, head and compound in the dictionary file (cmh_dict.txt)
提供机构:
University of Tübingen
创建时间:
2024-07-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作