German Compounds Dataset for Compositionality Tests

Name: German Compounds Dataset for Compositionality Tests
Creator: University of Tübingen
Published: 2024-07-13 11:31:18
License: 暂无描述

DataCite Commons2024-07-13 更新2025-04-15 收录

下载链接：

https://fdat.uni-tuebingen.de/records/tyza5-9kj67

下载链接

链接失效反馈

官方服务：

资源简介：

The compounds in this dataset were extracted from the list of 54759 German compounds in GermaNet version 9.0, available at http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml. The following paper describes the automatic compound splitting that is performed before the manual post-correction. If you want to use the split compounds in the context of scientific or research work, please refer to the paper: Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426. [Download paper: http://www.aclweb.org/anthology/R11-1058] The initial compound list was filtered to only those compounds that had a minimum frequency of 500 in the DECOW14AX corpus (https://webcorpora.org/), resulting in a list of 34497 compounds, which were split into the train, test and dev splits (with 24147, 6901 and 3449 compounds respectively). The dataset contains of a dictionary file, cmh_dict.txt, containing 41732 unique words. 8580 of them are modifiers and/or heads of a compound. The modifiers and heads appear first in the dictionary, and the compounds appear last. The train/test/dev files have the following format: index_modifier index_head index_compound where index_modifier, index_head and index_compound are the 1-based indices of the modifier, head and compound in the dictionary file (cmh_dict.txt)

提供机构：

University of Tübingen

创建时间：

2024-07-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集