five

German Nominal Compounds Dataset for Compositionality Tests

收藏
DataCite Commons2024-05-19 更新2024-07-13 收录
下载链接:
https://fdat.uni-tuebingen.de/records/6wszk-fkg48
下载链接
链接失效反馈
官方服务:
资源简介:
The nominal compounds in this dataset were extracted from the list of 54759 German compounds in GermaNet version 9.0, available at http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml. As specified on the GermaNet page, the list of compound data is free for academic research as defined in GermaNet's academic research licence agreement (http://www.sfs.uni-tuebingen.de/lsd/licenses.shtml). For any other intended purposes, please contact the GermaNet team. The following paper describes the automatic compound splitting that is performed before the manual post-correction. If you want to use the split compounds in the context of scientific or research work, please refer to the paper:                  Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, September 2011, pp. 420-426.                  [Download paper: http://www.aclweb.org/anthology/R11-1058] The initial compound list was filtered to only those compounds that had a minimum frequency of 500 in the DECOW14AX corpus (https://webcorpora.org/), resulting in a list of 34497 compounds, which were split into the train, test and dev splits (with 24147, 6901 and 3449 compounds respectively). The words in the dataset are lowercased, but the original casing can be recovered if needed by consulting the GermaNet database. The train/test/dev files have the following format:                  modifier head compound (e.g. abfall tonne abfalltonne)  For more details about the filtered dataset, as well as some results of compositionality models evaluated on this dataset see Dima (2015). The dataset is also used under the name the "mixed" dataset in Dima (2019).   German nominal compounds - semantic composition - compositional distributional representations
提供机构:
University of Tübingen
创建时间:
2024-05-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作