Dataset of German lexicalized and transparent compounds

Name: Dataset of German lexicalized and transparent compounds
Creator: University of Tübingen
Published: 2024-05-19 15:42:12
License: 暂无描述

DataCite Commons2024-05-19 更新2024-07-13 收录

下载链接：

https://fdat.uni-tuebingen.de/records/a5q53-yvk06

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset extracted from the de-nncom-sem annotated dataset (8005 compounds). Contains 648 compounds that are annotated as lexicalized (in some way: lex_M, lex_H, lex_R, lex_HS, lex_MS). An additional 648 compounds that were not marked as lexicalized were randomly extracted from the 8005 dataset and added to this dataset, to make the data balanced. Filtered for the compounds (and modif, heads) that occur with min freq. 101 in the word embeddings -> 1053. Removed Medizinfrau, Modepuppe and Abendland to get to a neat 1050 compounds in the dataset (they were above 100). Example entries: Hefekranz;Hefe;Kranz;lex_HS;1 Bruchwand;Bruch;Wand;not_lexicalized;0 The first 3 columns contain the compound, modifier and head. The fourth column contains the lexicalization labels annotated by Dr. Heike Telljohann. The lexicalized examples are coded with 1 on the fifth column, the non-lexicalized with 0. Columns 5-7 list the frequencies of the compound, modifier and head respectively in the decow14ax full vocabulary. The file de-ulex_dataset_freq.txt contains the original dataset with 1296 entries, while de-ulex_dataset_freq_gt100_shuf.txt contains the 1050 entries filtered for frequency > 100, which were used in chapter 6 of Dima (2019).

提供机构：

University of Tübingen

创建时间：

2024-05-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集