Dataset of German lexicalized and transparent compounds
收藏DataCite Commons2024-05-19 更新2024-07-13 收录
下载链接:
https://fdat.uni-tuebingen.de/records/a5q53-yvk06
下载链接
链接失效反馈官方服务:
资源简介:
Dataset extracted from the de-nncom-sem annotated dataset (8005 compounds). Contains 648 compounds that are annotated as lexicalized (in some way: lex_M, lex_H, lex_R, lex_HS, lex_MS). An additional 648 compounds that were not marked as lexicalized were randomly extracted from the 8005 dataset and added to this dataset, to make the data balanced.
Filtered for the compounds (and modif, heads) that occur with min freq. 101 in the word embeddings -> 1053.
Removed Medizinfrau, Modepuppe and Abendland to get to a neat 1050 compounds in the dataset (they were above 100).
Example entries:
Hefekranz;Hefe;Kranz;lex_HS;1
Bruchwand;Bruch;Wand;not_lexicalized;0
The first 3 columns contain the compound, modifier and head. The fourth column contains the lexicalization labels annotated by Dr. Heike Telljohann. The lexicalized examples are coded with 1 on the fifth column, the non-lexicalized with 0.
Columns 5-7 list the frequencies of the compound, modifier and head respectively in the decow14ax full vocabulary.
The file de-ulex_dataset_freq.txt contains the original dataset with 1296 entries, while de-ulex_dataset_freq_gt100_shuf.txt contains the 1050 entries filtered for frequency > 100, which were used in chapter 6 of Dima (2019).
提供机构:
University of Tübingen
创建时间:
2024-05-19



