five

Dataset of German lexicalized and transparent compounds

收藏
DataCite Commons2024-05-19 更新2024-07-13 收录
下载链接:
https://fdat.uni-tuebingen.de/records/a5q53-yvk06
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset extracted from the de-nncom-sem annotated dataset (8005 compounds). Contains 648 compounds that are annotated as lexicalized (in some way: lex_M, lex_H, lex_R, lex_HS, lex_MS). An additional 648 compounds that were not marked as lexicalized were randomly extracted from the 8005 dataset and added to this dataset, to make the data balanced. Filtered for the compounds (and modif, heads) that occur with min freq. 101 in the word embeddings -> 1053. Removed Medizinfrau, Modepuppe and Abendland to get to a neat 1050 compounds in the dataset (they were above 100). Example entries:                  Hefekranz;Hefe;Kranz;lex_HS;1                  Bruchwand;Bruch;Wand;not_lexicalized;0                   The first 3 columns contain the compound, modifier and head. The fourth column contains the lexicalization labels annotated by Dr. Heike Telljohann. The lexicalized examples are coded with 1 on the fifth column, the non-lexicalized with 0. Columns 5-7 list the frequencies of the compound, modifier and head respectively in the decow14ax full vocabulary. The file de-ulex_dataset_freq.txt contains the original dataset with 1296 entries, while de-ulex_dataset_freq_gt100_shuf.txt contains the 1050 entries filtered for frequency > 100, which were used in chapter 6 of Dima (2019).
提供机构:
University of Tübingen
创建时间:
2024-05-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作