German Adverb-Adjective Phrase Dataset for Compositionality Tests (deu-adv-adj)

Name: German Adverb-Adjective Phrase Dataset for Compositionality Tests (deu-adv-adj)
Creator: University of Tübingen
Published: 2024-08-08 08:26:35
License: 暂无描述

DataCite Commons2024-08-08 更新2025-04-15 收录

下载链接：

https://fdat.uni-tuebingen.de/records/7casr-x0p36

下载链接

链接失效反馈

官方服务：

资源简介：

If you want to use this dataset for research purposes, please refer to the following sources: - Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP). - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics. The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license. The German 23,488 adverb-adjective phrases (split into 16,441 train, 4,701 test, 2,346 dev instances) were extracted from the TüBa-D/DP treebank, which consists of articles from the newspaper taz, the German Wikipedia dump from January 20, 2018 and the German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012) and has a size of 64.9M sentences and 1.3B tokens. The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases, head-dependent pairs were extracted that fulfilled the following requirements: - the head is an attributive or predicative adjective and governs the dependent with the adverb relation - the dependent immediately precedes the head The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb. The train/test/dev files have the following format, the single parts are separated by space. adverb adjective phrase, where the adverb and the adjective in the phrase are separated by the string _adv_adj_ (e.g. immer leer immer_adv_adj_leer). For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition. The word representations were trained on the lemmatized TüBa-D/DP treebank with the word2vec package. The embeddings were constructed using the skip-gram model with negative sampling (Mikolov et al., 2013). The embedding size is 200, context size is a symmetric window of 10 words, 25 negative samples were used and a sample probability of 0.0001. Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The final vocabulary contains 615,908 words. The resulting embeddings are stored in the binary word2vec format in twe-adv-adj.bin, which can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

若您希望将本数据集用于研究目的，请参考以下文献来源： - Daniël de Kok, Sebastian Pütz. 2019. 《图宾根德语依存句法分析树库（TüBa-D/DP）样式手册》（Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP)）。 - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. 《无词为孤岛——语义组合的变换加权模型》（No word is an island — a transformation weighting model for semantic composition）。《计算语言学协会汇刊》（Transactions of the Association for Computational Linguistics）。本数据集遵循知识共享署名-非商业性使用许可协议（CC-BY-NC）发布。本数据集包含23,488个德语副词-形容词短语（分为16,441个训练实例、4,701个测试实例和2,346个验证实例），均从TüBa-D/DP树库中提取。该树库由《taz报》文章、2018年1月20日的德语维基百科数据转储以及欧洲议会语料库（EuroParl corpus）的德语会议记录组成（Koehn, 2005；Tiedemann, 2012），规模为6490万句和13亿个Token。本数据集的构建借助了树库的依存标注（dependency annotations）信息。为收集副词-形容词短语，提取了满足以下条件的中心词-依存词对（head-dependent pairs）： - 中心词为定语形容词或表语形容词，且与依存词构成副词关系； - 依存词紧邻中心词之前。提取的词对中，第一个元素既可以是真实副词，也可以是充当副词功能的形容词。训练/测试/验证文件采用以下格式，各部分以空格分隔：副词形容词短语，其中短语中的副词与形容词由字符串_adv_adj_分隔（例如：immer leer immer_adv_adj_leer）。不同组合模型在本数据集上的结果详见Dima等人（2019）的论文《无词为孤岛——语义组合的变换加权模型》。词表示基于词形还原后的TüBa-D/DP树库，通过word2vec工具包训练得到。词嵌入采用带负采样（negative sampling）的跳字模型（skip-gram model）构建（Mikolov等人，2013）。嵌入维度为200，上下文窗口为对称的10个词，使用25个负样本，采样概率为0.0001。仅对出现频率至少为30次的词和短语训练其表示。最终词汇表包含615,908个词。生成的嵌入以二进制word2vec格式存储在twe-adv-adj.bin文件中，可通过多个工具包加载（例如Řehůřek、Radim与Petr Sojka（2010）开发的gensim工具包）。

提供机构：

University of Tübingen

创建时间：

2024-08-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集