German Adverb-Adjective Phrase Dataset for Compositionality Tests (deu-adv-adj)
收藏DataCite Commons2024-08-08 更新2025-04-15 收录
下载链接:
https://fdat.uni-tuebingen.de/records/7casr-x0p36
下载链接
链接失效反馈官方服务:
资源简介:
If you want to use this dataset for research purposes, please refer to the following sources:
- Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP).
- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics.
The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license.
The German 23,488 adverb-adjective phrases (split into 16,441 train, 4,701 test, 2,346 dev instances) were extracted from the TüBa-D/DP treebank, which consists of articles from the newspaper taz, the German Wikipedia dump from January 20, 2018 and the German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012) and has a size of 64.9M sentences and 1.3B tokens.
The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases, head-dependent pairs were extracted that fulfilled the following requirements:
- the head is an attributive or predicative adjective and governs the dependent with the adverb relation
- the dependent immediately precedes the head
The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb.
The train/test/dev files have the following format, the single parts are separated by space.
adverb adjective phrase, where the adverb and the adjective in the phrase are separated by the string _adv_adj_ (e.g. immer leer immer_adv_adj_leer).
For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition.
The word representations were trained on the lemmatized TüBa-D/DP treebank with the word2vec package. The embeddings were constructed using the skip-gram model with negative sampling (Mikolov et al., 2013).
The embedding size is 200, context size is a symmetric window of 10 words, 25 negative samples were used and a sample probability of 0.0001.
Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The final vocabulary contains 615,908 words.
The resulting embeddings are stored in the binary word2vec format in twe-adv-adj.bin, which can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).
若您希望将本数据集用于研究目的,请参考以下文献来源:
- Daniël de Kok, Sebastian Pütz. 2019. 《图宾根德语依存句法分析树库(TüBa-D/DP)样式手册》(Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP))。
- Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. 《无词为孤岛——语义组合的变换加权模型》(No word is an island — a transformation weighting model for semantic composition)。《计算语言学协会汇刊》(Transactions of the Association for Computational Linguistics)。
本数据集遵循知识共享署名-非商业性使用许可协议(CC-BY-NC)发布。
本数据集包含23,488个德语副词-形容词短语(分为16,441个训练实例、4,701个测试实例和2,346个验证实例),均从TüBa-D/DP树库中提取。该树库由《taz报》文章、2018年1月20日的德语维基百科数据转储以及欧洲议会语料库(EuroParl corpus)的德语会议记录组成(Koehn, 2005;Tiedemann, 2012),规模为6490万句和13亿个Token。
本数据集的构建借助了树库的依存标注(dependency annotations)信息。为收集副词-形容词短语,提取了满足以下条件的中心词-依存词对(head-dependent pairs):
- 中心词为定语形容词或表语形容词,且与依存词构成副词关系;
- 依存词紧邻中心词之前。
提取的词对中,第一个元素既可以是真实副词,也可以是充当副词功能的形容词。
训练/测试/验证文件采用以下格式,各部分以空格分隔:副词 形容词 短语,其中短语中的副词与形容词由字符串_adv_adj_分隔(例如:immer leer immer_adv_adj_leer)。
不同组合模型在本数据集上的结果详见Dima等人(2019)的论文《无词为孤岛——语义组合的变换加权模型》。
词表示基于词形还原后的TüBa-D/DP树库,通过word2vec工具包训练得到。词嵌入采用带负采样(negative sampling)的跳字模型(skip-gram model)构建(Mikolov等人,2013)。
嵌入维度为200,上下文窗口为对称的10个词,使用25个负样本,采样概率为0.0001。
仅对出现频率至少为30次的词和短语训练其表示。最终词汇表包含615,908个词。
生成的嵌入以二进制word2vec格式存储在twe-adv-adj.bin文件中,可通过多个工具包加载(例如Řehůřek、Radim与Petr Sojka(2010)开发的gensim工具包)。
提供机构:
University of Tübingen
创建时间:
2024-08-07



