Vector representations of English words and compounds

Name: Vector representations of English words and compounds
Creator: University of Tübingen
Published: 2024-05-19 15:31:36
License: 暂无描述

DataCite Commons2024-05-19 更新2024-07-13 收录

下载链接：

https://fdat.uni-tuebingen.de/records/9jj4v-6dj64

下载链接

链接失效反馈

官方服务：

资源简介：

Word representations used in Dima (2019). The vectors were generated from the concatenated encow14ax (https://corporafromtheweb.org/) and English Wikipedia - Müller and Schutze (2015) version, ~9 billion words of text. The corpus was also pre-processed for compounds, i.e. the compounds from the en-comcom dataset were linked with an underscore and treated as a single word - e.g. 'police car' was rewritten to 'police_car'. Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 424,014 words. The vocabulary words and their frequency in the corpus can be found in the file 'glove_encow14ax_enwiki_9B.400k_min100.vocab'. Word representations with 4 different vector dimensionalities - 50 dimensional, 100 dimensional, 200 dimensional, 300 dimensional. The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word). MAX_ITER=15 WINDOW_SIZE=10 BINARY=0 NUM_THREADS=8 X_MAX=100

提供机构：

University of Tübingen

创建时间：

2024-02-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集