five

Bilingual English-German word embedding models for scientific text

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4467632
下载链接
链接失效反馈
官方服务:
资源简介:
This data set contains three word embedding models, constructed from the same training corpus of English and German parallel scientific texts (abstracts and research project descriptions). All text was pre-processed by language-specific stemming with the Porter stemming algorithm, removing numbers, and lower-casing. The first model is a 1000-dimensional Latent Semantic Analysis model, constructed from concatenating the English and German texts. The input data was a m×n (297,852×923,864) document-term matrix of tf-idf weights. This was processed with truncated SVD. There are two files, the word vectors in file lsa_1000_Vmat.csv (the V* term by latent factors matrix of right singular values) and the dimension weights in lsa_1000_d_weights.csv (the 1000 values of the diagonal of the \(\Sigma\) matrix. lsa_1000_Vmat.csv has two fields, the term and its vector representation in LSA space, separated by a "|" character. The structure looks like this: tarifplural|{5.00599733151825e-08,-1.43071379136936e-08,8.32862290483082e-08,-6.08010721687266e-08,1.15831140150142e-07,-2.46470313387358e-08,3.43215595753282e-07,6.24301666802575e-07,-2.62907158945831e-07,-1.04120313981517e-07,4.5864574355164e-07,-2.31799632277312e-07,8.37354377858843e-07,8.22507467711628e-07,4.07585381069368e-07,-4.26358988941922e-08,-8.38652991154651e-07,1.98091851171759e-07,-3.94768548759816e-08,-4.28802181962385e-07, ...} The other two models are a basic Random Indexing and a Reflective Random Indexing model, contained in same file, RI_training.csv. Both models have 1000 dimensions. The data structure is as follows. language: either "en" (English) or "de" (German), the language of the term term: the term as a character string term_collection_count: integer, number of times the term occurred in the training data c_vector: vector of 1000 reals, RI context vector of the term. formatted like this: "{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.12309149,0,0,-0.12309149,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ...}" n_docs: integer, number of different documents which contained the term c_vector_o2: vector of 1000 reals, RRI context vector of the term, formatted like c_vector above 1,034,860 rows. All files are aggressively compressed with GNU gzip and will require much more disk space when uncompressed.  Note the special formatting of the vector numeric variables, which are different for the two models.
创建时间:
2021-01-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作