Wikipedia category embeddings - Node2Vec, Poincare, Elmo
收藏DataCite Commons2020-07-29 更新2025-04-16 收录
下载链接:
https://databank.illinois.edu/datasets/IDB-4551278
下载链接
链接失效反馈官方服务:
资源简介:
Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (https://archive.org/download/enwiki-20170920) created using the following algorithms: * Node2vec * Poincare embedding * Elmo model on the category title The following files are present: * wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with "_") 300 dim space separated embedding. * wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. * elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using * node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt * poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt * wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt * categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. * category_edges.txt - Category edges based on category names (with spaces). Format from_category to_category * category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt * wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt Software used: * https://github.com/napsternxg/WikiUtils - Processing sql dumps * https://github.com/napsternxg/node2vec - Generate random walks for node2vec * https://github.com/RaRe-Technologies/gensim (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm * https://github.com/allenai/allennlp (version 0.8.2) - Generate elmo embeddings for each category title Code used: * wiki_cat_node2vec_commands.sh - Commands used to * wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings * wiki_cat_poincare_embedding.py - generate poincare embeddings
提供机构:
University of Illinois at Urbana-Champaign
创建时间:
2019-07-08



