five

Topic Model for English Wikipedia's Biographies with list of all 1.8M articles linked to Wikidata

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/5747335
下载链接
链接失效反馈
官方服务:
资源简介:
A Genism LDA Topic Model of English Wikipedia biographical articles with list of all 1.8M articles, and some associated Wikidata information The model has 150 Topics. This model was developed in the process of isolating a set of visual arts biographical articles, as described in "Clowns in the Visual Artists: Topic Modeling Wikipedia and Wikidata" in the Spring 2022 issue of Art Documentation - https://doi.org/10.1086/719999 Because names, nationalities, and birthdays are so prominent in biographies, the stopwords list removed 170,000 names, surnames, city names, place names, countries, days, months and other time related words (https://github.com/mandiberg/Names-Surnames-and-Countries-for-Stopwords).  We also directly removed each article subject’s given and surname, which were almost always the most frequently occurring words in any given article. Otherwise, the model just produced topics based on nationality, and common names and surnames. Files: all_enwiki_bios_from_wikidata.csv The list of all Wikidata items for humans with an enwiki page (e.g biographical article) was extracted from Wikidata JSON dump; list includes gender, occupation, and nationality. This was joined with the converted plaintext from an English Wikipedia dump. This data was downloaded in March 2021. Wikipedia Biographies LDA Topic Model human readable summary.csv A human readable file with the 150 topics ranked by count of articles per topic from the 1.8M corpus. The most popular topics have categorical descriptions of the occupations of each cluster. Some are marked as not an occupation cluster.  BoW_corpus.mm* model_lda_full_Sep2_150Tv2* These six files comprise the topic model. The code to load them is present in the python files.  dict_full_Aug-28-2021 processed_docs_full_Aug-28-2021.txt processed_docs_1000_Aug-18-2021.txt These are the dictionary and processed corpuses required to build and implement the model using this code. The corpus with the first 1000 items is meant to be used for testing, as the full one is quite large and takes a long time to complete.  topic-model-wikipedia-sept2021.zip The code and settings used for creating and implementing this model are included in this zip and are also available here: https://github.com/mandiberg/topic-model-wikipedia All-Wikipedia-Biographies-with-topic1.csv All-Wikipedia-Biographies-with-topic1and2.csv These are the list of 1.8M biographies matched to topics. The "topic1" file just includes the first topic, this is a slightly larger list. The "topic1and2" file is slightly smaller because about 2% articles do not match to a second topic. Analysis-for-Clowns-Visual-Arts.zip These are the raw data and final data produced for the "Clowns in the Visual Artists." Please see the article for context.
创建时间:
2023-01-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作