five

Methodology data of "Twenty years of research in Digital Humanities: a topic modeling study"

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4552435
下载链接
链接失效反馈
官方服务:
资源简介:
This document contains the datasets created in the thesis "Twenty years of research in Digital Humanities: a topic modeling study". The methodological approach of the work is based on two datasets built by web scraping DH journals’ official web pages and API requests to popular academic databases (Crossref, Datacite). The datasets constitute a corpus of DH research and include research papers abstracts and abstract papers from DH journals and international DH conferences published between 2000 and 2020. Probabilistic topic modeling with latent Dirichlet allocation is then performed on both datasets to identify relevant research subfields. Data Folder "data/" contains four folders which relate to two datasets: The first dataset, which will be referred to as the journals dataset, contains original research papers published in journals exclusively devoted to digital humanities scholarshipis [1] and is composed of 2,464 articles from 26 journals.  The second dataset, the conference dataset, contains abstract papers available in ADHO conference archives and is composed of 2,160 articles from 15 years of ADHO conferences and 4 conferences promoted by journals Both datasets are provided with: URL (if available); identifier and related scheme (if available); abstract or abstract paper; title; authors’ given name, family name; author’s affiliation name, found within the document metadata or text; normalized affiliation name, country of the affiliation, identifiers of the affiliation provided by the Research Organization Registry Community (ROR, https://ror.org); publisher (if available);  publishing date (complete date when provided or only the year); keywords (if available); journal title; volume and issue (if available); electronic and/or print ISSN (if available).  The two folders "data/no_abstracts..." are licensed under a Creative Commons public domain dedication (CC0), while the others keep their original license (the one provided by their publisher) because they contain full abstracts of the papers. These latter datasets are provided in order to favor the reproducibility of the results obtained in our work. Topic modeling "topic_modeling/" directory contains input and output data used within MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [2]. The topic modeling results are divided in two folders, one for each of the datasets.  Note: It's necessary to unzip the file to get access to all the files and directories listed below. References Spinaci, G., Colavizza, G., Peroni, S., Preliminary Results on Mapping Digital Humanities Research, in: Atti del IX Convegno Annuale AIUCD. La svolta inevitabile: sfide e prospettive per l'Informatica Umanistica, Milan, Università Cattolica del Sacro Cuore, 2020, pp. 246 - 252 (atti di: IX Convegno Annuale AIUCD. La svolta inevitabile: sfide e prospettive per l'Informatica Umanistica, Milano, Italy, 15-17 gennaio 2020) Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3

本文档收录了学位论文《数字人文(Digital Humanities, DH)研究二十年:主题建模分析》中构建的全部数据集。本研究的方法论依托两类数据集构建:其一通过网络爬取DH期刊官方网页生成,其二通过向主流学术数据库(Crossref、Datacite)发起API请求获取。两类数据集共同构成DH研究语料库,收录了2000年至2020年间发表的DH期刊研究论文摘要与国际DH会议论文摘要,并针对两个数据集分别开展基于潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)的概率主题建模,以识别该领域的核心研究子方向。 数据 "data/"文件夹包含四个子文件夹,对应上述两类数据集: 第一类数据集可称为期刊数据集,收录了仅面向DH研究的专业期刊所发表的原创研究论文,涵盖26种期刊的2464篇文章。 第二类数据集为会议数据集,收录了ADHO会议档案中可获取的会议论文摘要,包含15届ADHO年会与4个期刊联合主办会议的2160篇文章。 两类数据集均附带以下元数据(若有则提供):资源链接(URL)、标识符及关联标识体系、论文/会议摘要、论文标题、作者姓名(名与姓)、作者所属机构名称(从文档元数据或正文中提取)、标准化机构名称、机构所属国家、研究组织注册表(Research Organization Registry, ROR,https://ror.org)提供的机构标识符、出版方(若有)、出版日期(提供完整日期则使用完整日期,否则仅标注年份)、关键词(若有)、期刊名称、卷期(若有)、电子 ISSN 与/或印刷型 ISSN(若有)。 两个命名为"data/no_abstracts..."的文件夹采用知识共享公共领域贡献许可(CC0)进行授权,其余文件夹保留原出版方提供的版权协议,因其包含论文的完整摘要。提供后者的目的是为复现本研究的实验结果提供便利。 主题建模 "topic_modeling/"目录包含MITAO工具所需的输入与输出数据。MITAO是一款用于整合自动文本分析工具并构建完全可定制可视化工作流的软件[2]。主题建模结果分为两个子文件夹,分别对应前述两类数据集。 注意:需解压本文件方可访问下述所有文件与目录。 参考文献 [1] Spinaci G, Colavizza G, Peroni S. 数字人文研究图谱绘制的初步成果[C]// 第九届AIUCD年度会议论文集:不可避免的转折:人文信息学的挑战与展望. 米兰:圣心天主教大学,2020:246-252.(本次会议为2020年1月15日至17日于意大利米兰举办的第九届AIUCD年度会议) [2] Ferri P, Heibi I, Pareschi L, Peroni S. MITAO:一款面向主题建模的易用型模块化软件[J]. PuntOorg国际期刊, 2020, 5(2): 135-149. https://doi.org/10.19245/25.05.pij.5.2.3
创建时间:
2021-02-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作