Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people
收藏IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/age-dataset-structured-general-purpose-dataset-life-work-and-death-122-million
下载链接
链接失效反馈官方服务:
资源简介:
Several fields of study can benefit from a large, structured, and accurate dataset of historical figures. Due to a lack of such a dataset, in this paper, we aim to use machine learning and text mining models to collect, predict, and cleanse online data with a focus on age and gender. We developed a five-step method and inferred birth and death years, binary gender, and occupation from community-submitted data to all language versions of the Wikipedia project. The dataset is the largest on notable deceased people and includes individuals from a variety of social groups, including but not limited to 107k females, 124 non-binary people, and 90k researchers, who are spread across more than 300 contemporary or historical regions. The final product provides new insights into the demographics of mortality in relation to gender and profession in history. The technical method demonstrates the usability of the latest text mining approaches to accurately clean historical data and reduce the missing values.
诸多研究领域均可从大规模、结构化且精准的历史人物数据集中获益。鉴于此类数据集的稀缺性,本文拟采用机器学习与文本挖掘模型,聚焦年龄与性别维度,对在线数据开展采集、预测与清洗工作。我们开发了一套五步流程法,依托维基百科各语言版本的社区用户提交数据,推断出人物的生卒年份、二元性别与职业信息。本数据集是目前规模最大的知名已故人物数据集,涵盖来自各类社会群体的个体,其中包括但不限于10.7万名女性、124名非二元性别者以及9万名研究人员,相关人物分布于300余个当代或历史地域范围内。最终产出的数据集为探究历史上与性别、职业相关的死亡人口统计学特征提供了全新视角。本文提出的技术方法验证了最新文本挖掘技术在精准清洗历史数据、降低缺失值方面的可用性。
提供机构:
Annamoradnejad, Rahimberdi; Annamoradnejad, Issa



