Research data supporting "Machine learning in the processing of historical census data"

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/p4zptr98dh

下载链接

链接失效反馈

官方服务：

资源简介：

This collection of data contains ground-truth (gold standard) datasets for the employment status reconstruction problem of historical census data. Different machine learning methods can be tested and compared with these datasets as described in the paper "Machine learning in the processing of historical census data" by Montebruno, P., Bennett, R, Smith, H., and van Lieshout, C., an outcome of the ESRC project ES/M010953: Drivers of Entrepreneurship and Small Businesses lead by PI Prof. Robert J. Bennett. The material consists of three raw text files (1. and 2. are random samples). No census identification of individuals variable (RecID) is given so that the datasets are fully anonymised and it is not possible to track the individuals in each of the files. Below the variables descriptors: 1."1891 1000 Ent". 1891 Census of England and Wales economically active individuals: 1,000 labelled Entrepreneurs (500 labelled Employers and 500 labelled Own account business proprietors) and 1,000 labelled workers. Labelling derives from the known employment status reported on the night of the Census, for the later 1891-1911 censuses; using the reported crosses in the columns of the 1891 Census Enumerators' Books (CEBs). 2."1851 1000 Ent". 1851 Census of England and Wales economically active individuals: 1,000 labelled Entrepreneurs (500 labelled Employers and 500 labelled Own accounts) and 1,000 labelled workers. Labelling using clerical control of the occupational strings for the extracted Groups of business proprietors in the 1851 Census. 3."1851 MAX(Extracted)". 1851 Census of England and Wales economically active individuals: 70,872 labelled Entrepreneurs (35,436 labelled Employers and 35,436 labelled Own accounts) and 70,872 labelled workers. A maximum possible balanced dataset, from all the employers and own account identified by extracted Groups (1 for Employers and 3 and 5 for Own account). Labelling using clerical control of the occupation strings for the extracted Groups of the 1851 Census. It is also included the key variable OccString with full occupation strings. A detailed explanation of how these datasets were obtained and how to use them in the context of machine learning reconstruction of the employment status problem of historical census data can be found in the paper "Machine learning in the processing of historical census data" by Montebruno, P., Bennett, R, Smith, H., and van Lieshout, C. (2020) Information Processing & Management. This dataset should be cited as: Montebruno, Piero; Bennett, Robert J.; Smith, Harry J.; van Lieshout, Carry (2020), “Research data supporting "Machine learning in the processing of historical census data" ”, Mendeley Data, http://dx.doi.org/10.17632/p4zptr98dh.1

本数据集合集收录了用于历史人口普查数据就业状态重构任务的基准真值（ground-truth，金标准gold standard）数据集。依托该数据集，可对各类机器学习方法开展测试与对比，相关细节可参见Montebruno P.、Bennett R.、Smith H.与van Lieshout C.发表的论文《Machine learning in the processing of historical census data》，该研究系ESRC项目ES/M010953——由首席研究员Robert J. Bennett教授主持的“创业与小型企业驱动因素”项目的成果。本数据集包含三份原始文本文件（第1、2份为随机抽样样本）。所有数据集均未提供个人普查识别变量（RecID），因此实现了完全匿名化，无法追踪各文件中的个体。以下为各数据集的变量说明： 1. 「1891 1000 Ent」：英格兰与威尔士1891年人口普查经济活跃个体数据集，包含1000名标注为创业者的个体（其中500名为雇主，500名为自营业务经营者）与1000名标注为工人的个体。其标注信息源自1891-1911年普查当晚上报的已知就业状态，具体通过1891年普查员登记册（Census Enumerators' Books，CEBs）栏目的上报交叉记录提取。 2. 「1851 1000 Ent」：英格兰与威尔士1851年人口普查经济活跃个体数据集，包含1000名标注为创业者的个体（其中500名为雇主，500名为自营经营者）与1000名标注为工人的个体。其标注信息通过对1851年人口普查中提取的经营者群体职业字符串进行人工审核完成。 3. 「1851 MAX(Extracted)」：英格兰与威尔士1851年人口普查经济活跃个体数据集，包含70872名标注为创业者的个体（其中35436名为雇主，35436名为自营经营者）与70872名标注为工人的个体，为基于所有提取到的雇主与自营经营者构建的最大规模平衡数据集（雇主对应提取组1，自营经营者对应提取组3与5）。其标注信息通过对1851年人口普查中提取的群体职业字符串进行人工审核完成。本数据集同时包含完整职业字符串的核心变量OccString。关于上述数据集的获取方式，以及如何将其应用于历史人口普查数据就业状态重构的机器学习任务的详细说明，可参见Montebruno P.、Bennett R.、Smith H.与van Lieshout C.于2020年发表于《Information Processing & Management》的论文《Machine learning in the processing of historical census data》。该数据集的引用规范为： Montebruno, Piero; Bennett, Robert J.; Smith, Harry J.; van Lieshout, Carry (2020), “Research data supporting "Machine learning in the processing of historical census data" ”, Mendeley Data, http://dx.doi.org/10.17632/p4zptr98dh.1

创建时间：

2020-01-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集