Synthetic Household Populations and Evaluation Data: Global, Newcastle upon Tyne and Dar es Salaam Experiments

Name: Synthetic Household Populations and Evaluation Data: Global, Newcastle upon Tyne and Dar es Salaam Experiments
Creator: Newcastle University
Published: 2026-04-17 11:40:52
License: 暂无描述

DataCite Commons2026-04-17 更新2026-05-04 收录

下载链接：

https://data.ncl.ac.uk/articles/dataset/Synthetic_Household_Populations_and_Evaluation_Data_Global_Newcastle_upon_Tyne_and_Dar_es_Salaam_Experiments/31830205/1

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains the data artefacts supporting the paper "A large language model framework for sample-free population synthesis". It provides all inputs, outputs, prompts, configurations, and evaluation results needed to reproduce the experiments reported in the paper.OverviewSynthetic populations provide the demographic foundations for agent-based models in transport, public health, disaster management, and related sectors. Many established synthesis methods rely on census microdata, which is infrequently collected, privacy-restricted, and typically available only as small public-use samples at coarse geographic scales. This record supports a sample-free framework that uses large language models (LLMs) to generate complete, household-structured synthetic populations directly from aggregate demographic data, without requiring microdata, cross-tabulations, or model fine-tuning.MethodologyPopulation synthesis is formulated as an iterative constrained reasoning task. At each generation step, an LLM is supplied with a prompt containing: (i) task instructions, (ii) a household schema defining the required attributes and structural rules, (iii) the target marginal distributions for each demographic variable, and (iv) the current empirical distributions computed from the population generated so far. The model generates a batch of candidate households, which are validated against the schema and structural rules. Invalid households trigger a correction request to the model; the generation loop continues until the target population size is reached.The framework is LLM-agnostic and requires only individual marginal distributions as inputs (e.g. age, gender, household size, household composition). No joint distributions, cross-tabulations, or sample microdata are required. All demographic inputs are supplied in a standardised two-column CSV format specifying category labels and percentage shares.LLMs were accessed via two routes: commercial models (GPT-4o, GPT-4o-mini, o3-mini, Grok-3, Grok-3-mini) through Microsoft Azure AI Foundry, and open-source models (Llama variants, Phi-3-mini) via Ollama running locally. No fine-tuning was performed on any model.ExperimentsThe record is organised by experiment, corresponding to the sections of the paper:LLM benchmark: Comparison of multiple LLMs on a standardised test case, evaluating distributional fit, structural validity, error rates, token usage, and cost.Global evaluation: Application of the framework across 109 countries using publicly available demographic data, assessing generalisability across diverse demographic contexts.Newcastle upon Tyne case study: Detailed evaluation against ONS 2021 Census data for a UK city, including repeated runs and assessment of joint demographic patterns.Dar es Salaam case study: Evaluation in a data-scarce context using fragmented and partially inconsistent demographic inputs, demonstrating performance under low data availability.Sensitivity analysis — batch size: Examination of how the number of households generated per LLM prompt affects distributional fit, structural validity, and computational efficiency.IPU comparison: Direct comparison with Iterative Proportional Updating (IPU) on a Newcastle LSOA, using 2021 ONS Census data as the evaluation target.Additional experiments: Further sensitivity and ablation experiments as described in the paper.Each experiment folder contains the input marginals, LLM prompt template, model configuration (model name, version snapshot, temperature, top-p, batch size), and output CSV files with synthetic population records or evaluation metrics.EvaluationSynthetic populations are evaluated on two dimensions. Distributional fit is measured using Standardised Root Mean Square Error (SRMSE) and Wasserstein distance, computed per variable against the input target distributions. Structural feasibility is assessed by checking all generated households against a set of domain rules (e.g. one household head per household; parents older than children; no child living alone as household head). Error rates are reported for JSON parsing failures, schema violations, and API errors.Data SourcesGlobal evaluation (109 countries)Age and gender distributions were drawn from the UN World Population Prospects dataset, using the most recent 2023 estimates. Household size and composition data were taken from the UN Database on Household Size and Composition, which collates and harmonises data from heterogeneous sources including Multiple Indicator Cluster Surveys, Demographic Health Surveys, and the Integrated Public Use Microdata Series. To limit data quality issues, records from before 2010 or with more than 0.1% unknown values were excluded. After filtering, the sample covered 108 UN member states. Additional household data were sourced from the 2021 UK Census, bringing the total to 109 countries. For each country, a population of 500 households was generated using identical prompts.International migrant stock data used in regression analyses were drawn from the International Migrant Stock 2024 dataset (United Nations Department of Economic and Social Affairs, Population Division).Newcastle upon Tyne case studyAll demographic inputs were drawn from the 2021 ONS Census, released under the Open Government Licence. Newcastle served as a benchmark case under ideal data conditions, with complete and internally consistent inputs from a single source.Dar es Salaam case studyData availability for Dar es Salaam was limited and fragmented across multiple sources collected at different times and geographic aggregation levels, as summarised below:VariableDatasetData TypeGeographic LevelAge by genderTanzania Population Estimates and Projections, 2015–2030Estimated (for 2025 from 2022 Census)Dar es SalaamAverage household sizeTanzania Demographic and Health Survey and Malaria Indicator Survey 2022 Final ReportMeasuredDar es SalaamHousehold sizeTanzania Demographic and Health Survey and Malaria Indicator Survey 2022 Final ReportMeasuredUrban Tanzania MainlandHousehold compositionUN Database on Household Size and Composition (from 2015 DHS)EstimatedTanzaniaIPU comparisonThe IPU comparison used ONS 2021 Census 10% safeguarded microdata (North East region). This microdata is not included in this dataset due to licence restrictions. It is available via the UK Data Service safeguarded access route (https://ukdataservice.ac.uk). Derived evaluation metrics are included in the experiment outputs.Ethical and Legal ComplianceThis study does not involve the collection, processing, or storage of personal data. All demographic inputs are aggregate statistics drawn from publicly available sources published by national statistical agencies and international organisations. No individuals are represented in the input data. Prompts supplied to LLMs contain only aggregate statistical targets and no personal information.The synthetic records produced by the framework include ages and relationship roles but do not contain quasi-identifiers such as names, addresses, dates of birth, or sub-city geographic resolution. The outputs cannot be linked to real individuals and do not constitute personal data under applicable data protection legislation.Commercial LLMs were accessed via Azure AI Foundry, which operates under published usage policies, content filtering, and safety guardrails. Open-source models were run locally via Ollama, ensuring no external transmission of input data. No model fine-tuning was performed.CodeThe generation framework and evaluation scripts are maintained in two public GitHub repositories.Population generator library: https://github.com/MJones235/LLM-Population-Generator/releases/tag/v1.0.0.Data collection and processing: https://github.com/MJones235/Synthetic-Population-Experiments/releases/tag/v1.0.0.

提供机构：

Newcastle University

创建时间：

2026-04-17