Generation of Synthetic Data in Health Surveys Using Large Language Models

Figshare2026-01-24 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Generation_of_Synthetic_Data_in_Health_Surveys_Using_Large_Language_Models/31143748

下载链接

链接失效反馈

官方服务：

资源简介：

Background: Generating synthetic data using artificial intelligence, such as large language models (LLMs), is a useful strategy in public health because it can reduce time and costs, expand access to data, and facilitate information sharing without compromising confidentiality.Objective: To evaluate the consistency and psychometric plausibility of synthetic data generated by an LLM to simulate the responses of survey participants (user personas) in a national health survey in Peru.Methods: We conducted a cross-sectional study based on the National Health Satisfaction Survey (ENSUSALUD 2016) of ambulatory health service users. We used the GPT-OSS-20B model to generate synthetic responses in Spanish, conditioned on narrative profiles derived from sociodemographic and clinical variables. We evaluated consistency between responses and profile characteristics (sex, age, and comorbidities) using performance metrics (accuracy, precision, recall, F1 score, and AUC). We compared distributions between real and synthetic data using t-tests and chi-square tests. For latent variables, we conducted confirmatory factor analyses of the PHQ-9, PHQ-8, and GAD-7 (WLSMV; polychoric matrices) and estimated internal consistency (α and ω). We examined normality (Jarque–Bera test) and stability through correlations between real measures (PHQ-2 and EQ-5D) and synthetic measures (PHQ-2, PHQ-8, PHQ-9, GAD-2, and GAD-7).Results: The model showed strong concordance with the profile for sex, age, and chronic disease status, with metrics close to 1 for most variables; overall consistency was high in the vast majority of cases. The synthetic PHQ-9, PHQ-8, and GAD-7 instruments showed optimal factor fit and high internal consistency. Synthetic measures were positively and significantly correlated with the real PHQ-2 and negatively correlated with EQ-5D, with moderate to high correlations, particularly for PHQ-8/PHQ-9 and GAD-7.Conclusions: An LLM can generate plausible synthetic data for health surveys when its output is conditioned on user personas, preserving high coherence with demographic and clinical characteristics and maintaining adequate psychometric properties in depression and anxiety scales. However, relevant deviations were identified (e.g., overestimation of obesity, unexpected distributions in some variables, and missing values in a sensitive item), which supports the need for rigorous validation and bias control before using these data for inferential purposes or public policy.

创建时间：

2026-01-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集