Exploring Synthetic Data, 2021-2022
收藏DataCite Commons2026-02-16 更新2026-05-06 收录
下载链接:
http://reshare.ukdataservice.ac.uk/id/eprint/858361
下载链接
链接失效反馈官方服务:
资源简介:
In the UK, administrative data has helped us gain a picture of public service users and their needs. But administrative data contain personal and sensitive information which is important to protect so that individuals can never be identified. However, the approvals and governance processes associated with accessing administrative data are extremely time consuming which can threaten the timeliness of research. The other time-consuming part of any research study using administrative data is understanding the structure of the data, and developing data cleaning and analysis plans. The time taken conducting final analyses is often comparatively short. Research timelines could be substantially reduced if there was a way to do these preliminary tasks, in parallel to applying for access to the real data on which final analyses would be conducted. This is where synthetic (or artificial) data could help. Synthetic data are artificially generated data designed to mimic real datasets, without containing personally identifiable information.
Synthetic data also has a lot of potential for capacity building. It can be difficult to recruit researchers who have experience in using administrative data, in part because it is often impossible to grant access to these datasets to MSc students. If we could train students using synthetic data, we could really enhance the training we can provide to the next generation of data scientists.
Synthetic data can:
• Facilitate easier access to data for those who are generating hypotheses and developing tools
• Prepare and train researchers for the practical challenges of working with national clinical datasets
• Be used as pilot data (instead of real data) to strengthen research applications
Being able to explore the datasets, understand what is available, and test code on the data can help streamline the research process, and enable researchers to make informed decisions and plan their research thoroughly, in a low-risk setting.
Although the idea of synthetic data was introduced around 30 years ago, it is still not widely used, and synthetic versions of administrative datasets are not routinely available. One of the main barriers to wider use of synthetic administrative data is uncertainty about the level of fidelity that is required in the data. However, terminology describing synthetic data varies, which makes it difficult to communicate within the research field, and with data providers and members of the public.
In an article published in the International Journal of Population Data Science (IJPDS), we provide a comprehensive overview of the main synthetic data generation methods in the context of UK administrative data research. We discuss the benefits and challenges, and propose simplified terms that would help data holders and data users familiarise themselves with the concepts of synthetic data. Using a consistent terminology should promote collaboration and engagement and allow effective communication of the benefits of synthetic data, to help build further acceptance and trust.
Our workshop highlighted the clear value of synthetic data for a range of purposes. It also showed that the first step to demonstrating the value of synthetic data would be to facilitate the rollout of a number of low fidelity datasets for training and to consolidate and validate data methods for synthetic data. Demonstrating value with low fidelity datasets, which are not resource-intensive to generate, will pave the way for high fidelity datasets. For example, more research is needed to understand how best to replicate the longitudinal and high-dimensional nature of administrative datasets.
在英国,行政数据助力我们勾勒出公共服务使用者及其需求的全貌。但行政数据包含个人敏感信息,需严加保护以确保无法识别个体身份。然而,获取行政数据所需的审批与治理流程耗时极长,可能危及研究的时效性。使用行政数据开展的研究中,另一项耗时环节是理解数据结构、制定数据清洗与分析方案;而最终数据分析的耗时往往相对较短。若能在申请获取用于最终分析的真实数据的同时,开展这些前置工作,则可大幅缩短研究周期。而合成(synthetic)数据恰好能解决这一问题:合成数据是经人工生成的数据集,旨在模拟真实数据集且不包含可识别个人身份的信息。
合成数据在能力建设方面也具备巨大潜力。招募具备行政数据使用经验的研究人员往往存在难度,部分原因在于通常无法向理学硕士(Master of Science, MSc)学生开放此类数据集的访问权限。若能利用合成数据开展学生培训,则可切实提升对下一代数据科学家的培养质量。
合成数据可实现以下功能:
• 为提出研究假设、开发工具的人员提供更便捷的数据访问途径
• 帮助研究人员熟悉并应对国家级临床数据集实操中的各类挑战
• 可作为试点数据(替代真实数据)用于强化研究申报工作
借助合成数据,研究人员可在低风险环境中探索数据集、明晰数据资源情况并测试代码,这有助于优化研究流程,使研究人员能够做出明智决策并全面规划研究工作。
尽管合成数据的理念早在约30年前就已提出,但目前仍未得到广泛应用,行政数据的合成版本也未实现常规供给。阻碍合成行政数据更广泛应用的主要障碍之一,在于对数据所需保真度(fidelity)标准的认知模糊。然而,合成数据的相关术语表述并不统一,这给研究领域内部、数据提供方与社会公众之间的沟通带来了阻碍。
在发表于《国际人口数据科学期刊(International Journal of Population Data Science, IJPDS)》的一篇论文中,我们针对英国行政数据研究场景下的主流合成数据生成方法开展了全面综述。本文探讨了合成数据的优势与挑战,并提出了简化术语体系,以帮助数据持有方与使用者更好地理解合成数据的相关概念。采用统一的术语体系将有助于促进协作与参与,有效传播合成数据的价值,进而推动其获得更广泛的认可与信任。
我们举办的研讨会凸显了合成数据在多种场景下的明确应用价值。同时,研讨会也表明,展现合成数据价值的第一步,应是推动一批低保真度数据集的推广应用以用于培训,并巩固和验证合成数据的生成方法。依托生成成本较低的低保真度数据集展现其应用价值,将为高保真度数据集的推广铺平道路。例如,目前仍需开展更多研究,以探索如何最佳地复现行政数据的纵向性与高维性特征。
提供机构:
UK Data Service
创建时间:
2026-02-16



