five

SynC Data Sets

收藏
Figshare2019-04-02 更新2026-04-29 收录
下载链接:
https://figshare.com/articles/dataset/SynC_Data_Sets/7938644
下载链接
链接失效反馈
官方服务:
资源简介:
Generating synthetic population data from multiple raw data sources is a fundamental step for many data science tasks with a wide range of applications. However, despite the presence of a number of ap- proaches such as iterative proportional fitting (IPF) and combinatorial optimization (CO), an efficient and standard framework for handling this type of problems is absent. In this study, we propose a multi-stage frame- work called SynC (short for Synthetic Population via Gaussian Copula) to fill the gap. SynC first removes potential outliers in the data and then fits the filtered data with a Gaussian copula model to correctly capture dependencies and marginals distributions of sampled survey data. Fi- nally, SynC leverages neural networks to merge datasets into one. Our key contributions include: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of- the-art machine learning and statistical techniques, 2) design a metric for validating the accuracy of generated data when the ground truth is hard to obtain, 3) release an easy-to-use framework implementation for repro- ducibility and demonstrate its effectiveness with the Canada National Census data, and 4) present two real-world use cases where datasets of this nature can be leveraged by businesses.
创建时间:
2019-04-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作