TWINS dataset used for experiment in the paper How to select predictive models for causal inference ?
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14674617
下载链接
链接失效反馈官方服务:
资源简介:
Dataset obtained form the shalit-lab github and used in our experiments. Raw url for the dataset : "https://raw.githubusercontent.com/shalit-lab/Benchmarks/master/Twins/Final_data_twins.csv".
Explanation on the dataset :
Louizos et al. (2017) introduced the Twins dataset as an augmentation of the real data on twin births and twin mortality rates in the USA from 1989-1991 (Almond et al., 2005). The treatment is "born the heavier twin" so, in one sense, we can observe both potential outcomes. Louizos et al. (2017) create an observational dataset out of this by hiding one of the twins (for each pair) in the dataset. To ensure there is some confounding, Louizos et al. (2017) simulate the treatment assignment (which twin is heavier) as a function of the GESTAT10 covariate, which is the number of gestation weeks prior to birth. GESTAT10 is highly correlated with the outcome and it seems intuitive that it would be a cause of the outcome, so this should simulate some confounding. They simulate this "treatment" with a sigmoid model based on GESTAT10 (number of gestation weeks before birth) and x, the 45 other covariates: $\mathbf{t}_{i} \mid \mathbf{x}_{i}, \mathbf{z}_{i} \sim \operatorname{Bern}\left(\sigma\left(w_{o}^{\top} \mathbf{x}+w_{h}(\mathbf{z} / 10-0.1)\right)\right) \quad with \; w_{o} \sim \mathcal{N}(0,0.1 \cdot I), w_{h} \sim \mathcal{N}(5,0.1)$ Furthermore, to make sure the twins are very similar, they limit the data to the twins that are the same sex. To look at data with higher mortality rates, they further limit the dataset to twins that were born weighing less than 2 kg.
References:
Almond, D., Chay, K. Y., & Lee, D. S. (2005). The costs of low birth weight. The Quarterly Journal of Economics, 120(3), 1031-1083.
Louizos, C., Shalit, U., Mooij, J. M., Sontag, D., Zemel, R., & Welling, M. (2017). Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems (pp. 6446-6456). B. Neal, C.-W. Huang, et S. Raghupathi. RealCause: Realistic Causal Inference Benchmarking. arXiv:2011.15007 [cs, stat], march 2021
创建时间:
2025-01-16



