theaidevlab/WorldDev.2026.1073278
收藏Hugging Face2026-04-21 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/theaidevlab/WorldDev.2026.1073278
下载链接
链接失效反馈官方服务:
资源简介:
# ImageDeconfoundAid Replication Bundle
This folder is the upload-ready replication package for the article:
**Chinese vs. World Bank Development Projects: Insights from Earth Observation and Computer Vision on Wealth Gains in Africa, 2002-2013**
Published in *World Development*:
<https://www.sciencedirect.com/science/article/pii/S0305750X26000173>
The package is organized so that the same directory can be uploaded unchanged to both Hugging Face and Dataverse.
## What this bundle is for
This bundle supports two replication paths:
1. **Default paper reproduction** from bundled processed data and bundled model outputs.
2. **Optional heavy rerun of the image-based models** after users download the excluded satellite images locally and build TFRecords.
The default path is the intended starting point. It is substantially lighter and reproduces the consolidated results, descriptive figures, and manuscript-facing tables from the shipped replication assets.
## Why the entry points are split
The original project pipeline is documented in `Analysis/AidDeconfound_Master.R`. In the source project, the active runner `call_CI_Conf_5k_3yr.R` also serves additional within-unit robustness workflows (`unitFE` and `did`) and is not a clean public entry point by itself.
For replication packaging, this bundle therefore exposes a small set of stable numbered scripts in `code/`:
1. `code/00_verify_bundle.R`
2. `code/01_rebuild_paper_outputs.R`
3. `code/02_run_main_ate_optional.R`
4. `code/03_run_within_unit_robustness.R`
5. `code/04_consolidate_results.R`
6. `code/05_build_figures_tables.R`
7. `code/06_list_run_grid.R`
These wrappers keep all paths relative to this replication folder and avoid the machine-specific assumptions in the original project tree.
## Directory layout
`code/`
: replication entry points, shared helpers, and copies of the analysis scripts needed for reproduction.
`data/interim/`
: processed analysis inputs used by the model, figure, and table scripts.
`data/country_regions/`
: bundled country boundary support files used by the project.
`results/per_run_csv/Epoch5EarlyStopNewTreatDefLabelS_Run2/`
: per-run CSV outputs needed to rebuild consolidated results and robustness summaries.
`results/processed/`
: bundled consolidated outputs used to verify the shipped package and to inspect final replication products without rerunning the consolidation step.
`figures/manuscript/`
: manuscript-facing figure assets copied from the LaTeX archive.
`figures/`
: descriptive figures regenerated by the replication scripts.
`tables/`
: descriptive CSV tables and copied LaTeX macro/table files.
`env/`
: package manifests and install helpers.
`manifests/`
: checksum manifest and notes on excluded external requirements.
## Included material
This bundle includes only files that are directly useful for reproducing the findings:
1. Processed DHS confounder, treatment, country, and sector inputs.
2. Per-run CSV outputs for the main image-confounding specifications.
3. Per-run CSV outputs for the bundled within-unit robustness analyses.
4. Consolidated result files already generated from the original project.
5. Manuscript figure assets.
6. Replication-facing scripts and environment manifests.
## Excluded material
The following are intentionally **not** included:
1. Extremely large raw satellite image tiles.
2. TFRecord files derived from those tiles.
3. Large preparation-stage artifacts that are not needed for the default replication path.
4. Miscellaneous legacy or exploratory outputs not required to reproduce the paper.
See `manifests/external_requirements.md` for the external items needed only for the optional image-heavy rerun.
Additional raw data related to the broader project is available separately at <https://huggingface.co/datasets/cjerzak/AfricaAidDeconfoundingAnalysis_ConlinThesis>. It is not required for the default replication path described here.
## Quick start
Open an R session or use `Rscript` from this folder.
### 1. Verify the shipped bundle
```bash
Rscript code/00_verify_bundle.R
```
This checks the expected directory structure, key required files, per-run CSV counts, and the bundled checksum manifest.
If you have already rerun scripts and intentionally changed bundled outputs, you can skip hash checking:
```bash
IMAGEDECONFOUND_SKIP_HASH=true Rscript code/00_verify_bundle.R
```
### 2. Rebuild the main paper-facing outputs from bundled assets
```bash
Rscript code/01_rebuild_paper_outputs.R
```
This runs:
1. `code/04_consolidate_results.R`
2. `code/05_build_figures_tables.R`
Expected locations after the rebuild:
1. `results/processed/` for consolidated model outputs and robustness summaries.
2. `figures/` for regenerated descriptive figures.
3. `tables/` for descriptive tables and copied LaTeX macro/table files.
## Optional: rerun the within-unit robustness analyses
These paths use the bundled processed tabular inputs and do **not** require the excluded satellite image tiles.
```bash
Rscript code/03_run_within_unit_robustness.R
```
This reruns the `unitFE` and `did` branches of the analysis driver across the bundled parameter grid and writes outputs under:
`results/per_run_csv/Epoch5EarlyStopNewTreatDefLabelS_Run2/`
If you want to inspect the row-index grid before running subsets:
```bash
Rscript code/06_list_run_grid.R within_unit
```
You can also pass specific row indices directly to the runner:
```bash
Rscript code/03_run_within_unit_robustness.R 1 2 3
```
Note: the original source code uses the same driver for both `unitFE` and `did`, and some output file names retain the historical `unitFE` naming convention even inside the DiD results directory. The replication package preserves that behavior but makes the directories explicit.
## Optional: rerun the main image-confounding models
This path is intentionally separate because it requires the excluded image assets.
### Step A. Obtain local image data
Use the optional Google Earth Engine helpers in:
`code/lib/optional/`
The bundle includes:
1. `code/lib/optional/runP_GrabData_EE.sh`
2. `code/lib/optional/get_images/GetImageRun_3y.py`
3. `code/lib/optional/get_images/GetImageRun_annual.py`
The default expected locations are:
1. `external_artifacts/images/dhs_tifs_5k_3yr/`
2. `external_artifacts/tfrecords/`
You can override those with environment variables:
1. `IMAGEDECONFOUND_IMAGE_ROOT`
2. `IMAGEDECONFOUND_TFRECORD_HOME`
3. `IMAGEDECONFOUND_CONDA_ENV`
### Step B. Inspect the row-index grid
```bash
Rscript code/06_list_run_grid.R main
```
The main image runner uses the same row-index convention as the original driver. Pick the row IDs you want to run.
### Step C. Build TFRecords if needed
If TFRecords are not already present, first generate them after placing the excluded image tiles locally:
```bash
IMAGEDECONFOUND_RESAVE_TFRECORDS=true Rscript code/02_run_main_ate_optional.R 1
```
### Step D. Run the main image-based analysis
```bash
Rscript code/02_run_main_ate_optional.R 1
```
The wrapper stops early if no TFRecords are available, which prevents accidental launches that cannot complete.
## Environment setup
R package names are listed in `env/R_packages.txt`.
If you want a helper installer:
```bash
Rscript env/install_R_packages.R
```
The optional Python image-download path uses:
`env/requirements-image-download.txt`
## Notes on verification and manifests
`manifests/file_manifest.csv` records the shipped bundle state using relative paths, file sizes, SHA-256 hashes, and high-level file roles.
The manifest is most useful **before** rerunning scripts. Once you regenerate outputs locally, hash mismatches simply mean your local files no longer match the exact shipped upload state.
## Minimal reproduction checklist
For most users, the shortest replication path is:
1. `Rscript code/00_verify_bundle.R`
2. `Rscript code/01_rebuild_paper_outputs.R`
3. Inspect `results/processed/`, `figures/`, `tables/`, and `figures/manuscript/`
## Contact with the original pipeline
This package is derived from the project workflow documented in `Analysis/AidDeconfound_Master.R`, but only includes the subset of code and data needed for direct replication. Large image files remain external by design.
提供机构:
theaidevlab



