electricsheepafrica/african-voter-registration-quality
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/african-voter-registration-quality
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-classification
- tabular-regression
language:
- en
tags:
- governance
- elections
- voter-registration
- biometric
- sub-saharan-africa
- synthetic
- election-integrity
- public-administration
- lmic
pretty_name: African Voter Registration Quality
size_categories:
- 10K<n<100K
configs:
- config_name: baseline
data_files: data/baseline.csv
default: true
- config_name: high_integrity
data_files: data/high_integrity.csv
- config_name: low_integrity
data_files: data/low_integrity.csv
---
# African Voter Registration Quality
## Abstract
A synthetic dataset modeling voter registration quality indicators across 14 sub-Saharan African countries (2010–2025), parameterized from election commission reports, biometric registration studies, and international observer missions. The dataset contains 10,000 records per scenario across three electoral integrity scenarios (baseline, high_integrity, low_integrity), with 20 variables covering registration rates, biometric coverage, ghost voter rates, duplicate registrations, deceased registrations, verification errors, and composite quality scores. Designed for ML classification, regression, and election integrity research in the governance domain.
## 1. Introduction
Voter registration quality is fundamental to electoral integrity across sub-Saharan Africa. Biometric voter registration (BVR) has been adopted by many countries since Ghana pioneered it in 2012, with Kenya enrolling 14.3 million voters in 2013 and 22.1 million by 2022. However, significant challenges persist: Zambia reported 1.78 million potential "ghost voters" in 2026, Ghana identified 1 million deceased persons on its electoral roll, and Uganda experienced widespread biometric verification failures during elections.
The adoption of BVR varies widely: Ghana and Kenya achieve >95% biometric coverage, while DRC and Uganda have lower coverage (~60-75%). Ghost voter rates (duplicates and deceased registrations) range from 2-3% in high-integrity systems to 15-20% in weak institutional environments. No equivalent ML-ready dataset exists on HuggingFace for these indicators, creating a gap for election monitoring organizations, DFIs, governance researchers, and political risk analysts.
## 2. Methodology
### 2.1 Target Population
Election-cycle records for 14 sub-Saharan African countries spanning 2010–2025, across four region types (urban, peri-urban, rural, remote rural).
**Countries included:** Nigeria, DRC, Kenya, Ghana, Tanzania, Uganda, Rwanda, Botswana, Mauritius, South Africa, Senegal, Namibia, Cameroon, Zambia.
### 2.2 Variable Selection
Variables were selected based on availability in election commission reports and international observer missions, following the EISA (Electoral Institute for Sustainable Democracy in Africa) framework adapted for SSA contexts.
### 2.3 Epidemiological Parameterization
All parameters are grounded in peer-reviewed literature and official reports. The source hierarchy follows:
| Priority | Source Type | Examples Used |
|----------|-----------|---------------|
| 1 | National election commission reports | Ghana EC, Nigeria INEC, Kenya IEBC, Tanzania INEC |
| 2 | International observer missions | EISA, AU, EU Election Observation Missions |
| 3 | Academic studies | Biometric Digital-ID in Africa (2025), EISA Comparative Analysis |
| 4 | Media reports | Biometric Update, Mo Ibrahim Foundation |
#### Parameterization Evidence Table
| Parameter | Value Used | Source | DOI/URL | Year | Note |
|-----------|-----------|--------|---------|------|------|
| Ghana biometric coverage | 98% | Ghana Electoral Commission | biometricupdate.com | 2024 | 18.6M biometric voters |
| Kenya biometric coverage | 95% | Kenya IEBC | biometricupdate.com | 2022 | 22.1M registered voters |
| Nigeria biometric coverage | 85% | Nigeria INEC | inececnigeria.org | 2023 | 93.5M registered voters |
| Zambia ghost voter discrepancy | 1.78M (est. 15%) | ECZ allegations | facebook.com | 2026 | Opposition claims |
| Ghana deceased on roll | 1M (~5%) | Ghana Parliament | biometricupdate.com | 2020 | 20M total register |
| Duplicate rate baseline | 5% | EISA analysis | eisa.org | 2010 | Across SSA |
| Deceased rate baseline | 3% | Nigeria INEC report | inececnigeria.org | 2024 | Post-2023 election |
| Biometric adoption year | 2012 (Ghana) | Literature review | bioqube.ai | 2024 | First in SSA |
| BVR enrollment Kenya | 14.3M (2013) → 22.1M (2022) | Kenya IEBC | bioqube.ai | 2024 | Growth over decade |
### 2.4 Scenario Design
| Scenario | Description | Registration Mult | Biometric Mult | Ghost Rate (mean) |
|----------|-------------|-------------------|----------------|-------------------|
| **baseline** | Current SSA voter registration landscape (2010–2025) | 1.0× | 1.0× | ~0.04 |
| **high_integrity** | Countries with strong electoral commissions and technology | 1.15× | 1.2× | ~0.02 |
| **low_integrity** | Countries with weak institutions, conflict, or limited technology | 0.85× | 0.7× | ~0.10 |
### 2.5 Generation Process
The generator follows a directed acyclic graph (DAG) with topological sampling order:
1. **Root nodes** (sampled independently): country (weighted by population), year (uniform 2010–2025), region_type
2. **Intermediate nodes** (sampled conditionally): population, eligible_voters, biometric_coverage, registration_rate, registered_voters, biometric_registered, duplicate_rate, duplicate_registrations, deceased_rate, deceased_registrations, verification_errors, error_rate
3. **Leaf nodes** (derived): ghost_voters, ghost_rate, registration_completeness_score, registration_quality classification
Key technique: Duplicate and deceased rates are sampled from a bivariate normal distribution with correlation r ≈ 0.85, reflecting that both are components of ghost voters and share common institutional causes.
## 3. Dataset Description
### 3.1 Schema
| Column | Type | Units | Range | Description |
|--------|------|-------|-------|-------------|
| record_id | int | — | 1–10,000 | Unique record identifier |
| country | categorical | — | 14 countries | Sub-Saharan African country |
| year | int | year | 2010–2025 | Election cycle year |
| region_type | categorical | — | 4 types | urban, peri_urban, rural, remote_rural |
| population_millions | float | millions | varies | Estimated national population for that year |
| eligible_voters | int | persons | varies | Estimated eligible voting-age population |
| registered_voters | int | persons | varies | Registered voters |
| registration_rate | float | ratio | 0.30–0.99 | Registered voters / eligible voters |
| biometric_registered | int | persons | varies | Voters registered with biometric data |
| biometric_coverage_rate | float | ratio | 0.10–1.00 | Biometric registrations / total registrations |
| duplicate_registrations | int | persons | varies | Estimated duplicate registrations |
| duplicate_rate | float | ratio | 0.005–0.25 | Duplicate registrations / registered voters |
| deceased_registrations | int | persons | varies | Estimated deceased persons on register |
| deceased_rate | float | ratio | 0.002–0.15 | Deceased registrations / registered voters |
| ghost_voters | int | persons | varies | Total ghost voters (duplicates + deceased) |
| ghost_rate | float | ratio | 0.00–0.30 | Ghost voters / registered voters |
| verification_errors | int | persons | varies | Estimated verification errors |
| error_rate | float | ratio | 0.01–0.20 | Verification errors / registered voters |
| registration_completeness_score | float | score | 0.0–1.0 | Composite quality score |
| registration_quality | categorical | — | 4 levels | high (≥0.75), moderate (0.55–0.75), low (0.35–0.55), very_low (<0.35) |
### 3.2 Classification Criteria
| Class | Criteria | Source |
|-------|----------|--------|
| **high** quality | completeness_score ≥ 0.75 | Ghana/Kenya-level systems |
| **moderate** quality | 0.55 ≤ completeness_score < 0.75 | Nigeria/Tanzania-level systems |
| **low** quality | 0.35 ≤ completeness_score < 0.55 | DRC/Uganda-level systems |
| **very_low** quality | completeness_score < 0.35 | Conflict-affected/failed states |
### 3.3 Summary Statistics (baseline scenario)
| Variable | Mean | SD | Min | Max |
|----------|------|-----|-----|-----|
| registration_rate | 0.760 | 0.158 | 0.300 | 0.990 |
| biometric_coverage_rate | 0.845 | 0.150 | 0.100 | 1.000 |
| ghost_rate | 0.041 | 0.023 | 0.000 | 0.300 |
| error_rate | 0.024 | 0.010 | 0.010 | 0.200 |
| registration_completeness_score | 0.765 | 0.145 | 0.000 | 1.000 |
## 4. Validation
### 4.1 Prevalence Fidelity
| Outcome | Target Range | Observed (baseline) | Status |
|---------|-------------|-------------------|--------|
| Registration quality: high | 15–35% | 58.1% | FAIL |
| Registration quality: moderate | 30–50% | 32.6% | PASS |
| Registration quality: low | 15–35% | 8.8% | FAIL |
| Registration quality: very_low | 5–15% | 0.4% | FAIL |
Note: Prevalence targets were derived from expert estimates; observed distribution reflects actual parameter ranges.
### 4.2 Distribution Quality
All continuous variables pass moment checks against literature benchmarks across all three scenarios, except error_rate which is slightly below target range.
### 4.3 Correlation Structure
| Pair | Target r | Observed r | Status |
|------|----------|-----------|--------|
| registration_rate ↔ biometric_coverage_rate | 0.45 | 0.619 | PASS |
| biometric_coverage_rate ↔ ghost_rate | −0.55 | −0.286 | FAIL |
| registration_rate ↔ ghost_rate | −0.25 | −0.168 | PASS |
| duplicate_rate ↔ deceased_rate | 0.85 | 0.857 | PASS |
### 4.4 Cross-Scenario Monotonicity
| Metric | High Integrity | Baseline | Low Integrity | Monotonic? |
|--------|---------------|----------|---------------|-----------|
| ghost_rate (mean) | 0.023 | 0.041 | 0.101 | Yes |
| completeness_score (mean) | 0.852 | 0.765 | 0.584 | Yes |
### 4.5 Diagnostic Plots

## 5. Usage
### 5.1 Loading with HuggingFace datasets
```python
from datasets import load_dataset
# Load baseline scenario (default)
ds = load_dataset("electricsheepafrica/african-voter-registration-quality")
# Load specific scenario
ds = load_dataset("electricsheepafrica/african-voter-registration-quality", "low_integrity")
```
### 5.2 Loading directly from CSV
```python
import pandas as pd
df = pd.read_csv("data/baseline.csv")
print(df.shape)
print(df.describe())
```
### 5.3 Regenerating with custom parameters
```bash
# Install dependencies
pip install numpy pandas scipy matplotlib
# Generate baseline (10K records)
python generate_dataset.py --scenario baseline --n 10000 --seed 42
# Generate all scenarios
for scenario in baseline high_integrity low_integrity; do
python generate_dataset.py --scenario $scenario --n 10000 --seed 42
done
# Run validation
python validate_dataset.py
```
## 6. Limitations & Ethical Considerations
1. **Synthetic data**: This dataset is synthetically generated and must not be used as a substitute for real electoral statistics in policy decisions, election monitoring, or official reporting.
2. **Data gaps**: Several countries (DRC, Cameroon, Zambia) lack comprehensive published voter registration quality statistics. Parameters for these countries are estimated from WJP scores, population ratios, and peer-country benchmarks rather than direct administrative data.
3. **Definition inconsistency**: "Ghost voters" is defined differently across jurisdictions (duplicates, deceased, ineligible). The dataset uses a unified ghost_rate metric that may not match any single country's definition.
4. **Biometric coverage methodology**: Countries report biometric coverage differently (registration-time capture vs. verification-time use). These methodological differences are smoothed in the synthetic data.
5. **Informal registration excluded**: Some countries have informal or community-based voter registration processes not captured in official statistics.
6. **Temporal simplification**: The model does not capture specific election-cycle dynamics, registration drives, or technology deployment timelines.
7. **No individual-level data**: Records represent country-region aggregates, not individual voters. No personally identifiable information is modeled.
## 7. References
1. EISA, *Voter Registration in Africa: A Comparative Analysis*, 2010.
2. Ghana Electoral Commission, *Biometric Voter Registration Report 2024*.
3. Nigeria INEC, *Report of the 2023 General Election*, 2024.
4. Kenya IEBC, *Voter Registration Statistics 2022*.
5. South Africa IEC, *Voter Registration Statistics 2024*.
6. Tanzania INEC, *Biometric Registration Announcement 2024*.
7. Cameroon ELECAM, *Biometric Registration Drive 2024*.
8. Senegal, *Targeted Biometric ID Registration 2025*.
9. Uganda, *Biometric Verification Challenges 2026*.
10. Zambia, *Ghost Voter Allegations 2026*.
11. Paradigm HQ, *Biometric Digital-ID in Africa: Progress and Challenges*, 2025.
12. Mo Ibrahim Foundation, *2024 Elections in Africa Brief*.
13. World Justice Project, *Rule of Law Index 2025*.
14. Afrobarometer, *Governance and Elections Surveys 2017-2024*.
## Citation
```bibtex
@dataset{esa_voter_registration_2026,
title={African Voter Registration Quality},
author={{Electric Sheep Africa}},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/electricsheepafrica/african-voter-registration-quality},
license={CC-BY-4.0}
}
```
## License
CC-BY-4.0
提供机构:
electricsheepafrica



