electricsheepafrica/africa-demographics-zimbabwe
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-demographics-zimbabwe
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license: other
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- tabular-classification
- other
task_ids: []
tags:
- africa
- humanitarian
- hdx
- electric-sheep-africa
- demographics
- health
- zwe
pretty_name: "Zimbabwe - Subnational Demographic and Health Data"
dataset_info:
splits:
- name: train
num_examples: 1075
- name: test
num_examples: 268
---
# Zimbabwe - Subnational Demographic and Health Data
**Publisher:** The DHS Program · **Source:** [HDX](https://data.humdata.org/dataset/dhs-subnational-data-for-zimbabwe) · **License:** `hdx-other` · **Updated:** 2026-02-24
---
## Abstract
Contains data from the [DHS data portal](https://api.dhsprogram.com/). There is also a dataset containing [Zimbabwe - National Demographic and Health Data](https://data.humdata.org/dataset/dhs-data-for-zimbabwe) on HDX.
The DHS Program Application Programming Interface (API) provides software developers access to aggregated indicator data from The Demographic and Health Surveys (DHS) Program. The API can be used to create various applications to help analyze, visualize, explore and disseminate data on population, health, HIV, and nutrition from more than 90 countries.
Each row in this dataset represents first-level administrative unit observations. Data was last updated on HDX on 2026-02-24. Geographic scope: **ZWE**.
*Curated into ML-ready Parquet format by [Electric Sheep Africa](https://huggingface.co/electricsheepafrica).*
---
## Dataset Characteristics
| | |
|---|---|
| **Domain** | Public health |
| **Unit of observation** | First-level administrative unit observations |
| **Rows (total)** | 1,344 |
| **Columns** | 30 (14 numeric, 16 categorical, 0 datetime) |
| **Train split** | 1,075 rows |
| **Test split** | 268 rows |
| **Geographic scope** | ZWE |
| **Publisher** | The DHS Program |
| **HDX last updated** | 2026-02-24 |
---
## Variables
**Geographic** — `iso3` (ZWE), `location` (Manicaland, Mashonaland Central, Mashonaland East), `dhs_countrycode` (ZW), `countryname` (Zimbabwe), `surveyyear` (range 1988.0–2015.0) and 8 others.
**Outcome / Measurement** — `value` (range 0.4–126.0), `istotal` (range 0.0–0.0).
**Identifier / Metadata** — `dataid` (range 84203.0–7980940.0), `indicatorid` (RH_DELP_C_DHF, CH_DIAT_C_ORT, FE_FRTR_W_TFR), `characteristicid` (range 457001.0–457010.0), `characteristiclabel` (Manicaland, Mashonaland Central, Mashonaland East), `ispreferred` (range 0.0–1.0) and 3 others.
**Other** — `indicator` (Place of delivery: Health facility, Treatment of diarrhea: Either ORS or RHF, Total fertility rate 15-49), `precision` (range 0.0–1.0), `indicatororder` (range 11763080.0–260321010.0), `characteristicorder` (range 1457001.0–1457010.0), `denominatorweighted` (range 12.0–2865.0) and 2 others.
---
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-demographics-zimbabwe")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
```
---
## Schema
| Column | Type | Null % | Range / Sample Values |
|---|---|---|---|
| `iso3` | object | 0.0% | ZWE |
| `location` | object | 0.0% | Manicaland, Mashonaland Central, Mashonaland East |
| `dataid` | int64 | 0.0% | 84203.0 – 7980940.0 (mean 4177727.2031) |
| `indicator` | object | 0.0% | Place of delivery: Health facility, Treatment of diarrhea: Either ORS or RHF, Total fertility rate 15-49 |
| `value` | float64 | 0.0% | 0.4 – 126.0 (mean 40.7255) |
| `precision` | int64 | 0.0% | 0.0 – 1.0 (mean 0.9115) |
| `dhs_countrycode` | object | 0.0% | ZW |
| `countryname` | object | 0.0% | Zimbabwe |
| `surveyyear` | int64 | 0.0% | 1988.0 – 2015.0 (mean 2004.7582) |
| `surveyid` | object | 0.0% | ZW2010DHS, ZW2015DHS, ZW2005DHS |
| `indicatorid` | object | 0.0% | RH_DELP_C_DHF, CH_DIAT_C_ORT, FE_FRTR_W_TFR |
| `indicatororder` | int64 | 0.0% | 11763080.0 – 260321010.0 (mean 96204564.7693) |
| `indicatortype` | object | 0.0% | I |
| `characteristicid` | int64 | 0.0% | 457001.0 – 457010.0 (mean 457005.4844) |
| `characteristicorder` | int64 | 0.0% | 1457001.0 – 1457010.0 (mean 1457005.4844) |
| `characteristiccategory` | object | 0.0% | Region |
| `characteristiclabel` | object | 0.0% | Manicaland, Mashonaland Central, Mashonaland East |
| `byvariableid` | int64 | 0.0% | 0.0 – 631001.0 (mean 17668.5751) |
| `byvariablelabel` | object | 72.2% | |
| `istotal` | int64 | 0.0% | 0.0 – 0.0 (mean 0.0) |
| `ispreferred` | int64 | 0.0% | 0.0 – 1.0 (mean 0.8757) |
| `sdrid` | object | 0.0% | |
| `regionid` | object | 0.0% | |
| `surveyyearlabel` | float64 | 43.0% | 1988.0 – 2015.0 (mean 2002.6828) |
| `surveytype` | object | 0.0% | |
| `denominatorweighted` | float64 | 26.0% | 12.0 – 2865.0 (mean 497.3427) |
| `denominatorunweighted` | float64 | 26.0% | 25.0 – 1912.0 (mean 496.7367) |
| `levelrank` | float64 | 5.9% | 1.0 – 1.0 (mean 1.0) |
| `esa_source` | object | 0.0% | |
| `esa_processed` | object | 0.0% | |
---
## Numeric Summary
| Column | Min | Max | Mean | Median |
|---|---|---|---|---|
| `dataid` | 84203.0 | 7980940.0 | 4177727.2031 | 4324618.0 |
| `value` | 0.4 | 126.0 | 40.7255 | 36.65 |
| `precision` | 0.0 | 1.0 | 0.9115 | 1.0 |
| `surveyyear` | 1988.0 | 2015.0 | 2004.7582 | 2005.0 |
| `indicatororder` | 11763080.0 | 260321010.0 | 96204564.7693 | 93906230.0 |
| `characteristicid` | 457001.0 | 457010.0 | 457005.4844 | 457005.0 |
| `characteristicorder` | 1457001.0 | 1457010.0 | 1457005.4844 | 1457005.0 |
| `byvariableid` | 0.0 | 631001.0 | 17668.5751 | 0.0 |
| `istotal` | 0.0 | 0.0 | 0.0 | 0.0 |
| `ispreferred` | 0.0 | 1.0 | 0.8757 | 1.0 |
| `surveyyearlabel` | 1988.0 | 2015.0 | 2002.6828 | 1999.0 |
| `denominatorweighted` | 12.0 | 2865.0 | 497.3427 | 409.0 |
| `denominatorunweighted` | 25.0 | 1912.0 | 496.7367 | 445.0 |
| `levelrank` | 1.0 | 1.0 | 1.0 | 1.0 |
---
## Curation
Raw data was downloaded from HDX via the CKAN API and converted to Parquet. Column names were lowercased and standardised to snake_case. Common missing-value markers (`N/A`, `null`, `none`, `-`, `unknown`, `no data`, `#N/A`) were unified to `NaN`. 2 column(s) with >80% missing values were removed: `cilow`, `cihigh`. 1 column(s) were cast from string to numeric or datetime based on parse-success rate (>85% threshold). The dataset was split 80/20 into train and test partitions using a fixed random seed (42) and saved as Snappy-compressed Parquet.
---
## Limitations
- Data originates from The DHS Program and has not been independently validated by ESA.
- Automated cleaning cannot correct for misreported values, definitional inconsistencies, or sampling bias in the original collection.
- The following columns have >20% missing values and should be treated with caution in modelling: `byvariablelabel`, `surveyyearlabel`, `denominatorweighted`, `denominatorunweighted`.
- Refer to the [original HDX dataset page](https://data.humdata.org/dataset/dhs-subnational-data-for-zimbabwe) for the publisher's own methodology notes and caveats.
---
## Citation
```bibtex
@dataset{hdx_africa_demographics_zimbabwe,
title = {Zimbabwe - Subnational Demographic and Health Data},
author = {The DHS Program},
year = {2026},
url = {https://data.humdata.org/dataset/dhs-subnational-data-for-zimbabwe},
note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)}
}
```
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — Africa's ML dataset infrastructure. Lagos, Nigeria.*
annotations_creators:
- 无注释
language_creators:
- 采集自现有公开文本
language:
- 英语
license:
- 其他
multilinguality:
- 单语言
size_categories:
- 1000 < 样本量 < 10000
source_datasets:
- 原始数据集
task_categories:
- 表格分类
- 其他
task_ids:
- 无
tags:
- 非洲
- 人道主义
- HDX
- Electric Sheep Africa
- 人口统计学
- 健康
- 津巴布韦(ZWE)
pretty_name: "津巴布韦——省级人口与健康数据"
dataset_info:
splits:
- name: 训练集
num_examples: 1075
- name: 测试集
num_examples: 268
# 津巴布韦——省级人口与健康数据
**发布方:** 人口与健康调查项目(Demographic and Health Surveys Program, 简称DHS Program) · **来源:** [人类数据交换平台(Humanitarian Data Exchange, 简称HDX)](https://data.humdata.org/dataset/dhs-subnational-data-for-zimbabwe) · **许可协议:** `hdx-other` · **更新时间:** 2026-02-24
---
## 摘要
本数据集的数据源自[DHS数据门户](https://api.dhsprogram.com/)。HDX平台上另有一份包含[津巴布韦——全国人口与健康数据](https://data.humdata.org/dataset/dhs-data-for-zimbabwe)的数据集。
人口与健康调查项目的应用程序编程接口(API)可为软件开发人员提供来自该项目的聚合指标数据。该接口可用于开发各类应用,助力分析、可视化、探索并传播全球90余个国家的人口、健康、艾滋病病毒(HIV)及营养相关数据。
本数据集的每一行均代表一级行政区划的观测数据。本数据集在HDX平台的最后更新时间为2026年2月24日。地理覆盖范围:**津巴布韦(ZWE,ISO 3166-1 alpha-3国家代码)**。
*本数据集已由[Electric Sheep Africa](https://huggingface.co/electricsheepafrica)整理为适合机器学习使用的帕奎特格式(Parquet)。*
---
## 数据集特征
| | |
|---|---|
| **领域** | 公共卫生 |
| **观测单元** | 一级行政区划观测单元 |
| **总数据行数** | 1344 |
| **总列数** | 30列(14列为数值型,16列为分类型,0个日期时间型列) |
| **训练集样本量** | 1075行 |
| **测试集样本量** | 268行 |
| **地理覆盖范围** | 津巴布韦(ZWE) |
| **发布方** | 人口与健康调查项目 |
| **HDX平台最后更新时间** | 2026-02-24 |
---
## 变量分类
**地理类变量**:包含`iso3`(国家代码为ZWE)、`location`(行政区划,如马尼卡兰省、中央马绍纳兰省、东马绍纳兰省)、`dhs_countrycode`(国家代码为ZW)、`countryname`(国家名称为津巴布韦)、`surveyyear`(调查年份范围为1988.0至2015.0)等共10个变量(含上述5项及另外8项)。
**结果/测量类变量**:包含`value`(数值范围0.4至126.0)、`istotal`(数值范围为0.0至0.0)。
**标识符/元数据类变量**:包含`dataid`(数值范围84203.0至7980940.0)、`indicatorid`(指标代码如RH_DELP_C_DHF、CH_DIAT_C_ORT、FE_FRTR_W_TFR)、`characteristicid`(特征代码范围457001.0至457010.0)、`characteristiclabel`(特征标签如马尼卡兰省、中央马绍纳兰省、东马绍纳兰省)、`ispreferred`(数值范围0.0至1.0)等共6个变量(含上述5项及另外3项)。
**其他类变量**:包含`indicator`(指标名称,如“分娩地点:卫生机构”“腹泻治疗:口服补液盐(ORS)或快速口服补液疗法(RHF)”“15-49岁总生育率”)、`precision`(数值范围0.0至1.0)、`indicatororder`(指标排序值范围11763080.0至260321010.0)、`characteristicorder`(特征排序值范围1457001.0至1457010.0)、`denominatorweighted`(加权分母范围12.0至2865.0)等共7个变量(含上述5项及另外2项)。
---
## 快速上手
python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-demographics-zimbabwe")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
---
## 数据结构
| 列名 | 数据类型 | 缺失率 | 取值范围/示例值 |
|---|---|---|---|
| `iso3` | 字符串型(object) | 0.0% | 固定取值为ZWE |
| `location` | 字符串型 | 0.0% | 示例值:马尼卡兰省、中央马绍纳兰省、东马绍纳兰省 |
| `dataid` | 64位整型(int64) | 0.0% | 取值范围84203.0至7980940.0,均值为4177727.2031 |
| `indicator` | 字符串型 | 0.0% | 示例指标:“分娩地点:卫生机构”“腹泻治疗:ORS或RHF”“15-49岁总生育率” |
| `value` | 64位浮点型(float64) | 0.0% | 取值范围0.4至126.0,均值为40.7255 |
| `precision` | 64位整型 | 0.0% | 取值范围0.0至1.0,均值为0.9115 |
| `dhs_countrycode` | 字符串型 | 0.0% | 固定取值为ZW |
| `countryname` | 字符串型 | 0.0% | 固定取值为津巴布韦 |
| `surveyyear` | 64位整型 | 0.0% | 取值范围1988.0至2015.0,均值为2004.7582 |
| `surveyid` | 字符串型 | 0.0% | 示例调查ID:ZW2010DHS、ZW2015DHS、ZW2005DHS |
| `indicatorid` | 字符串型 | 0.0% | 示例指标代码:RH_DELP_C_DHF、CH_DIAT_C_ORT、FE_FRTR_W_TFR |
| `indicatororder` | 64位整型 | 0.0% | 取值范围11763080.0至260321010.0,均值为96204564.7693 |
| `indicatortype` | 字符串型 | 0.0% | 固定取值为I |
| `characteristicid` | 64位整型 | 0.0% | 取值范围457001.0至457010.0,均值为457005.4844 |
| `characteristicorder` | 64位整型 | 0.0% | 取值范围1457001.0至1457010.0,均值为1457005.4844 |
| `characteristiccategory` | 字符串型 | 0.0% | 固定分类为“地区” |
| `characteristiclabel` | 字符串型 | 0.0% | 示例标签:马尼卡兰省、中央马绍纳兰省、东马绍纳兰省 |
| `byvariableid` | 64位整型 | 0.0% | 取值范围0.0至631001.0,均值为17668.5751 |
| `byvariablelabel` | 字符串型 | 72.2% | 存在大量缺失值 |
| `istotal` | 64位整型 | 0.0% | 取值固定为0.0,均值为0.0 |
| `ispreferred` | 64位整型 | 0.0% | 取值范围0.0至1.0,均值为0.8757 |
| `sdrid` | 字符串型 | 0.0% | 无有效取值 |
| `regionid` | 字符串型 | 0.0% | 无有效取值 |
| `surveyyearlabel` | 64位浮点型 | 43.0% | 取值范围1988.0至2015.0,均值为2002.6828,缺失率较高 |
| `surveytype` | 字符串型 | 0.0% | 无有效取值 |
| `denominatorweighted` | 64位浮点型 | 26.0% | 加权分母取值范围12.0至2865.0,均值为497.3427 |
| `denominatorunweighted` | 64位浮点型 | 26.0% | 未加权分母取值范围25.0至1912.0,均值为496.7367 |
| `levelrank` | 64位浮点型 | 5.9% | 取值固定为1.0,均值为1.0 |
| `esa_source` | 字符串型 | 0.0% | 无有效取值 |
| `esa_processed` | 字符串型 | 0.0% | 无有效取值 |
---
## 数值型变量汇总
| 列名 | 最小值 | 最大值 | 均值 | 中位数 |
|---|---|---|---|---|
| `dataid` | 84203.0 | 7980940.0 | 4177727.2031 | 4324618.0 |
| `value` | 0.4 | 126.0 | 40.7255 | 36.65 |
| `precision` | 0.0 | 1.0 | 0.9115 | 1.0 |
| `surveyyear` | 1988.0 | 2015.0 | 2004.7582 | 2005.0 |
| `indicatororder` | 11763080.0 | 260321010.0 | 96204564.7693 | 93906230.0 |
| `characteristicid` | 457001.0 | 457010.0 | 457005.4844 | 457005.0 |
| `characteristicorder` | 1457001.0 | 1457010.0 | 1457005.4844 | 1457005.0 |
| `byvariableid` | 0.0 | 631001.0 | 17668.5751 | 0.0 |
| `istotal` | 0.0 | 0.0 | 0.0 | 0.0 |
| `ispreferred` | 0.0 | 1.0 | 0.8757 | 1.0 |
| `surveyyearlabel` | 1988.0 | 2015.0 | 2002.6828 | 1999.0 |
| `denominatorweighted` | 12.0 | 2865.0 | 497.3427 | 409.0 |
| `denominatorunweighted` | 25.0 | 1912.0 | 496.7367 | 445.0 |
| `levelrank` | 1.0 | 1.0 | 1.0 | 1.0 |
---
## 数据整理流程
原始数据通过CKAN应用程序编程接口从HDX平台下载,并转换为帕奎特格式。所有列名均转换为小写,并统一采用蛇形命名法(snake_case)。常见缺失值标记(`N/A`、`null`、`none`、`-`、`unknown`、`no data`、`#N/A`)均被统一替换为`NaN`。移除了2个缺失率超过80%的列:`cilow`和`cihigh`。根据解析成功率(阈值为85%),将1列从字符串类型转换为数值型或日期时间型。本数据集以固定随机种子(42)按照80:20的比例划分为训练集与测试集,并以Snappy压缩的帕奎特格式存储。
---
## 数据局限性
- 本数据集源自人口与健康调查项目,未经过Electric Sheep Africa(ESA)的独立验证。
- 自动化数据清洗无法修正原始数据收集中的错报值、定义不一致或抽样偏差问题。
- 以下列的缺失率超过20%,在建模过程中需谨慎使用:`byvariablelabel`、`surveyyearlabel`、`denominatorweighted`、`denominatorunweighted`。
- 如需了解发布方的方法说明与注意事项,请参阅[HDX平台原始数据集页面](https://data.humdata.org/dataset/dhs-subnational-data-for-zimbabwe)。
---
## 引用格式
bibtex
@dataset{hdx_africa_demographics_zimbabwe,
title = {Zimbabwe - Subnational Demographic and Health Data},
author = {The DHS Program},
year = {2026},
url = {https://data.humdata.org/dataset/dhs-subnational-data-for-zimbabwe},
note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)}
}
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — 非洲机器学习数据集基础设施提供商,尼日利亚拉各斯。*
提供机构:
electricsheepafrica



