five

electricsheepafrica/africa-demographics-zambia

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-demographics-zambia
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - en license: other multilinguality: - monolingual size_categories: - n<1K source_datasets: - original task_categories: - tabular-classification task_ids: [] tags: - africa - humanitarian - hdx - electric-sheep-africa - demographics - health - zmb pretty_name: "Zambia - National Demographic and Health Data" dataset_info: splits: - name: train num_examples: 164 - name: test num_examples: 41 --- # Zambia - National Demographic and Health Data **Publisher:** The DHS Program · **Source:** [HDX](https://data.humdata.org/dataset/dhs-data-for-zambia) · **License:** `hdx-other` · **Updated:** 2026-04-20 --- ## Abstract Contains data from the [DHS data portal](https://api.dhsprogram.com/). There is also a dataset containing [Zambia - Subnational Demographic and Health Data](https://data.humdata.org/dataset/dhs-subnational-data-for-zambia) on HDX. The DHS Program Application Programming Interface (API) provides software developers access to aggregated indicator data from The Demographic and Health Surveys (DHS) Program. The API can be used to create various applications to help analyze, visualize, explore and disseminate data on population, health, HIV, and nutrition from more than 90 countries. Each row in this dataset represents country-level aggregates. Data was last updated on HDX on 2026-04-20. Geographic scope: **ZMB**. *Curated into ML-ready Parquet format by [Electric Sheep Africa](https://huggingface.co/electricsheepafrica).* --- ## Dataset Characteristics | | | |---|---| | **Domain** | Public health | | **Unit of observation** | Country-level aggregates | | **Rows (total)** | 206 | | **Columns** | 29 (14 numeric, 15 categorical, 0 datetime) | | **Train split** | 164 rows | | **Test split** | 41 rows | | **Geographic scope** | ZMB | | **Publisher** | The DHS Program | | **HDX last updated** | 2026-04-20 | --- ## Variables **Geographic** — `iso3` (ZMB), `dhs_countrycode` (ZM), `countryname` (Zambia), `surveyyear` (range 1992.0–2024.0), `surveyid` (ZM2018DHS, ZM2024DHS, ZM2007DHS) and 6 others. **Outcome / Measurement** — `value` (range 0.4–729.0), `istotal` (range 1.0–1.0). **Identifier / Metadata** — `dataid` (range 41515.0–834693.0), `indicatorid` (RH_DELP_C_DHF, CH_DIAT_C_ORT, CM_ECMR_C_IMR), `characteristicid` (range 1000.0–10000.0), `characteristiclabel` (Total, Total 15-49), `ispreferred` (range 0.0–1.0) and 3 others. **Other** — `indicator` (Place of delivery: Health facility, Treatment of diarrhea: Either ORS or RHF, Infant mortality rate), `precision` (range 0.0–1.0), `indicatororder` (range 11763080.0–260321010.0), `characteristicorder` (range 0.0–10000.0), `denominatorweighted` (range 745.0–27859.0) and 3 others. --- ## Quick Start ```python from datasets import load_dataset ds = load_dataset("electricsheepafrica/africa-demographics-zambia") train = ds["train"].to_pandas() test = ds["test"].to_pandas() print(train.shape) train.head() ``` --- ## Schema | Column | Type | Null % | Range / Sample Values | |---|---|---|---| | `iso3` | object | 0.0% | ZMB | | `dataid` | int64 | 0.0% | 41515.0 – 834693.0 (mean 483738.1456) | | `indicator` | object | 0.0% | Place of delivery: Health facility, Treatment of diarrhea: Either ORS or RHF, Infant mortality rate | | `value` | float64 | 0.0% | 0.4 – 729.0 (mean 58.3112) | | `precision` | int64 | 0.0% | 0.0 – 1.0 (mean 0.8252) | | `dhs_countrycode` | object | 0.0% | ZM | | `countryname` | object | 0.0% | Zambia | | `surveyyear` | int64 | 0.0% | 1992.0 – 2024.0 (mean 2008.7573) | | `surveyid` | object | 0.0% | ZM2018DHS, ZM2024DHS, ZM2007DHS | | `indicatorid` | object | 0.0% | RH_DELP_C_DHF, CH_DIAT_C_ORT, CM_ECMR_C_IMR | | `indicatororder` | int64 | 0.0% | 11763080.0 – 260321010.0 (mean 96782154.7087) | | `indicatortype` | object | 0.0% | I | | `characteristicid` | int64 | 0.0% | 1000.0 – 10000.0 (mean 2747.5728) | | `characteristicorder` | int64 | 0.0% | 0.0 – 10000.0 (mean 1941.7476) | | `characteristiccategory` | object | 0.0% | Total, Total 15-49 | | `characteristiclabel` | object | 0.0% | Total, Total 15-49 | | `byvariableid` | int64 | 0.0% | 0.0 – 631002.0 (mean 19529.5583) | | `byvariablelabel` | object | 67.5% | Five years preceding the survey, Ten years preceding the survey, Three years preceding the survey | | `istotal` | int64 | 0.0% | 1.0 – 1.0 (mean 1.0) | | `ispreferred` | int64 | 0.0% | 0.0 – 1.0 (mean 0.8155) | | `sdrid` | object | 0.0% | | | `surveyyearlabel` | object | 0.0% | | | `surveytype` | object | 0.0% | | | `denominatorweighted` | float64 | 31.1% | 745.0 – 27859.0 (mean 7003.7183) | | `denominatorunweighted` | float64 | 31.1% | 750.0 – 27883.0 (mean 7038.1408) | | `cilow` | float64 | 75.2% | 5.3 – 586.0 (mean 100.5471) | | `cihigh` | float64 | 75.2% | 6.6 – 872.0 (mean 141.9471) | | `esa_source` | object | 0.0% | | | `esa_processed` | object | 0.0% | | --- ## Numeric Summary | Column | Min | Max | Mean | Median | |---|---|---|---|---| | `dataid` | 41515.0 | 834693.0 | 483738.1456 | 546494.0 | | `value` | 0.4 | 729.0 | 58.3112 | 42.0 | | `precision` | 0.0 | 1.0 | 0.8252 | 1.0 | | `surveyyear` | 1992.0 | 2024.0 | 2008.7573 | 2007.0 | | `indicatororder` | 11763080.0 | 260321010.0 | 96782154.7087 | 83566070.0 | | `characteristicid` | 1000.0 | 10000.0 | 2747.5728 | 1000.0 | | `characteristicorder` | 0.0 | 10000.0 | 1941.7476 | 0.0 | | `byvariableid` | 0.0 | 631002.0 | 19529.5583 | 0.0 | | `istotal` | 1.0 | 1.0 | 1.0 | 1.0 | | `ispreferred` | 0.0 | 1.0 | 0.8155 | 1.0 | | `denominatorweighted` | 745.0 | 27859.0 | 7003.7183 | 5771.0 | | `denominatorunweighted` | 750.0 | 27883.0 | 7038.1408 | 5894.0 | | `cilow` | 5.3 | 586.0 | 100.5471 | 63.0 | | `cihigh` | 6.6 | 872.0 | 141.9471 | 78.0 | --- ## Curation Raw data was downloaded from HDX via the CKAN API and converted to Parquet. Column names were lowercased and standardised to snake_case. Common missing-value markers (`N/A`, `null`, `none`, `-`, `unknown`, `no data`, `#N/A`) were unified to `NaN`. 2 column(s) with >80% missing values were removed: `regionid`, `levelrank`. The dataset was split 80/20 into train and test partitions using a fixed random seed (42) and saved as Snappy-compressed Parquet. --- ## Limitations - Data originates from The DHS Program and has not been independently validated by ESA. - Automated cleaning cannot correct for misreported values, definitional inconsistencies, or sampling bias in the original collection. - The following columns have >20% missing values and should be treated with caution in modelling: `byvariablelabel`, `denominatorweighted`, `denominatorunweighted`, `cilow`, `cihigh`. - Refer to the [original HDX dataset page](https://data.humdata.org/dataset/dhs-data-for-zambia) for the publisher's own methodology notes and caveats. --- ## Citation ```bibtex @dataset{hdx_africa_demographics_zambia, title = {Zambia - National Demographic and Health Data}, author = {The DHS Program}, year = {2026}, url = {https://data.humdata.org/dataset/dhs-data-for-zambia}, note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)} } ``` --- *[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — Africa's ML dataset infrastructure. Lagos, Nigeria.*

annotations_creators: 无注释 language_creators: 采集自现有文本 language: 英语 license: 其他 multilinguality: 单语言 size_categories: 少于1000条 source_datasets: 原创数据集 task_categories: 表格分类 task_ids: 无 tags: 非洲、人道主义、人道主义数据交换(Humanitarian Data Exchange, HDX)、electric-sheep-africa、人口统计、医疗卫生、赞比亚(ZMB) pretty_name: "赞比亚——全国人口与健康数据" dataset_info: 数据集划分: - 名称: 训练集, 样本数: 164 - 名称: 测试集, 样本数: 41 # 赞比亚——全国人口与健康数据 **发布方:** 人口与健康调查(Demographic and Health Surveys, DHS)项目 · **数据源:** [HDX](https://data.humdata.org/dataset/dhs-data-for-zambia) · **许可协议:** `hdx-other` · **更新时间:** 2026-04-20 --- ## 摘要 本数据集包含来自[DHS数据门户](https://api.dhsprogram.com/)的数据。人道主义数据交换(HDX)平台上另有一份包含[赞比亚——次国家人口与健康数据](https://data.humdata.org/dataset/dhs-subnational-data-for-zambia)的数据集。 DHS项目应用程序编程接口(Application Programming Interface, API)可为软件开发人员提供来自人口与健康调查项目的聚合指标数据。开发者可通过该API构建各类应用,用于分析、可视化、探索并传播来自全球90余个国家的人口、医疗卫生、艾滋病病毒感染及营养相关数据。 本数据集的每一行均代表国家级聚合数据。本数据集在HDX平台的最后更新时间为2026-04-20。地理覆盖范围:**ZMB(赞比亚)**。 *本数据集已由[非洲电羊(Electric Sheep Africa)](https://huggingface.co/electricsheepafrica)整理为适配机器学习的Parquet格式。* --- ## 数据集特征 | | | |---|---| | **领域** | 公共卫生 | | **观测单元** | 国家级聚合数据 | | **总行数** | 206 | | **列数** | 29(14个数值型、15个分类型、0个日期时间型) | | **训练集划分** | 164行 | | **测试集划分** | 41行 | | **地理覆盖范围** | ZMB(赞比亚) | | **发布方** | DHS项目 | | **HDX平台最后更新时间** | 2026-04-20 | --- ## 变量 **地理类** — `iso3`(ZMB)、`dhs_countrycode`(ZM)、`countryname`(赞比亚)、`surveyyear`(取值范围1992.0–2024.0)、`surveyid`(ZM2018DHS、ZM2024DHS、ZM2007DHS)及另外6个字段。 **结果/测量类** — `value`(取值范围0.4–729.0)、`istotal`(取值范围1.0–1.0)。 **标识符/元数据类** — `dataid`(取值范围41515.0–834693.0)、`indicatorid`(RH_DELP_C_DHF、CH_DIAT_C_ORT、CM_ECMR_C_IMR)、`characteristicid`(取值范围1000.0–10000.0)、`characteristiclabel`(总计、总计15-49岁)、`ispreferred`(取值范围0.0–1.0)及另外3个字段。 **其他类** — `indicator`(分娩地点:医疗机构、腹泻治疗:口服补液盐(ORS)或RHF、婴儿死亡率)、`precision`(取值范围0.0–1.0)、`indicatororder`(取值范围11763080.0–260321010.0)、`characteristicorder`(取值范围0.0–10000.0)、`denominatorweighted`(取值范围745.0–27859.0)及另外3个字段。 --- ## 快速入门 python from datasets import load_dataset ds = load_dataset("electricsheepafrica/africa-demographics-zambia") train = ds["train"].to_pandas() test = ds["test"].to_pandas() print(train.shape) train.head() --- ## 数据结构 | 字段名 | 数据类型 | 空值占比 | 取值范围/示例值 | |---|---|---|---| | `iso3` | 字符串(object) | 0.0% | ZMB | | `dataid` | 64位整型(int64) | 0.0% | 41515.0 – 834693.0(均值483738.1456) | | `indicator` | 字符串(object) | 0.0% | 分娩地点:医疗机构、腹泻治疗:口服补液盐或RHF、婴儿死亡率 | | `value` | 64位浮点型(float64) | 0.0% | 0.4 – 729.0(均值58.3112) | | `precision` | 64位整型(int64) | 0.0% | 0.0 – 1.0(均值0.8252) | | `dhs_countrycode` | 字符串(object) | 0.0% | ZM | | `countryname` | 字符串(object) | 0.0% | 赞比亚 | | `surveyyear` | 64位整型(int64) | 0.0% | 1992.0 – 2024.0(均值2008.7573) | | `surveyid` | 字符串(object) | 0.0% | ZM2018DHS、ZM2024DHS、ZM2007DHS | | `indicatorid` | 字符串(object) | 0.0% | RH_DELP_C_DHF、CH_DIAT_C_ORT、CM_ECMR_C_IMR | | `indicatororder` | 64位整型(int64) | 0.0% | 11763080.0 – 260321010.0(均值96782154.7087) | | `indicatortype` | 字符串(object) | 0.0% | I | | `characteristicid` | 64位整型(int64) | 0.0% | 1000.0 – 10000.0(均值2747.5728) | | `characteristicorder` | 64位整型(int64) | 0.0% | 0.0 – 10000.0(均值1941.7476) | | `characteristiccategory` | 字符串(object) | 0.0% | 总计、总计15-49岁 | | `characteristiclabel` | 字符串(object) | 0.0% | 总计、总计15-49岁 | | `byvariableid` | 64位整型(int64) | 0.0% | 0.0 – 631002.0(均值19529.5583) | | `byvariablelabel` | 字符串(object) | 67.5% | 调查前五年、调查前十年、调查前三年 | | `istotal` | 64位整型(int64) | 0.0% | 1.0 – 1.0(均值1.0) | | `ispreferred` | 64位整型(int64) | 0.0% | 0.0 – 1.0(均值0.8155) | | `sdrid` | 字符串(object) | 0.0% | 无 | | `surveyyearlabel` | 字符串(object) | 0.0% | 无 | | `surveytype` | 字符串(object) | 0.0% | 无 | | `denominatorweighted` | 64位浮点型(float64) | 31.1% | 745.0 – 27859.0(均值7003.7183) | | `denominatorunweighted` | 64位浮点型(float64) | 31.1% | 750.0 – 27883.0(均值7038.1408) | | `cilow` | 64位浮点型(float64) | 75.2% | 5.3 – 586.0(均值100.5471) | | `cihigh` | 64位浮点型(float64) | 75.2% | 6.6 – 872.0(均值141.9471) | | `esa_source` | 字符串(object) | 0.0% | 无 | | `esa_processed` | 字符串(object) | 0.0% | 无 | --- ## 数值统计摘要 | 字段名 | 最小值 | 最大值 | 均值 | 中位数 | |---|---|---|---|---| | `dataid` | 41515.0 | 834693.0 | 483738.1456 | 546494.0 | | `value` | 0.4 | 729.0 | 58.3112 | 42.0 | | `precision` | 0.0 | 1.0 | 0.8252 | 1.0 | | `surveyyear` | 1992.0 | 2024.0 | 2008.7573 | 2007.0 | | `indicatororder` | 11763080.0 | 260321010.0 | 96782154.7087 | 83566070.0 | | `characteristicid` | 1000.0 | 10000.0 | 2747.5728 | 1000.0 | | `characteristicorder` | 0.0 | 10000.0 | 1941.7476 | 0.0 | | `byvariableid` | 0.0 | 631002.0 | 19529.5583 | 0.0 | | `istotal` | 1.0 | 1.0 | 1.0 | 1.0 | | `ispreferred` | 0.0 | 1.0 | 0.8155 | 1.0 | | `denominatorweighted` | 745.0 | 27859.0 | 7003.7183 | 5771.0 | | `denominatorunweighted` | 750.0 | 27883.0 | 7038.1408 | 5894.0 | | `cilow` | 5.3 | 586.0 | 100.5471 | 63.0 | | `cihigh` | 6.6 | 872.0 | 141.9471 | 78.0 | --- ## 数据整理 原始数据通过康卡恩(Comprehensive Knowledge Archive Network, CKAN)API从HDX平台下载,并转换为Parquet格式。字段名均转换为小写并统一为蛇形命名法。常见缺失值标记(`N/A`、`null`、`none`、`-`、`unknown`、`no data`、`#N/A`)被统一替换为`NaN`。移除了2个缺失值占比超过80%的字段:`regionid`、`levelrank`。本数据集以固定随机种子(42)按80/20的比例划分为训练集与测试集,并保存为Snappy压缩的Parquet格式。 --- ## 数据集局限性 - 本数据集的数据源自DHS项目,未经过非洲电羊团队的独立验证。 - 自动化清洗无法修正原始数据收集阶段存在的错报值、定义不一致或抽样偏差问题。 - 以下字段的缺失值占比超过20%,在建模过程中需谨慎使用:`byvariablelabel`、`denominatorweighted`、`denominatorunweighted`、`cilow`、`cihigh`。 - 如需了解发布方的方法说明与免责声明,请参阅[原始HDX数据集页面](https://data.humdata.org/dataset/dhs-data-for-zambia)。 --- ## 引用格式 bibtex @dataset{hdx_africa_demographics_zambia, title = {Zambia - National Demographic and Health Data}, author = {The DHS Program}, year = {2026}, url = {https://data.humdata.org/dataset/dhs-data-for-zambia}, note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)} } --- *[非洲电羊(Electric Sheep Africa)](https://huggingface.co/electricsheepafrica) — 非洲机器学习数据集基础设施。尼日利亚拉各斯。*
提供机构:
electricsheepafrica
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作