electricsheepafrica/africa-demographics-congo-dem-rep
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-demographics-congo-dem-rep
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license: other
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- tabular-classification
- other
task_ids: []
tags:
- africa
- humanitarian
- hdx
- electric-sheep-africa
- demographics
- health
- cod
pretty_name: "Democratic Republic of the Congo - Subnational Demographic and Health Data"
dataset_info:
splits:
- name: train
num_examples: 1592
- name: test
num_examples: 398
---
# Democratic Republic of the Congo - Subnational Demographic and Health Data
**Publisher:** The DHS Program · **Source:** [HDX](https://data.humdata.org/dataset/dhs-subnational-data-for-democratic-republic-of-the-congo) · **License:** `hdx-other` · **Updated:** 2026-02-24
---
## Abstract
Contains data from the [DHS data portal](https://api.dhsprogram.com/). There is also a dataset containing [Democratic Republic of the Congo - National Demographic and Health Data](https://data.humdata.org/dataset/dhs-data-for-democratic-republic-of-the-congo) on HDX.
The DHS Program Application Programming Interface (API) provides software developers access to aggregated indicator data from The Demographic and Health Surveys (DHS) Program. The API can be used to create various applications to help analyze, visualize, explore and disseminate data on population, health, HIV, and nutrition from more than 90 countries.
Each row in this dataset represents first-level administrative unit observations. Data was last updated on HDX on 2026-02-24. Geographic scope: **COD**.
*Curated into ML-ready Parquet format by [Electric Sheep Africa](https://huggingface.co/electricsheepafrica).*
---
## Dataset Characteristics
| | |
|---|---|
| **Domain** | Public health |
| **Unit of observation** | First-level administrative unit observations |
| **Rows (total)** | 1,991 |
| **Columns** | 30 (14 numeric, 16 categorical, 0 datetime) |
| **Train split** | 1,592 rows |
| **Test split** | 398 rows |
| **Geographic scope** | COD |
| **Publisher** | The DHS Program |
| **HDX last updated** | 2026-02-24 |
---
## Variables
**Geographic** — `iso3` (COD), `location` (Kinshasa, Maniema, Kasaï Oriental), `dhs_countrycode` (CD), `countryname` (Congo Democratic Republic), `surveyyear` (range 2007.0–2023.0) and 8 others.
**Outcome / Measurement** — `value` (range 0.0–219.0), `istotal` (range 0.0–0.0).
**Identifier / Metadata** — `dataid` (range 61328.0–7975738.0), `indicatorid` (RH_DELP_C_DHF, CH_DIAT_C_ORT, DV_SPVL_W_POS), `characteristicid` (range 503010.0–503095.0), `characteristiclabel` (Kinshasa, Maniema, Kasaï Oriental), `ispreferred` (range 0.0–1.0) and 3 others.
**Other** — `indicator` (Place of delivery: Health facility, Treatment of diarrhea: Either ORS or RHF, Physical or sexual violence committed by husband/partner), `precision` (range 0.0–1.0), `indicatororder` (range 11763080.0–260321010.0), `characteristicorder` (range 1503010.0–1503095.0), `denominatorweighted` (range 13.0–4431.0) and 2 others.
---
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-demographics-congo-dem-rep")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
```
---
## Schema
| Column | Type | Null % | Range / Sample Values |
|---|---|---|---|
| `iso3` | object | 0.0% | COD |
| `location` | object | 0.0% | Kinshasa, Maniema, Kasaï Oriental |
| `dataid` | int64 | 0.0% | 61328.0 – 7975738.0 (mean 4156655.7403) |
| `indicator` | object | 0.0% | Place of delivery: Health facility, Treatment of diarrhea: Either ORS or RHF, Physical or sexual violence committed by husband/partner |
| `value` | float64 | 0.0% | 0.0 – 219.0 (mean 32.0316) |
| `precision` | int64 | 0.0% | 0.0 – 1.0 (mean 0.9458) |
| `dhs_countrycode` | object | 0.0% | CD |
| `countryname` | object | 0.0% | Congo Democratic Republic |
| `surveyyear` | int64 | 0.0% | 2007.0 – 2023.0 (mean 2016.6926) |
| `surveyid` | object | 0.0% | CD2023DHS, CD2013DHS, CD2007DHS |
| `indicatorid` | object | 0.0% | RH_DELP_C_DHF, CH_DIAT_C_ORT, DV_SPVL_W_POS |
| `indicatororder` | int64 | 0.0% | 11763080.0 – 260321010.0 (mean 106680092.883) |
| `indicatortype` | object | 0.0% | I |
| `characteristicid` | int64 | 0.0% | 503010.0 – 503095.0 (mean 503056.2707) |
| `characteristicorder` | int64 | 0.0% | 1503010.0 – 1503095.0 (mean 1503056.0432) |
| `characteristiccategory` | object | 0.0% | Region |
| `characteristiclabel` | object | 0.0% | Kinshasa, Maniema, Kasaï Oriental |
| `byvariableid` | int64 | 0.0% | 0.0 – 631002.0 (mean 37075.7062) |
| `byvariablelabel` | object | 72.0% | |
| `istotal` | int64 | 0.0% | 0.0 – 0.0 (mean 0.0) |
| `ispreferred` | int64 | 0.0% | 0.0 – 1.0 (mean 0.8659) |
| `sdrid` | object | 0.0% | |
| `regionid` | object | 0.0% | |
| `surveyyearlabel` | float64 | 85.0% | 2007.0 – 2007.0 (mean 2007.0) |
| `surveytype` | object | 0.0% | |
| `denominatorweighted` | float64 | 19.4% | 13.0 – 4431.0 (mean 653.9589) |
| `denominatorunweighted` | float64 | 19.4% | 27.0 – 5127.0 (mean 705.534) |
| `levelrank` | float64 | 32.4% | 1.0 – 1.0 (mean 1.0) |
| `esa_source` | object | 0.0% | |
| `esa_processed` | object | 0.0% | |
---
## Numeric Summary
| Column | Min | Max | Mean | Median |
|---|---|---|---|---|
| `dataid` | 61328.0 | 7975738.0 | 4156655.7403 | 4277063.0 |
| `value` | 0.0 | 219.0 | 32.0316 | 22.7 |
| `precision` | 0.0 | 1.0 | 0.9458 | 1.0 |
| `surveyyear` | 2007.0 | 2023.0 | 2016.6926 | 2013.0 |
| `indicatororder` | 11763080.0 | 260321010.0 | 106680092.883 | 94096040.0 |
| `characteristicid` | 503010.0 | 503095.0 | 503056.2707 | 503054.0 |
| `characteristicorder` | 1503010.0 | 1503095.0 | 1503056.0432 | 1503054.0 |
| `byvariableid` | 0.0 | 631002.0 | 37075.7062 | 0.0 |
| `istotal` | 0.0 | 0.0 | 0.0 | 0.0 |
| `ispreferred` | 0.0 | 1.0 | 0.8659 | 1.0 |
| `surveyyearlabel` | 2007.0 | 2007.0 | 2007.0 | 2007.0 |
| `denominatorweighted` | 13.0 | 4431.0 | 653.9589 | 433.0 |
| `denominatorunweighted` | 27.0 | 5127.0 | 705.534 | 521.0 |
| `levelrank` | 1.0 | 1.0 | 1.0 | 1.0 |
---
## Curation
Raw data was downloaded from HDX via the CKAN API and converted to Parquet. Column names were lowercased and standardised to snake_case. Common missing-value markers (`N/A`, `null`, `none`, `-`, `unknown`, `no data`, `#N/A`) were unified to `NaN`. 2 column(s) with >80% missing values were removed: `cilow`, `cihigh`. 1 column(s) were cast from string to numeric or datetime based on parse-success rate (>85% threshold). The dataset was split 80/20 into train and test partitions using a fixed random seed (42) and saved as Snappy-compressed Parquet.
---
## Limitations
- Data originates from The DHS Program and has not been independently validated by ESA.
- Automated cleaning cannot correct for misreported values, definitional inconsistencies, or sampling bias in the original collection.
- The following columns have >20% missing values and should be treated with caution in modelling: `byvariablelabel`, `surveyyearlabel`, `levelrank`.
- Refer to the [original HDX dataset page](https://data.humdata.org/dataset/dhs-subnational-data-for-democratic-republic-of-the-congo) for the publisher's own methodology notes and caveats.
---
## Citation
```bibtex
@dataset{hdx_africa_demographics_congo_dem_rep,
title = {Democratic Republic of the Congo - Subnational Demographic and Health Data},
author = {The DHS Program},
year = {2026},
url = {https://data.humdata.org/dataset/dhs-subnational-data-for-democratic-republic-of-the-congo},
note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)}
}
```
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — Africa's ML dataset infrastructure. Lagos, Nigeria.*
annotations_creators:
- 无注释
language_creators:
- 现有资源采集
language:
- 英语
license:
- 其他
multilinguality:
- 单语言
size_categories:
- 1000 < 样本量 < 10000
source_datasets:
- 原始数据集
task_categories:
- 表格分类
- 其他
task_ids: []
tags:
- 非洲
- 人道主义
- HDX
- Electric Sheep Africa
- 人口统计
- 卫生
- COD
pretty_name: "刚果民主共和国——次国家级人口与健康数据"
dataset_info:
splits:
- name: train
num_examples: 1592
- name: test
num_examples: 398
---
# 刚果民主共和国——次国家级人口与健康数据
**发布方:人口与健康调查项目(Demographic and Health Surveys Program,DHS Program)** · **来源:[HDX(人道主义数据交换,Humanitarian Data Exchange)](https://data.humdata.org/dataset/dhs-subnational-data-for-democratic-republic-of-the-congo)** · **许可:`hdx-other`** · **更新时间:2026-02-24**
---
## 摘要
本数据集的数据源自[DHS数据门户](https://api.dhsprogram.com/)。人道主义数据交换(HDX)平台上另有一份《刚果民主共和国——国家级人口与健康数据》数据集可供获取。
人口与健康调查项目(DHS Program)应用程序编程接口(Application Programming Interface,API)可为软件开发人员提供来自该项目的聚合指标数据,支持开发者创建各类应用,以分析、可视化、探索并发布全球90余个国家的人口、卫生、HIV及营养相关数据。
本数据集的每一行代表一级行政单元的观测数据。该数据集在HDX平台的最后更新时间为2026-02-24。地理覆盖范围:**COD(刚果民主共和国ISO 3166-1 alpha-3代码)**。
本数据集由[Electric Sheep Africa](https://huggingface.co/electricsheepafrica)整理为适配机器学习的Parquet格式。
---
## 数据集特征
| | |
|---|---|
| **领域** | 公共卫生 |
| **观测单元** | 一级行政单元观测数据 |
| **总行数** | 1,991 |
| **列数** | 30(14个数值型、16个分类型、0个日期时间型) |
| **训练集划分** | 1,592行 |
| **测试集划分** | 398行 |
| **地理覆盖范围** | COD |
| **发布方** | 人口与健康调查项目(DHS Program) |
| **HDX最后更新时间** | 2026-02-24 |
---
## 变量分类
**地理类变量** — `iso3`(COD)、`location`(金沙萨、马尼埃马、东开赛)、`dhs_countrycode`(CD)、`countryname`(刚果民主共和国)、`surveyyear`(取值范围2007.0–2023.0)及其他8个变量。
**结局/测量类变量** — `value`(取值范围0.0–219.0)、`istotal`(取值范围0.0–0.0)。
**标识符/元数据类变量** — `dataid`(取值范围61328.0–7975738.0)、`indicatorid`(RH_DELP_C_DHF、CH_DIAT_C_ORT、DV_SPVL_W_POS)、`characteristicid`(取值范围503010.0–503095.0)、`characteristiclabel`(金沙萨、马尼埃马、东开赛)、`ispreferred`(取值范围0.0–1.0)及其他3个变量。
**其他类变量** — `indicator`(分娩地点:卫生机构、腹泻治疗:口服补液盐或重组人乳铁蛋白、丈夫/伴侣实施的躯体或性暴力)、`precision`(取值范围0.0–1.0)、`indicatororder`(取值范围11763080.0–260321010.0)、`characteristicorder`(取值范围1503010.0–1503095.0)、`denominatorweighted`(取值范围13.0–4431.0)及其他2个变量。
---
## 快速上手
python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-demographics-congo-dem-rep")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
---
## 数据结构
| 列名 | 数据类型 | 缺失率 | 取值范围/示例值 |
|---|---|---|---|
| `iso3` | 字符串(object) | 0.0% | COD |
| `location` | 字符串(object) | 0.0% | 金沙萨、马尼埃马、东开赛 |
| `dataid` | 64位整数(int64) | 0.0% | 61328.0 – 7975738.0(均值 4156655.7403) |
| `indicator` | 字符串(object) | 0.0% | 分娩地点:卫生机构、腹泻治疗:口服补液盐或重组人乳铁蛋白、丈夫/伴侣实施的躯体或性暴力 |
| `value` | 64位浮点数(float64) | 0.0% | 0.0 – 219.0(均值 32.0316) |
| `precision` | 64位整数(int64) | 0.0% | 0.0 – 1.0(均值 0.9458) |
| `dhs_countrycode` | 字符串(object) | 0.0% | CD |
| `countryname` | 字符串(object) | 0.0% | 刚果民主共和国 |
| `surveyyear` | 64位整数(int64) | 0.0% | 2007.0 – 2023.0(均值 2016.6926) |
| `surveyid` | 字符串(object) | 0.0% | CD2023DHS、CD2013DHS、CD2007DHS |
| `indicatorid` | 字符串(object) | 0.0% | RH_DELP_C_DHF、CH_DIAT_C_ORT、DV_SPVL_W_POS |
| `indicatororder` | 64位整数(int64) | 0.0% | 11763080.0 – 260321010.0(均值 106680092.883) |
| `indicatortype` | 字符串(object) | 0.0% | I |
| `characteristicid` | 64位整数(int64) | 0.0% | 503010.0 – 503095.0(均值 503056.2707) |
| `characteristicorder` | 64位整数(int64) | 0.0% | 1503010.0 – 1503095.0(均值 1503056.0432) |
| `characteristiccategory` | 字符串(object) | 0.0% | 区域 |
| `characteristiclabel` | 字符串(object) | 0.0% | 金沙萨、马尼埃马、东开赛 |
| `byvariableid` | 64位整数(int64) | 0.0% | 0.0 – 631002.0(均值 37075.7062) |
| `byvariablelabel` | 字符串(object) | 72.0% | 无有效值 |
| `istotal` | 64位整数(int64) | 0.0% | 0.0 – 0.0(均值 0.0) |
| `ispreferred` | 64位整数(int64) | 0.0% | 0.0 – 1.0(均值 0.8659) |
| `sdrid` | 字符串(object) | 0.0% | 无有效值 |
| `regionid` | 字符串(object) | 0.0% | 无有效值 |
| `surveyyearlabel` | 64位浮点数(float64) | 85.0% | 2007.0 – 2007.0(均值 2007.0) |
| `surveytype` | 字符串(object) | 0.0% | 无有效值 |
| `denominatorweighted` | 64位浮点数(float64) | 19.4% | 13.0 – 4431.0(均值 653.9589) |
| `denominatorunweighted` | 64位浮点数(float64) | 19.4% | 27.0 – 5127.0(均值 705.534) |
| `levelrank` | 64位浮点数(float64) | 32.4% | 1.0 – 1.0(均值 1.0) |
| `esa_source` | 字符串(object) | 0.0% | 无有效值 |
| `esa_processed` | 字符串(object) | 0.0% | 无有效值 |
---
## 数值型变量汇总
| 列名 | 最小值 | 最大值 | 均值 | 中位数 |
|---|---|---|---|---|
| `dataid` | 61328.0 | 7975738.0 | 4156655.7403 | 4277063.0 |
| `value` | 0.0 | 219.0 | 32.0316 | 22.7 |
| `precision` | 0.0 | 1.0 | 0.9458 | 1.0 |
| `surveyyear` | 2007.0 | 2023.0 | 2016.6926 | 2013.0 |
| `indicatororder` | 11763080.0 | 260321010.0 | 106680092.883 | 94096040.0 |
| `characteristicid` | 503010.0 | 503095.0 | 503056.2707 | 503054.0 |
| `characteristicorder` | 1503010.0 | 1503095.0 | 1503056.0432 | 1503054.0 |
| `byvariableid` | 0.0 | 631002.0 | 37075.7062 | 0.0 |
| `istotal` | 0.0 | 0.0 | 0.0 | 0.0 |
| `ispreferred` | 0.0 | 1.0 | 0.8659 | 1.0 |
| `surveyyearlabel` | 2007.0 | 2007.0 | 2007.0 | 2007.0 |
| `denominatorweighted` | 13.0 | 4431.0 | 653.9589 | 433.0 |
| `denominatorunweighted` | 27.0 | 5127.0 | 705.534 | 521.0 |
| `levelrank` | 1.0 | 1.0 | 1.0 | 1.0 |
---
## 数据整理流程
原始数据通过CKAN API从HDX平台下载,并转换为Parquet格式。列名统一转换为小写并采用蛇形命名法(snake_case)进行标准化。将常见缺失值标记(`N/A`、`null`、`none`、`-`、`unknown`、`no data`、`#N/A`)统一替换为`NaN`。删除了2个缺失率超过80%的列:`cilow`、`cihigh`。根据解析成功率(阈值85%),将1列从字符串类型转换为数值或日期时间类型。本数据集以固定随机种子(42)按80/20比例划分为训练集与测试集,并保存为Snappy压缩的Parquet格式。
---
## 局限性说明
- 本数据集源自人口与健康调查项目(DHS Program),未由Electric Sheep Africa(ESA)进行独立验证。
- 自动化清洗流程无法修正原始数据收集阶段的错报值、定义不一致或抽样偏差问题。
- 以下列的缺失率超过20%,在建模过程中需谨慎使用:`byvariablelabel`、`surveyyearlabel`、`levelrank`。
- 如需查看发布方的方法说明与注意事项,请参阅[原始HDX数据集页面](https://data.humdata.org/dataset/dhs-subnational-data-for-democratic-republic-of-the-congo)。
---
## 引用格式
bibtex
@dataset{hdx_africa_demographics_congo_dem_rep,
title = "Democratic Republic of the Congo - Subnational Demographic and Health Data",
author = {The DHS Program},
year = {2026},
url = {https://data.humdata.org/dataset/dhs-subnational-data-for-democratic-republic-of-the-congo},
note = {由Electric Sheep Africa重新打包以适配机器学习需求 (https://huggingface.co/electricsheepafrica)}
}
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — 非洲机器学习数据集基础设施提供商,总部位于尼日利亚拉各斯。*
提供机构:
electricsheepafrica



