electricsheepafrica/africa-idmc-event-data-for-som
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-idmc-event-data-for-som
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license: cc-by-4.0
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- tabular-classification
- tabular-regression
- other
task_ids: []
tags:
- africa
- humanitarian
- hdx
- electric-sheep-africa
- conflict-violence
- displacement
- drought
- internally-displaced-persons-idp
- som
pretty_name: "Somalia - Internal Displacements Updates (IDU) (event data)"
dataset_info:
splits:
- name: train
num_examples: 895
- name: test
num_examples: 223
---
# Somalia - Internal Displacements Updates (IDU) (event data)
**Publisher:** Internal Displacement Monitoring Centre (IDMC) · **Source:** [HDX](https://data.humdata.org/dataset/idmc-event-data-for-som) · **License:** `cc-by-igo` · **Updated:** 2026-04-09
---
## Abstract
Conflict and disaster population movement (flows) data for Somalia.
The **IDU (Internal Displacement Updates) dataset**, provided by the [Internal Displacement Monitoring Centre (IDMC)](https://www.internal-displacement.org/), offers timely event data and provisional information on new internal displacements caused by conflicts and disasters. Representing the most recent available information over a 180-day time period, the IDU is updated daily and focuses on "flows" (new displacements).
Internally displaced persons (IDPs) are defined according to the [1998 Guiding Principles](https://www.internal-displacement.org/internal-displacement/guiding-principles-on-internal-displacement/) as people or groups of people who have been forced or obliged to flee or to leave their homes or places of habitual residence, in particular as a result of armed conflict, or to avoid the effects of armed conflict, situations of generalized violence, violations of human rights, or natural or human-made disasters and who have not crossed an international border. The IDMC's event data, sourced from the IDU, provides initial assessments of these internal displacements, reflecting continually updated provisional information from various sources.
While the IDU offers early insights, the more thoroughly validated and curated "stock" (Total number of people leaving on internal displacement) and "flow" (population movements) estimates are available in the annual [Global Internal Displacement Database (GIDD)](http://www.internal-displacement.org/database/displacement-data). Both datasets are accessible via API, with specific guidance on data access, structure, and limitations, including important preprocessing considerations for the IDU to ensure accurate analysis and avoid double-counting. For further detailed information and complete API specifications, users are encouraged to consult the official documentation at https://www.internal-displacement.org/database/api-documentation/.
The IDMC's Event data, sourced from the Internal Displacement Updates (IDU), offers initial assessments of internal displacements reported within the last 180 days. This dataset provides provisional information that is continually updated on a daily basis, reflecting the availability of data on new displacements arising from conflicts and disasters. The finalized, carefully curated, and validated estimates are then made accessible through [the Global Internal Displacement Database (GIDD)](https://www.internal-displacement.org/database/displacement-data). The IDU dataset comprises preliminary estimates aggregated from various publishers or sources.
Each row in this dataset represents discrete events or incidents. Temporal coverage is indicated by the `displacement_date`, `displacement_start_date` column(s). Geographic scope: **SOM**.
*Curated into ML-ready Parquet format by [Electric Sheep Africa](https://huggingface.co/electricsheepafrica).*
---
## Dataset Characteristics
| | |
|---|---|
| **Domain** | Conflict and security |
| **Unit of observation** | Discrete events or incidents |
| **Rows (total)** | 1,119 |
| **Columns** | 33 (6 numeric, 21 categorical, 5 datetime) |
| **Train split** | 895 rows |
| **Test split** | 223 rows |
| **Geographic scope** | SOM |
| **Publisher** | Internal Displacement Monitoring Centre (IDMC) |
| **HDX last updated** | 2026-04-09 |
---
## Variables
**Geographic** — `country` (Somalia), `iso3` (SOM), `latitude` (range -1.0332–11.472), `longitude` (range 41.4413–50.0967), `displacement_type` (Disaster, Conflict) and 14 others.
**Temporal** — `event_start_date`, `event_end_date`.
**Identifier / Metadata** — `id` (range 229103.0–240684.0), `centroid` ([-0.14424499999999998, 42.645165000000006], [9.50209, 49.50765], [3.81645, 43.43285]), `event_id` (range 36962.0–40347.0), `event_name` (Somalia: Drought - Nugaal - 01/11/2025, Somalia: Drought - Bay - 01/01/2026, Somalia: Unclear/Unknown - Middle Juba - 01/10/2025), `sources` and 2 others.
**Other** — `role` (Recommended figure), `qualifier` (total), `figure` (range 1.0–39000.0), `created_at`, `description`.
---
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-idmc-event-data-for-som")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
```
---
## Schema
| Column | Type | Null % | Range / Sample Values |
|---|---|---|---|
| `id` | int64 | 0.0% | 229103.0 – 240684.0 (mean 232617.8186) |
| `country` | object | 0.0% | Somalia |
| `iso3` | object | 0.0% | SOM |
| `latitude` | float64 | 0.0% | -1.0332 – 11.472 (mean 4.8164) |
| `longitude` | float64 | 0.0% | 41.4413 – 50.0967 (mean 45.0274) |
| `centroid` | object | 0.0% | [-0.14424499999999998, 42.645165000000006], [9.50209, 49.50765], [3.81645, 43.43285] |
| `role` | object | 0.0% | Recommended figure |
| `displacement_type` | object | 0.0% | Disaster, Conflict |
| `qualifier` | object | 0.0% | total |
| `figure` | int64 | 0.0% | 1.0 – 39000.0 (mean 344.7855) |
| `displacement_date` | datetime64[ns] | 0.0% | |
| `displacement_start_date` | datetime64[ns] | 0.0% | |
| `displacement_end_date` | datetime64[ns] | 0.0% | |
| `year` | int64 | 0.0% | 2025.0 – 2026.0 (mean 2025.2779) |
| `event_id` | int64 | 0.0% | 36962.0 – 40347.0 (mean 39171.3548) |
| `event_name` | object | 0.0% | Somalia: Drought - Nugaal - 01/11/2025, Somalia: Drought - Bay - 01/01/2026, Somalia: Unclear/Unknown - Middle Juba - 01/10/2025 |
| `event_start_date` | datetime64[ns] | 0.0% | |
| `event_end_date` | datetime64[ns] | 0.0% | |
| `category` | object | 23.1% | Weather related |
| `subcategory` | object | 23.1% | Climatological |
| `type` | object | 23.1% | Drought |
| `subtype` | object | 23.1% | |
| `sources` | object | 0.0% | |
| `locations_name` | object | 0.0% | |
| `locations_coordinates` | object | 0.0% | |
| `locations_accuracy` | object | 0.0% | |
| `locations_type` | object | 0.0% | |
| `displacement_occurred` | object | 0.0% | |
| `created_at` | datetime64[ns, UTC] | 0.0% | |
| `description` | object | 0.0% | |
| `combined_type` | object | 0.0% | |
| `esa_source` | object | 0.0% | |
| `esa_processed` | object | 0.0% | |
---
## Numeric Summary
| Column | Min | Max | Mean | Median |
|---|---|---|---|---|
| `id` | 229103.0 | 240684.0 | 232617.8186 | 229729.0 |
| `latitude` | -1.0332 | 11.472 | 4.8164 | 3.8098 |
| `longitude` | 41.4413 | 50.0967 | 45.0274 | 44.0069 |
| `figure` | 1.0 | 39000.0 | 344.7855 | 18.0 |
| `year` | 2025.0 | 2026.0 | 2025.2779 | 2025.0 |
| `event_id` | 36962.0 | 40347.0 | 39171.3548 | 39582.0 |
---
## Curation
Raw data was downloaded from HDX via the CKAN API and converted to Parquet. Column names were lowercased and standardised to snake_case. Common missing-value markers (`N/A`, `null`, `none`, `-`, `unknown`, `no data`, `#N/A`) were unified to `NaN`. 5 column(s) with >80% missing values were removed: `event_codes`, `event_code_types`, `old_id`, `source_url`, `link`. 6 column(s) were cast from string to numeric or datetime based on parse-success rate (>85% threshold). The dataset was split 80/20 into train and test partitions using a fixed random seed (42) and saved as Snappy-compressed Parquet.
---
## Limitations
- Data originates from Internal Displacement Monitoring Centre (IDMC) and has not been independently validated by ESA.
- Automated cleaning cannot correct for misreported values, definitional inconsistencies, or sampling bias in the original collection.
- The following columns have >20% missing values and should be treated with caution in modelling: `category`, `subcategory`, `type`, `subtype`.
- Refer to the [original HDX dataset page](https://data.humdata.org/dataset/idmc-event-data-for-som) for the publisher's own methodology notes and caveats.
---
## Citation
```bibtex
@dataset{hdx_africa_idmc_event_data_for_som,
title = {Somalia - Internal Displacements Updates (IDU) (event data)},
author = {Internal Displacement Monitoring Centre (IDMC)},
year = {2026},
url = {https://data.humdata.org/dataset/idmc-event-data-for-som},
note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)}
}
```
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — Africa's ML dataset infrastructure. Lagos, Nigeria.*
annotations_creators:
- 无标注(no-annotation)
language_creators:
- 采集自现有文本(found)
language:
- 英语(en)
license: cc-by-4.0
multilinguality:
- 单语言(monolingual)
size_categories:
- 1000<n<10000
source_datasets:
- 原创数据集(original)
task_categories:
- 表格分类(tabular-classification)
- 表格回归(tabular-regression)
- 其他(other)
task_ids: []
tags:
- 非洲(africa)
- 人道主义(humanitarian)
- HDX
- electric-sheep-africa
- 冲突与暴力(conflict-violence)
- 流离失所(displacement)
- 干旱(drought)
- 国内流离失所者(internally-displaced-persons-idp)
- 索马里(som)
pretty_name: "索马里 - 国内流离失所更新(IDU)(事件数据)"
dataset_info:
splits:
- name: train
num_examples: 895
- name: test
num_examples: 223
# 索马里 - 国内流离失所更新(IDU)(事件数据)
**发布方:** 国内流离失所监测中心(Internal Displacement Monitoring Centre, IDMC) · **来源:** [HDX](https://data.humdata.org/dataset/idmc-event-data-for-som) · **授权协议:** `cc-by-igo` · **更新时间:** 2026-04-09
---
## 摘要
冲突与灾害引发的索马里境内人口流动(流离失所流)数据。
由[国内流离失所监测中心(Internal Displacement Monitoring Centre, IDMC)](https://www.internal-displacement.org/)提供的**IDU(国内流离失所更新)数据集**,提供了由冲突与灾害导致的新增国内流离失所事件的实时事件数据与临时信息。该数据集覆盖过去180天内的最新可用信息,每日更新,聚焦于“流动量”(新增流离失所情况)。
国内流离失所者(Internally Displaced Persons, IDPs)的定义依据[1998年指导原则](https://www.internal-displacement.org/internal-displacement/guiding-principles-on-internal-displacement/),指因武装冲突、规避武装冲突后果、大规模暴力事件、人权侵犯行为、自然或人为灾害而被迫逃离或离开家园或惯常居所,且未跨越国际边境的个人或群体。IDMC基于IDU生成的事件数据,可对这类国内流离失所情况进行初步评估,反映来自多源的持续更新的临时信息。
尽管IDU可提供早期洞察,但经过全面验证与整理的“存量”(国内流离失所总人数)与“流动量”(人口流动)估算值可在年度[全球国内流离失所数据库(Global Internal Displacement Database, GIDD)](http://www.internal-displacement.org/database/displacement-data)中获取。两类数据集均支持通过API访问,并提供数据访问、结构与局限性的具体指南,包括针对IDU的重要预处理注意事项,以确保分析准确并避免重复计数。如需获取详细信息与完整API规范,建议用户查阅官方文档:https://www.internal-displacement.org/database/api-documentation/。
IDMC基于IDU生成的事件数据,可对过去180天内报告的国内流离失所情况进行初步评估。本数据集每日持续更新临时信息,反映由冲突与灾害引发的新增流离失所数据的可获得性。最终的、经过精心整理与验证的估算值将通过[全球国内流离失所数据库(Global Internal Displacement Database, GIDD)](https://www.internal-displacement.org/database/displacement-data)对外发布。IDU数据集包含来自各发布方或数据源的初步估算值汇总。
每一行代表一个独立的事件或事故。时间覆盖范围由`displacement_date`、`displacement_start_date`等字段标识。地理范围:**索马里(SOM)**。
*本数据集由[Electric Sheep Africa](https://huggingface.co/electricsheepafrica)整理为适用于机器学习的Parquet格式。*
---
## 数据集特征
| | |
|---|---|
| **领域** | 冲突与安全 |
| **观测单元** | 独立事件或事故 |
| **总数据行数** | 1,119 |
| **字段数** | 33(6个数值型、21个分类型、5个日期时间型) |
| **训练集划分** | 895行 |
| **测试集划分** | 223行 |
| **地理范围** | 索马里(SOM) |
| **发布方** | 国内流离失所监测中心(IDMC) |
| **HDX最后更新时间** | 2026-04-09 |
---
## 字段说明
**地理类字段** — `country`(国家:索马里)、`iso3`(ISO3代码:SOM)、`latitude`(纬度范围:-1.0332–11.472)、`longitude`(经度范围:41.4413–50.0967)、`displacement_type`(流离失所类型:灾害、冲突)及另外14个字段。
**时间类字段** — `event_start_date`、`event_end_date`。
**标识符与元数据类字段** — `id`(编号范围:229103.0–240684.0)、`centroid`(质心坐标示例:[-0.14424499999999998, 42.645165000000006], [9.50209, 49.50765], [3.81645, 43.43285])、`event_id`(事件编号范围:36962.0–40347.0)、`event_name`(事件名称示例:索马里: 干旱 - 努加尔州 - 2025/11/01、索马里: 干旱 - 拜州 - 2026/01/01、索马里: 不明/未知 - 中朱巴州 - 2025/10/01)、`sources`(数据源)及另外2个字段。
**其他类字段** — `role`(角色:推荐估算值)、`qualifier`(限定词:总计)、`figure`(估算人数范围:1.0–39000.0)、`created_at`(创建时间)、`description`(描述)。
---
## 快速上手
python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-idmc-event-data-for-som")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
---
## 数据结构
| 字段名 | 数据类型 | 缺失率 | 范围/示例值 |
|---|---|---|---|
| `id` | int64 | 0.0% | 229103.0 – 240684.0(均值 232617.8186) |
| `country` | object | 0.0% | 索马里 |
| `iso3` | object | 0.0% | SOM |
| `latitude` | float64 | 0.0% | -1.0332 – 11.472(均值 4.8164) |
| `longitude` | float64 | 0.0% | 41.4413 – 50.0967(均值 45.0274) |
| `centroid` | object | 0.0% | [-0.14424499999999998, 42.645165000000006], [9.50209, 49.50765], [3.81645, 43.43285] |
| `role` | object | 0.0% | 推荐估算值 |
| `displacement_type` | object | 0.0% | 灾害、冲突 |
| `qualifier` | object | 0.0% | 总计 |
| `figure` | int64 | 0.0% | 1.0 – 39000.0(均值 344.7855) |
| `displacement_date` | datetime64[ns] | 0.0% | |
| `displacement_start_date` | datetime64[ns] | 0.0% | |
| `displacement_end_date` | datetime64[ns] | 0.0% | |
| `year` | int64 | 0.0% | 2025.0 – 2026.0(均值 2025.2779) |
| `event_id` | int64 | 0.0% | 36962.0 – 40347.0(均值 39171.3548) |
| `event_name` | object | 0.0% | 索马里: 干旱 - 努加尔州 - 2025/11/01、索马里: 干旱 - 拜州 - 2026/01/01、索马里: 不明/未知 - 中朱巴州 - 2025/10/01 |
| `event_start_date` | datetime64[ns] | 0.0% | |
| `event_end_date` | datetime64[ns] | 0.0% | |
| `category` | object | 23.1% | 气象相关 |
| `subcategory` | object | 23.1% | 气候相关 |
| `type` | object | 23.1% | 干旱 |
| `subtype` | object | 23.1% | |
| `sources` | object | 0.0% | |
| `locations_name` | object | 0.0% | |
| `locations_coordinates` | object | 0.0% | |
| `locations_accuracy` | object | 0.0% | |
| `locations_type` | object | 0.0% | |
| `displacement_occurred` | object | 0.0% | |
| `created_at` | datetime64[ns, UTC] | 0.0% | |
| `description` | object | 0.0% | |
| `combined_type` | object | 0.0% | |
| `esa_source` | object | 0.0% | |
| `esa_processed` | object | 0.0% | |
---
## 数值型字段统计
| 字段名 | 最小值 | 最大值 | 均值 | 中位数 |
|---|---|---|---|---|
| `id` | 229103.0 | 240684.0 | 232617.8186 | 229729.0 |
| `latitude` | -1.0332 | 11.472 | 4.8164 | 3.8098 |
| `longitude` | 41.4413 | 50.0967 | 45.0274 | 44.0069 |
| `figure` | 1.0 | 39000.0 | 344.7855 | 18.0 |
| `year` | 2025.0 | 2026.0 | 2025.2779 | 2025.0 |
| `event_id` | 36962.0 | 40347.0 | 39171.3548 | 39582.0 |
---
## 数据整理流程
原始数据通过CKAN API从HDX下载并转换为Parquet格式。字段名统一转换为小写并采用蛇形命名法(snake_case)。将常见的缺失值标记(`N/A`、`null`、`none`、`-`、`unknown`、`no data`、`#N/A`)统一替换为`NaN`。移除了5个缺失值占比超过80%的字段:`event_codes`、`event_code_types`、`old_id`、`source_url`、`link`。基于解析成功率(阈值>85%),将6个字段从字符串类型转换为数值型或日期时间型。本数据集使用固定随机种子(42)按80/20的比例划分为训练集与测试集,并以Snappy压缩的Parquet格式存储。
---
## 局限性说明
- 数据源自国内流离失所监测中心(IDMC),未经过Electric Sheep Africa(ESA)的独立验证。
- 自动化清洗无法修正原始收集阶段的错报值、定义不一致或采样偏差问题。
- 以下字段的缺失值占比超过20%,在建模时需谨慎使用:`category`、`subcategory`、`type`、`subtype`。
- 如需了解发布方的方法学说明与注意事项,请查阅[原始HDX数据集页面](https://data.humdata.org/dataset/idmc-event-data-for-som)。
---
## 引用格式
bibtex
@dataset{hdx_africa_idmc_event_data_for_som,
title = {索马里 - 国内流离失所更新(IDU)(事件数据)},
author = {国内流离失所监测中心(Internal Displacement Monitoring Centre, IDMC)},
year = {2026},
url = {https://data.humdata.org/dataset/idmc-event-data-for-som},
note = {由Electric Sheep Africa(https://huggingface.co/electricsheepafrica)重新打包以适配机器学习场景}
}
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — 非洲机器学习数据集基础设施。尼日利亚拉各斯。*
提供机构:
electricsheepafrica



