five

electricsheepafrica/africa-ucdp-data-for-south-sudan

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-ucdp-data-for-south-sudan
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - en license: cc-by-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - tabular-classification - other task_ids: [] tags: - africa - humanitarian - hdx - electric-sheep-africa - conflict-violence - hxl - ssd pretty_name: "South Sudan - Data on Conflict Events" dataset_info: splits: - name: train num_examples: 800 - name: test num_examples: 200 --- # South Sudan - Data on Conflict Events **Publisher:** HDX · **Source:** [HDX](https://data.humdata.org/dataset/ucdp-data-for-south-sudan) · **License:** `cc-by-igo` · **Updated:** 2026-04-03 --- ## Abstract This dataset is UCDP's most disaggregated dataset, covering individual events of organized violence (phenomena of lethal violence occurring at a given time and place). These events are sufficiently fine-grained to be geo-coded down to the level of individual villages, with temporal durations disaggregated to single, individual days. Sundberg, Ralph, and Erik Melander, 2013, “Introducing the UCDP Georeferenced Event Dataset”, Journal of Peace Research, vol.50, no.4, 523-532 Högbladh Stina, 2019, “UCDP GED Codebook version 19.1”, Department of Peace and Conflict Research, Uppsala University Each row in this dataset represents first-level administrative unit observations. Temporal coverage is indicated by the `date_start`, `date_end` column(s). Geographic scope: **SSD**. *Curated into ML-ready Parquet format by [Electric Sheep Africa](https://huggingface.co/electricsheepafrica).* --- ## Dataset Characteristics | | | |---|---| | **Domain** | Conflict and security | | **Unit of observation** | First-level administrative unit observations | | **Rows (total)** | 1,001 | | **Columns** | 51 (27 numeric, 21 categorical, 2 datetime) | | **Train split** | 800 rows | | **Test split** | 200 rows | | **Geographic scope** | SSD | | **Publisher** | HDX | | **HDX last updated** | 2026-04-03 | --- ## Variables **Geographic** — `year` (range 2011.0–2024.0), `active_year`, `type_of_violence` (range 1.0–3.0), `dyad_dset_id` (range 112.0–18302.0), `dyad_new_id` (range 688.0–18302.0) and 9 others. **Temporal** — `source_date` (2014-08-08, 2015-07-22, 2014-09-29), `date_prec` (range 1.0–5.0), `date_start`, `date_end`. **Outcome / Measurement** — `number_of_sources` (range -1.0–20.0), `deaths_a` (range 0.0–120.0), `deaths_b`, `deaths_civilians`, `deaths_unknown`. **Identifier / Metadata** — `id` (range 28726.0–565951.0), `relid` (UGA-2011-1-151-27, SSD-2018-1-12413-4, SSD-2017-1-12413-44), `code_status` (Clear), `conflict_dset_id` (range 112.0–18302.0), `conflict_new_id` (range 309.0–16472.0) and 14 others. **Other** — `where_prec` (range 1.0–6.0), `where_description`, `adm_1`, `adm_2`, `geom_wkt` and 4 others. --- ## Quick Start ```python from datasets import load_dataset ds = load_dataset("electricsheepafrica/africa-ucdp-data-for-south-sudan") train = ds["train"].to_pandas() test = ds["test"].to_pandas() print(train.shape) train.head() ``` --- ## Schema | Column | Type | Null % | Range / Sample Values | |---|---|---|---| | `id` | int64 | 0.0% | 28726.0 – 565951.0 (mean 234324.3037) | | `relid` | object | 0.0% | UGA-2011-1-151-27, SSD-2018-1-12413-4, SSD-2017-1-12413-44 | | `year` | int64 | 0.0% | 2011.0 – 2024.0 (mean 2016.5235) | | `active_year` | bool | 0.0% | | | `code_status` | object | 0.0% | Clear | | `type_of_violence` | int64 | 0.0% | 1.0 – 3.0 (mean 1.978) | | `conflict_dset_id` | int64 | 0.0% | 112.0 – 18302.0 (mean 7360.2657) | | `conflict_new_id` | int64 | 0.0% | 309.0 – 16472.0 (mean 7793.1748) | | `conflict_name` | object | 0.0% | South Sudan: Government, Government of South Sudan - Civilians, SPLM/A - IO - Civilians | | `dyad_dset_id` | int64 | 0.0% | 112.0 – 18302.0 (mean 7992.2847) | | `dyad_new_id` | int64 | 0.0% | 688.0 – 18302.0 (mean 8845.5954) | | `dyad_name` | object | 0.0% | Government of South Sudan - Civilians, Government of South Sudan - SPLM/A - IO, SPLM/A - IO - Civilians | | `side_a_dset_id` | int64 | 0.0% | 90.0 – 8500.0 (mean 1020.983) | | `side_a_new_id` | int64 | 0.0% | 90.0 – 8500.0 (mean 1020.983) | | `side_a` | object | 0.0% | Government of South Sudan, SPLM/A - IO, Lou Nuer | | `side_b_dset_id` | int64 | 0.0% | 112.0 – 9999.0 (mean 6150.4036) | | `side_b_new_id` | int64 | 0.0% | 1.0 – 9369.0 (mean 2434.8631) | | `side_b` | object | 0.0% | Civilians, SPLM/A - IO, Murle | | `number_of_sources` | int64 | 0.0% | -1.0 – 20.0 (mean 1.4216) | | `source_article` | object | 0.0% | "All Africa,2014-08-08,South Sudan's New War - Abuses By Government and Opposition Forces [document]", "Human Rights Watch,2015-07-22,They Burned it All", "UN Security Council,2018-09-11,Report of the Secretary-General on South Sudan (covering the period from 4 June to 1 September 2018) " | | `source_office` | object | 7.6% | All Africa, UN Security Council, Human Rights Watch | | `source_date` | object | 7.6% | 2014-08-08, 2015-07-22, 2014-09-29 | | `source_headline` | object | 9.9% | South Sudan's New War - Abuses By Government and Opposition Forces [document], They Burned it All, Report of the Secretary-General on South Sudan (covering the period from 4 June to 1 September 2018) | | `source_original` | object | 17.9% | | | `where_prec` | int64 | 0.0% | 1.0 – 6.0 (mean 2.2288) | | `where_coordinates` | object | 0.0% | | | `where_description` | object | 0.4% | | | `adm_1` | object | 1.7% | | | `adm_2` | object | 13.6% | | | `latitude` | float64 | 0.0% | 3.55 – 12.0375 (mean 7.1149) | | `longitude` | float64 | 0.0% | 24.8158 – 34.3908 (mean 30.6781) | | `geom_wkt` | object | 0.0% | | | `priogrid_gid` | int64 | 0.0% | 135061.0 – 147307.0 (mean 139939.3556) | | `country` | object | 0.0% | | | `iso3` | object | 0.0% | | | `country_id` | int64 | 0.0% | 626.0 – 626.0 (mean 626.0) | | `region` | object | 0.0% | | | `event_clarity` | int64 | 0.0% | 1.0 – 2.0 (mean 1.2338) | | `date_prec` | int64 | 0.0% | 1.0 – 5.0 (mean 1.8981) | | `date_start` | datetime64[ns] | 0.0% | | | `date_end` | datetime64[ns] | 0.0% | | | `deaths_a` | int64 | 0.0% | 0.0 – 120.0 (mean 1.7882) | | `deaths_b` | int64 | 0.0% | | | `deaths_civilians` | int64 | 0.0% | | | `deaths_unknown` | int64 | 0.0% | | | `best` | int64 | 0.0% | | | `high` | int64 | 0.0% | | | `low` | int64 | 0.0% | | | `gwnoa` | float64 | 30.7% | | | `esa_source` | object | 0.0% | | | `esa_processed` | object | 0.0% | | --- ## Numeric Summary | Column | Min | Max | Mean | Median | |---|---|---|---|---| | `id` | 28726.0 | 565951.0 | 234324.3037 | 241026.0 | | `year` | 2011.0 | 2024.0 | 2016.5235 | 2017.0 | | `type_of_violence` | 1.0 | 3.0 | 1.978 | 2.0 | | `conflict_dset_id` | 112.0 | 18302.0 | 7360.2657 | 11345.0 | | `conflict_new_id` | 309.0 | 16472.0 | 7793.1748 | 11345.0 | | `dyad_dset_id` | 112.0 | 18302.0 | 7992.2847 | 11988.0 | | `dyad_new_id` | 688.0 | 18302.0 | 8845.5954 | 12413.0 | | `side_a_dset_id` | 90.0 | 8500.0 | 1020.983 | 113.0 | | `side_a_new_id` | 90.0 | 8500.0 | 1020.983 | 113.0 | | `side_b_dset_id` | 112.0 | 9999.0 | 6150.4036 | 6341.0 | | `side_b_new_id` | 1.0 | 9369.0 | 2434.8631 | 1005.0 | | `number_of_sources` | -1.0 | 20.0 | 1.4216 | 1.0 | | `where_prec` | 1.0 | 6.0 | 2.2288 | 2.0 | | `latitude` | 3.55 | 12.0375 | 7.1149 | 7.5822 | | `longitude` | 24.8158 | 34.3908 | 30.6781 | 30.7557 | --- ## Curation Raw data was downloaded from HDX via the CKAN API and converted to Parquet. Column names were lowercased and standardised to snake_case. Common missing-value markers (`N/A`, `null`, `none`, `-`, `unknown`, `no data`, `#N/A`) were unified to `NaN`. 1 column(s) with >80% missing values were removed: `gwnob`. 2 column(s) were cast from string to numeric or datetime based on parse-success rate (>85% threshold). The dataset was split 80/20 into train and test partitions using a fixed random seed (42) and saved as Snappy-compressed Parquet. --- ## Limitations - Data originates from HDX and has not been independently validated by ESA. - Automated cleaning cannot correct for misreported values, definitional inconsistencies, or sampling bias in the original collection. - The following columns have >20% missing values and should be treated with caution in modelling: `gwnoa`. - Refer to the [original HDX dataset page](https://data.humdata.org/dataset/ucdp-data-for-south-sudan) for the publisher's own methodology notes and caveats. --- ## Citation ```bibtex @dataset{hdx_africa_ucdp_data_for_south_sudan, title = {South Sudan - Data on Conflict Events}, author = {HDX}, year = {2026}, url = {https://data.humdata.org/dataset/ucdp-data-for-south-sudan}, note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)} } ``` --- *[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — Africa's ML dataset infrastructure. Lagos, Nigeria.*
提供机构:
electricsheepafrica
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集源自乌普萨拉冲突数据计划(UCDP)的地理参照事件数据集,由HDX平台发布并由Electric Sheep Africa团队精心梳理为机器学习就绪的Parquet格式。原始数据通过CKAN API下载,经过系统化的预处理流程,包括列名统一为小写蛇形命名法、常见缺失值标记(如N/A、null等)标准化为NaN、剔除缺失率超过80%的冗余列、基于解析成功率将字符串列转换为数值或时间类型,最终以固定随机种子(42)按80/20比例划分为训练集与测试集,并保存为Snappy压缩Parquet文件,确保数据高效存取与复现性。
使用方法
用户可通过HuggingFace Datasets库便捷加载此数据集,使用`load_dataset`函数直接获取训练与测试划分,并支持转换为Pandas DataFrame以进行后续分析。推荐将`type_of_violence`、`deaths_civilians`等列作为目标变量,结合`year`、`adm_1`等地理时间特征构建分类或回归模型。需注意`gwnoa`列缺失率超过20%应谨慎使用,建议在建模前进行插补或剔除,同时引证时应注明原始HDX发布方及ESA的格式化贡献。
背景与挑战
背景概述
南苏丹自2011年独立以来,深陷部族冲突与政治动荡,武装暴力事件频发,严重威胁平民安全与人道主义救援。为系统记录并分析该地区有组织暴力活动,乌普萨拉大学冲突研究系与HDX(人道数据交换平台)基于UCDP地理参照事件数据集(GED),于2026年4月发布了此精炼数据集。该数据集归属于Electric Sheep Africa团队,聚焦南苏丹境内2011至2024年间发生的千余起冲突事件,涵盖暴力类型、交战方、地理坐标与伤亡人数等51个维度,旨在通过高时空分辨率的事件颗粒度,为冲突动态建模、平民保护策略及和平研究提供量化基础,已成为非洲冲突科学领域的重要数据资源。
当前挑战
该数据集所应对的领域挑战在于,传统冲突调查往往依赖宏观统计,难以捕获村落级别的暴力时空演化规律。其构建过程也面临诸多严峻挑战:原始数据源于多渠道新闻报道与机构报告,来源多样性导致信息不一致与缺失,部分字段的缺失率超过20%;暴力事件的分类与定义(如国家武装、非国家武装与平民之间的界限)因冲突实际而模糊,易引入编码偏差;此外,时空精度折衷(如日期仅精确至月、地点精度分为六级)使得高分辨率建模需审慎处理测量误差与非随机缺失,以确保分析结果的稳健性。
常用场景
经典使用场景
南苏丹冲突事件数据集(africa-ucdp-data-for-south-sudan)源于乌普萨拉冲突数据项目(UCDP)的地理参照事件数据集,是冲突与安全研究领域中极具权威性的细粒度数据资源。该数据集以个体暴力事件为观测单位,涵盖时间、地点、参与方及伤亡人数等关键变量,为学者提供了从村庄级别到行政一级的精确空间定位和天级的时间分辨率。研究者常利用这一经典数据集进行冲突事件的时空模式挖掘、暴力类型分类以及冲突动态的预测建模,例如基于历史事件特征预测特定区域未来冲突爆发概率,或识别不同类型暴力(如国家间冲突、非国家冲突、单方面暴力)的驱动因素。其精细化的结构使得微观层面的因果推断成为可能,是冲突研究学术社区中不可或缺的标准基准数据集。
解决学术问题
这一数据集有效解决了冲突研究领域中长期存在的两大核心难题:一是事件级数据的稀缺性与不完整性,二是不同来源数据整合时的标准化困境。传统冲突数据多依赖于国家层面的聚合统计,难以捕捉地方性冲突的细微差异与动态演变。该数据集通过精细的地理编码与时间戳记录,填补了从宏观叙事到微观事件之间的鸿沟,使得研究者能够考察冲突的局部扩散机制、暴力升级路径以及人道主义后果的时空分布。其公开可获取的特性极大地降低了数据获取门槛,促进了可重复性研究的发展,推动了冲突预防、和平建设及人道援助策略从经验判断向数据驱动的科学决策转型。该数据集在政治学、经济学和地理学等多学科交叉领域产生了深远影响,成为验证冲突理论、评估干预措施效果的基础性工具。
实际应用
在实际应用层面,该数据集为人道主义组织、国际发展机构及政策制定者提供了关键的决策支持。人道主义事务协调厅(OCHA)等机构可据此精准识别高风险区域,优化援助物资的预置部署与人员安全调度。非政府组织利用历史冲突事件的时空模式,设计更具针对性的社区韧性增强项目,并通过监测暴力事件的短期波动及时调整干预策略。此外,该数据集还服务于冲突预警系统的构建,结合卫星影像与社交媒体信号,开发实时或近实时的风险评估模型。新闻媒体与调查记者在报道南苏丹安全局势时,常引用该数据集的统计结果以增强报道的客观性与权威性。其标准化的数据格式也便于集成到地理信息系统(GIS)中,支持可视化分析,使复杂的冲突态势以直观的方式呈现给非专业受众。
数据集最近研究
最新研究方向
该数据集聚焦于南苏丹境内武装冲突事件的精细化时空建模与暴力模式分析,通过整合乌普萨拉冲突数据项目(UCDP)的地理编码事件记录,为理解非洲之角地区复杂的族际暴力、国家内部冲突动态以及人道主义危机演化提供了高分辨率的数据支撑。前沿研究方向涵盖利用机器学习方法预测冲突热点扩散、评估平民伤亡的时空分布规律,以及结合卫星影像与新闻源数据验证冲突事件的可信度。当前,伴随南苏丹政治过渡进程的脆弱性,该数据集在监测停火协议执行、识别非国家武装团体的行为模式以及支持国际人道主义干预决策中扮演着关键角色,其细粒度的行政单元观测和标准化清洗流程,显著提升了冲突研究中的可复现性与跨数据集可比性。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作