electricsheepafrica/africa-world-bank-science-and-technology-indicators-for-south-sudan
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-world-bank-science-and-technology-indicators-for-south-sudan
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license: cc-by-4.0
multilinguality:
- monolingual
size_categories:
- n<1K
source_datasets:
- original
task_categories:
- tabular-classification
- tabular-regression
task_ids: []
tags:
- africa
- humanitarian
- hdx
- electric-sheep-africa
- economics
- indicators
- ssd
pretty_name: "South Sudan - Science and Technology"
dataset_info:
splits:
- name: train
num_examples: 34
- name: test
num_examples: 8
---
# South Sudan - Science and Technology
**Publisher:** World Bank Group · **Source:** [HDX](https://data.humdata.org/dataset/world-bank-science-and-technology-indicators-for-south-sudan) · **License:** `cc-by` · **Updated:** 2026-03-27
---
## Abstract
Contains data from the World Bank's [data portal](http://data.worldbank.org/). There is also a [consolidated country dataset](https://data.humdata.org/dataset/world-bank-combined-indicators-for-south-sudan) on HDX.
Technological innovation, often fueled by governments, drives industrial growth and helps raise living standards. Data here aims to shed light on countries technology base: research and development, scientific and technical journal articles, high-technology exports, royalty and license fees, and patents and trademarks. Sources include the UNESCO Institute for Statistics, the U.S. National Science Board, the UN Statistics Division, the International Monetary Fund, and the World Intellectual Property Organization.
Each row in this dataset represents country-level aggregates. Data was last updated on HDX on 2026-03-27. Geographic scope: **SSD**.
*Curated into ML-ready Parquet format by [Electric Sheep Africa](https://huggingface.co/electricsheepafrica).*
---
## Dataset Characteristics
| | |
|---|---|
| **Domain** | Humanitarian and development data |
| **Unit of observation** | Country-level aggregates |
| **Rows (total)** | 43 |
| **Columns** | 8 (2 numeric, 6 categorical, 0 datetime) |
| **Train split** | 34 rows |
| **Test split** | 8 rows |
| **Geographic scope** | SSD |
| **Publisher** | World Bank Group |
| **HDX last updated** | 2026-03-27 |
---
## Variables
**Geographic** — `country_name` (South Sudan), `country_iso3` (SSD), `year` (range 1996.0–2023.0).
**Outcome / Measurement** — `value` (range 0.0–25800000.0).
**Identifier / Metadata** — `indicator_name` (Scientific and technical journal articles, Charges for the use of intellectual property, payments (BoP, current US$), Charges for the use of intellectual property, receipts (BoP, current US$)), `indicator_code` (IP.JRN.ARTC.SC, BM.GSR.ROYL.CD, BX.GSR.ROYL.CD), `esa_source` (HDX), `esa_processed` (2026-04-10).
---
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-world-bank-science-and-technology-indicators-for-south-sudan")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
```
---
## Schema
| Column | Type | Null % | Range / Sample Values |
|---|---|---|---|
| `country_name` | object | 0.0% | South Sudan |
| `country_iso3` | object | 0.0% | SSD |
| `year` | int64 | 0.0% | 1996.0 – 2023.0 (mean 2012.7442) |
| `indicator_name` | object | 0.0% | Scientific and technical journal articles, Charges for the use of intellectual property, payments (BoP, current US$), Charges for the use of intellectual property, receipts (BoP, current US$) |
| `indicator_code` | object | 0.0% | IP.JRN.ARTC.SC, BM.GSR.ROYL.CD, BX.GSR.ROYL.CD |
| `value` | float64 | 0.0% | 0.0 – 25800000.0 (mean 3195748.7047) |
| `esa_source` | object | 0.0% | HDX |
| `esa_processed` | object | 0.0% | 2026-04-10 |
---
## Numeric Summary
| Column | Min | Max | Mean | Median |
|---|---|---|---|---|
| `year` | 1996.0 | 2023.0 | 2012.7442 | 2015.0 |
| `value` | 0.0 | 25800000.0 | 3195748.7047 | 8.67 |
---
## Curation
Raw data was downloaded from HDX via the CKAN API and converted to Parquet. Column names were lowercased and standardised to snake_case. Common missing-value markers (`N/A`, `null`, `none`, `-`, `unknown`, `no data`, `#N/A`) were unified to `NaN`. The dataset was split 80/20 into train and test partitions using a fixed random seed (42) and saved as Snappy-compressed Parquet.
---
## Limitations
- Data originates from World Bank Group and has not been independently validated by ESA.
- Automated cleaning cannot correct for misreported values, definitional inconsistencies, or sampling bias in the original collection.
- Refer to the [original HDX dataset page](https://data.humdata.org/dataset/world-bank-science-and-technology-indicators-for-south-sudan) for the publisher's own methodology notes and caveats.
---
## Citation
```bibtex
@dataset{hdx_africa_world_bank_science_and_technology_indicators_for_south_sudan,
title = {South Sudan - Science and Technology},
author = {World Bank Group},
year = {2026},
url = {https://data.humdata.org/dataset/world-bank-science-and-technology-indicators-for-south-sudan},
note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)}
}
```
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — Africa's ML dataset infrastructure. Lagos, Nigeria.*
提供机构:
electricsheepafrica
搜集汇总
数据集介绍

构建方式
在科技与创新驱动全球发展的宏观背景下,该数据集由世界银行集团发布,并由Electric Sheep Africa团队进行专业化整理。原始数据通过HDX平台的CKAN API获取,涵盖了南苏丹自1996年至2023年的科技指标。构建过程中,团队对数据进行了标准化清洗,包括统一列名为蛇形命名法、将各类缺失值标记规范为NaN,并采用固定随机种子以80/20的比例划分训练集与测试集,最终以Snappy压缩的Parquet格式存储,确保了数据的机器学习可用性。
特点
作为聚焦非洲地区科技发展的代表性数据集,其核心特点体现在高度结构化的国家层面聚合数据。数据集共包含43条观测记录,涵盖8个变量,其中包含年份、指标名称与数值等关键字段。指标内容涉及科技期刊文章数量、知识产权使用费用收支等多元维度,数值范围跨度显著,从0至2580万美元,反映了南苏丹科技发展的动态轨迹。数据以纯英文呈现,具备明确的训练与测试分割,为后续的回归或分类建模提供了清晰的基础。
使用方法
在实证研究与机器学习应用领域,该数据集为分析南苏丹科技发展态势提供了量化基础。使用者可通过Hugging Face的datasets库直接加载数据,便捷转换为Pandas DataFrame以进行探索性分析。数据适用于时间序列预测、指标关联性研究或经济影响评估等任务。鉴于其较小的规模与清晰的划分,建议在建模前充分考虑数据的时间跨度与指标异质性,并可结合世界银行的原方法说明以深化解读。
背景与挑战
背景概述
在全球化与数字化浪潮的推动下,科技创新被视为驱动经济增长与提升民生福祉的核心引擎。世界银行集团作为国际发展领域的重要机构,长期致力于构建全球性的科学、技术与创新指标体系,以监测各国技术基础的发展动态。该数据集聚焦于南苏丹,收录了自1996年至2023年间关于研发投入、科技期刊文章、高技术出口、知识产权使用费及专利商标等关键指标的国家级汇总数据。由Electric Sheep Africa团队于2026年进行机器学习友好型格式化处理,旨在为发展经济学、区域研究与政策分析领域提供结构化的实证基础,助力对脆弱国家技术转型路径的深入探索。
当前挑战
该数据集致力于应对发展经济学中关于科技创新能力量化评估的挑战,特别是在数据稀缺的冲突后国家背景下,如何准确捕捉技术积累与知识扩散的轨迹。构建过程中的挑战主要源于原始数据的异构性与局限性:指标定义可能随时间或数据源(如联合国教科文组织、世界知识产权组织等)而存在差异,导致跨年度或跨机构比较时出现一致性偏差;同时,南苏丹作为新兴国家,其统计基础设施尚在完善中,数据缺失、报告延迟或测量误差等问题可能影响时序分析的可靠性。此外,将宏观指标转化为适用于机器学习任务(如表格分类或回归)的特征时,需谨慎处理高度聚合数据所固有的信息损失与过拟合风险。
常用场景
经典使用场景
在科技政策与经济发展研究领域,该数据集作为南苏丹国家层面科技指标的权威汇编,常被用于构建时间序列分析模型,以揭示该国在科研产出、知识产权交易及高技术出口等方面的动态演变轨迹。研究者通过整合科学期刊文章数量、知识产权使用费收支等关键指标,能够系统评估南苏丹科技创新能力的结构性特征及其与宏观经济变量的关联性,为深入理解后发国家科技追赶路径提供实证基础。
实际应用
在实际应用层面,该数据集被国际组织、政策智库及人道主义机构用于制定针对南苏丹的科技援助方案与能力建设规划。通过分析知识产权收支与高技术出口数据,决策者能够识别该国技术转移的瓶颈环节,进而设计定向扶持政策;同时,科研产出指标为评估高等教育投入成效提供了参照,助力优化资源分配策略,最终服务于提升南苏丹可持续发展和经济韧性的现实目标。
衍生相关工作
围绕该数据集衍生的经典工作主要包括两类:一是基于面板数据模型的跨国科技政策比较研究,学者常将其与非洲其他国家同类指标合并,探究制度环境对创新绩效的异质性影响;二是机器学习驱动的科技指标预测研究,利用时间序列特征构建回归或分类模型,尝试预测南苏丹未来科技发展趋势。这些工作显著拓展了发展数据在因果推断与预测分析中的应用边界,催生了多篇聚焦脆弱国家创新系统的学术文献。
以上内容由遇见数据集搜集并总结生成



