electricsheepafrica/africa-drc-languages
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-drc-languages
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license: cc-by-sa-4.0
multilinguality:
- monolingual
size_categories:
- n<1K
source_datasets:
- original
task_categories:
- other
task_ids: []
tags:
- africa
- humanitarian
- hdx
- electric-sheep-africa
- hxl
- languages
- cod
pretty_name: "DRC: Languages"
dataset_info:
splits:
- name: train
num_examples: 152
- name: test
num_examples: 38
---
# DRC: Languages
**Publisher:** CLEAR Global (previously Translators without Borders) · **Source:** [HDX](https://data.humdata.org/dataset/drc-languages) · **License:** `cc-by-sa` · **Updated:** 2025-04-09
---
## Abstract
Language data drawn from the 2016 CAID "Rapport annuel de l’Administration du territoire." Includes languages spoken by territory. Available at the admin 2 level only.
Each row in this dataset represents geolocated point observations. Data was last updated on HDX on 2025-04-09. Geographic scope: **COD**.
*Curated into ML-ready Parquet format by [Electric Sheep Africa](https://huggingface.co/electricsheepafrica).*
---
## Dataset Characteristics
| | |
|---|---|
| **Domain** | Demographics and population |
| **Unit of observation** | Geolocated point observations |
| **Rows (total)** | 190 |
| **Columns** | 193 (181 numeric, 12 categorical, 0 datetime) |
| **Train split** | 152 rows |
| **Test split** | 38 rows |
| **Geographic scope** | COD |
| **Publisher** | CLEAR Global (previously Translators without Borders) |
| **HDX last updated** | 2025-04-09 |
---
## Variables
**Geographic** — `admin_2` (#adm2+name, Beni, Kibombo), `admin_1` (Kongo-Central, Sud-Kivu, Maniema), `admin_0` (Democratic Republic of Congo, #country+name), `admin0_pcode` (CD, #country+code), `primary_language` (Lingala, Swahili, Tshiluba) and 25 others.
**Demographic** — `ngwaka_minagende`, `other_languages`, `language_data` (Y, N, #meta+data+bool).
**Identifier / Metadata** — `adm_2_pcode` (#adm2+code, CD6109, CD6313), `adm_1_pcode` (CD20, CD62, CD63), `data_confidence` (Low, #meta+confidence), `esa_source`, `esa_processed`.
**Other** — `alur` (range 0.0–0.93), `bakutshu` (range 0.0–0.03), `balika` (range 0.0–0.2), `balobo` (range 0.0–0.25), `bandaka` (range 0.0–0.076) and 150 others.
---
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-drc-languages")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
```
---
## Schema
| Column | Type | Null % | Range / Sample Values |
|---|---|---|---|
| `admin_2` | object | 0.0% | #adm2+name, Beni, Kibombo |
| `adm_2_pcode` | object | 0.0% | #adm2+code, CD6109, CD6313 |
| `admin_1` | object | 0.0% | Kongo-Central, Sud-Kivu, Maniema |
| `adm_1_pcode` | object | 0.0% | CD20, CD62, CD63 |
| `admin_0` | object | 0.0% | Democratic Republic of Congo, #country+name |
| `admin0_pcode` | object | 0.0% | CD, #country+code |
| `primary_language` | object | 11.6% | Lingala, Swahili, Tshiluba |
| `primary_language_percentage_share` | float64 | 28.9% | 0.3 – 1.0 (mean 0.8377) |
| `alur` | float64 | 30.0% | 0.0 – 0.93 (mean 0.007) |
| `bakutshu` | float64 | 30.0% | 0.0 – 0.03 (mean 0.0002) |
| `balika` | float64 | 30.0% | 0.0 – 0.2 (mean 0.0015) |
| `balobo` | float64 | 30.0% | 0.0 – 0.25 (mean 0.0019) |
| `bandaka` | float64 | 30.0% | 0.0 – 0.076 (mean 0.0006) |
| `banunu` | float64 | 30.0% | 0.0 – 0.3 (mean 0.0023) |
| `bangala` | float64 | 30.0% | 0.0 – 0.95 (mean 0.0144) |
| `bango` | float64 | 30.0% | 0.0 – 0.68 (mean 0.0051) |
| `batabwa` | float64 | 30.0% | 0.0 – 0.3 (mean 0.0023) |
| `bati` | float64 | 30.0% | 0.0 – 0.28 (mean 0.0021) |
| `baloi` | float64 | 30.0% | 0.0 – 0.1 (mean 0.0008) |
| `bemba` | float64 | 30.0% | 0.0 – 0.7 (mean 0.006) |
| `bembe` | float64 | 30.0% | 0.0 – 0.7 (mean 0.0105) |
| `benge` | float64 | 30.0% | 0.0 – 0.05 (mean 0.0004) |
| `benza` | float64 | 30.0% | 0.0 – 0.22 (mean 0.0017) |
| `bila` | float64 | 30.0% | 0.0 – 0.204 (mean 0.0027) |
| `boa` | float64 | 30.0% | 0.0 – 0.56 (mean 0.0087) |
| `boba` | float64 | 30.0% | 0.0 – 0.01 (mean 0.0001) |
| `bobayi` | float64 | 30.0% | 0.0 – 0.04 (mean 0.0006) |
| `boma` | float64 | 30.0% | |
| `bombo` | float64 | 30.0% | |
| `budja` | float64 | 30.0% | |
| `budu` | float64 | 30.0% | |
| `bunda` | float64 | 30.0% | |
| `dinga` | float64 | 30.0% | |
| `djonga` | float64 | 30.0% | |
| `ekonda` | float64 | 30.0% | |
| `eleku` | float64 | 30.0% | |
| `fuliru` | float64 | 30.0% | |
| `geyna` | float64 | 30.0% | |
| `gilima` | float64 | 30.0% | |
| `gobu` | float64 | 30.0% | |
| `havu` | float64 | 30.0% | |
| `hema` | float64 | 30.0% | |
| `hemba` | float64 | 30.0% | |
| `hesoo` | float64 | 30.0% | |
| `humbu` | float64 | 30.0% | |
| `hunde` | float64 | 30.0% | |
| `kalanga` | float64 | 30.0% | |
| `kango` | float64 | 30.0% | |
| `kanyoka` | float64 | 30.0% | |
| `kaonde` | float64 | 30.0% | |
| `kebu_tuu` | float64 | 30.0% | |
| `kere` | float64 | 30.0% | |
| `kilese` | float64 | 30.0% | |
| `kiongo` | float64 | 30.0% | |
| `kitwa` | float64 | 30.0% | |
| `koka` | float64 | 30.0% | |
| `kongo` | float64 | 29.5% | |
| `kuba` | float64 | 30.0% | |
| `kumu` | float64 | 30.0% | |
| `kusu` | float64 | 30.0% | |
| `kwala` | float64 | 30.0% | |
| `kwange` | float64 | 30.0% | |
| `kwese` | float64 | 30.0% | |
| `lalia` | float64 | 30.0% | |
| `lande` | float64 | 30.0% | |
| `langa` | float64 | 30.0% | |
| `langbasi` | float64 | 30.0% | |
| `leboale` | float64 | 30.0% | |
| `lega` | float64 | 30.0% | |
| `lemfu` | float64 | 30.0% | |
| `lendu` | float64 | 30.0% | |
| `libinza` | float64 | 30.0% | |
| `liboko` | float64 | 30.0% | |
| `likumbe` | float64 | 30.0% | |
| `lingala` | float64 | 30.0% | |
| `lobala` | float64 | 30.0% | |
| `logo` | float64 | 30.0% | |
| `lokele` | float64 | 30.0% | |
| `lokonda` | float64 | 30.0% | |
| `lonkundo` | float64 | 30.0% | |
| `londengese` | float64 | 30.0% | |
| `longando` | float64 | 30.0% | |
| `lontomba` | float64 | 30.0% | |
| `lothsua` | float64 | 30.0% | |
| `lotwa` | float64 | 30.0% | |
| `loyembe` | float64 | 30.0% | |
| `luba` | float64 | 30.0% | |
| `luba_lubangule` | float64 | 30.0% | |
| `lulua` | float64 | 30.0% | |
| `lunda` | float64 | 30.0% | |
| `lungu` | float64 | 30.0% | |
| `mabale` | float64 | 30.0% | |
| `makutu` | float64 | 30.0% | |
| `manianga` | float64 | 30.0% | |
| `manga` | float64 | 30.0% | |
| `manyanga` | float64 | 30.0% | |
| `mashi` | float64 | 30.0% | |
| `mbala` | float64 | 30.0% | |
| `mbanza` | float64 | 30.0% | |
| `mbanza_fula` | float64 | 30.0% | |
| `mbati` | float64 | 30.0% | |
| `mbole` | float64 | 30.0% | |
| `mboma` | float64 | 30.0% | |
| `mbuba` | float64 | 30.0% | |
| `mituku` | float64 | 30.0% | |
| `moko` | float64 | 30.0% | |
| `mongo` | float64 | 30.0% | |
| `mongando` | float64 | 30.0% | |
| `mongbandi` | float64 | 30.0% | |
| `mono` | float64 | 30.0% | |
| `mpama` | float64 | 30.0% | |
| `mpee` | float64 | 30.0% | |
| `nalengwe` | float64 | 30.0% | |
| `nande` | float64 | 30.0% | |
| `nanyembo` | float64 | 30.0% | |
| `ndembo` | float64 | 30.0% | |
| `ndibu` | float64 | 30.0% | |
| `ndo_tuki` | float64 | 30.0% | |
| `ngando` | float64 | 30.0% | |
| `ngbaka` | float64 | 30.0% | |
| `ngbaka_mabo` | float64 | 30.0% | |
| `ngbandi` | float64 | 30.0% | |
| `ngbundu` | float64 | 30.0% | |
| `ngbungbu` | float64 | 30.0% | |
| `ngelema` | float64 | 30.0% | |
| `ngengele` | float64 | 30.0% | |
| `ngoli` | float64 | 30.0% | |
| `ngombe` | float64 | 30.0% | |
| `ngwaka_minagende` | float64 | 30.0% | |
| `ngwandi` | float64 | 30.0% | |
| `nkundo` | float64 | 30.0% | |
| `nkutshu` | float64 | 30.0% | |
| `ntandu` | float64 | 30.0% | |
| `nunu` | float64 | 30.0% | |
| `nyali` | float64 | 30.0% | |
| `nyabwisha` | float64 | 30.0% | |
| `nyanga` | float64 | 30.0% | |
| `nyarwanda` | float64 | 30.0% | |
| `nyindu` | float64 | 30.0% | |
| `nzakara` | float64 | 30.0% | |
| `ohendo` | float64 | 30.0% | |
| `okela` | float64 | 30.0% | |
| `okutsu` | float64 | 30.0% | |
| `pakombe` | float64 | 30.0% | |
| `pazande` | float64 | 30.0% | |
| `pende` | float64 | 30.0% | |
| `piri` | float64 | 30.0% | |
| `popoyi` | float64 | 30.0% | |
| `portuguese` | float64 | 30.0% | |
| `probe` | float64 | 30.0% | |
| `pygmy` | float64 | 30.0% | |
| `sakata` | float64 | 30.0% | |
| `sanga` | float64 | 30.0% | |
| `sango` | float64 | 30.0% | |
| `songe` | float64 | 30.0% | |
| `songola` | float64 | 30.0% | |
| `suku` | float64 | 30.0% | |
| `swahili` | float64 | 30.0% | |
| `tabwa` | float64 | 30.0% | |
| `teke` | float64 | 30.0% | |
| `tembo` | float64 | 30.0% | |
| `tete_south` | float64 | 30.0% | |
| `tetela` | float64 | 30.0% | |
| `togba` | float64 | 30.0% | |
| `topoke` | float64 | 30.0% | |
| `tshibindi` | float64 | 30.0% | |
| `tshikete` | float64 | 30.0% | |
| `tshikwamputu` | float64 | 30.0% | |
| `tsihilualua` | float64 | 30.0% | |
| `tshiluba` | float64 | 30.0% | |
| `tshokwe` | float64 | 30.0% | |
| `tua` | float64 | 30.0% | |
| `tungwa` | float64 | 30.0% | |
| `urund` | float64 | 30.0% | |
| `vungu` | float64 | 30.0% | |
| `waria` | float64 | 30.0% | |
| `yaelima` | float64 | 30.0% | |
| `yaka` | float64 | 30.0% | |
| `yazi` | float64 | 30.0% | |
| `yanzi` | float64 | 30.0% | |
| `yogo` | float64 | 30.0% | |
| `yombe` | float64 | 30.0% | |
| `zande` | float64 | 30.0% | |
| `zela` | float64 | 30.0% | |
| `zimba` | float64 | 30.0% | |
| `zola` | float64 | 30.0% | |
| `other_languages` | float64 | 30.0% | |
| `population_total` | float64 | 0.5% | |
| `language_data` | object | 0.0% | Y, N, #meta+data+bool |
| `data_confidence` | object | 0.0% | Low, #meta+confidence |
| `notes` | object | 61.6% | Population is reflected in CD3206, language shares inherited from same., No statistical language data available., Population is reflected in CD8306, language shares inherited from same. |
| `esa_source` | object | 0.0% | |
| `esa_processed` | object | 0.0% | |
---
## Numeric Summary
| Column | Min | Max | Mean | Median |
|---|---|---|---|---|
| `primary_language_percentage_share` | 0.3 | 1.0 | 0.8377 | 0.9 |
| `alur` | 0.0 | 0.93 | 0.007 | 0.0 |
| `bakutshu` | 0.0 | 0.03 | 0.0002 | 0.0 |
| `balika` | 0.0 | 0.2 | 0.0015 | 0.0 |
| `balobo` | 0.0 | 0.25 | 0.0019 | 0.0 |
| `bandaka` | 0.0 | 0.076 | 0.0006 | 0.0 |
| `banunu` | 0.0 | 0.3 | 0.0023 | 0.0 |
| `bangala` | 0.0 | 0.95 | 0.0144 | 0.0 |
| `bango` | 0.0 | 0.68 | 0.0051 | 0.0 |
| `batabwa` | 0.0 | 0.3 | 0.0023 | 0.0 |
| `bati` | 0.0 | 0.28 | 0.0021 | 0.0 |
| `baloi` | 0.0 | 0.1 | 0.0008 | 0.0 |
| `bemba` | 0.0 | 0.7 | 0.006 | 0.0 |
| `bembe` | 0.0 | 0.7 | 0.0105 | 0.0 |
| `benge` | 0.0 | 0.05 | 0.0004 | 0.0 |
---
## Curation
Raw data was downloaded from HDX via the CKAN API and converted to Parquet. Column names were lowercased and standardised to snake_case. Common missing-value markers (`N/A`, `null`, `none`, `-`, `unknown`, `no data`, `#N/A`) were unified to `NaN`. 181 column(s) were cast from string to numeric or datetime based on parse-success rate (>85% threshold). The dataset was split 80/20 into train and test partitions using a fixed random seed (42) and saved as Snappy-compressed Parquet.
---
## Limitations
- Data originates from CLEAR Global (previously Translators without Borders) and has not been independently validated by ESA.
- Automated cleaning cannot correct for misreported values, definitional inconsistencies, or sampling bias in the original collection.
- The following columns have >20% missing values and should be treated with caution in modelling: `primary_language_percentage_share`, `alur`, `bakutshu`, `balika`, `balobo`, `bandaka`, `banunu`, `bangala`....
- Refer to the [original HDX dataset page](https://data.humdata.org/dataset/drc-languages) for the publisher's own methodology notes and caveats.
---
## Citation
```bibtex
@dataset{hdx_africa_drc_languages,
title = {DRC: Languages},
author = {CLEAR Global (previously Translators without Borders)},
year = {2025},
url = {https://data.humdata.org/dataset/drc-languages},
note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)}
}
```
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — Africa's ML dataset infrastructure. Lagos, Nigeria.*
提供机构:
electricsheepafrica



