five

electricsheepafrica/africa-drc-languages

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-drc-languages
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - en license: cc-by-sa-4.0 multilinguality: - monolingual size_categories: - n<1K source_datasets: - original task_categories: - other task_ids: [] tags: - africa - humanitarian - hdx - electric-sheep-africa - hxl - languages - cod pretty_name: "DRC: Languages" dataset_info: splits: - name: train num_examples: 152 - name: test num_examples: 38 --- # DRC: Languages **Publisher:** CLEAR Global (previously Translators without Borders) · **Source:** [HDX](https://data.humdata.org/dataset/drc-languages) · **License:** `cc-by-sa` · **Updated:** 2025-04-09 --- ## Abstract Language data drawn from the 2016 CAID "Rapport annuel de l’Administration du territoire." Includes languages spoken by territory. Available at the admin 2 level only. Each row in this dataset represents geolocated point observations. Data was last updated on HDX on 2025-04-09. Geographic scope: **COD**. *Curated into ML-ready Parquet format by [Electric Sheep Africa](https://huggingface.co/electricsheepafrica).* --- ## Dataset Characteristics | | | |---|---| | **Domain** | Demographics and population | | **Unit of observation** | Geolocated point observations | | **Rows (total)** | 190 | | **Columns** | 193 (181 numeric, 12 categorical, 0 datetime) | | **Train split** | 152 rows | | **Test split** | 38 rows | | **Geographic scope** | COD | | **Publisher** | CLEAR Global (previously Translators without Borders) | | **HDX last updated** | 2025-04-09 | --- ## Variables **Geographic** — `admin_2` (#adm2+name, Beni, Kibombo), `admin_1` (Kongo-Central, Sud-Kivu, Maniema), `admin_0` (Democratic Republic of Congo, #country+name), `admin0_pcode` (CD, #country+code), `primary_language` (Lingala, Swahili, Tshiluba) and 25 others. **Demographic** — `ngwaka_minagende`, `other_languages`, `language_data` (Y, N, #meta+data+bool). **Identifier / Metadata** — `adm_2_pcode` (#adm2+code, CD6109, CD6313), `adm_1_pcode` (CD20, CD62, CD63), `data_confidence` (Low, #meta+confidence), `esa_source`, `esa_processed`. **Other** — `alur` (range 0.0–0.93), `bakutshu` (range 0.0–0.03), `balika` (range 0.0–0.2), `balobo` (range 0.0–0.25), `bandaka` (range 0.0–0.076) and 150 others. --- ## Quick Start ```python from datasets import load_dataset ds = load_dataset("electricsheepafrica/africa-drc-languages") train = ds["train"].to_pandas() test = ds["test"].to_pandas() print(train.shape) train.head() ``` --- ## Schema | Column | Type | Null % | Range / Sample Values | |---|---|---|---| | `admin_2` | object | 0.0% | #adm2+name, Beni, Kibombo | | `adm_2_pcode` | object | 0.0% | #adm2+code, CD6109, CD6313 | | `admin_1` | object | 0.0% | Kongo-Central, Sud-Kivu, Maniema | | `adm_1_pcode` | object | 0.0% | CD20, CD62, CD63 | | `admin_0` | object | 0.0% | Democratic Republic of Congo, #country+name | | `admin0_pcode` | object | 0.0% | CD, #country+code | | `primary_language` | object | 11.6% | Lingala, Swahili, Tshiluba | | `primary_language_percentage_share` | float64 | 28.9% | 0.3 – 1.0 (mean 0.8377) | | `alur` | float64 | 30.0% | 0.0 – 0.93 (mean 0.007) | | `bakutshu` | float64 | 30.0% | 0.0 – 0.03 (mean 0.0002) | | `balika` | float64 | 30.0% | 0.0 – 0.2 (mean 0.0015) | | `balobo` | float64 | 30.0% | 0.0 – 0.25 (mean 0.0019) | | `bandaka` | float64 | 30.0% | 0.0 – 0.076 (mean 0.0006) | | `banunu` | float64 | 30.0% | 0.0 – 0.3 (mean 0.0023) | | `bangala` | float64 | 30.0% | 0.0 – 0.95 (mean 0.0144) | | `bango` | float64 | 30.0% | 0.0 – 0.68 (mean 0.0051) | | `batabwa` | float64 | 30.0% | 0.0 – 0.3 (mean 0.0023) | | `bati` | float64 | 30.0% | 0.0 – 0.28 (mean 0.0021) | | `baloi` | float64 | 30.0% | 0.0 – 0.1 (mean 0.0008) | | `bemba` | float64 | 30.0% | 0.0 – 0.7 (mean 0.006) | | `bembe` | float64 | 30.0% | 0.0 – 0.7 (mean 0.0105) | | `benge` | float64 | 30.0% | 0.0 – 0.05 (mean 0.0004) | | `benza` | float64 | 30.0% | 0.0 – 0.22 (mean 0.0017) | | `bila` | float64 | 30.0% | 0.0 – 0.204 (mean 0.0027) | | `boa` | float64 | 30.0% | 0.0 – 0.56 (mean 0.0087) | | `boba` | float64 | 30.0% | 0.0 – 0.01 (mean 0.0001) | | `bobayi` | float64 | 30.0% | 0.0 – 0.04 (mean 0.0006) | | `boma` | float64 | 30.0% | | | `bombo` | float64 | 30.0% | | | `budja` | float64 | 30.0% | | | `budu` | float64 | 30.0% | | | `bunda` | float64 | 30.0% | | | `dinga` | float64 | 30.0% | | | `djonga` | float64 | 30.0% | | | `ekonda` | float64 | 30.0% | | | `eleku` | float64 | 30.0% | | | `fuliru` | float64 | 30.0% | | | `geyna` | float64 | 30.0% | | | `gilima` | float64 | 30.0% | | | `gobu` | float64 | 30.0% | | | `havu` | float64 | 30.0% | | | `hema` | float64 | 30.0% | | | `hemba` | float64 | 30.0% | | | `hesoo` | float64 | 30.0% | | | `humbu` | float64 | 30.0% | | | `hunde` | float64 | 30.0% | | | `kalanga` | float64 | 30.0% | | | `kango` | float64 | 30.0% | | | `kanyoka` | float64 | 30.0% | | | `kaonde` | float64 | 30.0% | | | `kebu_tuu` | float64 | 30.0% | | | `kere` | float64 | 30.0% | | | `kilese` | float64 | 30.0% | | | `kiongo` | float64 | 30.0% | | | `kitwa` | float64 | 30.0% | | | `koka` | float64 | 30.0% | | | `kongo` | float64 | 29.5% | | | `kuba` | float64 | 30.0% | | | `kumu` | float64 | 30.0% | | | `kusu` | float64 | 30.0% | | | `kwala` | float64 | 30.0% | | | `kwange` | float64 | 30.0% | | | `kwese` | float64 | 30.0% | | | `lalia` | float64 | 30.0% | | | `lande` | float64 | 30.0% | | | `langa` | float64 | 30.0% | | | `langbasi` | float64 | 30.0% | | | `leboale` | float64 | 30.0% | | | `lega` | float64 | 30.0% | | | `lemfu` | float64 | 30.0% | | | `lendu` | float64 | 30.0% | | | `libinza` | float64 | 30.0% | | | `liboko` | float64 | 30.0% | | | `likumbe` | float64 | 30.0% | | | `lingala` | float64 | 30.0% | | | `lobala` | float64 | 30.0% | | | `logo` | float64 | 30.0% | | | `lokele` | float64 | 30.0% | | | `lokonda` | float64 | 30.0% | | | `lonkundo` | float64 | 30.0% | | | `londengese` | float64 | 30.0% | | | `longando` | float64 | 30.0% | | | `lontomba` | float64 | 30.0% | | | `lothsua` | float64 | 30.0% | | | `lotwa` | float64 | 30.0% | | | `loyembe` | float64 | 30.0% | | | `luba` | float64 | 30.0% | | | `luba_lubangule` | float64 | 30.0% | | | `lulua` | float64 | 30.0% | | | `lunda` | float64 | 30.0% | | | `lungu` | float64 | 30.0% | | | `mabale` | float64 | 30.0% | | | `makutu` | float64 | 30.0% | | | `manianga` | float64 | 30.0% | | | `manga` | float64 | 30.0% | | | `manyanga` | float64 | 30.0% | | | `mashi` | float64 | 30.0% | | | `mbala` | float64 | 30.0% | | | `mbanza` | float64 | 30.0% | | | `mbanza_fula` | float64 | 30.0% | | | `mbati` | float64 | 30.0% | | | `mbole` | float64 | 30.0% | | | `mboma` | float64 | 30.0% | | | `mbuba` | float64 | 30.0% | | | `mituku` | float64 | 30.0% | | | `moko` | float64 | 30.0% | | | `mongo` | float64 | 30.0% | | | `mongando` | float64 | 30.0% | | | `mongbandi` | float64 | 30.0% | | | `mono` | float64 | 30.0% | | | `mpama` | float64 | 30.0% | | | `mpee` | float64 | 30.0% | | | `nalengwe` | float64 | 30.0% | | | `nande` | float64 | 30.0% | | | `nanyembo` | float64 | 30.0% | | | `ndembo` | float64 | 30.0% | | | `ndibu` | float64 | 30.0% | | | `ndo_tuki` | float64 | 30.0% | | | `ngando` | float64 | 30.0% | | | `ngbaka` | float64 | 30.0% | | | `ngbaka_mabo` | float64 | 30.0% | | | `ngbandi` | float64 | 30.0% | | | `ngbundu` | float64 | 30.0% | | | `ngbungbu` | float64 | 30.0% | | | `ngelema` | float64 | 30.0% | | | `ngengele` | float64 | 30.0% | | | `ngoli` | float64 | 30.0% | | | `ngombe` | float64 | 30.0% | | | `ngwaka_minagende` | float64 | 30.0% | | | `ngwandi` | float64 | 30.0% | | | `nkundo` | float64 | 30.0% | | | `nkutshu` | float64 | 30.0% | | | `ntandu` | float64 | 30.0% | | | `nunu` | float64 | 30.0% | | | `nyali` | float64 | 30.0% | | | `nyabwisha` | float64 | 30.0% | | | `nyanga` | float64 | 30.0% | | | `nyarwanda` | float64 | 30.0% | | | `nyindu` | float64 | 30.0% | | | `nzakara` | float64 | 30.0% | | | `ohendo` | float64 | 30.0% | | | `okela` | float64 | 30.0% | | | `okutsu` | float64 | 30.0% | | | `pakombe` | float64 | 30.0% | | | `pazande` | float64 | 30.0% | | | `pende` | float64 | 30.0% | | | `piri` | float64 | 30.0% | | | `popoyi` | float64 | 30.0% | | | `portuguese` | float64 | 30.0% | | | `probe` | float64 | 30.0% | | | `pygmy` | float64 | 30.0% | | | `sakata` | float64 | 30.0% | | | `sanga` | float64 | 30.0% | | | `sango` | float64 | 30.0% | | | `songe` | float64 | 30.0% | | | `songola` | float64 | 30.0% | | | `suku` | float64 | 30.0% | | | `swahili` | float64 | 30.0% | | | `tabwa` | float64 | 30.0% | | | `teke` | float64 | 30.0% | | | `tembo` | float64 | 30.0% | | | `tete_south` | float64 | 30.0% | | | `tetela` | float64 | 30.0% | | | `togba` | float64 | 30.0% | | | `topoke` | float64 | 30.0% | | | `tshibindi` | float64 | 30.0% | | | `tshikete` | float64 | 30.0% | | | `tshikwamputu` | float64 | 30.0% | | | `tsihilualua` | float64 | 30.0% | | | `tshiluba` | float64 | 30.0% | | | `tshokwe` | float64 | 30.0% | | | `tua` | float64 | 30.0% | | | `tungwa` | float64 | 30.0% | | | `urund` | float64 | 30.0% | | | `vungu` | float64 | 30.0% | | | `waria` | float64 | 30.0% | | | `yaelima` | float64 | 30.0% | | | `yaka` | float64 | 30.0% | | | `yazi` | float64 | 30.0% | | | `yanzi` | float64 | 30.0% | | | `yogo` | float64 | 30.0% | | | `yombe` | float64 | 30.0% | | | `zande` | float64 | 30.0% | | | `zela` | float64 | 30.0% | | | `zimba` | float64 | 30.0% | | | `zola` | float64 | 30.0% | | | `other_languages` | float64 | 30.0% | | | `population_total` | float64 | 0.5% | | | `language_data` | object | 0.0% | Y, N, #meta+data+bool | | `data_confidence` | object | 0.0% | Low, #meta+confidence | | `notes` | object | 61.6% | Population is reflected in CD3206, language shares inherited from same., No statistical language data available., Population is reflected in CD8306, language shares inherited from same. | | `esa_source` | object | 0.0% | | | `esa_processed` | object | 0.0% | | --- ## Numeric Summary | Column | Min | Max | Mean | Median | |---|---|---|---|---| | `primary_language_percentage_share` | 0.3 | 1.0 | 0.8377 | 0.9 | | `alur` | 0.0 | 0.93 | 0.007 | 0.0 | | `bakutshu` | 0.0 | 0.03 | 0.0002 | 0.0 | | `balika` | 0.0 | 0.2 | 0.0015 | 0.0 | | `balobo` | 0.0 | 0.25 | 0.0019 | 0.0 | | `bandaka` | 0.0 | 0.076 | 0.0006 | 0.0 | | `banunu` | 0.0 | 0.3 | 0.0023 | 0.0 | | `bangala` | 0.0 | 0.95 | 0.0144 | 0.0 | | `bango` | 0.0 | 0.68 | 0.0051 | 0.0 | | `batabwa` | 0.0 | 0.3 | 0.0023 | 0.0 | | `bati` | 0.0 | 0.28 | 0.0021 | 0.0 | | `baloi` | 0.0 | 0.1 | 0.0008 | 0.0 | | `bemba` | 0.0 | 0.7 | 0.006 | 0.0 | | `bembe` | 0.0 | 0.7 | 0.0105 | 0.0 | | `benge` | 0.0 | 0.05 | 0.0004 | 0.0 | --- ## Curation Raw data was downloaded from HDX via the CKAN API and converted to Parquet. Column names were lowercased and standardised to snake_case. Common missing-value markers (`N/A`, `null`, `none`, `-`, `unknown`, `no data`, `#N/A`) were unified to `NaN`. 181 column(s) were cast from string to numeric or datetime based on parse-success rate (>85% threshold). The dataset was split 80/20 into train and test partitions using a fixed random seed (42) and saved as Snappy-compressed Parquet. --- ## Limitations - Data originates from CLEAR Global (previously Translators without Borders) and has not been independently validated by ESA. - Automated cleaning cannot correct for misreported values, definitional inconsistencies, or sampling bias in the original collection. - The following columns have >20% missing values and should be treated with caution in modelling: `primary_language_percentage_share`, `alur`, `bakutshu`, `balika`, `balobo`, `bandaka`, `banunu`, `bangala`.... - Refer to the [original HDX dataset page](https://data.humdata.org/dataset/drc-languages) for the publisher's own methodology notes and caveats. --- ## Citation ```bibtex @dataset{hdx_africa_drc_languages, title = {DRC: Languages}, author = {CLEAR Global (previously Translators without Borders)}, year = {2025}, url = {https://data.humdata.org/dataset/drc-languages}, note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)} } ``` --- *[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — Africa's ML dataset infrastructure. Lagos, Nigeria.*
提供机构:
electricsheepafrica
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作