electricsheepafrica/africa-guinea-languages
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/africa-guinea-languages
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license: cc-by-sa-4.0
multilinguality:
- monolingual
size_categories:
- n<1K
source_datasets:
- original
task_categories:
- other
task_ids: []
tags:
- africa
- humanitarian
- hdx
- electric-sheep-africa
- languages
- gin
pretty_name: "Guinea: Languages"
dataset_info:
splits:
- name: train
num_examples: 17
- name: test
num_examples: 4
---
# Guinea: Languages
**Publisher:** CLEAR Global (previously Translators without Borders) · **Source:** [HDX](https://data.humdata.org/dataset/guinea-languages) · **License:** `cc-by-sa` · **Updated:** 2026-04-06
---
## Abstract
Data on languages spoken in Guinea, showing the main language spoken in the household by proportion of the population. Data is drawn from IPUMS International. For more resources on the languages of Guinea and language use in humanitarian contexts please visit: https://clearglobal.org/language-maps-and-data/
Each row in this dataset represents time-series observations. Temporal coverage is indicated by the `datetime_published`, `date_creation` column(s). Geographic scope: **GIN**.
*Curated into ML-ready Parquet format by [Electric Sheep Africa](https://huggingface.co/electricsheepafrica).*
---
## Dataset Characteristics
| | |
|---|---|
| **Domain** | Demographics and population |
| **Unit of observation** | Time-series observations |
| **Rows (total)** | 22 |
| **Columns** | 16 (4 numeric, 10 categorical, 2 datetime) |
| **Train split** | 17 rows |
| **Test split** | 4 rows |
| **Geographic scope** | GIN |
| **Publisher** | CLEAR Global (previously Translators without Borders) |
| **HDX last updated** | 2026-04-06 |
---
## Variables
**Geographic** — `location_code` (GIN), `location_name` (Guinea), `location_level` (range 0.0–0.0), `reliability_score` (range 0.695–0.695), `representivity_rating` (very_high).
**Temporal** — `datetime_published`, `date_creation`.
**Demographic** — `language_code` (land1256, wame1240, temn1245), `language_name` (Landoma, Wamey, Northern Mel), `language_rank` (range 1.0–22.0).
**Outcome / Measurement** — `proportion_value` (range 0.0005–0.3129).
**Identifier / Metadata** — `dataset_name` (Guinea Census 2014 (IPUMS extract)), `source` (IPUMS International), `esa_source` (HDX), `esa_processed` (2026-04-07).
**Other** — `url` (https://api.ipums.org/downloads/ipumsi/api/v1/extracts/2404368/ipumsi_00292.sav.gz).
---
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-guinea-languages")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
```
---
## Schema
| Column | Type | Null % | Range / Sample Values |
|---|---|---|---|
| `location_code` | object | 0.0% | GIN |
| `location_name` | object | 0.0% | Guinea |
| `location_level` | int64 | 0.0% | 0.0 – 0.0 (mean 0.0) |
| `language_code` | object | 0.0% | land1256, wame1240, temn1245 |
| `language_name` | object | 0.0% | Landoma, Wamey, Northern Mel |
| `language_rank` | int64 | 0.0% | 1.0 – 22.0 (mean 11.5) |
| `proportion_value` | float64 | 0.0% | 0.0005 – 0.3129 (mean 0.0455) |
| `reliability_score` | float64 | 0.0% | 0.695 – 0.695 (mean 0.695) |
| `dataset_name` | object | 0.0% | Guinea Census 2014 (IPUMS extract) |
| `url` | object | 0.0% | https://api.ipums.org/downloads/ipumsi/api/v1/extracts/2404368/ipumsi_00292.sav.gz |
| `source` | object | 0.0% | IPUMS International |
| `datetime_published` | datetime64[ns] | 0.0% | |
| `date_creation` | datetime64[ns] | 0.0% | |
| `representivity_rating` | object | 0.0% | very_high |
| `esa_source` | object | 0.0% | HDX |
| `esa_processed` | object | 0.0% | 2026-04-07 |
---
## Numeric Summary
| Column | Min | Max | Mean | Median |
|---|---|---|---|---|
| `location_level` | 0.0 | 0.0 | 0.0 | 0.0 |
| `language_rank` | 1.0 | 22.0 | 11.5 | 11.5 |
| `proportion_value` | 0.0005 | 0.3129 | 0.0455 | 0.008 |
| `reliability_score` | 0.695 | 0.695 | 0.695 | 0.695 |
---
## Curation
Raw data was downloaded from HDX via the CKAN API and converted to Parquet. Column names were lowercased and standardised to snake_case. Common missing-value markers (`N/A`, `null`, `none`, `-`, `unknown`, `no data`, `#N/A`) were unified to `NaN`. 2 column(s) were cast from string to numeric or datetime based on parse-success rate (>85% threshold). The dataset was split 80/20 into train and test partitions using a fixed random seed (42) and saved as Snappy-compressed Parquet.
---
## Limitations
- Data originates from CLEAR Global (previously Translators without Borders) and has not been independently validated by ESA.
- Automated cleaning cannot correct for misreported values, definitional inconsistencies, or sampling bias in the original collection.
- Refer to the [original HDX dataset page](https://data.humdata.org/dataset/guinea-languages) for the publisher's own methodology notes and caveats.
---
## Citation
```bibtex
@dataset{hdx_africa_guinea_languages,
title = {Guinea: Languages},
author = {CLEAR Global (previously Translators without Borders)},
year = {2026},
url = {https://data.humdata.org/dataset/guinea-languages},
note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)}
}
```
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — Africa's ML dataset infrastructure. Lagos, Nigeria.*
---
annotations_creators:
- 无注释
language_creators:
- 公开资源采集
language:
- 英语
license: cc-by-sa-4.0
multilinguality:
- 单语
size_categories:
- 样本量小于1000
source_datasets:
- 原创数据集
task_categories:
- 其他
task_ids: []
tags:
- 非洲
- 人道主义
- HDX
- electric-sheep-africa
- 语言
- GIN
pretty_name: "几内亚:语言"
dataset_info:
splits:
- name: train
num_examples: 17
- name: test
num_examples: 4
---
# 几内亚:语言
**发布方**:CLEAR Global(前身为Translators without Borders)· **数据来源**:[人道主义数据交换(HDX)](https://data.humdata.org/dataset/guinea-languages) · **许可证**:`cc-by-sa` · **最后更新时间**:2026-04-06
---
## 摘要
本数据集包含几内亚境内使用语言的相关统计数据,展示了按人口占比统计的家庭主要使用语言分布情况。数据源自国际人口普查项目微观数据整合系统(IPUMS International)。如需获取更多几内亚语言及人道主义场景下语言使用的相关资源,请访问:https://clearglobal.org/language-maps-and-data/
本数据集的每一行均代表一条时间序列观测值。时间覆盖范围由`datetime_published`(发布时间)与`date_creation`(创建时间)列标注。地理覆盖范围:**GIN**。
*本数据集已由[Electric Sheep Africa](https://huggingface.co/electricsheepafrica)整理为适配机器学习的Parquet格式。*
---
## 数据集特征
| | |
|---|---|
| **领域** | 人口与人口统计学 |
| **观测单元** | 时间序列观测值 |
| **总行数** | 22 |
| **列数** | 16列(4个数值型、10个分类型、2个日期时间型) |
| **训练集样本量** | 17行 |
| **测试集样本量** | 4行 |
| **地理覆盖范围** | GIN |
| **发布方** | CLEAR Global(前身为Translators without Borders) |
| **HDX平台最后更新时间** | 2026-04-06 |
---
## 字段说明
### 地理类字段
`location_code`(GIN,国家代码)、`location_name`(几内亚,国家名称)、`location_level`(取值范围0.0–0.0)、`reliability_score`(可信度评分,取值范围0.695–0.695)、`representivity_rating`(代表性评级,very_high)。
### 时间类字段
`datetime_published`(发布时间)、`date_creation`(创建时间)。
### 人口统计类字段
`language_code`(语言代码,取值为land1256、wame1240、temn1245)、`language_name`(语言名称,Landoma、Wamey、Northern Mel)、`language_rank`(语言排名,取值范围1.0–22.0)。
### 结果/测量类字段
`proportion_value`(人口占比,取值范围0.0005–0.3129)。
### 标识符/元数据类字段
`dataset_name`(数据集名称,Guinea Census 2014 (IPUMS extract))、`source`(数据来源,IPUMS International)、`esa_source`(本数据集来源,HDX)、`esa_processed`(数据处理时间,2026-04-07)。
### 其他字段
`url`(原始数据下载链接,https://api.ipums.org/downloads/ipumsi/api/v1/extracts/2404368/ipumsi_00292.sav.gz)。
---
## 快速入门
python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/africa-guinea-languages")
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()
print(train.shape)
train.head()
---
## 数据模式
| 字段名 | 数据类型 | 空值占比 | 取值范围/示例值 |
|---|---|---|---|
| `location_code` | object | 0.0% | GIN |
| `location_name` | object | 0.0% | 几内亚 |
| `location_level` | int64 | 0.0% | 0.0 – 0.0(平均值0.0) |
| `language_code` | object | 0.0% | land1256、wame1240、temn1245 |
| `language_name` | object | 0.0% | Landoma、Wamey、Northern Mel |
| `language_rank` | int64 | 0.0% | 1.0 – 22.0(平均值11.5) |
| `proportion_value` | float64 | 0.0% | 0.0005 – 0.3129(平均值0.0455) |
| `reliability_score` | float64 | 0.0% | 0.695 – 0.695(平均值0.695) |
| `dataset_name` | object | 0.0% | Guinea Census 2014 (IPUMS extract) |
| `url` | object | 0.0% | https://api.ipums.org/downloads/ipumsi/api/v1/extracts/2404368/ipumsi_00292.sav.gz |
| `source` | object | 0.0% | IPUMS International |
| `datetime_published` | datetime64[ns] | 0.0% | 无 |
| `date_creation` | datetime64[ns] | 0.0% | 无 |
| `representivity_rating` | object | 0.0% | very_high |
| `esa_source` | object | 0.0% | HDX |
| `esa_processed` | object | 0.0% | 2026-04-07 |
---
## 数值统计摘要
| 字段名 | 最小值 | 最大值 | 平均值 | 中位数 |
|---|---|---|---|---|
| `location_level` | 0.0 | 0.0 | 0.0 | 0.0 |
| `language_rank` | 1.0 | 22.0 | 11.5 | 11.5 |
| `proportion_value` | 0.0005 | 0.3129 | 0.0455 | 0.008 |
| `reliability_score` | 0.695 | 0.695 | 0.695 | 0.695 |
---
## 数据整理流程
原始数据通过CKAN API从HDX平台下载,并转换为Parquet格式。所有字段名均转换为小写,并标准化为蛇形命名法(snake_case)。常见的缺失值标记(`N/A`、`null`、`none`、`-`、`unknown`、`no data`、`#N/A`)被统一替换为`NaN`。基于解析成功率(阈值>85%),将2个字段从字符串类型转换为数值型或日期时间型。本数据集使用固定随机种子(42)按80/20的比例划分为训练集与测试集,并以Snappy压缩的Parquet格式存储。
---
## 局限性说明
- 本数据集源自CLEAR Global(前身为Translators without Borders),未经过ESA的独立验证。
- 自动化数据清洗无法修正原始数据收集中的错报值、定义不一致或抽样偏差问题。
- 如需查看发布方提供的方法说明与免责声明,请访问[原始HDX数据集页面](https://data.humdata.org/dataset/guinea-languages)。
---
## 引用
bibtex
@dataset{hdx_africa_guinea_languages,
title = {Guinea: Languages},
author = {CLEAR Global (previously Translators without Borders)},
year = {2026},
url = {https://data.humdata.org/dataset/guinea-languages},
note = {Repackaged for machine learning by Electric Sheep Africa (https://huggingface.co/electricsheepafrica)}
}
---
*[Electric Sheep Africa](https://huggingface.co/electricsheepafrica) — 非洲机器学习数据集基础设施。尼日利亚拉各斯。*
提供机构:
electricsheepafrica



