eliezermga/ruwund-french
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/eliezermga/ruwund-french
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
size_categories:
- 1K<n<10K
---
# Ruwund-French Parallel Dataset
## Overview
This dataset is a parallel corpus of the Ruwund (Luwund) language aligned with French translations. It is intended for research and development in natural language processing (NLP), especially for low-resource languages.
Ruwund is a Bantu language spoken mainly in the Democratic Republic of the Congo and Angola. This dataset aims to contribute to the development of language technologies for under-resourced African languages.
---
## Objectives
- Provide a clean bilingual corpus (Ruwund <-> French)
- Support machine translation systems
- Contribute to linguistic preservation
- Enable research on low-resource NLP
---
## Dataset Structure
The dataset is stored in TSV format (Tab-Separated Values).
Each line contains:
- A sentence in Ruwund
- Its corresponding translation in French
### Format
```text
ruwund_sentence<TAB>french_sentence
```
### Example
```text
Mu musumb winou mukez kumekanap kand chimunyik cha mwend wa kasu. Mukez kwovakanap kand mazu ma angatan ap ma angachik. In kwisak ey ading antu ajim a pa mangand. Wayipumbula antu a michid yawonsu nich ulaj wey! La lumière de la lampe ne brillera plus jamais chez toi; on n'y entendra plus la voix des jeunes mariés. Tes marchands étaient les plus importants du monde, et par tes pratiques de magie tu as égaré tous les peuples.»
Auleja musumb wa Babilon kadimu mulong atanamu mash mau aruu a Nzamb, ni mash mau in kwitiyij, ni mash mau antu awonsu a pa mangand ajipau kudi antu akwau. C'est à Babylone qu'a coulé le sang des prophètes et du peuple de Dieu, le sang de tous ceux qui ont été massacrés sur la terre.
```
---
## Data Sources
The dataset is constructed from:
- Religious texts (for example, Bible excerpts)
- Written documents
- Manually aligned translations
---
## Usage
### Load with Hugging Face Datasets
```python
from datasets import load_dataset
dataset = load_dataset("eliezermga/ruwund-french")
print(dataset["train"][0])
```
### Load manually (TSV)
```python
with open("data.tsv", "r", encoding="utf-8") as f:
for line in f:
ruwund, french = line.strip().split("\t")
print(ruwund, french)
```
---
## Hugging Face
Dataset page:
https://huggingface.co/datasets/eliezermga/ruwund-french
---
## Use Cases
- Machine Translation (Ruwund -> French, French -> Ruwund)
- Fine-tuning multilingual models (mBART, M2M100, etc.)
- Linguistic analysis of Bantu languages
- Low-resource NLP benchmarks
---
## Dataset Size
- Number of sentence pairs: to be specified
- Format: TSV
- Languages: Ruwund, French
---
## Limitations
- Limited dataset size
- Possible alignment or translation inconsistencies
- Domain bias (mainly religious texts)
---
## Future Work
- Increase dataset size
- Add validation and test splits
- Improve data quality and alignment
- Integrate speech data (audio + transcription)
---
## Contribution
Contributions are welcome:
- Add new sentence pairs
- Correct translations
- Improve alignment
---
## License
license: cc-by-sa-4.0
---
## Author
Eliezer Mununga
Student in Artificial Intelligence
Project: LugaYetu
https://github.com/Eliezermga/Lugayetu
email: eliezermunung@outlook.fr
---
## Citation
```bibtex
@dataset{ruwund_french_dataset,
author = {Mununga, Eliezer},
title = {Ruwund-French Parallel Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/eliezermga/ruwund-french}
}
```
提供机构:
eliezermga



