five

eliezermga/ruwund-french

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/eliezermga/ruwund-french
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 size_categories: - 1K<n<10K --- # Ruwund-French Parallel Dataset ## Overview This dataset is a parallel corpus of the Ruwund (Luwund) language aligned with French translations. It is intended for research and development in natural language processing (NLP), especially for low-resource languages. Ruwund is a Bantu language spoken mainly in the Democratic Republic of the Congo and Angola. This dataset aims to contribute to the development of language technologies for under-resourced African languages. --- ## Objectives - Provide a clean bilingual corpus (Ruwund <-> French) - Support machine translation systems - Contribute to linguistic preservation - Enable research on low-resource NLP --- ## Dataset Structure The dataset is stored in TSV format (Tab-Separated Values). Each line contains: - A sentence in Ruwund - Its corresponding translation in French ### Format ```text ruwund_sentence<TAB>french_sentence ``` ### Example ```text Mu musumb winou mukez kumekanap kand chimunyik cha mwend wa kasu. Mukez kwovakanap kand mazu ma angatan ap ma angachik. In kwisak ey ading antu ajim a pa mangand. Wayipumbula antu a michid yawonsu nich ulaj wey! La lumière de la lampe ne brillera plus jamais chez toi; on n'y entendra plus la voix des jeunes mariés. Tes marchands étaient les plus importants du monde, et par tes pratiques de magie tu as égaré tous les peuples.» Auleja musumb wa Babilon kadimu mulong atanamu mash mau aruu a Nzamb, ni mash mau in kwitiyij, ni mash mau antu awonsu a pa mangand ajipau kudi antu akwau. C'est à Babylone qu'a coulé le sang des prophètes et du peuple de Dieu, le sang de tous ceux qui ont été massacrés sur la terre. ``` --- ## Data Sources The dataset is constructed from: - Religious texts (for example, Bible excerpts) - Written documents - Manually aligned translations --- ## Usage ### Load with Hugging Face Datasets ```python from datasets import load_dataset dataset = load_dataset("eliezermga/ruwund-french") print(dataset["train"][0]) ``` ### Load manually (TSV) ```python with open("data.tsv", "r", encoding="utf-8") as f: for line in f: ruwund, french = line.strip().split("\t") print(ruwund, french) ``` --- ## Hugging Face Dataset page: https://huggingface.co/datasets/eliezermga/ruwund-french --- ## Use Cases - Machine Translation (Ruwund -> French, French -> Ruwund) - Fine-tuning multilingual models (mBART, M2M100, etc.) - Linguistic analysis of Bantu languages - Low-resource NLP benchmarks --- ## Dataset Size - Number of sentence pairs: to be specified - Format: TSV - Languages: Ruwund, French --- ## Limitations - Limited dataset size - Possible alignment or translation inconsistencies - Domain bias (mainly religious texts) --- ## Future Work - Increase dataset size - Add validation and test splits - Improve data quality and alignment - Integrate speech data (audio + transcription) --- ## Contribution Contributions are welcome: - Add new sentence pairs - Correct translations - Improve alignment --- ## License license: cc-by-sa-4.0 --- ## Author Eliezer Mununga Student in Artificial Intelligence Project: LugaYetu https://github.com/Eliezermga/Lugayetu email: eliezermunung@outlook.fr --- ## Citation ```bibtex @dataset{ruwund_french_dataset, author = {Mununga, Eliezer}, title = {Ruwund-French Parallel Dataset}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/eliezermga/ruwund-french} } ```
提供机构:
eliezermga
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作