five

xri/RwilaNMT

收藏
Hugging Face2025-02-27 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/xri/RwilaNMT
下载链接
链接失效反馈
官方服务:
资源简介:
RwilaNMT是一个由8000句斯瓦希里语和Rwila语组成的平行语料库。该数据集旨在用于对神经机器翻译模型和大型语言模型进行Rwila语的微调。Rwila是一种在坦桑尼亚使用的低资源班图语。该数据集通过XRI Global开发的专有方法创建和整理,确保在数据收集时覆盖概念空间。此方法旨在为低资源语言创建最快、成本最低的收集领域内纯净数据的方式,以优化语言模型的微调。贡献者已给予适当同意,由当地机构聘请,并得到公平补偿。数据收集使用了我们的移动数据收集应用程序Echonet和自定义翻译管理系统。该数据集的领域较为通用,最适用于文学和叙事文本,而在技术、科学或口语等其他领域则不那么有效。

RwilaNMT is a parallel dataset composed of 8,000 sentences in Swahili and Rwila. It is intended for fine-tuning Neural Machine Translation models and Large Language Models for the Rwila language. Rwila is a low-resource Bantu language spoken in Tanzania. The dataset was created and curated using a proprietary method developed by XRI Global to ensure coverage of a conceptual space during data collection. This method was designed to be the fastest and most affordable way to collect pristine in-domain data for low-resource languages optimized for fine-tuning language models. The contributors provided proper consent, were hired by a local agency, and were fairly compensated. Data collection utilized our mobile data collection app, Echonet, and our custom translation management system. The dataset is somewhat generalized and is most effective for literary and narrative texts, less so for other domains such as technical, scientific, or colloquial. If you are interested in creating ideal datasets for new languages, contact us at contact@xriglobal.ai
提供机构:
xri
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作