five

INMAD (Indonesia - Madurese) Sentences Dataset

收藏
Mendeley Data2026-04-09 收录
下载链接:
https://data.mendeley.com/datasets/dt7d8dmfgs/1
下载链接
链接失效反馈
官方服务:
资源简介:
The INMAD (Indonesia-Madura) Sentence Dataset is a collection of parallel sentences in Indonesian and Madura intended to support the development of natural language processing (NLP) technology for languages with limited resources. This dataset combines three main sources from the indonlp website: Korpus Nusantara (1,100 sentences), NusaX MT (994 sentences), and Nusa Paragraf (9,449 sentences). When combined, these sources produce a total of 11,543 parallel sentences, which are then consolidated into a single CSV file named Dataset Raw. To increase the variety and quantity of sentences in the dataset, data augmentation was performed using back-translation with the MarianMT model through Python programming, which involved translating Indonesian sentences into English and then back into Indonesian. This process doubled the number of sentences to 23,086 Indonesian-Madura translation lines. 11,543 parallel sentences were then manually translated into Madura at the 'engghi-enten' level by expert translators to ensure the quality of the translation. The results of the Indonesian data augmentation, consisting of 11,543 parallel sentences, use the same Madura translations from the 'engghi-enten' level in the first translation. This dataset is available in CSV format, consisting of a total of 23,086 lines, with columns labelled 'Indonesia' and 'Madura', and stored using UTF-8 encoding. The entire process aims to produce a high-quality parallel dataset that can be used for various linguistic and language technology applications, including machine translation training, language preservation, and digital dictionary development.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作