SEACrowd/parallel_id_nyo

Name: SEACrowd/parallel_id_nyo
Creator: SEACrowd
Published: 2024-06-24 13:33:08
License: 暂无描述

Hugging Face2024-06-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/SEACrowd/parallel_id_nyo

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含印尼语和楠榜语（Lampung）的语言对，主要用于机器翻译任务。原始数据包含3000行，但由于对齐问题，最终数据集只包含1727对。

提供机构：

SEACrowd

原始信息汇总

数据集概述

语言

印尼语 (ind)
拉姆邦语 (abl)

任务类别

机器翻译

数据集描述

包含印尼语和拉姆邦语的语言对数据集。
原始数据包含3000行，但由于对齐问题，最终数据集仅包含1727对对齐的数据。

数据集版本

源版本: 1.0.0
SEACrowd版本: 2024.06.20

数据集许可证

未知

引用

如果使用该数据集，请引用以下文献：

@article{Abidin_2021, doi = {10.1088/1742-6596/1751/1/012036}, url = {https://dx.doi.org/10.1088/1742-6596/1751/1/012036}, year = {2021}, month = {jan}, publisher = {IOP Publishing}, volume = {1751}, number = {1}, pages = {012036}, author = {Z Abidin and Permata and I Ahmad and Rusliyawati}, title = {Effect of mono corpus quantity on statistical machine translation Indonesian - Lampung dialect of nyo}, journal = {Journal of Physics: Conference Series}, abstract = {Lampung Province is located on the island of Sumatera. For the immigrants in Lampung, they have difficulty in communicating with the indigenous people of Lampung. As an alternative, both immigrants and the indigenous people of Lampung speak Indonesian. This research aims to build a language model from Indonesian language and a translation model from the Lampung language dialect of nyo, both models will be combined in a Moses decoder. This research focuses on observing the effect of adding mono corpus to the experimental statistical machine translation of Indonesian - Lampung dialect of nyo. This research uses 3000 pair parallel corpus in Indonesia language and Lampung language dialect of nyo as source language and uses 3000 mono corpus sentences in Lampung language dialect of nyo as target language. The results showed that the accuracy value in bilingual evalution under-study score when using 1000 sentences, 2000 sentences, 3000 sentences mono corpus show the accuracy value of the bilingual evaluation under-study, respectively, namely 40.97 %, 41.80 % and 45.26 %.} }

@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集