Low-Resource Corpus Indonesian Local Language

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/tjstx5rg6k

下载链接

链接失效反馈

官方服务：

资源简介：

This study departs from the hypothesis that combining Neural Machine Translation (NMT) with the stemming algorithms ECS, Porter, and Porter Hybrid (ECS) can improve root-word extraction accuracy for Indonesian regional languages. In particular, the hybrid model which integrates ECS’s morphological sensitivity with Porter’s efficiency is expected to deliver the best performance in preserving semantic equivalence across word forms. The dataset comprises nine text files representing three regional languages (Javanese, Minangkabau, and Sundanese) and three stemming approaches (ECS, Porter, and Porter Hybrid-ECS). Each file contains pairs of source sentences and outputs produced through the NMT → Stemming → NMT pipeline: regional-language text is translated into Indonesian, stemmed, and then projected back into the original regional language. The parallel corpora were aligned at the lexical level and curated to balance token counts, structural variation, and morphological diversity. Experimental results reveal several prominent findings. ECS excels for languages with complex affixation because it captures local morphological patterns; Porter is computationally lighter but tends to under-stem prefixes/infices typical of regional languages; and Porter Hybrid (ECS) consistently performs best across all three languages, yielding an average accuracy gain of approximately 4–7% over single models. Evaluation uses precision–recall (assessing correctness and coverage of recovered lemmas), cosine similarity (semantic proximity to reference forms), and BLEU score (n-gram similarity to reference). Interpretively, the data indicate that a hybrid approach blending linguistic insight (ECS) with a general algorithm (Porter) within an NMT-based pipeline produces a more universal and adaptive stemming system for Indonesian regional languages. These findings are relevant for developing cross-regional NLP applications (machine translation, sentiment analysis, lexical normalization), training models for low-resource languages, and improving text preprocessing in translation systems to achieve more accurate morphology-level handling.

创建时间：

2025-10-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集