Indonesian Part-of-Speech Tagging Corpus
收藏DataCite Commons2025-12-09 更新2026-02-09 收录
下载链接:
https://figshare.com/articles/dataset/_b_Indonesian_Part-of-Speech_Tagging_Corpus_b_/30830651
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is a refined version of the Indonesian POS-tagged corpus by Fu et al. (2018). The corpus was filtered to include only complete single-clause sentences, identified through rule-based syntactic analysis using UPOS tags and dependency parsing.Annotation inconsistencies were detected using word-level entropy. From 7,388 word types analyzed, 617 words with H>0 were identified as candidates and reviewed by 15 linguistic experts. Using majority voting, experts corrected 550 inconsistent word labels, while the remaining words were confirmed as valid ambiguities. All corrected labels were reintegrated into the corpus to produce the revised version.Evaluation comparing the original filtered corpus (V1) and the corrected version (V2) shows that V2 improves model performance across all metrics, especially on ambiguous words (≈14% accuracy and ≈15% macro F1 improvement). Average entropy also decreases from 0.07 to 0.04, indicating higher annotation consistency.This dataset provides a cleaner, more homogeneous POS-tagged resource for Indonesian language processing and improved model reliability on ambiguous and OOV cases.
提供机构:
figshare
创建时间:
2025-12-09



