Indonesian Part-of-Speech Tagging Corpus

Name: Indonesian Part-of-Speech Tagging Corpus
Creator: figshare
Published: 2025-12-09 09:25:27
License: 暂无描述

DataCite Commons2025-12-09 更新2026-02-09 收录

下载链接：

https://figshare.com/articles/dataset/_b_Indonesian_Part-of-Speech_Tagging_Corpus_b_/30830651

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is a refined version of the Indonesian POS-tagged corpus by Fu et al. (2018). The corpus was filtered to include only complete single-clause sentences, identified through rule-based syntactic analysis using UPOS tags and dependency parsing.Annotation inconsistencies were detected using word-level entropy. From 7,388 word types analyzed, 617 words with H>0 were identified as candidates and reviewed by 15 linguistic experts. Using majority voting, experts corrected 550 inconsistent word labels, while the remaining words were confirmed as valid ambiguities. All corrected labels were reintegrated into the corpus to produce the revised version.Evaluation comparing the original filtered corpus (V1) and the corrected version (V2) shows that V2 improves model performance across all metrics, especially on ambiguous words (≈14% accuracy and ≈15% macro F1 improvement). Average entropy also decreases from 0.07 to 0.04, indicating higher annotation consistency.This dataset provides a cleaner, more homogeneous POS-tagged resource for Indonesian language processing and improved model reliability on ambiguous and OOV cases.

提供机构：

figshare

创建时间：

2025-12-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集