A Gold-Standard Dataset for Benchmarking Balinese Extractive and Abstractive Text Summarization
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/m9yfr2mszw
下载链接
链接失效反馈官方服务:
资源简介:
1. Research Hypothesis and Data Scope
The central hypothesis guiding the creation of the BaliSummarization dataset is that the development of robust Balinese text summarization models—both extractive and abstractive—requires a high-quality, large-scale, and genre-diverse gold-standard resource, which is currently non-existent for this low-resource language. This dataset addresses this gap by providing human-generated reference summaries, enabling direct benchmarking for sequence-to-sequence (abstractive) and sequence-labeling (extractive) summarization tasks. The data is categorized into six distinct genres: articles, speeches (as known as Pidarta), and various forms of Balinese narrative (as known as Satua Bali), ensuring comprehensive coverage of linguistic variations across textual domains.
2. Data Overview and Interpretation
The dataset includes three distinct categories: article or news, formal speech, and traditional folklore. These categories provide document-summaries, with the first category comprising articles, the second category being formal speech, and the last category being folklore. he data demonstrates a significant variance in document length, ranging from an average of 12.6 sentences/208.7 words (articlebats) to 73.8 sentences/1108.9 words (satuabaliweb), necessitating summarization models capable of handling extreme input lengths. The quality of human-generated reference summaries was validated using metrics like ROUGE-1, ROUGE-2, ROUGE-L, BLEU, and Cosine Similarity (FastText-based Embedding). We incorporated two metrics to validate our extractive summaries, i.e Fleiss's Kappa and Krippendorff's Alpha. Both of them resulting almost perfect agreement in all categories.
3. Notable Findings for Abstractive IAA score
All categories achieved extremely high Cosine Similarity scores (above 0.91 across all annotation phases), significantly exceeding the 0.5 threshold. This indicates that annotators maintained a near-perfect consensus on the core meaning, topic, and semantic content of the summaries, despite using different phraseology. The primary challenge observed was low lexical overlap in bigrams (ROUGE-2), particularly for the articlebats, articlemlft, and pidarta categories, which failed to meet the 0.2 threshold in the independent phase. This suggests high linguistic variability in sentence construction when abstracting short or formal texts. The long narrative texts (folklore) and the articlesuara successfully passed all five IAA thresholds in the independent annotation, confirming their high reliability and serving as the most robust reference data within the collection.
创建时间:
2025-10-07



