five

A Structured Dataset Containing 6,640 Indonesian Pantun Stanzas with Structural Annotations for Natural Language Generation

收藏
Mendeley Data2026-07-04 收录
下载链接:
https://data.mendeley.com/datasets/xcnbn5rpzn
下载链接
链接失效反馈
官方服务:
资源简介:
Computational Context and Summary This dataset offers a curated, structurally validated, and well-annotated corpus of 6,640 lines of Indonesian pantun for advanced text generation and computational linguistics research. The primary research objective is to estimate and quantify the ability of a Large Language Model (LLM) and text generation framework to adhere to strict, multi-layered formal constraints (line-by-line metrics, phonetic end-rhyme, and macro-structural dualism), while maintaining deep cultural and semantic coherence in a resource-constrained regional language. Key Features and What the Data Shows This dataset is provided as a single, fully structured file, pantun_dataset.csv, consisting of 6,640 unique four-line stanzas mapped to 17 operational variables with no missing values. To optimize the corpus for direct statistical and machine learning applications, the text layout was explicitly flattened and separated into structural subcomponents 'line_sampiran' (lines 1-2) and 'line_content' (lines 3-4). The metric architecture includes three different operational feature vectors for each line: line-level token density ('number_of_words_line_1..4'), the exact number of syllables strictly limited between 8 and 12 syllables ('suku_kata_line_1..4'), and a string representation of the extracted phonetic end rhyme ('rima_akhir_line_1..4'). The final corpus consisted of 4,945 verses (74.47%) with a cross rhyme pattern (a-b-a-b) and 1,695 verses (25.53%) with a continuous rhyme pattern (a-a-a-a). The update from version 2 to version 3 increased the dataset size to 6,640 verses through a more optimized filtering algorithm. The Lexical Validity Ratio filter criteria were added using Indonesian root words and the rima_akhir_line_1..4 variables were added, resulting in a total of 17 operational variables. Collection Methodology and Procedure This dataset was generated through an extensive two-stage programmatic quality control and filtering workflow: - Data Acquisition & Multilevel Deduplication. A raw corpus of 11,795 multi-line text blocks was manually obtained from primary digital databases, historical archives, and cultural portals. A complete deduplication workflow was implemented by combining token-based exact matching with sequence-aware fuzzy similarity alignment (using Jaccard metrics and SequenceMatcher with optimized thresholds). This eliminated duplicate records while preserving legitimate oral literary dialect variations, resulting in 8,765 unique couplets. - Algorithmic structure filtering and error isolation: The remaining couplets were evaluated according to traditional prosodic requirements by a deterministic, rule-based Python framework. A total of 2,125 verses were disqualified and discarded due to low lexical validity ratio, incorrect rhyme scheme or syllable count outside the range of 8-12. This rigorous screening process has yielded a high quality gold standard corpus of 6,640 structurally correct pantun verses.
创建时间:
2026-06-11
二维码
社区交流群
二维码
科研交流群
商业服务