five

himath-nimpura/sl-parliamentary-hansard-17-26

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/himath-nimpura/sl-parliamentary-hansard-17-26
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - si - en - ta tags: - parliamentary - hansard - sri-lanka - multilingual - politics - government - hdbscan - topic-modeling - sinhala - tamil pretty_name: Sri Lanka Parliamentary Hansard (2017–2026) size_categories: - 10K<n<100K --- # Sri Lanka Parliamentary Hansard (2017–2026) A trilingual corpus of **19,553 parliamentary speeches** from the Sri Lankan Parliament, spanning 2017 to 2026. Speeches are in Sinhala (සිංහල), Tamil (தமிழ்), and English, including code-mixed utterances. Each speech is associated with a speaker name, date, and unsupervised topic cluster label produced by the BiTopic hybrid topic modeling pipeline. This dataset was constructed and used in the thesis: **"Multilingual Topic Modeling of Sri Lankan Parliamentary Discourse"** (Himath Nimpura, 2026). --- ## Dataset Summary | Field | Value | |---|---| | Total speeches | 19,553 | | Years covered | 2017–2026 | | Languages | Sinhala (~76%) English (~7%) Tamil (~7%) code-mixed (~10%) | | Median speech length | ~300 words | | 95th percentile length | ~1,200 words | | Topic clusters (micro) | 331–336 (HDBSCAN, noise label = -1) | | Macro-topic groups | 15 (empirical dendrogram cut) | --- ## Corpus Construction **Source**: Official Sri Lankan Parliament Hansard PDFs (publicly available parliamentary archive). **Extraction**: A two-stage pipeline was used: 1. PDFs were scraped from the parliamentary archive website. 2. Text was extracted using **Google Gemini** (LLM-based extraction) rather than OCR, to handle the dual-column trilingual layout, non-standard Sinhala/Tamil font encodings, and structural noise inherent in Hansard PDFs. **Preprocessing**: - Speaker name normalization across Sinhala, Tamil, and English honorific variants. - Removal of procedural headers, footers, and short non-substantive utterances (<50 characters). - Language detection per speech using script-based heuristics and probabilistic language identification. --- ## Dataset Fields | Column | Type | Description | |---|---|---| | `speech_id` | string | Unique speech identifier (e.g., `SP_00000`) | | `date` | string | Parliamentary session date (YYYY-MM-DD) | | `year` | integer | Year extracted from date | | `speaker` | string | Speaker name (normalized; may be in Sinhala, Tamil, or English) | | `text` | string | Full speech text (original language, may be code-mixed) | | `micro_topic_cluster` | integer | HDBSCAN cluster label; **-1 = noise** (procedural/unassigned speech) | | `macro_topic` | string | Aggregated macro-topic label (e.g. `Macro-Topic 3`); `"Procedural Noise"` for noise speeches; `"Niche/Localized Debates"` for small unrescued micro-clusters | --- ## Topic Modeling - **Semantic channel**: BGE-M3 dense embeddings (1024-dim, 8192-token context) - **Lexical channel**: CountVectorizer sparse TF vectors - **Clustering**: HDBSCAN (density-based; noise label = -1) - **Dimensionality reduction**: UMAP (5D for clustering, 2D for visualization) HDBSCAN outperformed KMeans and Agglomerative Clustering in all internal metrics: | Algorithm | Silhouette | Calinski-Harabasz | Davies-Bouldin | |---|---|---|---| | **HDBSCAN** | **0.5741** | **15,052** | **0.5148** | | KMeans | 0.4076 | 13,098 | 0.8745 | | Agglomerative | 0.3812 | 12,217 | 0.9088 | The `micro_topic_cluster` column contains raw HDBSCAN micro-topic labels. Speeches with label **-1** are noise — typically short procedural utterances, interjections, or fragments that do not belong to any dense topic region. --- ## Political Events Captured The corpus spans a politically turbulent period. The unsupervised topic model recovers key historical events: - **2019 Easter Sunday attacks** — security and defense discourse spike - **2022 economic crisis & Aragalaya** — economy and fiscal policy topics surge dramatically - **IMF restructuring period** — continued elevated economic debate volume --- ## Intended Uses - Multilingual NLP research (Sinhala, Tamil, English) - Parliamentary discourse analysis - Topic modeling benchmarking on low-resource, code-mixed corpora - Political science: agenda setting, temporal attention analysis, actor-level discourse profiling --- ## Limitations - **Morphological noise**: Sinhala and Tamil agglutinative morphology means surface token forms may vary across speeches with the same semantic content. - **Speaker metadata**: Speaker names are normalized heuristically; rare or ambiguous names may remain un-merged across sessions. - **Noise label**: ~33-44% of speeches are labeled -1 (noise). These are real speeches but lack dense topic neighborhood membership under HDBSCAN. - **Temporal coverage**: Parliamentary sitting frequency is uneven across years; raw topic volume counts are not directly comparable year-to-year without normalization. --- ## Citation If you use this dataset, please cite: ``` @article{dhanapala2026multilingual, title={Multilingual Topic Modeling of Sri Lankan Parliamentary Debates: An NLP Pipeline for Trilingual Hansard Analysis}, author={Dhanapala, D.H.N. and Daishika, K.S.H. and Nawagamuwa, N.A.K. and Kuruppu, H.A. and Seneviratne, N.S.}, journal={Department of Computer Science & Engineering, University of Moratuwa}, year={2026}, publisher={University of Moratuwa, Sri Lanka} } ``` --- ## License Apache 2.0. The underlying Hansard text is sourced from the Parliament of Sri Lanka's publicly available official records.
提供机构:
himath-nimpura
二维码
社区交流群
二维码
科研交流群
商业服务