himath-nimpura/sl-parliamentary-hansard-17-26
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/himath-nimpura/sl-parliamentary-hansard-17-26
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- si
- en
- ta
tags:
- parliamentary
- hansard
- sri-lanka
- multilingual
- politics
- government
- hdbscan
- topic-modeling
- sinhala
- tamil
pretty_name: Sri Lanka Parliamentary Hansard (2017–2026)
size_categories:
- 10K<n<100K
---
# Sri Lanka Parliamentary Hansard (2017–2026)
A trilingual corpus of **19,553 parliamentary speeches** from the Sri Lankan Parliament, spanning 2017 to 2026. Speeches are in Sinhala (සිංහල), Tamil (தமிழ்), and English, including code-mixed utterances. Each speech is associated with a speaker name, date, and unsupervised topic cluster label produced by the BiTopic hybrid topic modeling pipeline.
This dataset was constructed and used in the thesis:
**"Multilingual Topic Modeling of Sri Lankan Parliamentary Discourse"** (Himath Nimpura, 2026).
---
## Dataset Summary
| Field | Value |
|---|---|
| Total speeches | 19,553 |
| Years covered | 2017–2026 |
| Languages | Sinhala (~76%) English (~7%) Tamil (~7%) code-mixed (~10%) |
| Median speech length | ~300 words |
| 95th percentile length | ~1,200 words |
| Topic clusters (micro) | 331–336 (HDBSCAN, noise label = -1) |
| Macro-topic groups | 15 (empirical dendrogram cut) |
---
## Corpus Construction
**Source**: Official Sri Lankan Parliament Hansard PDFs (publicly available parliamentary archive).
**Extraction**: A two-stage pipeline was used:
1. PDFs were scraped from the parliamentary archive website.
2. Text was extracted using **Google Gemini** (LLM-based extraction) rather than OCR, to handle the dual-column trilingual layout, non-standard Sinhala/Tamil font encodings, and structural noise inherent in Hansard PDFs.
**Preprocessing**:
- Speaker name normalization across Sinhala, Tamil, and English honorific variants.
- Removal of procedural headers, footers, and short non-substantive utterances (<50 characters).
- Language detection per speech using script-based heuristics and probabilistic language identification.
---
## Dataset Fields
| Column | Type | Description |
|---|---|---|
| `speech_id` | string | Unique speech identifier (e.g., `SP_00000`) |
| `date` | string | Parliamentary session date (YYYY-MM-DD) |
| `year` | integer | Year extracted from date |
| `speaker` | string | Speaker name (normalized; may be in Sinhala, Tamil, or English) |
| `text` | string | Full speech text (original language, may be code-mixed) |
| `micro_topic_cluster` | integer | HDBSCAN cluster label; **-1 = noise** (procedural/unassigned speech) |
| `macro_topic` | string | Aggregated macro-topic label (e.g. `Macro-Topic 3`); `"Procedural Noise"` for noise speeches; `"Niche/Localized Debates"` for small unrescued micro-clusters |
---
## Topic Modeling
- **Semantic channel**: BGE-M3 dense embeddings (1024-dim, 8192-token context)
- **Lexical channel**: CountVectorizer sparse TF vectors
- **Clustering**: HDBSCAN (density-based; noise label = -1)
- **Dimensionality reduction**: UMAP (5D for clustering, 2D for visualization)
HDBSCAN outperformed KMeans and Agglomerative Clustering in all internal metrics:
| Algorithm | Silhouette | Calinski-Harabasz | Davies-Bouldin |
|---|---|---|---|
| **HDBSCAN** | **0.5741** | **15,052** | **0.5148** |
| KMeans | 0.4076 | 13,098 | 0.8745 |
| Agglomerative | 0.3812 | 12,217 | 0.9088 |
The `micro_topic_cluster` column contains raw HDBSCAN micro-topic labels. Speeches with label **-1** are noise — typically short procedural utterances, interjections, or fragments that do not belong to any dense topic region.
---
## Political Events Captured
The corpus spans a politically turbulent period. The unsupervised topic model recovers key historical events:
- **2019 Easter Sunday attacks** — security and defense discourse spike
- **2022 economic crisis & Aragalaya** — economy and fiscal policy topics surge dramatically
- **IMF restructuring period** — continued elevated economic debate volume
---
## Intended Uses
- Multilingual NLP research (Sinhala, Tamil, English)
- Parliamentary discourse analysis
- Topic modeling benchmarking on low-resource, code-mixed corpora
- Political science: agenda setting, temporal attention analysis, actor-level discourse profiling
---
## Limitations
- **Morphological noise**: Sinhala and Tamil agglutinative morphology means surface token forms may vary across speeches with the same semantic content.
- **Speaker metadata**: Speaker names are normalized heuristically; rare or ambiguous names may remain un-merged across sessions.
- **Noise label**: ~33-44% of speeches are labeled -1 (noise). These are real speeches but lack dense topic neighborhood membership under HDBSCAN.
- **Temporal coverage**: Parliamentary sitting frequency is uneven across years; raw topic volume counts are not directly comparable year-to-year without normalization.
---
## Citation
If you use this dataset, please cite:
```
@article{dhanapala2026multilingual,
title={Multilingual Topic Modeling of Sri Lankan Parliamentary Debates: An NLP Pipeline for Trilingual Hansard Analysis},
author={Dhanapala, D.H.N. and Daishika, K.S.H. and Nawagamuwa, N.A.K. and Kuruppu, H.A. and Seneviratne, N.S.},
journal={Department of Computer Science & Engineering, University of Moratuwa},
year={2026},
publisher={University of Moratuwa, Sri Lanka}
}
```
---
## License
Apache 2.0. The underlying Hansard text is sourced from the Parliament of Sri Lanka's publicly available official records.
提供机构:
himath-nimpura



