AmashaHw/sinhala_intent_commands_dataset
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/AmashaHw/sinhala_intent_commands_dataset
下载链接
链接失效反馈官方服务:
资源简介:
language:
si
tags:
nlp
intent-classification
sinhala
voice-commands
accessibility
license: apache-2.0
# Sinhala Intent Commands Dataset
## 📌 Overview
This dataset contains Sinhala natural language command variations designed for training an intent classification model in a voice-assisted smart reading system.
The dataset enables users to interact with documents using flexible Sinhala voice or text commands, supporting accessibility-focused applications such as assistive reading for visually impaired users.
---
## 🎯 Purpose
This dataset was created as part of the research project:
**"Intelligent Segmentation and RAG-based Summarization with Non-Linear Voice Navigation for Sinhala OCR Text"**
It supports:
* Sinhala voice command understanding
* Non-linear document navigation
* Context-aware content retrieval
* Accessibility-focused smart reading systems
---
## 📊 Dataset Structure
| Column | Description |
| ------ | -------------------------- |
| text | Sinhala command input |
| intent | Corresponding intent label |
---
## 🧠 Intent Classes
The dataset includes multiple intent categories such as:
**Navigation**
* NEXT
* PREV
* READ
* REPEAT
**Structure-Based Navigation**
* SECTIONS
* JUMP_SECTION
* READ_SECTION
**Search**
* SEARCH
**Summarization**
* SUMMARIZE_SHORT
* SUMMARIZE_MEDIUM
* SUMMARIZE_LONG
* SUMMARIZE_DOC_SHORT
* SUMMARIZE_DOC_MEDIUM
* SUMMARIZE_DOC_LONG
**Question Answering**
* ANSWER_CHUNK_SHORT
* ANSWER_CHUNK_MEDIUM
* ANSWER_CHUNK_LONG
* ANSWER_DOC_SHORT
* ANSWER_DOC_MEDIUM
* ANSWER_DOC_LONG
Each intent contains multiple Sinhala paraphrases to improve model generalization.
---
## 📈 Dataset Size
* Total samples:1245
* Balanced dataset: approximately **50 samples per intent class**
---
## ⚙️ Usage
### Load dataset in Python
```python
import pandas as pd
df = pd.read_csv("sinhala_intent_commands_dataset.csv")
print(df.head())
```
---
## 🧪 Model Training
This dataset was used to train an intent classification model using:
* TF-IDF with character n-grams (3–5)
* Linear SVM (LinearSVC)
* Calibrated classifier for confidence estimation
This approach helps handle:
* Sinhala language variations
* ASR (Automatic Speech Recognition) noise
* Short command-based inputs
---
## ⚠️ Limitations
* May not cover all Sinhala dialects and linguistic variations
* Contains ASR-like variations but may not fully represent real-world speech errors
* Some noise may exist due to OCR-based preprocessing
* Limited to command-style inputs rather than full conversational language
---
## 🔮 Future Improvements
* Expand dataset with real voice-transcribed Sinhala data
* Include more dialect variations
* Improve robustness to ASR errors
* Extend to conversational intent understanding
---
## 📜 License
This dataset is released under the **Apache 2.0 License**.
---
## 👤 Author
**Amasha Hewagama**
Sri Lanka Institute of Information Technology (SLIIT)
---
## 🔗 Citation
If you use this dataset, please cite:
Author: Amasha Hewagama
Title: Sinhala Intent Commands Dataset
Year: 2026
---
## 🙌 Acknowledgements
This dataset was developed as part of an undergraduate research project in Data Science, focusing on accessibility and intelligent document interaction using Sinhala language technologies.
提供机构:
AmashaHw



