AmashaHw/sinhala_intent_commands_dataset

Name: AmashaHw/sinhala_intent_commands_dataset
Creator: AmashaHw
Published: 2026-04-26 17:34:30
License: 暂无描述

Hugging Face2026-04-26 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/AmashaHw/sinhala_intent_commands_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

language: si tags: nlp intent-classification sinhala voice-commands accessibility license: apache-2.0 # Sinhala Intent Commands Dataset ## 📌 Overview This dataset contains Sinhala natural language command variations designed for training an intent classification model in a voice-assisted smart reading system. The dataset enables users to interact with documents using flexible Sinhala voice or text commands, supporting accessibility-focused applications such as assistive reading for visually impaired users. --- ## 🎯 Purpose This dataset was created as part of the research project: **"Intelligent Segmentation and RAG-based Summarization with Non-Linear Voice Navigation for Sinhala OCR Text"** It supports: * Sinhala voice command understanding * Non-linear document navigation * Context-aware content retrieval * Accessibility-focused smart reading systems --- ## 📊 Dataset Structure | Column | Description | | ------ | -------------------------- | | text | Sinhala command input | | intent | Corresponding intent label | --- ## 🧠 Intent Classes The dataset includes multiple intent categories such as: **Navigation** * NEXT * PREV * READ * REPEAT **Structure-Based Navigation** * SECTIONS * JUMP_SECTION * READ_SECTION **Search** * SEARCH **Summarization** * SUMMARIZE_SHORT * SUMMARIZE_MEDIUM * SUMMARIZE_LONG * SUMMARIZE_DOC_SHORT * SUMMARIZE_DOC_MEDIUM * SUMMARIZE_DOC_LONG **Question Answering** * ANSWER_CHUNK_SHORT * ANSWER_CHUNK_MEDIUM * ANSWER_CHUNK_LONG * ANSWER_DOC_SHORT * ANSWER_DOC_MEDIUM * ANSWER_DOC_LONG Each intent contains multiple Sinhala paraphrases to improve model generalization. --- ## 📈 Dataset Size * Total samples:1245 * Balanced dataset: approximately **50 samples per intent class** --- ## ⚙️ Usage ### Load dataset in Python ```python import pandas as pd df = pd.read_csv("sinhala_intent_commands_dataset.csv") print(df.head()) ``` --- ## 🧪 Model Training This dataset was used to train an intent classification model using: * TF-IDF with character n-grams (3–5) * Linear SVM (LinearSVC) * Calibrated classifier for confidence estimation This approach helps handle: * Sinhala language variations * ASR (Automatic Speech Recognition) noise * Short command-based inputs --- ## ⚠️ Limitations * May not cover all Sinhala dialects and linguistic variations * Contains ASR-like variations but may not fully represent real-world speech errors * Some noise may exist due to OCR-based preprocessing * Limited to command-style inputs rather than full conversational language --- ## 🔮 Future Improvements * Expand dataset with real voice-transcribed Sinhala data * Include more dialect variations * Improve robustness to ASR errors * Extend to conversational intent understanding --- ## 📜 License This dataset is released under the **Apache 2.0 License**. --- ## 👤 Author **Amasha Hewagama** Sri Lanka Institute of Information Technology (SLIIT) --- ## 🔗 Citation If you use this dataset, please cite: Author: Amasha Hewagama Title: Sinhala Intent Commands Dataset Year: 2026 --- ## 🙌 Acknowledgements This dataset was developed as part of an undergraduate research project in Data Science, focusing on accessibility and intelligent document interaction using Sinhala language technologies.

提供机构：

AmashaHw

5,000+

优质数据集

54 个

任务类型

进入经典数据集