five

AmashaHw/sinhala_intent_commands_dataset

收藏
Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/AmashaHw/sinhala_intent_commands_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
language: si tags: nlp intent-classification sinhala voice-commands accessibility license: apache-2.0 # Sinhala Intent Commands Dataset ## 📌 Overview This dataset contains Sinhala natural language command variations designed for training an intent classification model in a voice-assisted smart reading system. The dataset enables users to interact with documents using flexible Sinhala voice or text commands, supporting accessibility-focused applications such as assistive reading for visually impaired users. --- ## 🎯 Purpose This dataset was created as part of the research project: **"Intelligent Segmentation and RAG-based Summarization with Non-Linear Voice Navigation for Sinhala OCR Text"** It supports: * Sinhala voice command understanding * Non-linear document navigation * Context-aware content retrieval * Accessibility-focused smart reading systems --- ## 📊 Dataset Structure | Column | Description | | ------ | -------------------------- | | text | Sinhala command input | | intent | Corresponding intent label | --- ## 🧠 Intent Classes The dataset includes multiple intent categories such as: **Navigation** * NEXT * PREV * READ * REPEAT **Structure-Based Navigation** * SECTIONS * JUMP_SECTION * READ_SECTION **Search** * SEARCH **Summarization** * SUMMARIZE_SHORT * SUMMARIZE_MEDIUM * SUMMARIZE_LONG * SUMMARIZE_DOC_SHORT * SUMMARIZE_DOC_MEDIUM * SUMMARIZE_DOC_LONG **Question Answering** * ANSWER_CHUNK_SHORT * ANSWER_CHUNK_MEDIUM * ANSWER_CHUNK_LONG * ANSWER_DOC_SHORT * ANSWER_DOC_MEDIUM * ANSWER_DOC_LONG Each intent contains multiple Sinhala paraphrases to improve model generalization. --- ## 📈 Dataset Size * Total samples:1245 * Balanced dataset: approximately **50 samples per intent class** --- ## ⚙️ Usage ### Load dataset in Python ```python import pandas as pd df = pd.read_csv("sinhala_intent_commands_dataset.csv") print(df.head()) ``` --- ## 🧪 Model Training This dataset was used to train an intent classification model using: * TF-IDF with character n-grams (3–5) * Linear SVM (LinearSVC) * Calibrated classifier for confidence estimation This approach helps handle: * Sinhala language variations * ASR (Automatic Speech Recognition) noise * Short command-based inputs --- ## ⚠️ Limitations * May not cover all Sinhala dialects and linguistic variations * Contains ASR-like variations but may not fully represent real-world speech errors * Some noise may exist due to OCR-based preprocessing * Limited to command-style inputs rather than full conversational language --- ## 🔮 Future Improvements * Expand dataset with real voice-transcribed Sinhala data * Include more dialect variations * Improve robustness to ASR errors * Extend to conversational intent understanding --- ## 📜 License This dataset is released under the **Apache 2.0 License**. --- ## 👤 Author **Amasha Hewagama** Sri Lanka Institute of Information Technology (SLIIT) --- ## 🔗 Citation If you use this dataset, please cite: Author: Amasha Hewagama Title: Sinhala Intent Commands Dataset Year: 2026 --- ## 🙌 Acknowledgements This dataset was developed as part of an undergraduate research project in Data Science, focusing on accessibility and intelligent document interaction using Sinhala language technologies.
提供机构:
AmashaHw
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作