five

Kurdish Social Media Opinions

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/495h8779p6
下载链接
链接失效反馈
官方服务:
资源简介:
The corpus was compiled through a systematic data collection process from authentic digital sources, including Kurdish-language social media, public forums, and news commentary sections. This strategy ensures the dataset reflects the dynamic and colloquial nature of the language as used by native speakers in organic, real-world settings. To ensure thematic relevance and linguistic diversity, the data collection was stratified across a range of topics of significant interest to Kurdish-speaking communities, such as politics, culture, social affairs, sports, and local current events. This domain-specific approach enhances the practical applicability of the resulting model. A rigorous multi-stage data preprocessing pipeline was implemented to ensure corpus integrity. Initially, raw user comments were subjected to a filtering process where non-Kurdish text, unintelligible statements, and irrelevant entries were removed. Subsequently, the remaining texts underwent a normalisation procedure to address orthographic variations and common informal writing conventions typical of computer-mediated communication. The core of the annotation scheme involved the manual classification of each textual unit into discrete sentiment categories: sadness, happiness, anger, disgust, fear, surprise and sarcastic. This categorisation was performed by native Kurdish speakers, who were trained to interpret linguistic cues, contextual nuances, and cultural subtleties. The focus of the annotation extended beyond simple lexical polarity (e.g., the presence of positive words) to encompass a more holistic assessment of the author's intent and overall opinion, thereby adding a layer of pragmatic understanding to the dataset. The resulting annotated corpus serves as a critical resource for advancing Kurdish language technology. It provides a reliable ground-truth dataset for training and evaluating machine learning and deep learning models tailored for sentiment analysis. This work establishes a foundational benchmark for future research in Kurdish NLP and contributes to the broader effort of developing inclusive language technologies for under-resourced languages.
创建时间:
2026-02-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作