five

Kurdish Social Media Opinions

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/495h8779p6
下载链接
链接失效反馈
官方服务:
资源简介:
The corpus was compiled through a systematic data collection process from authentic digital sources, including Kurdish-language social media, public forums, and news commentary sections. This strategy ensures the dataset reflects the dynamic and colloquial nature of the language as used by native speakers in organic, real-world settings. To ensure thematic relevance and linguistic diversity, the data collection was stratified across a range of topics of significant interest to Kurdish-speaking communities, such as politics, culture, social affairs, sports, and local current events. This domain-specific approach enhances the practical applicability of the resulting model. A rigorous multi-stage data preprocessing pipeline was implemented to ensure corpus integrity. Initially, raw user comments were subjected to a filtering process where non-Kurdish text, unintelligible statements, and irrelevant entries were removed. Subsequently, the remaining texts underwent a normalisation procedure to address orthographic variations and common informal writing conventions typical of computer-mediated communication. The core of the annotation scheme involved the manual classification of each textual unit into discrete sentiment categories: sadness, happiness, anger, disgust, fear, surprise and sarcastic. This categorisation was performed by native Kurdish speakers, who were trained to interpret linguistic cues, contextual nuances, and cultural subtleties. The focus of the annotation extended beyond simple lexical polarity (e.g., the presence of positive words) to encompass a more holistic assessment of the author's intent and overall opinion, thereby adding a layer of pragmatic understanding to the dataset. The resulting annotated corpus serves as a critical resource for advancing Kurdish language technology. It provides a reliable ground-truth dataset for training and evaluating machine learning and deep learning models tailored for sentiment analysis. This work establishes a foundational benchmark for future research in Kurdish NLP and contributes to the broader effort of developing inclusive language technologies for under-resourced languages.

本语料库通过系统化的数据采集流程,从真实数字数据源中采集构建而成,涵盖库尔德语社交媒体、公共论坛及新闻评论板块。该采集策略确保本数据集能够反映母语使用者在自然真实场景中所使用的库尔德语的动态性与口语化特征。为确保数据集的主题相关性与语言多样性,数据采集按照库尔德语社群高度关注的多个主题进行分层处理,涵盖政治、文化、社会事务、体育及本地时事等领域。这种面向特定领域的方法,能够提升后续模型的实际应用价值。 为保障语料库的完整性,项目采用了严格的多阶段数据预处理流程。首先,对原始用户评论进行过滤,剔除非库尔德语文本、无法理解的语句以及无关条目。随后,对剩余文本开展归一化处理,以解决计算机中介沟通中常见的正字法变体与非正式书写规范问题。 本标注方案的核心环节为:将每个文本单元手动归类至明确的情感类别中,包括悲伤、愉悦、愤怒、厌恶、恐惧、惊讶与讽刺。该分类工作由母语为库尔德语的标注人员完成,他们均经过针对语言线索、上下文细微差别及文化内涵解读的专项培训。本次标注不仅关注简单的词汇极性(如正向词汇的出现),更着重对作者意图与整体观点进行整体性评估,从而为数据集增添了语用理解维度。 最终构建的带标注语料库,是推动库尔德语语言技术发展的关键资源。它可为针对情感分析任务定制的机器学习与深度学习模型的训练与评估,提供可靠的真值(ground-truth)数据集。本工作为库尔德语自然语言处理(Natural Language Processing,NLP)领域的后续研究奠定了基准基础,同时也为开发针对低资源语言的包容性语言技术这一宏观目标贡献了力量。
创建时间:
2026-02-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作