KuSarcasm: Automated Kurdish Sorani Sarcasm Dataset (KSSD) Using Deep Learning Techniques
收藏DataCite Commons2025-05-06 更新2025-05-17 收录
下载链接:
https://data.mendeley.com/datasets/3kscrg5y4y/1
下载链接
链接失效反馈官方服务:
资源简介:
KuSarcasm: Automated Kurdish Sorani Sarcasm Dataset (KSSD) Using Deep Learning Techniques is a curated corpus designed to support sarcasm detection tasks in the Kurdish Sorani dialect. As a low-resource language with limited NLP tools, Kurdish presents unique challenges for researchers in sentiment analysis and sarcasm identification. Developed as part of a master’s research, this dataset aims to bridge the resource gap and provide a solid foundation for future Kurdish NLP research.
The dataset contains 16,833 annotated text entries selected from an initial pool of 25,697 raw samples, sourced from culturally rich materials such as Kurdish proverbs, poems, and idiomatic texts gathered from Sekhurma Magazine, digital publications, and online repositories. Following a rigorous data cleaning and preprocessing phase, only high-quality entries with clear semantic relevance were retained. Each entry is labeled with a binary tag: 1 for sarcastic and 0 for non-sarcastic. The dataset is provided in .csv format with two main columns: text (the sentence) and label (sarcasm classification).
The annotation process was automated using a two-stage NLP pipeline tailored to Kurdish Sorani. In the first stage, multilingual BERT (mBERT) determined sentiment polarity. The second stage employed Sentence-BERT (SBERT) for rule-based semantic similarity matching, detecting sarcasm based on linguistic patterns. This method ensures efficient, scalable annotation while preserving semantic relevance, avoiding the limitations of manual annotation, and supporting reproducibility for future research.
A unique characteristic of this dataset is its exclusive focus on Kurdish Sorani, a dialect for which few machine learning resources exist. It is especially suitable for transformer-based models such as mBERT and SBERT. The dataset has already been used in a sarcasm detection pipeline combining mBERT for sentiment analysis and SBERT for semantic matching. However, current models are still limited in their ability to fully support Kurdish-specific syntactic features and handle sentence-level ambiguity.
While the dataset captures sarcasm in written text, it does not include prosodic or paralinguistic features such as tone or emphasis, which are often crucial in detecting spoken sarcasm. Nonetheless, the linguistic richness and cultural depth of the dataset offer significant value for computational studies in irony, sentiment, and Kurdish language processing.
KSSD is intended for academic and research purposes, particularly in sentiment analysis, sarcasm detection, low-resource NLP, and Kurdish linguistics. It contributes to the broader goals of multilingual NLP research and supports the development of AI systems that reflect the linguistic and cultural nuances of Kurdish. Researchers are encouraged to cite this dataset and contribute feedback to enhance its ongoing development and collaborative use.
提供机构:
Mendeley Data
创建时间:
2025-05-06



