five

Kurdish Dataset for Stance Detection

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/ckkxx8mdcg
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset comprises 2,174 Sorani Kurdish news articles scraped from the Rudaw website between March 2024 and February 2025. The articles span two primary domains—economy and politics—and include metadata such as title, content, publication date, link, accessed date, and category label. To support stance detection research in a low-resource language, we implemented a multi-stage annotation pipeline: Target Identification: Articles were first processed using a Python-based system that leveraged a curated lexicon of 2,456 domain-specific terms to detect the main topic (e.g., "currency", "election"). We used a stance lexicon of 4,243 verbs and adjectives to automatically assign a stance label—support, oppose, or neutral—to each article. For ambiguous cases, we applied semantic similarity (via Sentence Transformers) and zero-shot classification as fallback strategies. Manual Validation: All articles were manually reviewed and refined by four native Sorani Kurdish experts—two specializing in economics and two in political journalism. The whole dataset was checked through group discussions, and the agreement between the annotators (Cohen’s kappa) was 0.90 for economy articles and 0.93 for politics, which shows that the final annotations are reliable. The final dataset includes fields for article title, content, publication metadata, identified target topic, and stance label. We provide all related code in a separate folder to ensure reproducibility.
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作