DictaBERT results.
收藏Figshare2025-11-06 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/DictaBERT_results_/30557361
下载链接
链接失效反馈官方服务:
资源简介:
Clickbait headlines, designed to entice readers with sensationalized or misleading content, pose significant challenges in the digital landscape. They exploit curiosity to generate traffic and revenue, often at the cost of spreading misinformation and undermining the credibility of online content. Identifying clickbait is essential for improving the quality of information consumed, fostering trust in digital media, and enabling users to make informed decisions. This study advances Hebrew clickbait detection through deep learning approaches and comprehensive data augmentation strategies, targeting the unique challenges of processing a low-resource language. Building on prior research that achieved an accuracy of 87% using traditional machine learning methods, this work explores the potential of BERT-based models and diverse augmentation techniques to further enhance performance. Our experiments incorporated a variety of augmentation methods, including weak supervision, substitution-based methods, generative techniques and language-based methods, applied to state-of-the-art Hebrew language models. The results highlight that targeted augmentation strategies, particularly those focusing on word-level replacements and contextual enhancements, consistently improved model performance. Our top-performing configuration achieved an accuracy of 92%, surpassing traditional machine learning benchmarks. These study results can be applied in real-world systems to automatically detect and reduce clickbait in Hebrew digital media, supporting news websites and social platforms in improving content quality and user trust. Furthermore, it provides a replicable framework for tackling similar challenges in other underrepresented languages, highlighting the transformative potential of combining advanced deep learning methods with tailored data augmentation strategies.
以耸动或误导性内容吸引读者的标题党 (clickbait) 标题,在数字生态中带来了显著挑战。此类标题利用受众好奇心获取流量与收益,却往往以传播不实信息、损害在线内容公信力为代价。识别标题党内容,对于提升受众获取的信息质量、培育数字媒体信任度,以及帮助用户做出理性决策至关重要。本研究针对低资源语言 (low-resource language) 处理的独特挑战,借助深度学习方法与全面的数据增强策略,推进了希伯来语标题党检测任务的发展。本研究依托此前采用传统机器学习方法实现87%准确率的相关研究,探索了基于BERT模型与多样化增强技术进一步提升检测性能的潜力。本研究的实验纳入了多种增强方法,包括弱监督、基于替换的方法、生成式技术以及基于语言的方法,并将其应用于最先进的希伯来语语言模型。实验结果表明,针对性的数据增强策略——尤其是聚焦于词级替换与上下文增强的方法——能够持续提升模型性能。本研究表现最优的模型配置实现了92%的准确率,超越了传统机器学习方法的基准性能。本研究的成果可应用于实际系统中,自动检测并减少希伯来语数字媒体中的标题党内容,助力新闻网站与社交平台提升内容质量并增强用户信任。此外,本研究还为解决其他弱势语言面临的类似挑战提供了可复用的框架,彰显了将先进深度学习方法与定制化数据增强策略相结合的变革性潜力。
创建时间:
2025-11-06



