BAN-PL
收藏arXiv2024-03-26 更新2024-06-21 收录
下载链接:
https://github.com/ZILiAT-NASK/BAN-PL
下载链接
链接失效反馈官方服务:
资源简介:
BAN-PL是由波兰国家研究机构与Wykop.pl合作创建的一个开放数据集,专注于波兰语的攻击性社交媒体内容。该数据集包含从2019年至2023年间由用户报告并经专业内容审核员删除的691,662条帖子及评论,分为有害和无害两类。数据集的创建旨在解决波兰语资源在自动在线内容审核领域的不足,并提供对真实内容审核过程的深入洞察。此外,数据集还详细描述了全面的匿名化过程,并讨论了类似数据集中常见的偏见问题。BAN-PL的应用领域主要集中在检测和分类社交媒体中的攻击性语言,旨在提高自动内容审核的准确性和效率。
BAN-PL is an open dataset developed in collaboration between the Polish National Research Institute and Wykop.pl, focusing on offensive social media content in the Polish language. The dataset comprises 691,662 posts and comments that were reported by users and removed by professional content reviewers between 2019 and 2023, and is categorized into two classes: harmful and non-harmful. It was created to address the shortage of Polish-language resources in the domain of automated online content moderation, and to offer in-depth insights into real-world content moderation workflows. Additionally, the dataset elaborates on a comprehensive anonymization process and discusses bias issues commonly prevalent in similar datasets. The primary application scenarios of BAN-PL center around the detection and classification of offensive language in social media, aiming to enhance the accuracy and efficiency of automated content moderation.
提供机构:
国家研究机构
创建时间:
2023-08-21



