Offensive Hebrew Corpus
收藏arXiv2023-09-06 更新2024-06-21 收录
下载链接:
https://github.com/SinaLab/OffensiveHebrew
下载链接
链接失效反馈官方服务:
资源简介:
Offensive Hebrew Corpus是由比尔宰特大学创建的一个包含15,881条推文的希伯来语数据集,旨在识别和分类社交媒体中的攻击性语言。数据集通过Twitter API收集,每条推文由阿拉伯-希伯来双语者标注,涉及五个类别:辱骂、仇恨、暴力、色情或无攻击性。创建过程中,注释者需熟悉以色列文化、政治和实践,以理解每条推文的上下文。该数据集主要用于训练和评估希伯来语BERT模型,以解决希伯来语中攻击性语言检测的问题,特别是在社交媒体监控和内容审核中的应用。
Offensive Hebrew Corpus is a Hebrew-language dataset containing 15,881 tweets, created by Birzeit University for the purpose of identifying and classifying offensive language in social media. The dataset was collected via the Twitter API, and each tweet was annotated by Arabic-Hebrew bilingual annotators across five categories: abusive, hate speech, violence, pornography, and non-offensive. During the annotation process, annotators were required to have a solid understanding of Israeli culture, politics, and societal practices to fully comprehend the context of each tweet. This dataset is primarily utilized for training and evaluating Hebrew BERT models to tackle the problem of offensive language detection in Hebrew, with specific applications in social media monitoring and content moderation.
提供机构:
比尔宰特大学
创建时间:
2023-09-06
搜集汇总
数据集介绍

背景与挑战
背景概述
Offensive Hebrew Corpus是一个包含15,881条希伯来语推文的数据集,手动标注为五种细粒度类别(辱骂、仇恨、暴力、色情或非冒犯性)。为解决类别不平衡问题,数据集被重新映射为二分类(冒犯性/非冒犯性),并提供了平衡后的2,500条推文用于模型训练和评估。
以上内容由遇见数据集搜集并总结生成



