Offensive Hebrew Corpus

Name: Offensive Hebrew Corpus
Creator: 比尔宰特大学
Published: 2023-09-06 13:18:43
License: 暂无描述

arXiv2023-09-06 更新2024-06-21 收录

下载链接：

https://github.com/SinaLab/OffensiveHebrew

下载链接

链接失效反馈

官方服务：

资源简介：

Offensive Hebrew Corpus是由比尔宰特大学创建的一个包含15,881条推文的希伯来语数据集，旨在识别和分类社交媒体中的攻击性语言。数据集通过Twitter API收集，每条推文由阿拉伯-希伯来双语者标注，涉及五个类别：辱骂、仇恨、暴力、色情或无攻击性。创建过程中，注释者需熟悉以色列文化、政治和实践，以理解每条推文的上下文。该数据集主要用于训练和评估希伯来语BERT模型，以解决希伯来语中攻击性语言检测的问题，特别是在社交媒体监控和内容审核中的应用。

Offensive Hebrew Corpus is a Hebrew-language dataset containing 15,881 tweets, created by Birzeit University for the purpose of identifying and classifying offensive language in social media. The dataset was collected via the Twitter API, and each tweet was annotated by Arabic-Hebrew bilingual annotators across five categories: abusive, hate speech, violence, pornography, and non-offensive. During the annotation process, annotators were required to have a solid understanding of Israeli culture, politics, and societal practices to fully comprehend the context of each tweet. This dataset is primarily utilized for training and evaluating Hebrew BERT models to tackle the problem of offensive language detection in Hebrew, with specific applications in social media monitoring and content moderation.

提供机构：

比尔宰特大学

创建时间：

2023-09-06

搜集汇总

数据集介绍