christinacdl/offensive_language_dataset

Name: christinacdl/offensive_language_dataset
Creator: christinacdl
Published: 2024-02-01 14:59:20
License: 暂无描述

Hugging Face2024-02-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/christinacdl/offensive_language_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification language: - en --- - 36.528 English texts in total, 12.955 NOT offensive and 23.573O OFFENSIVE texts - All duplicate values were removed - Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label) - Split: 80/10/10 - Train set label distribution: 0 ==> 10.364, 1 ==> 18.858 - Validation set label distribution: 0 ==> 1.296, 1 ==> 2.357 - Test set label distribution: 0 ==> 1.295, 1 ==> 2.358 - The OLID dataset (Zampieri et al., 2019) and the labels "Offensive" and "Neither" from the paper's dataset "Automated Hate Speech Detection and the Problem of Offensive Language" (Davidson et al.,2017)

提供机构：

christinacdl

原始信息汇总

数据集概述

数据集基本信息

许可协议: Apache 2.0
任务类别: 文本分类
语言: 英语

数据集规模

总文本数量: 36,528条
非攻击性文本数量: 12,955条
攻击性文本数量: 23,573条

数据处理

去重处理: 所有重复值已被移除
数据分割: 使用sklearn进行分割，分为80%训练集和20%临时测试集（按标签分层），然后将测试集进一步分为50%测试集和50%验证集（按标签分层）

数据集分割详情

分割比例: 80/10/10
训练集标签分布:
- 非攻击性文本: 10,364条
- 攻击性文本: 18,858条
验证集标签分布:
- 非攻击性文本: 1,296条
- 攻击性文本: 2,357条
测试集标签分布:
- 非攻击性文本: 1,295条
- 攻击性文本: 2,358条

数据来源

数据集来源: OLID数据集（Zampieri et al., 2019）以及论文“Automated Hate Speech Detection and the Problem of Offensive Language”（Davidson et al., 2017）中的标签“Offensive”和“Neither”

5,000+

优质数据集

54 个

任务类型

进入经典数据集