five

odegiber/hate_speech18

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/odegiber/hate_speech18
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language_creators: - found language: - en license: - cc-by-sa-3.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-classification task_ids: - intent-classification paperswithcode_id: hate-speech pretty_name: Hate Speech dataset_info: features: - name: text dtype: string - name: user_id dtype: int64 - name: subforum_id dtype: int64 - name: num_contexts dtype: int64 - name: label dtype: class_label: names: '0': noHate '1': hate '2': idk/skip '3': relation splits: - name: train num_bytes: 1375340 num_examples: 10944 download_size: 3664530 dataset_size: 1375340 train-eval-index: - config: default task: text-classification task_id: multi_class_classification splits: train_split: train col_mapping: text: text label: target metrics: - type: accuracy name: Accuracy - type: f1 name: F1 macro args: average: macro - type: f1 name: F1 micro args: average: micro - type: f1 name: F1 weighted args: average: weighted - type: precision name: Precision macro args: average: macro - type: precision name: Precision micro args: average: micro - type: precision name: Precision weighted args: average: weighted - type: recall name: Recall macro args: average: macro - type: recall name: Recall micro args: average: micro - type: recall name: Recall weighted args: average: weighted --- # Dataset Card for [Dataset Name] ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/Vicomtech/hate-speech-dataset - **Repository:** https://github.com/Vicomtech/hate-speech-dataset - **Paper:** https://www.aclweb.org/anthology/W18-51.pdf - **Leaderboard:** - **Point of Contact:** ### Dataset Summary These files contain text extracted from Stormfront, a white supremacist forum. A random set of forums posts have been sampled from several subforums and split into sentences. Those sentences have been manually labelled as containing hate speech or not, according to certain annotation guidelines. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages English ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields - text: the provided sentence - user_id: information to make it possible to re-build the conversations these sentences belong to - subforum_id: information to make it possible to re-build the conversations these sentences belong to - num_contexts: number of previous posts the annotator had to read before making a decision over the category of the sentence - label: hate, noHate, relation (sentence in the post doesn't contain hate speech on their own, but combination of serveral sentences does) or idk/skip (sentences that are not written in English or that don't contain information as to be classified into hate or noHate) ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ``` @inproceedings{gibert2018hate, title = "{Hate Speech Dataset from a White Supremacy Forum}", author = "de Gibert, Ona and Perez, Naiara and Garc{\'\i}a-Pablos, Aitor and Cuadros, Montse", booktitle = "Proceedings of the 2nd Workshop on Abusive Language Online ({ALW}2)", month = oct, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W18-5102", doi = "10.18653/v1/W18-5102", pages = "11--20", } ``` ### Contributions Thanks to [@czabo](https://github.com/czabo) for adding this dataset.
提供机构:
odegiber
原始信息汇总

数据集描述

数据集摘要

该数据集包含从白人至上主义论坛Stormfront提取的文本。从多个子论坛中随机抽取的论坛帖子被分割成句子,并根据特定的标注指南手动标注为是否包含仇恨言论。

支持的任务和排行榜

  • 任务类别: 文本分类
  • 任务ID: 意图分类
  • Papers with Code ID: hate-speech

语言

英语

数据集结构

数据实例

  • 特征:
    • text: 字符串类型,提供的句子
    • user_id: 整数类型,用于重建句子所属的对话
    • subforum_id: 整数类型,用于重建句子所属的对话
    • num_contexts: 整数类型,标注者在做决策前需要阅读的前置帖子数量
    • label: 类别标签,包括 noHatehateidk/skiprelation

数据分割

  • 训练集:
    • name: train
    • num_bytes: 1375340
    • num_examples: 10944

数据集大小

  • 下载大小: 3664530
  • 数据集大小: 1375340

数据集创建

数据集信息

  • 特征:
    • text: 字符串类型
    • user_id: 整数类型
    • subforum_id: 整数类型
    • num_contexts: 整数类型
    • label: 类别标签,包括 noHatehateidk/skiprelation

数据分割

  • 训练集:
    • name: train
    • num_bytes: 1375340
    • num_examples: 10944

训练-评估指标

  • 配置: default
  • 任务: 文本分类
  • 任务ID: 多类分类
  • 训练分割: train
  • 列映射:
    • text: text
    • label: target
  • 评估指标:
    • 准确率 (Accuracy)
    • F1 宏平均 (F1 macro)
    • F1 微平均 (F1 micro)
    • F1 加权平均 (F1 weighted)
    • 精确率 宏平均 (Precision macro)
    • 精确率 微平均 (Precision micro)
    • 精确率 加权平均 (Precision weighted)
    • 召回率 宏平均 (Recall macro)
    • 召回率 微平均 (Recall micro)
    • 召回率 加权平均 (Recall weighted)
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作