jtatman/civil_comments_hatebert

Name: jtatman/civil_comments_hatebert
Creator: jtatman
Published: 2023-09-06 08:15:58
License: 暂无描述

Hugging Face2023-09-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jtatman/civil_comments_hatebert

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: text_masked dtype: string - name: text_replaced list: - name: score dtype: float64 - name: sequence dtype: string - name: token dtype: int64 - name: token_str dtype: string splits: - name: train num_bytes: 872262083 num_examples: 451219 download_size: 333147199 dataset_size: 872262083 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - text-classification - text2text-generation - fill-mask language: - en tags: - masked - mask-scored - comment scoring - masked-model pretty_name: civil comments w/hatebert scoring size_categories: - 100K<n<1M --- # Dataset Card for "civil_comments_hatebert" This is an experiment to see how "civil-comments" can be changed by models without much manipulation to offensive speech in certain cases. This data is a reformat of the civil comments dataset, discarding all scoring attributes of abusive speech, masking random tokens, and processing with hatebert to fill-masked tokens with possible abusive language. This merely sets up some good data for three things: fill-mask activities, text training, and scored responses based on random tokens being manipulatible according to this model. Showing the progress of incarnation, three columns illustrate the original text data extracted, the randomly masked text, and the filled text with scores in a list for the hatebert output. So far in practice, the hatebert model mostly fills with innocuous placeholders, from *very* limited testing. Hatebert is as it sounds, a BERT based model trained on fill-mask activites. [civil_comments dataset](https://huggingface.co/datasets/civil_comments) [hatebert model](https://huggingface.co/datasets/civil_comments) [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

jtatman

原始信息汇总

数据集概述

数据集信息

特征:
- text: 类型为 string
- text_masked: 类型为 string
- text_replaced: 包含以下子特征
  - score: 类型为 float64
  - sequence: 类型为 string
  - token: 类型为 int64
  - token_str: 类型为 string
分割:
- train: 字节数为 872262083，样本数为 451219
下载大小: 333147199 字节
数据集大小: 872262083 字节

配置

配置名称: default
数据文件:
- train: 路径为 data/train-*

许可

许可证: MIT

任务类别

文本分类
文本生成
填充掩码

语言

英语

易读名称

civil comments w/hatebert scoring

大小类别

100K<n<1M

5,000+

优质数据集

54 个

任务类型

进入经典数据集