jtatman/civil_comments_hatebert
收藏Hugging Face2023-09-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jtatman/civil_comments_hatebert
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: text_masked
dtype: string
- name: text_replaced
list:
- name: score
dtype: float64
- name: sequence
dtype: string
- name: token
dtype: int64
- name: token_str
dtype: string
splits:
- name: train
num_bytes: 872262083
num_examples: 451219
download_size: 333147199
dataset_size: 872262083
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: mit
task_categories:
- text-classification
- text2text-generation
- fill-mask
language:
- en
tags:
- masked
- mask-scored
- comment scoring
- masked-model
pretty_name: civil comments w/hatebert scoring
size_categories:
- 100K<n<1M
---
# Dataset Card for "civil_comments_hatebert"
This is an experiment to see how "civil-comments" can be changed by models without much manipulation to offensive speech in certain cases.
This data is a reformat of the civil comments dataset, discarding all scoring attributes of abusive speech, masking random tokens, and processing with hatebert to fill-masked tokens with possible abusive language.
This merely sets up some good data for three things: fill-mask activities, text training, and scored responses based on random tokens being manipulatible according to this model.
Showing the progress of incarnation, three columns illustrate the original text data extracted, the randomly masked text, and the filled text with scores in a list for the hatebert output.
So far in practice, the hatebert model mostly fills with innocuous placeholders, from *very* limited testing.
Hatebert is as it sounds, a BERT based model trained on fill-mask activites.
[civil_comments dataset](https://huggingface.co/datasets/civil_comments)
[hatebert model](https://huggingface.co/datasets/civil_comments)
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
jtatman
原始信息汇总
数据集概述
数据集信息
- 特征:
text: 类型为stringtext_masked: 类型为stringtext_replaced: 包含以下子特征score: 类型为float64sequence: 类型为stringtoken: 类型为int64token_str: 类型为string
- 分割:
train: 字节数为 872262083,样本数为 451219
- 下载大小: 333147199 字节
- 数据集大小: 872262083 字节
配置
- 配置名称:
default - 数据文件:
train: 路径为data/train-*
许可
- 许可证: MIT
任务类别
- 文本分类
- 文本生成
- 填充掩码
语言
- 英语
标签
- 掩码
- 掩码评分
- 评论评分
- 掩码模型
易读名称
civil comments w/hatebert scoring
大小类别
- 100K<n<1M



