five

jtatman/civil_comments_hatebert

收藏
Hugging Face2023-09-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jtatman/civil_comments_hatebert
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: text_masked dtype: string - name: text_replaced list: - name: score dtype: float64 - name: sequence dtype: string - name: token dtype: int64 - name: token_str dtype: string splits: - name: train num_bytes: 872262083 num_examples: 451219 download_size: 333147199 dataset_size: 872262083 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - text-classification - text2text-generation - fill-mask language: - en tags: - masked - mask-scored - comment scoring - masked-model pretty_name: civil comments w/hatebert scoring size_categories: - 100K<n<1M --- # Dataset Card for "civil_comments_hatebert" This is an experiment to see how "civil-comments" can be changed by models without much manipulation to offensive speech in certain cases. This data is a reformat of the civil comments dataset, discarding all scoring attributes of abusive speech, masking random tokens, and processing with hatebert to fill-masked tokens with possible abusive language. This merely sets up some good data for three things: fill-mask activities, text training, and scored responses based on random tokens being manipulatible according to this model. Showing the progress of incarnation, three columns illustrate the original text data extracted, the randomly masked text, and the filled text with scores in a list for the hatebert output. So far in practice, the hatebert model mostly fills with innocuous placeholders, from *very* limited testing. Hatebert is as it sounds, a BERT based model trained on fill-mask activites. [civil_comments dataset](https://huggingface.co/datasets/civil_comments) [hatebert model](https://huggingface.co/datasets/civil_comments) [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
jtatman
原始信息汇总

数据集概述

数据集信息

  • 特征:
    • text: 类型为 string
    • text_masked: 类型为 string
    • text_replaced: 包含以下子特征
      • score: 类型为 float64
      • sequence: 类型为 string
      • token: 类型为 int64
      • token_str: 类型为 string
  • 分割:
    • train: 字节数为 872262083,样本数为 451219
  • 下载大小: 333147199 字节
  • 数据集大小: 872262083 字节

配置

  • 配置名称: default
  • 数据文件:
    • train: 路径为 data/train-*

许可

  • 许可证: MIT

任务类别

  • 文本分类
  • 文本生成
  • 填充掩码

语言

  • 英语

标签

  • 掩码
  • 掩码评分
  • 评论评分
  • 掩码模型

易读名称

  • civil comments w/hatebert scoring

大小类别

  • 100K<n<1M
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作