d-llm/wildchat-toxic

Name: d-llm/wildchat-toxic
Creator: d-llm
Published: 2024-06-21 05:26:27
License: 暂无描述

Hugging Face2024-06-21 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/d-llm/wildchat-toxic

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含多个字段，用于记录对话的详细信息，包括对话的哈希值、模型、时间戳、对话内容、国家、哈希IP地址、语言、是否被编辑、角色、状态、是否含有毒性内容等。此外，数据集还包含OpenAI和Detoxify的内容审核信息，记录了各种类型的毒性内容及其评分。数据集的总大小为1769111498.3309424字节，包含199011个示例，下载大小为1117748557字节。

This dataset contains multiple fields that record detailed information about conversations, including conversation hash, model, timestamp, conversation content, country, hashed IP address, language, whether it has been redacted, role, state, whether it contains toxic content, etc. Additionally, the dataset includes content moderation information from OpenAI and Detoxify, recording various types of toxic content and their scores. The total size of the dataset is 1769111498.3309424 bytes, containing 199011 examples, with a download size of 1117748557 bytes.

提供机构：

d-llm

原始信息汇总

数据集概述

数据集信息

特征

conversation_hash: 字符串类型
model: 字符串类型
timestamp: 时间戳类型，精度为微秒，时区为UTC
conversation: 列表类型
- content: 字符串类型
- country: 字符串类型
- hashed_ip: 字符串类型
- header: 结构体类型
  - accept-language: 字符串类型
  - user-agent: 字符串类型
- language: 字符串类型
- redacted: 布尔类型
- role: 字符串类型
- state: 字符串类型
- timestamp: 时间戳类型，精度为微秒，时区为UTC
- toxic: 布尔类型
- turn_identifier: 64位整数类型
turn: 64位整数类型
language: 字符串类型
openai_moderation: 列表类型
- categories: 结构体类型
  - harassment: 布尔类型
  - harassment/threatening: 布尔类型
  - harassment_threatening: 布尔类型
  - hate: 布尔类型
  - hate/threatening: 布尔类型
  - hate_threatening: 布尔类型
  - self-harm: 布尔类型
  - self-harm/instructions: 布尔类型
  - self-harm/intent: 布尔类型
  - self_harm: 布尔类型
  - self_harm_instructions: 布尔类型
  - self_harm_intent: 布尔类型
  - sexual: 布尔类型
  - sexual/minors: 布尔类型
  - sexual_minors: 布尔类型
  - violence: 布尔类型
  - violence/graphic: 布尔类型
  - violence_graphic: 布尔类型
- category_scores: 结构体类型
  - harassment: 64位浮点数类型
  - harassment/threatening: 64位浮点数类型
  - harassment_threatening: 64位浮点数类型
  - hate: 64位浮点数类型
  - hate/threatening: 64位浮点数类型
  - hate_threatening: 64位浮点数类型
  - self-harm: 64位浮点数类型
  - self-harm/instructions: 64位浮点数类型
  - self-harm/intent: 64位浮点数类型
  - self_harm: 64位浮点数类型
  - self_harm_instructions: 64位浮点数类型
  - self_harm_intent: 64位浮点数类型
  - sexual: 64位浮点数类型
  - sexual/minors: 64位浮点数类型
  - sexual_minors: 64位浮点数类型
  - violence: 64位浮点数类型
  - violence/graphic: 64位浮点数类型
  - violence_graphic: 64位浮点数类型
- flagged: 布尔类型
detoxify_moderation: 列表类型
- identity_attack: 64位浮点数类型
- insult: 64位浮点数类型
- obscene: 64位浮点数类型
- severe_toxicity: 64位浮点数类型
- sexual_explicit: 64位浮点数类型
- threat: 64位浮点数类型
- toxicity: 64位浮点数类型
toxic: 布尔类型
redacted: 布尔类型
state: 字符串类型
country: 字符串类型
hashed_ip: 字符串类型
header: 结构体类型
- accept-language: 字符串类型
- user-agent: 字符串类型

数据分割

train:
- 字节数: 1769111498.3309424
- 样本数: 199011

数据集大小

下载大小: 1117748557
数据集大小: 1769111498.3309424

配置

config_name: default
- data_files:
  - split: train
  - path: data/train-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集