siddharthmb/2026.transcoder-adapters.templated_chats.lmsys_lmsys-chat-1m
收藏Hugging Face2026-02-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/siddharthmb/2026.transcoder-adapters.templated_chats.lmsys_lmsys-chat-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: conversation_id
dtype: string
- name: model
dtype: string
- name: conversation
list:
- name: content
dtype: string
- name: role
dtype: string
- name: turn
dtype: int64
- name: language
dtype: string
- name: openai_moderation
list:
- name: categories
struct:
- name: harassment
dtype: bool
- name: harassment/threatening
dtype: bool
- name: hate
dtype: bool
- name: hate/threatening
dtype: bool
- name: self-harm
dtype: bool
- name: self-harm/instructions
dtype: bool
- name: self-harm/intent
dtype: bool
- name: sexual
dtype: bool
- name: sexual/minors
dtype: bool
- name: violence
dtype: bool
- name: violence/graphic
dtype: bool
- name: category_scores
struct:
- name: harassment
dtype: float64
- name: harassment/threatening
dtype: float64
- name: hate
dtype: float64
- name: hate/threatening
dtype: float64
- name: self-harm
dtype: float64
- name: self-harm/instructions
dtype: float64
- name: self-harm/intent
dtype: float64
- name: sexual
dtype: float64
- name: sexual/minors
dtype: float64
- name: violence
dtype: float64
- name: violence/graphic
dtype: float64
- name: flagged
dtype: bool
- name: redacted
dtype: bool
- name: google/gemma-2-2b-it_templated
dtype: string
splits:
- name: train
num_bytes: 4903447865
num_examples: 1000000
download_size: 2494194946
dataset_size: 4903447865
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
数据集信息:
特征字段:
1. 对话ID(conversation_id):数据类型为字符串
2. 模型(model):数据类型为字符串
3. 对话列表(conversation):列表类型,每个列表元素包含两个子字段:
- 对话内容(content):字符串类型
- 对话角色(role):字符串类型
4. 对话轮次(turn):64位整型
5. 对话语言(language):数据类型为字符串
6. OpenAI内容审核结果(openai_moderation):列表类型,每个列表元素包含三个子字段:
- 分类标签(categories):结构体类型,包含以下布尔型字段:
骚扰(harassment)、骚扰/威胁性内容(harassment/threatening)、仇恨言论(hate)、仇恨/威胁性言论(hate/threatening)、自残(self-harm)、自残指导(self-harm/instructions)、自残意图(self-harm/intent)、色情内容(sexual)、未成年人色情(sexual/minors)、暴力内容(violence)、具象化暴力(violence/graphic)
- 分类分数(category_scores):结构体类型,对应上述分类的64位双精度浮点型分数
- 触发标记(flagged):布尔类型,标识是否触发审核拦截
7. 脱敏标记(redacted):布尔类型,标识是否已完成内容脱敏
8. google/gemma-2-2b-it_templated:字符串类型,为谷歌Gemma-2-2B-IT模板化文本
数据集划分:
训练集(train):占用存储字节数4903447865,包含1000000条样本
数据集统计:
下载大小:2494194946字节
数据集存储总大小:4903447865字节
配置信息:
默认配置(default):数据文件对应训练集划分,文件路径为data/train-*
提供机构:
siddharthmb



