SungJoo/Cradle-Dialogue
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/SungJoo/Cradle-Dialogue
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: turn_id
dtype: int64
- name: text
dtype: string
- name: labels
dtype: string
splits:
- name: train
num_bytes: 9135003
num_examples: 48557
- name: validation
num_bytes: 1252761
num_examples: 6679
- name: test
num_bytes: 1903809
num_examples: 8975
download_size: 6552363
dataset_size: 12291573
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
# CRADLE-Dialogue
**CRADLE-Dialogue** is a clinician-annotated benchmark for turn-level crisis detection in multi-turn mental health conversations.
> ⚠️ **Content Warning:** This dataset discusses sensitive topics including suicide ideation, self-harm, rape, domestic violence, and child abuse.
## Overview
Real-world crisis intervention is conversational, yet prior work has focused on static texts. CRADLE-Dialogue addresses this gap with 600 expert-annotated multi-turn dialogues (8,975 turns) designed to evaluate whether models can detect *when* risk emerges across a conversation — not just *that* it exists.
## Dataset Description
- **600 dialogues** / **8,975 turns** (4,527 user turns for inference)
- Dialogues are generated from real Reddit posts (sourced from [CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)) via GPT-5, conditioned on 7 clinical scenarios and controlled crisis reveal timing (early / mid / late)
- Annotated at the turn level by **4 clinical experts**: two licensed psychologists, a PhD clinical postdoctoral resident, and a licensed clinical social worker
- Multi-label annotations with temporal distinction (ongoing vs. past)
## Alert–Confirm Protocol
A key contribution of this benchmark is the **Alert–Confirm** evaluation protocol:
| Label | Description |
|---|---|
| `alert_ongoing` / `alert_past` | Earliest ambiguous signal of *possible* crisis (type-agnostic) |
| `confirm_<type>_<ongoing/past>` | Turn where a specific crisis type becomes explicitly identifiable |
Each crisis event receives at most one Alert and one Confirm within a dialogue. This reflects clinical practice: intervene before risk becomes explicit.
## Crisis Types
| Label | Description |
|---|---|
| `SI_passive` | Passive suicidal ideation (wish to be dead, no plan) |
| `SI_active` | Active suicidal ideation (method, intent, or preparation) |
| `SH` | Non-suicidal self-harm |
| `DV` | Domestic violence between intimate partners |
| `CA` | Child abuse or endangerment |
| `SHA` | Sexual harassment |
| `RA` | Rape / non-consensual sexual acts with penetration |
## Data Splits
| Split | Dialogues | Turns | Labels | Source |
|---|---|---|---|---|
| `test` | 600 | 8,975 | 713 | Clinician-annotated |
| `validation` | 420 | 6,679 | 549 | GPT-5 synthetic (annotated) |
| `train` | 3,058 | 48,557 | 3,902 | GPT-5 synthetic (annotated) |
The train/validation splits are synthetically generated with structured label injection for supervised training.
## Data Fields
| Field | Type | Description |
|---|---|---|
| `dialogue_id` | string | Unique dialogue identifier |
| `turn_id` | int | Turn index within dialogue (0-indexed) |
| `text` | string | Utterance text, prefixed with `User:` or `Listener:` |
| `labels` | string | Crisis label(s) separated by `; `, or empty string for no crisis |
## Label Format Examples
```
none (empty string)
alert_ongoing
alert_past
confirm_SI_passive_ongoing
confirm_SH_past
confirm_DV_ongoing; confirm_RA_past
```
## Dataset Statistics (Test Set)
| Statistic | Value |
|---|---|
| Total dialogues | 600 |
| Dialogues with Alert | 226 (37.7%) |
| Dialogues with Confirm | 417 (69.5%) |
| Dialogues with neither | 160 (26.7%) |
| Avg. labels per dialogue | 1.18 ± 0.89 |
## Benchmark Results (Turn-Level Micro F1)
| Model | Turn-Level μF1 | Dialogue-Level μF1 |
|---|---|---|
| Claude-4.5-Sonnet | **56.85** | 70.11 |
| Gemini-3-Flash | 53.05 | 68.57 |
| GPT-5.1 | 48.91 | 67.00 |
| gpt-oss-120b | 47.20 | 63.22 |
| **Qwen3-32B-FT (Ours)** | **51.31** | **68.88** |
| Qwen3-32B (base) | 43.75 | 63.43 |
Our fine-tuned model outperforms all open-source baselines and achieves competitive results against proprietary systems. See the paper for full results.
## Related Resources
- 📦 **Fine-tuned Model**: [SungJoo/cradle-dialogue-qwen3-32b-2epoch](https://huggingface.co/SungJoo/cradle-dialogue-qwen3-32b-2epoch)
- 📄 **CRADLE-Bench** (post-level benchmark): [SungJoo/CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)
数据集信息:
特征:
- 名称:turn_id(轮次ID),数据类型:int64
- 名称:text(文本),数据类型:string
- 名称:labels(标签),数据类型:string
划分集:
- 名称:train(训练集),字节数:9135003,样本数:48557
- 名称:validation(验证集),字节数:1252761,样本数:6679
- 名称:test(测试集),字节数:1903809,样本数:8975
下载大小:6552363,数据集总大小:12291573
配置:
- 配置名称:default(默认配置)
数据文件:
- 划分集:train,路径:data/train-*
- 划分集:validation,路径:data/validation-*
- 划分集:test,路径:data/test-*
# CRADLE-Dialogue
**CRADLE-Dialogue** 是一款由临床医师标注的基准数据集,用于多轮心理健康对话中的轮次级别危机检测。
> ⚠️ **内容警示**:本数据集涉及自杀意念、自残、强奸、家庭暴力及儿童虐待等敏感话题。
## 概述
现实世界中的危机干预以对话形式展开,但此前的相关研究多聚焦于静态文本。CRADLE-Dialogue 填补了这一空白,其包含600段经专家标注的多轮对话(共8975个轮次),旨在评估模型是否能够识别风险在对话中**何时浮现**,而非仅判断风险是否存在。
## 数据集详情
- **600段对话** / **8975个轮次**(其中4527个用户轮次用于推理)
- 对话源自真实Reddit帖子(源自[CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)),由GPT-5生成,基于7种临床场景构建,并对危机披露时机进行了控制(早期/中期/晚期)
- 由**4名临床专家**进行轮次级别标注:2名执业心理学家、1名临床博士后研究员、1名注册临床社工
- 采用多标签标注,且区分时间属性(持续状态vs. 既往状态)
## 预警-确认协议
本基准数据集的一项核心贡献是**预警-确认(Alert–Confirm)评估协议**:
| 标签 | 描述 |
|---|---|
| `alert_ongoing` / `alert_past` | 首次出现的模糊潜在危机信号(无特定危机类型) |
| `confirm_<类型>_<持续/既往>` | 特定危机类型被明确识别的轮次 |
每一次危机事件在一段对话中最多对应一次预警和一次确认,这贴合临床实践逻辑:应在风险明确化之前进行干预。
## 危机类型
| 标签 | 描述 |
|---|---|
| `SI_passive` | 被动自杀意念(仅存在死亡意愿,无具体计划) |
| `SI_active` | 主动自杀意念(包含自杀方法、意图或准备行为) |
| `SH` | 非自杀性自残 |
| `DV` | 亲密伴侣间的家庭暴力 |
| `CA` | 儿童虐待或疏忽 |
| `SHA` | 性骚扰 |
| `RA` | 强奸/非自愿性侵入性行为 |
## 数据划分
| 划分集 | 对话数 | 轮次数 | 标签数 | 来源 |
|---|---|---|---|---|
| `test`(测试集) | 600 | 8975 | 713 | 临床专家标注 |
| `validation`(验证集) | 420 | 6679 | 549 | GPT-5合成数据(经标注) |
| `train`(训练集) | 3058 | 48557 | 3902 | GPT-5合成数据(经标注) |
训练集与验证集为通过结构化标签注入生成的合成数据,用于监督学习训练。
## 数据字段
| 字段 | 类型 | 描述 |
|---|---|---|
| `dialogue_id`(对话ID) | string | 对话唯一标识符 |
| `turn_id` | int | 对话内的轮次索引(从0开始计数) |
| `text` | string | 对话文本,前缀为`User:`或`Listener:` |
| `labels` | string | 危机标签,以`; `分隔,无危机时为空字符串 |
## 标签格式示例
none(空字符串)
alert_ongoing
alert_past
confirm_SI_passive_ongoing
confirm_SH_past
confirm_DV_ongoing; confirm_RA_past
## 测试集统计信息
| 统计项 | 数值 |
|---|---|
| 总对话数 | 600 |
| 含预警标签的对话数 | 226(占比37.7%) |
| 含确认标签的对话数 | 417(占比69.5%) |
| 无任何危机标签的对话数 | 160(占比26.7%) |
| 单对话平均标签数 | 1.18 ± 0.89 |
## 基准模型性能(轮次级别微F1值)
| 模型 | 轮次级别微F1值 | 对话级别微F1值 |
|---|---|---|
| Claude-4.5-Sonnet | **56.85** | 70.11 |
| Gemini-3-Flash | 53.05 | 68.57 |
| GPT-5.1 | 48.91 | 67.00 |
| gpt-oss-120b | 47.20 | 63.22 |
| **Qwen3-32B-FT(本文方法)** | **51.31** | **68.88** |
| Qwen3-32B(基础版) | 43.75 | 63.43 |
本微调模型在开源基线模型中表现最优,且与闭源系统的性能具备竞争力。完整结果详见相关论文。
## 相关资源
- 📦 **微调模型**:[SungJoo/cradle-dialogue-qwen3-32b-2epoch](https://huggingface.co/SungJoo/cradle-dialogue-qwen3-32b-2epoch)
- 📄 **CRADLE-Bench**(帖子级别基准数据集):[SungJoo/CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)
提供机构:
SungJoo



