SungJoo/Cradle-Dialogue

Name: SungJoo/Cradle-Dialogue
Creator: SungJoo
Published: 2026-04-21 07:01:37
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/SungJoo/Cradle-Dialogue

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: turn_id dtype: int64 - name: text dtype: string - name: labels dtype: string splits: - name: train num_bytes: 9135003 num_examples: 48557 - name: validation num_bytes: 1252761 num_examples: 6679 - name: test num_bytes: 1903809 num_examples: 8975 download_size: 6552363 dataset_size: 12291573 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # CRADLE-Dialogue **CRADLE-Dialogue** is a clinician-annotated benchmark for turn-level crisis detection in multi-turn mental health conversations. > ⚠️ **Content Warning:** This dataset discusses sensitive topics including suicide ideation, self-harm, rape, domestic violence, and child abuse. ## Overview Real-world crisis intervention is conversational, yet prior work has focused on static texts. CRADLE-Dialogue addresses this gap with 600 expert-annotated multi-turn dialogues (8,975 turns) designed to evaluate whether models can detect *when* risk emerges across a conversation — not just *that* it exists. ## Dataset Description - **600 dialogues** / **8,975 turns** (4,527 user turns for inference) - Dialogues are generated from real Reddit posts (sourced from [CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)) via GPT-5, conditioned on 7 clinical scenarios and controlled crisis reveal timing (early / mid / late) - Annotated at the turn level by **4 clinical experts**: two licensed psychologists, a PhD clinical postdoctoral resident, and a licensed clinical social worker - Multi-label annotations with temporal distinction (ongoing vs. past) ## Alert–Confirm Protocol A key contribution of this benchmark is the **Alert–Confirm** evaluation protocol: | Label | Description | |---|---| | `alert_ongoing` / `alert_past` | Earliest ambiguous signal of *possible* crisis (type-agnostic) | | `confirm_<type>_<ongoing/past>` | Turn where a specific crisis type becomes explicitly identifiable | Each crisis event receives at most one Alert and one Confirm within a dialogue. This reflects clinical practice: intervene before risk becomes explicit. ## Crisis Types | Label | Description | |---|---| | `SI_passive` | Passive suicidal ideation (wish to be dead, no plan) | | `SI_active` | Active suicidal ideation (method, intent, or preparation) | | `SH` | Non-suicidal self-harm | | `DV` | Domestic violence between intimate partners | | `CA` | Child abuse or endangerment | | `SHA` | Sexual harassment | | `RA` | Rape / non-consensual sexual acts with penetration | ## Data Splits | Split | Dialogues | Turns | Labels | Source | |---|---|---|---|---| | `test` | 600 | 8,975 | 713 | Clinician-annotated | | `validation` | 420 | 6,679 | 549 | GPT-5 synthetic (annotated) | | `train` | 3,058 | 48,557 | 3,902 | GPT-5 synthetic (annotated) | The train/validation splits are synthetically generated with structured label injection for supervised training. ## Data Fields | Field | Type | Description | |---|---|---| | `dialogue_id` | string | Unique dialogue identifier | | `turn_id` | int | Turn index within dialogue (0-indexed) | | `text` | string | Utterance text, prefixed with `User:` or `Listener:` | | `labels` | string | Crisis label(s) separated by `; `, or empty string for no crisis | ## Label Format Examples ``` none (empty string) alert_ongoing alert_past confirm_SI_passive_ongoing confirm_SH_past confirm_DV_ongoing; confirm_RA_past ``` ## Dataset Statistics (Test Set) | Statistic | Value | |---|---| | Total dialogues | 600 | | Dialogues with Alert | 226 (37.7%) | | Dialogues with Confirm | 417 (69.5%) | | Dialogues with neither | 160 (26.7%) | | Avg. labels per dialogue | 1.18 ± 0.89 | ## Benchmark Results (Turn-Level Micro F1) | Model | Turn-Level μF1 | Dialogue-Level μF1 | |---|---|---| | Claude-4.5-Sonnet | **56.85** | 70.11 | | Gemini-3-Flash | 53.05 | 68.57 | | GPT-5.1 | 48.91 | 67.00 | | gpt-oss-120b | 47.20 | 63.22 | | **Qwen3-32B-FT (Ours)** | **51.31** | **68.88** | | Qwen3-32B (base) | 43.75 | 63.43 | Our fine-tuned model outperforms all open-source baselines and achieves competitive results against proprietary systems. See the paper for full results. ## Related Resources - 📦 **Fine-tuned Model**: [SungJoo/cradle-dialogue-qwen3-32b-2epoch](https://huggingface.co/SungJoo/cradle-dialogue-qwen3-32b-2epoch) - 📄 **CRADLE-Bench** (post-level benchmark): [SungJoo/CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)

数据集信息：特征： - 名称：turn_id（轮次ID），数据类型：int64 - 名称：text（文本），数据类型：string - 名称：labels（标签），数据类型：string 划分集： - 名称：train（训练集），字节数：9135003，样本数：48557 - 名称：validation（验证集），字节数：1252761，样本数：6679 - 名称：test（测试集），字节数：1903809，样本数：8975 下载大小：6552363，数据集总大小：12291573 配置： - 配置名称：default（默认配置）数据文件： - 划分集：train，路径：data/train-* - 划分集：validation，路径：data/validation-* - 划分集：test，路径：data/test-* # CRADLE-Dialogue **CRADLE-Dialogue** 是一款由临床医师标注的基准数据集，用于多轮心理健康对话中的轮次级别危机检测。 > ⚠️ **内容警示**：本数据集涉及自杀意念、自残、强奸、家庭暴力及儿童虐待等敏感话题。 ## 概述现实世界中的危机干预以对话形式展开，但此前的相关研究多聚焦于静态文本。CRADLE-Dialogue 填补了这一空白，其包含600段经专家标注的多轮对话（共8975个轮次），旨在评估模型是否能够识别风险在对话中**何时浮现**，而非仅判断风险是否存在。 ## 数据集详情 - **600段对话** / **8975个轮次**（其中4527个用户轮次用于推理） - 对话源自真实Reddit帖子（源自[CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)），由GPT-5生成，基于7种临床场景构建，并对危机披露时机进行了控制（早期/中期/晚期） - 由**4名临床专家**进行轮次级别标注：2名执业心理学家、1名临床博士后研究员、1名注册临床社工 - 采用多标签标注，且区分时间属性（持续状态vs. 既往状态） ## 预警-确认协议本基准数据集的一项核心贡献是**预警-确认（Alert–Confirm）评估协议**： | 标签 | 描述 | |---|---| | `alert_ongoing` / `alert_past` | 首次出现的模糊潜在危机信号（无特定危机类型） | | `confirm_<类型>_<持续/既往>` | 特定危机类型被明确识别的轮次 | 每一次危机事件在一段对话中最多对应一次预警和一次确认，这贴合临床实践逻辑：应在风险明确化之前进行干预。 ## 危机类型 | 标签 | 描述 | |---|---| | `SI_passive` | 被动自杀意念（仅存在死亡意愿，无具体计划） | | `SI_active` | 主动自杀意念（包含自杀方法、意图或准备行为） | | `SH` | 非自杀性自残 | | `DV` | 亲密伴侣间的家庭暴力 | | `CA` | 儿童虐待或疏忽 | | `SHA` | 性骚扰 | | `RA` | 强奸/非自愿性侵入性行为 | ## 数据划分 | 划分集 | 对话数 | 轮次数 | 标签数 | 来源 | |---|---|---|---|---| | `test`（测试集） | 600 | 8975 | 713 | 临床专家标注 | | `validation`（验证集） | 420 | 6679 | 549 | GPT-5合成数据（经标注） | | `train`（训练集） | 3058 | 48557 | 3902 | GPT-5合成数据（经标注） | 训练集与验证集为通过结构化标签注入生成的合成数据，用于监督学习训练。 ## 数据字段 | 字段 | 类型 | 描述 | |---|---|---| | `dialogue_id`（对话ID） | string | 对话唯一标识符 | | `turn_id` | int | 对话内的轮次索引（从0开始计数） | | `text` | string | 对话文本，前缀为`User:`或`Listener:` | | `labels` | string | 危机标签，以`; `分隔，无危机时为空字符串 | ## 标签格式示例 none（空字符串） alert_ongoing alert_past confirm_SI_passive_ongoing confirm_SH_past confirm_DV_ongoing; confirm_RA_past ## 测试集统计信息 | 统计项 | 数值 | |---|---| | 总对话数 | 600 | | 含预警标签的对话数 | 226（占比37.7%） | | 含确认标签的对话数 | 417（占比69.5%） | | 无任何危机标签的对话数 | 160（占比26.7%） | | 单对话平均标签数 | 1.18 ± 0.89 | ## 基准模型性能（轮次级别微F1值） | 模型 | 轮次级别微F1值 | 对话级别微F1值 | |---|---|---| | Claude-4.5-Sonnet | **56.85** | 70.11 | | Gemini-3-Flash | 53.05 | 68.57 | | GPT-5.1 | 48.91 | 67.00 | | gpt-oss-120b | 47.20 | 63.22 | | **Qwen3-32B-FT（本文方法）** | **51.31** | **68.88** | | Qwen3-32B（基础版） | 43.75 | 63.43 | 本微调模型在开源基线模型中表现最优，且与闭源系统的性能具备竞争力。完整结果详见相关论文。 ## 相关资源 - 📦 **微调模型**：[SungJoo/cradle-dialogue-qwen3-32b-2epoch](https://huggingface.co/SungJoo/cradle-dialogue-qwen3-32b-2epoch) - 📄 **CRADLE-Bench**（帖子级别基准数据集）：[SungJoo/CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)

提供机构：

SungJoo

5,000+

优质数据集

54 个

任务类型

进入经典数据集