five

SungJoo/Cradle-Dialogue

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/SungJoo/Cradle-Dialogue
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: turn_id dtype: int64 - name: text dtype: string - name: labels dtype: string splits: - name: train num_bytes: 9135003 num_examples: 48557 - name: validation num_bytes: 1252761 num_examples: 6679 - name: test num_bytes: 1903809 num_examples: 8975 download_size: 6552363 dataset_size: 12291573 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # CRADLE-Dialogue **CRADLE-Dialogue** is a clinician-annotated benchmark for turn-level crisis detection in multi-turn mental health conversations. > ⚠️ **Content Warning:** This dataset discusses sensitive topics including suicide ideation, self-harm, rape, domestic violence, and child abuse. ## Overview Real-world crisis intervention is conversational, yet prior work has focused on static texts. CRADLE-Dialogue addresses this gap with 600 expert-annotated multi-turn dialogues (8,975 turns) designed to evaluate whether models can detect *when* risk emerges across a conversation — not just *that* it exists. ## Dataset Description - **600 dialogues** / **8,975 turns** (4,527 user turns for inference) - Dialogues are generated from real Reddit posts (sourced from [CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)) via GPT-5, conditioned on 7 clinical scenarios and controlled crisis reveal timing (early / mid / late) - Annotated at the turn level by **4 clinical experts**: two licensed psychologists, a PhD clinical postdoctoral resident, and a licensed clinical social worker - Multi-label annotations with temporal distinction (ongoing vs. past) ## Alert–Confirm Protocol A key contribution of this benchmark is the **Alert–Confirm** evaluation protocol: | Label | Description | |---|---| | `alert_ongoing` / `alert_past` | Earliest ambiguous signal of *possible* crisis (type-agnostic) | | `confirm_<type>_<ongoing/past>` | Turn where a specific crisis type becomes explicitly identifiable | Each crisis event receives at most one Alert and one Confirm within a dialogue. This reflects clinical practice: intervene before risk becomes explicit. ## Crisis Types | Label | Description | |---|---| | `SI_passive` | Passive suicidal ideation (wish to be dead, no plan) | | `SI_active` | Active suicidal ideation (method, intent, or preparation) | | `SH` | Non-suicidal self-harm | | `DV` | Domestic violence between intimate partners | | `CA` | Child abuse or endangerment | | `SHA` | Sexual harassment | | `RA` | Rape / non-consensual sexual acts with penetration | ## Data Splits | Split | Dialogues | Turns | Labels | Source | |---|---|---|---|---| | `test` | 600 | 8,975 | 713 | Clinician-annotated | | `validation` | 420 | 6,679 | 549 | GPT-5 synthetic (annotated) | | `train` | 3,058 | 48,557 | 3,902 | GPT-5 synthetic (annotated) | The train/validation splits are synthetically generated with structured label injection for supervised training. ## Data Fields | Field | Type | Description | |---|---|---| | `dialogue_id` | string | Unique dialogue identifier | | `turn_id` | int | Turn index within dialogue (0-indexed) | | `text` | string | Utterance text, prefixed with `User:` or `Listener:` | | `labels` | string | Crisis label(s) separated by `; `, or empty string for no crisis | ## Label Format Examples ``` none (empty string) alert_ongoing alert_past confirm_SI_passive_ongoing confirm_SH_past confirm_DV_ongoing; confirm_RA_past ``` ## Dataset Statistics (Test Set) | Statistic | Value | |---|---| | Total dialogues | 600 | | Dialogues with Alert | 226 (37.7%) | | Dialogues with Confirm | 417 (69.5%) | | Dialogues with neither | 160 (26.7%) | | Avg. labels per dialogue | 1.18 ± 0.89 | ## Benchmark Results (Turn-Level Micro F1) | Model | Turn-Level μF1 | Dialogue-Level μF1 | |---|---|---| | Claude-4.5-Sonnet | **56.85** | 70.11 | | Gemini-3-Flash | 53.05 | 68.57 | | GPT-5.1 | 48.91 | 67.00 | | gpt-oss-120b | 47.20 | 63.22 | | **Qwen3-32B-FT (Ours)** | **51.31** | **68.88** | | Qwen3-32B (base) | 43.75 | 63.43 | Our fine-tuned model outperforms all open-source baselines and achieves competitive results against proprietary systems. See the paper for full results. ## Related Resources - 📦 **Fine-tuned Model**: [SungJoo/cradle-dialogue-qwen3-32b-2epoch](https://huggingface.co/SungJoo/cradle-dialogue-qwen3-32b-2epoch) - 📄 **CRADLE-Bench** (post-level benchmark): [SungJoo/CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)

数据集信息: 特征: - 名称:turn_id(轮次ID),数据类型:int64 - 名称:text(文本),数据类型:string - 名称:labels(标签),数据类型:string 划分集: - 名称:train(训练集),字节数:9135003,样本数:48557 - 名称:validation(验证集),字节数:1252761,样本数:6679 - 名称:test(测试集),字节数:1903809,样本数:8975 下载大小:6552363,数据集总大小:12291573 配置: - 配置名称:default(默认配置) 数据文件: - 划分集:train,路径:data/train-* - 划分集:validation,路径:data/validation-* - 划分集:test,路径:data/test-* # CRADLE-Dialogue **CRADLE-Dialogue** 是一款由临床医师标注的基准数据集,用于多轮心理健康对话中的轮次级别危机检测。 > ⚠️ **内容警示**:本数据集涉及自杀意念、自残、强奸、家庭暴力及儿童虐待等敏感话题。 ## 概述 现实世界中的危机干预以对话形式展开,但此前的相关研究多聚焦于静态文本。CRADLE-Dialogue 填补了这一空白,其包含600段经专家标注的多轮对话(共8975个轮次),旨在评估模型是否能够识别风险在对话中**何时浮现**,而非仅判断风险是否存在。 ## 数据集详情 - **600段对话** / **8975个轮次**(其中4527个用户轮次用于推理) - 对话源自真实Reddit帖子(源自[CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)),由GPT-5生成,基于7种临床场景构建,并对危机披露时机进行了控制(早期/中期/晚期) - 由**4名临床专家**进行轮次级别标注:2名执业心理学家、1名临床博士后研究员、1名注册临床社工 - 采用多标签标注,且区分时间属性(持续状态vs. 既往状态) ## 预警-确认协议 本基准数据集的一项核心贡献是**预警-确认(Alert–Confirm)评估协议**: | 标签 | 描述 | |---|---| | `alert_ongoing` / `alert_past` | 首次出现的模糊潜在危机信号(无特定危机类型) | | `confirm_<类型>_<持续/既往>` | 特定危机类型被明确识别的轮次 | 每一次危机事件在一段对话中最多对应一次预警和一次确认,这贴合临床实践逻辑:应在风险明确化之前进行干预。 ## 危机类型 | 标签 | 描述 | |---|---| | `SI_passive` | 被动自杀意念(仅存在死亡意愿,无具体计划) | | `SI_active` | 主动自杀意念(包含自杀方法、意图或准备行为) | | `SH` | 非自杀性自残 | | `DV` | 亲密伴侣间的家庭暴力 | | `CA` | 儿童虐待或疏忽 | | `SHA` | 性骚扰 | | `RA` | 强奸/非自愿性侵入性行为 | ## 数据划分 | 划分集 | 对话数 | 轮次数 | 标签数 | 来源 | |---|---|---|---|---| | `test`(测试集) | 600 | 8975 | 713 | 临床专家标注 | | `validation`(验证集) | 420 | 6679 | 549 | GPT-5合成数据(经标注) | | `train`(训练集) | 3058 | 48557 | 3902 | GPT-5合成数据(经标注) | 训练集与验证集为通过结构化标签注入生成的合成数据,用于监督学习训练。 ## 数据字段 | 字段 | 类型 | 描述 | |---|---|---| | `dialogue_id`(对话ID) | string | 对话唯一标识符 | | `turn_id` | int | 对话内的轮次索引(从0开始计数) | | `text` | string | 对话文本,前缀为`User:`或`Listener:` | | `labels` | string | 危机标签,以`; `分隔,无危机时为空字符串 | ## 标签格式示例 none(空字符串) alert_ongoing alert_past confirm_SI_passive_ongoing confirm_SH_past confirm_DV_ongoing; confirm_RA_past ## 测试集统计信息 | 统计项 | 数值 | |---|---| | 总对话数 | 600 | | 含预警标签的对话数 | 226(占比37.7%) | | 含确认标签的对话数 | 417(占比69.5%) | | 无任何危机标签的对话数 | 160(占比26.7%) | | 单对话平均标签数 | 1.18 ± 0.89 | ## 基准模型性能(轮次级别微F1值) | 模型 | 轮次级别微F1值 | 对话级别微F1值 | |---|---|---| | Claude-4.5-Sonnet | **56.85** | 70.11 | | Gemini-3-Flash | 53.05 | 68.57 | | GPT-5.1 | 48.91 | 67.00 | | gpt-oss-120b | 47.20 | 63.22 | | **Qwen3-32B-FT(本文方法)** | **51.31** | **68.88** | | Qwen3-32B(基础版) | 43.75 | 63.43 | 本微调模型在开源基线模型中表现最优,且与闭源系统的性能具备竞争力。完整结果详见相关论文。 ## 相关资源 - 📦 **微调模型**:[SungJoo/cradle-dialogue-qwen3-32b-2epoch](https://huggingface.co/SungJoo/cradle-dialogue-qwen3-32b-2epoch) - 📄 **CRADLE-Bench**(帖子级别基准数据集):[SungJoo/CRADLE-Bench](https://huggingface.co/datasets/SungJoo/CRADLE-Bench)
提供机构:
SungJoo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作