MichiganNLP/HeadRoom

Name: MichiganNLP/HeadRoom
Creator: MichiganNLP
Published: 2024-05-20 20:53:45
License: 暂无描述

Hugging Face2024-05-20 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/MichiganNLP/HeadRoom

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification - zero-shot-classification - text-generation language: - en tags: - medical - LLM - depression - race - gender pretty_name: HeadRoom size_categories: - 1K<n<10K --- # Dataset Card for InspAIred ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Additional Information](#additional-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [HeadRoom homepage](https://github.com/MichiganNLP/depression_synthetic_data) - **Repository:** [HeadRoom repository](https://github.com/MichiganNLP/depression_synthetic_data) - **Paper:** [Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data ](https://arxiv.org/abs/2403.16909) - **Point of Contact:** [Shinka Mori](mailto:shinkamo@umich.edu ) ### Dataset Summary This work proposes to study the application of GPT-3 as a synthetic data generation tool for mental health, by analyzing its Algorithmic Fidelity, a term coined by Argyle et al 2022 to refer to the ability of LLMs to approximate real-life text distributions. Using GPT-3, we develop HeadRoom, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). We hope our work contributes to the study of synthetic data generation and helps researchers analyze and understand how closely GPT-3 can mimic real-life depression data. ### Languages The text in the dataset is in English. ### Supported Tasks and Leaderboards TODO ## Additional Information ### Citation Information ```bibtex @inproceedings{mori-etal-2024-towards-algorithmic, title = "Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data", author = "Mori, Shinka and Ignat, Oana and Lee, Andrew and Mihalcea, Rada", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italy", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1423", pages = "16378--16391", abstract = "Synthetic data generation has the potential to impact applications and domains with scarce data. However, before such data is used for sensitive tasks such as mental health, we need an understanding of how different demographics are represented in it. In our paper, we analyze the potential of producing synthetic data using GPT-3 by exploring the various stressors it attributes to different race and gender combinations, to provide insight for future researchers looking into using LLMs for data generation. Using GPT-3, we develop HeadRoom, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). Using this dataset, we conduct semantic and lexical analyses to (1) identify the predominant stressors for each demographic group; and (2) compare our synthetic data to a human-generated dataset. We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups, which could be used to test the limitations of LLMs for synthetic data generation for depression data. Our findings show that synthetic data mimics some of the human-generated data distribution for the predominant depression stressors across diverse demographics.", } ``` ### Contributions Thanks to [@shinka](https://github.com/ShinkaM), [@oignat](https://github.com/OanaIgnat), [@andrew](https://ajyl.github.io/)

license: MIT许可证 task_categories: - 文本分类（text-classification） - 零样本分类（zero-shot-classification） - 文本生成（text-generation） language: - 英语 tags: - 医疗 - 大语言模型（LLM） - 抑郁症 - 种族 - 性别 pretty_name: HeadRoom size_categories: - 1000 < n < 10000 --- # InspAIred 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [语言分布](#languages) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [附加信息](#additional-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**: [HeadRoom 项目主页](https://github.com/MichiganNLP/depression_synthetic_data) - **代码仓库**: [HeadRoom 代码仓库](https://github.com/MichiganNLP/depression_synthetic_data) - **相关论文**: [《算法保真度：合成数据与人工生成数据中的人口统计学维度心理健康表征》](https://arxiv.org/abs/2403.16909) - **联系人**: [Shinka Mori](mailto:shinkamo@umich.edu ) ### 数据集摘要本研究旨在探讨将GPT-3作为心理健康领域合成数据生成工具的应用价值，通过分析其**算法保真度（Algorithmic Fidelity）**——该术语由Argyle等人于2022年提出，用于指代大语言模型（LLM）拟合真实文本分布的能力。依托GPT-3，本研究构建了HeadRoom数据集，该数据集包含3120条关于抑郁症触发应激源的合成文本，研究过程中对种族、性别以及新冠疫情前后的时间维度进行了严格控制。本研究希望可为合成数据生成领域的研究提供助力，并帮助研究者分析与理解GPT-3模拟真实抑郁症相关文本的拟合程度。 ### 语言分布本数据集的文本语言为英语。 ### 支持任务与排行榜待补充 ## 附加信息 ### 引用信息 bibtex @inproceedings{mori-etal-2024-towards-algorithmic, title = "Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data", author = "Mori, Shinka and Ignat, Oana and Lee, Andrew and Mihalcea, Rada", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italy", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1423", pages = "16378--16391", abstract = "Synthetic data generation has the potential to impact applications and domains with scarce data. However, before such data is used for sensitive tasks such as mental health, we need an understanding of how different demographics are represented in it. In our paper, we analyze the potential of producing synthetic data using GPT-3 by exploring the various stressors it attributes to different race and gender combinations, to provide insight for future researchers looking into using LLMs for data generation. Using GPT-3, we develop HeadRoom, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). Using this dataset, we conduct semantic and lexical analyses to (1) identify the predominant stressors for each demographic group; and (2) compare our synthetic data to a human-generated dataset. We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups, which could be used to test the limitations of LLMs for synthetic data generation for depression data. Our findings show that synthetic data mimics some of the human-generated data distribution for the predominant depression stressors across diverse demographics.", } ### 贡献者感谢 [@shinka](https://github.com/ShinkaM)、[@oignat](https://github.com/OanaIgnat) 与 [@andrew](https://ajyl.github.io/) 为本项目做出的贡献。

提供机构：

MichiganNLP

原始信息汇总

数据集概述

数据集描述

数据集总结

名称: HeadRoom
描述: 使用GPT-3生成关于抑郁症触发因素的合成数据集，包含3,120条帖子，控制了种族、性别和时间框架（COVID-19前后）。
目的: 研究GPT-3在模拟真实生活文本分布方面的算法忠诚度，特别是在心理健康领域的应用。

语言

主要语言: 英语

支持的任务和排行榜

任务: 文本分类、零样本分类、文本生成
状态: 待定

附加信息

引用信息

论文: Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data
作者: Mori, Shinka; Ignat, Oana; Lee, Andrew; Mihalcea, Rada
出版: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
出版商: ELRA and ICCL
年份: 2024

贡献者

主要贡献者: @shinka, @oignat, @andrew

搜集汇总

数据集介绍

构建方式

在心理健康数据稀缺的背景下，HeadRoom数据集通过GPT-3模型生成，旨在探索合成数据在精神健康领域的应用潜力。该数据集构建过程严格遵循算法保真度理念，通过精心设计的提示词控制种族、性别及新冠疫情前后时间框架，生成了3120条关于抑郁触发压力源的文本帖子。这一方法不仅模拟了真实抑郁数据的分布特征，还为跨人口统计变量的语义分析提供了结构化基础。

特点

HeadRoom数据集的核心特点在于其跨维度的人口统计控制与时间敏感性。数据集覆盖了多种族与性别组合，并区分新冠疫情前后的时间背景，从而捕捉社会事件对心理健康表述的潜在影响。其文本内容聚焦于抑郁相关的压力源，语义层次丰富，为研究语言模型在模拟真实心理状态表达时的偏差与局限性提供了关键语料。该数据集以英文呈现，规模适中，适用于分类、生成及零样本学习等自然语言处理任务。

使用方法

该数据集主要应用于心理健康领域的自然语言处理研究，特别是算法保真度评估与合成数据质量分析。研究人员可通过语义与词汇分析，比较合成数据与真实人类生成数据在压力源表述上的分布差异，进而探索语言模型在模拟特定人口群体心理体验时的表现。数据集支持文本分类、零样本分类及文本生成等任务，为抑郁干预系统的开发、偏见检测以及跨文化心理健康研究提供了实验基础。

背景与挑战

背景概述

在心理健康研究领域，获取真实且具有代表性的数据常面临伦理与隐私的严峻挑战，这限制了相关自然语言处理模型的开发与评估。为此，密歇根大学自然语言处理实验室的研究团队于2024年提出了HeadRoom数据集，旨在探索大型语言模型在生成合成心理健康数据方面的潜力。该数据集由GPT-3模型生成，包含3,120条关于抑郁触发压力源的文本，并精细控制了种族、性别以及新冠疫情前后的时间框架。其核心研究问题聚焦于评估合成数据的算法保真度，即模型模拟真实文本分布的能力，以期在数据稀缺的敏感领域为研究提供新的资源与见解。

当前挑战

HeadRoom数据集致力于应对心理健康领域中数据稀缺与代表性不足的挑战，其核心任务是评估合成文本在模拟真实抑郁相关压力源分布时的算法保真度。构建过程中的主要挑战在于，如何通过精心设计的提示词控制，确保生成的内容在种族、性别等人口统计学维度上具有合理的多样性与真实性，同时避免模型固有的社会偏见被放大。此外，将生成数据与人类撰写的真实数据进行语义及词汇层面的对比分析，以验证其分布相似性，亦是一项复杂且关键的验证工作。

常用场景

经典使用场景

在心理健康与计算语言学交叉领域，HeadRoom数据集为研究合成数据生成提供了关键资源。该数据集通过控制种族、性别及新冠疫情前后时间框架，生成了3120条关于抑郁触发压力源的合成文本，其经典使用场景在于评估大型语言模型在模拟真实抑郁数据分布时的算法保真度。研究人员利用该数据集进行语义与词汇分析，以探索不同人口统计学群体在压力源表达上的差异，从而验证合成数据在心理健康研究中的可靠性与局限性。

衍生相关工作

基于HeadRoom数据集，衍生了一系列探索合成数据生成与心理健康表征的经典研究。例如，相关工作进一步扩展了算法保真度的评估框架，比较了不同大型语言模型在生成抑郁数据时的偏差与多样性。此外，该数据集启发了对跨文化、跨性别抑郁表达的语言学分析，推动了心理健康自然语言处理模型在公平性与可解释性方面的改进，为合成数据在社会科学与临床研究中的标准化应用奠定了方法论基础。

数据集最近研究