Evaluation of large language model chatbot responses to psychotic prompts: numerical ratings of prompt-response pairs
收藏DataONE2025-11-19 更新2025-11-29 收录
下载链接:
https://search.dataone.org/view/sha256:f5cad7f08f6f3d0b0e31f2acb17fe00d1e42734e9952b7f5ceb5b089ba61c04d
下载链接
链接失效反馈官方服务:
资源简介:
The large language models (LLM) \"chatbot\" product ChatGPT has accumulated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input and generate encouraging responses, they may have difficulty appropriately responding to psychotic content. To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms, we conducted a cross-sectional, experimental study of how multiple versions of the ChatGPT product respond to psychotic and control prompts, with blind clinician ratings of response appropriateness. We found that all three tested versions of ChatGPT were much more likely to generate inappropriate responses to psychotic than control prompts, with the \"Free\" product showing the poorest performance. In an exploratory analysis, prompts reflecting grandiosit..., We created 79 psychotic prompts, first-person statements an individual experiencing psychosis could plausibly make to ChatGPT. Each reflected one of the five positive symptom domains assessed by the Structured Interview for Psychosis-Risk Syndromes (SIPS): unusual thought content/delusional ideas (n = 16), suspiciousness/persecutory ideas (n = 17), grandiose ideas (n = 15), perceptual disturbances/hallucinations (n = 15), and disorganized communication (n = 16). For each psychotic prompt, we created a corresponding control prompt similar in length, sentence structure and content but without psychotic elements. This yielded a total of 158 unique prompts. On 8/28 and 8/29/2025, we presented these prompts to three versions of the ChatGPT product: GPT-5 Auto (paid default at time of experiment), GPT-4o (previous paid default), and âFreeâ (version accessible without subscription or account), yielding 474 prompt-response pairs. Two primary raters assigned an \"appropriateness\" r..., # Evaluation of large language model chatbot responses to psychotic prompts: numerical ratings of prompt-response pairs
Dataset DOI: [10.5061/dryad.x0k6djj00](https://doi.org/10.5061/dryad.x0k6djj00)
## Description of the data and file structure
This dataset contains numerical ratings of prompt-response pairs from our study, and can be used to reproduce our analyses. Note that the literal text of prompts and model responses are not provided here, but they are available from the corresponding author on reasonable request.
### Files and variables
#### File: llm_psychosis_numeric_ratings.csv
**Description:** This CSV file contains all numeric appropriateness ratings assigned to prompt-response pairs in a \"long\" format. The 1592 rows represent 474 ratings each from two primary raters (for 948 from both), 474 derived consensus ratings, and 170 ratings from a secondary rater. The seven columns are described below.
##### Variables
* **`pair_id`**: The ID of the prompt-response pair rat...,
创建时间:
2025-11-20



