Replication data for the paper "Leveraging Large Language Models for Comprehensive Psychological Analysis: Insights from Four Theoretical Frameworks"
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11572549
下载链接
链接失效反馈官方服务:
资源简介:
This is a replication data for the paper titled "Leveraging Large Language Models for Comprehensive Psychological Analysis: Insights from Four Theoretical Frameworks" submitted for a blind review.
Abstract
The rapid advancement of generative Artificial Intelligence (AI) has significantly transformed various research domains. This paper introduces a novel, fully automated methodology for applying Large Language Models (LLMs) to psychological text analysis. The approach includes prompt design for zero-shot and few-shot learning, model internal consistency analysis, autonomous machine evaluation, and additional human validation. Applied to four psychological theories—Self-Determination Theory, the Big Five Personality Traits, Psychological Well-being, and Cognitive Behavioral Therapy—this methodology is tested on a dataset of 25,780 emails written by a senior executive (called Person X) over 16 years. The analysis involves extracting psychological characteristics from the emails and regressing these characteristics against personal, professional, and environmental factors. The results demonstrate that the methodology provides unique insights into the examined psychological theories, offering a detailed understanding of how various factors influence psychological states and traits over time. This research highlights the potential of LLMs in capturing and analyzing complex psychological patterns in large text corpora, contributing a robust framework for future studies and practical applications in psychological assessment and intervention. The findings underscore the transformative impact of generative AI in psychological research, opening new avenues for understanding human behavior through advanced language models.
The zipped file contains five csv files:
Email_classification-csv: LLM (GPT-3.5 Turbo) classification of 25,780 emails for four psychological theories: SDT, Big Five, PWB and CBT.
SDT_regression_data.csv
Big_Five_regression_data.csv
PWB_regression_data.csv
CBT_regression_data.csv
For 2-5 files the dependent variable is monthy percentage share of emails the were assigned a given value for categories of one of the four psychological theories analyzed.
Linear regression model has been applied, where dependent variable is the percentage of emails in a specified category that assigned a specific value in this category. For example in Big Five Traits Model, for the Openness category, for each month we calculated percentage of emails that exhibit High or Low openness, or None if the content of the email does not provide enough information to assess whether the specific need is relevant. Two dependent variables were created: Openness-high and Openness-low and regressed on all independent variables. Regressions were not run for the None values.
Descriptions of independent variables:
- income_index: Person X salary income and consulting fees in a given month, normalized to [0,1].
- card_spending: Person X credit card expenditures in a given month, normalized to [0,1].
- abroad_far: dummy variable set to 1 for months when Person X worked in Central Asia
- abroad_near: dummy variable set to 1 when Person X worked in other EU country
- death_1_war: variable set to 1 in a month when Person X’ farther in law passed away. In the same month Russia invaded Ukraine. The variable was set to .75 in the following month, and to .5 in the month after that.
- death_2: variable set to 1 in a month when Person X’ mother passed away. The variable was set to .75 in the following month, and to .5 in the month after that.
- court_case: dummy variable set to 1 for months with the emotionally engaging inheritance court case involving other family members.
- BIG4_partner: dummy variable set to 1 for months when Person X worked as a partner in BIG4 accounting firm, which resulted in adopting a professional activity sharply different from the usual Person X habits.
- AI_company: dummy variable set to 1 for months when Person X worked as C-level executive at a company specializing in artificial intelligence.
- elections: dummy variable set to 1 for months when Person X unsuccessfully run in parliamentary elections
- covid_lockdown: dummy variable set to 1 for month where Polish government imposed tough measures during two covid lockdowns.
- no_receive: number of different email recipients each month, normalized to [0,1].
- avg_length: average number of words in emails sent each month, normalized to [0,1].
While the email data was collected for January 2008 – March 2014 period, financial data was available from October 2009. There were some months where no emails with more than 10 words were sent, yielding 166 monthly observations used for regressions, before removing outliers.
Independent variables were tested for multicollinearity, outlier months were removed, regressions were estimated with robust standard errors, and a range of standard tests were conducted for normality and autocorrelation of residuals, confirming good statistical properties of estimated models.
Due to privacy concerns, the email texts cannot be publicly shared. However, the classifications of psychological categories derived from the email texts, along with all other relevant data, are made publicly available in this open access repository, with the consent of email author.
创建时间:
2024-06-11



