Psychological profiling from digital traces: A case study on AI-driven longitudinal analysis of personal email communications
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15006845
下载链接
链接失效反馈官方服务:
资源简介:
This is a replication data for the paper titled "Psychological profiling from digital traces: A case study on AI-driven longitudinal analysis of personal email communications" submitted for a blind review.
Abstract
The rapid advancement of generative Artificial Intelligence (AI) has significantly expanded opportunities for psychological research by enabling automated analysis of digital communication. This paper introduces a novel, fully automated methodology for applying Large Language Models (LLMs) to psychological text analysis, ensuring rigor through internal consistency testing, machine evaluation, and human validation. The study develops a framework for extracting psychological traits from long-form digital text and applies it across four psychological theories - Self-Determination Theory, the Big Five Personality Traits, Psychological Well-being, and Cognitive Behavioral Therapy - using a 16-year longitudinal dataset of 25,780 emails. The methodology is validated through a multi-step process, including inter-rater reliability measures and benchmarking against self-reported psychological assessments. Results confirm that LLMs can provide consistent and interpretable psychological profiling, demonstrating a structured approach that extends beyond individual-level analysis. By integrating computational psychometrics with human-computer interaction research, this study establishes a scalable method for psychological assessment from digital traces. The findings underscore the potential of generative AI to enhance behavioral research, offering a replicable framework for future studies in automated psychological analysis.
The zipped file contains five csv files:
Email_classification-csv: LLM (GPT-3.5 Turbo) classification of 25,780 emails for four psychological theories: SDT, Big Five, PWB and CBT.
SDT_regression_data.csv
Big_Five_regression_data.csv
PWB_regression_data.csv
CBT_regression_data.csv
For 2-5 files the dependent variable is monthy percentage share of emails the were assigned a given value for categories of one of the four psychological theories analyzed.
Linear regression model has been applied, where dependent variable is the percentage of emails in a specified category that assigned a specific value in this category. For example in Big Five Traits Model, for the Openness category, for each month we calculated percentage of emails that exhibit High or Low openness, or None if the content of the email does not provide enough information to assess whether the specific need is relevant. Two dependent variables were created: Openness-high and Openness-low and regressed on all independent variables. Regressions were not run for the None values.
Descriptions of independent variables:
- income_index: Person X salary income and consulting fees in a given month, normalized to [0,1].
- card_spending: Person X credit card expenditures in a given month, normalized to [0,1].
- abroad_far: dummy variable set to 1 for months when Person X worked in Central Asia
- abroad_near: dummy variable set to 1 when Person X worked in other EU country
- death_1_war: variable set to 1 in a month when Person X’ farther in law passed away. In the same month Russia invaded Ukraine. The variable was set to .75 in the following month, and to .5 in the month after that.
- death_2: variable set to 1 in a month when Person X’ mother passed away. The variable was set to .75 in the following month, and to .5 in the month after that.
- court_case: dummy variable set to 1 for months with the emotionally engaging inheritance court case involving other family members.
- BIG4_partner: dummy variable set to 1 for months when Person X worked as a partner in BIG4 accounting firm, which resulted in adopting a professional activity sharply different from the usual Person X habits.
- AI_company: dummy variable set to 1 for months when Person X worked as C-level executive at a company specializing in artificial intelligence.
- elections: dummy variable set to 1 for months when Person X unsuccessfully run in parliamentary elections
- covid_lockdown: dummy variable set to 1 for month where Polish government imposed tough measures during two covid lockdowns.
- no_receive: number of different email recipients each month, normalized to [0,1].
- avg_length: average number of words in emails sent each month, normalized to [0,1].
While the email data was collected for January 2008 – March 2014 period, financial data was available from October 2009. There were some months where no emails with more than 10 words were sent, yielding 166 monthly observations used for regressions, before removing outliers.
Independent variables were tested for multicollinearity, outlier months were removed, regressions were estimated with robust standard errors, and a range of standard tests were conducted for normality and autocorrelation of residuals, confirming good statistical properties of estimated models.
Due to privacy concerns, the email texts cannot be publicly shared. However, the classifications of psychological categories derived from the email texts, along with all other relevant data, are made publicly available in this open access repository, with the consent of email author.
创建时间:
2025-03-12



