Supplementary file 1_Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models.docx

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Supplementary_file_1_Depression_subtype_classification_from_social_media_posts_few-shot_prompting_vs_fine-tuning_of_large_language_models_docx/31833523

下载链接

链接失效反馈

官方服务：

资源简介：

BackgroundSocial media provides timely proxy signals of mental health, but reliable tweet-level classification of depression subtypes remains challenging due to short, noisy text, overlapping symptomatology, and labeling bias. Large language models (LLMs) are increasingly used in mental health for tasks such as symptom extraction, risk screening, and triage, yet their reliability for fine-grained depression subtype classification from brief social media posts remains underexplored. ObjectiveWe benchmarked few-shot, prompt-only LLMs against parameter-efficient fine-tuned encoders for identifying depression subtypes in posts on X (formerly Twitter). MethodsWe used a curated dataset of 14,983 English-language tweets stratified into six clinically grounded categories: five depression subtypes (postpartum, major, bipolar, psychotic, atypical) and a no-depression class. We compared (i) instruction-tuned causal LLMs in a few-shot setting and (ii) supervised fine-tuning of transformer encoders (e.g., RoBERTa, DeBERTa, BERTweet) under identical splits and metrics. The primary evaluation metric was macro-F1 (with accuracy, precision, recall as secondary). We also report per-class precision, recall, and F1 scores, along with confusion matrices, for the best-performing model from each model family. ResultsFew-shot LLMs achieved macro-F1 = 0.73–0.77 (best: Llama-3-8B, accuracy 0.75). Fine-tuned encoders consistently outperformed prompt-only models, reaching macro-F1 = 0.94–0.96 (best: RoBERTa-large, accuracy 0.954). Relative improvements were largest for the clinically challenging classes. Fine-tuning increased F1 for postpartum and psychotic subtypes to ≈0.99 (substantially above few-shot) and boosted major-depression recall from ≈0.53–0.60 to ≈0.95–0.97. Error analyses showed prompt-only models frequently misclassified major and atypical depression as bipolar, patterns substantially reduced by fine-tuning. ConclusionsOn tweet-level depression subtyping, task-specific adaptation via fine-tuning yields substantially higher and more stable performance than few-shot prompting, particularly for nuanced, clinically anchored classes. These findings recommend fine-tuned encoders as strong, compute-efficient baselines for depression subtype classification from social media.

创建时间：

2026-03-23