Comparative evaluation of multilingual large language models for tobacco cessation referrals

Name: Comparative evaluation of multilingual large language models for tobacco cessation referrals
Creator: Villarreal-Zegarra, David
Published: 2025-10-10 00:00:00
License: 暂无描述

Figshare2025-10-10 更新2026-04-08 收录

下载链接：

https://figshare.com/articles/dataset/Comparative_evaluation_of_multilingual_large_language_models_for_tobacco_cessation_referrals/30334312/1

下载链接

链接失效反馈

官方服务：

资源简介：

Introduction: Large language models (LLMs) are increasingly incorporated into digital health interventions, including as components of conversational agents (i.e., chatbots) for tobacco cessation. However, their multilingual reliability and consistency across languages has been underexplored in population health settings. Objective: To compare the performance of multilingual LLMs in answering tobacco cessation–related questions across English and Spanish within the context of a public health program. Methods: We conducted a 4×2×2 factorial experiment evaluating four LLMs (ChatGPT-4o, Claude 3.5 Haiku, Gemini 2.5 Pro, and Llama 3.1-8B) using prompts and questions in English and Spanish. A total of 800 responses to program-related questions and 400 responses to off-topic questions were evaluated by bilingual human reviewers using established quality assessment criteria. Mixed-effects models tested hypotheses related to linguistic congruence, model-specific performance, and interaction effects. Results: Responses were of higher quality when the prompt and question languages matched. Linguistic congruence increased the probability of correct language generation (RR = 1.15, 95% CI: 1.04–1.28; p = .006) and program accuracy (β = 0.14, 95% CI: 0.06–0.23; p = .001). English-language pairs produced more complete responses than Spanish pairs (β = –0.36, 95% CI: –0.48 to –0.24; p < 0.001). Model-specific differences were observed, with Llama3.1-8B and ChatGPT-4o showing the best performances under congruent conditions. Conclusions: The performance of multilingual LLMs is influenced by linguistic congruence, particularly for underrepresented languages such as Spanish. These findings highlight the need for language-sensitive design and evaluation to ensure equitable and effective use of LLM-based chatbots in population health intervention contexts.

提供机构：

Villarreal-Zegarra, David

创建时间：

2025-10-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集