Comparison of answers to questions about tobacco cessation services in English and Spanish from different large language models

Figshare2025-10-10 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Comparative_evaluation_of_multilingual_large_language_models_for_tobacco_cessation_referrals/30334312

下载链接

链接失效反馈

官方服务：

资源简介：

Introduction: Large language models (LLMs) are increasingly incorporated into digital health interventions. For example, as components of conversational agents (i.e., chatbots) for tobacco cessation. However, the multilingual reliability and consistency of these LLMs, across languages, has been underexplored in population health settings. Objective: To compare the performance of multilingual LLMs in answering user test questions related to a tobacco cessation quitline across English and Spanish within the context of a public health program. Methods: We conducted a 4×2×2 factorial design with three factors: four LLM models (ChatGPT-4o, Claude 3.5 Haiku, Gemini 2.5 Pro, and Llama 3.1-8B), two prompt languages (English or Spanish), and two question languages (English or Spanish). We compared the quality of responses generated by different models in both languages. A total of 800 responses to program-related user test questions and 400 responses to off-topic user test questions were evaluated by bilingual human reviewers using established quality assessment criteria. Mixed-effects models tested hypotheses related to linguistic congruence, model-specific performance, and interaction effects. Results: The interrater agreement across the different criteria ranged from 81.4% to 100%. The probability of correct language generation and program accuracy was 15% higher (RR = 1.15, 95% CI: 1.04–1.28; p = .006) and 0.14 points (on a 1 to 5 Likert scale) higher (β = 0.14, 95% CI: 0.06–0.23; p = .001), respectively, when the language of the question matched the language of the prompt. English-prompted LLMs presented with questions in English achieved mean Likert scores 0.36 points higher than Spanish-prompted LLMs presented with questions in Spanish (β = –0.36, 95% CI: –0.48 to –0.24; p

创建时间：

2025-10-10