Comparison of answers to questions about tobacco cessation services in English and Spanish from different large language models
收藏DataCite Commons2026-02-12 更新2026-05-03 收录
下载链接:
https://figshare.com/articles/dataset/Comparative_evaluation_of_multilingual_large_language_models_for_tobacco_cessation_referrals/30334312
下载链接
链接失效反馈官方服务:
资源简介:
Introduction: Large language models (LLMs) are increasingly incorporated into digital health interventions. For example, as components of conversational agents (i.e., chatbots) for tobacco cessation. However, the multilingual reliability and consistency of these LLMs, across languages, has been underexplored in population health settings. Objective: To compare the performance of multilingual LLMs in answering user test questions related to a tobacco cessation quitline across English and Spanish within the context of a public health program. Methods: We conducted a 4×2×2 factorial design with three factors: four LLM models (ChatGPT-4o, Claude 3.5 Haiku, Gemini 2.5 Pro, and Llama 3.1-8B), two prompt languages (English or Spanish), and two question languages (English or Spanish). We compared the quality of responses generated by different models in both languages. A total of 800 responses to program-related user test questions and 400 responses to off-topic user test questions were evaluated by bilingual human reviewers using established quality assessment criteria. Mixed-effects models tested hypotheses related to linguistic congruence, model-specific performance, and interaction effects. Results: The interrater agreement across the different criteria ranged from 81.4% to 100%. The probability of correct language generation and program accuracy was 15% higher (RR = 1.15, 95% CI: 1.04–1.28; p = .006) and 0.14 points (on a 1 to 5 Likert scale) higher (β = 0.14, 95% CI: 0.06–0.23; p = .001), respectively, when the language of the question matched the language of the prompt. English-prompted LLMs presented with questions in English achieved mean Likert scores 0.36 points higher than Spanish-prompted LLMs presented with questions in Spanish (β = –0.36, 95% CI: –0.48 to –0.24; p < 0.001). Model-specific differences were observed, with Llama3.1-8B and ChatGPT-4o showing the best performances under congruent conditions. Conclusions: Our findings suggest that the performance of multilingual LLMs is influenced by linguistic congruence of custom LLM prompts and LLM user questions. These findings highlight the need for language-sensitive design and evaluation to ensure equitable and effective use of LLM-based chatbots in population health intervention contexts.
提供机构:
figshare
创建时间:
2025-10-10



