Exploring Large Language Models' Responses to Moral Reasoning Dilemmas

DataONE2025-05-31 更新2025-11-01 收录

下载链接：

https://search.dataone.org/view/sha256:bad34051435999bf29745a547cdcb0b32c81f9d2073df516161b3f4c3481f78d

下载链接

链接失效反馈

官方服务：

资源简介：

This study investigates how various large language models (LLMs) generate responses to moral reasoning dilemmas. It specifically examines LLM-generated responses using the Defining Issues Test (DIT-2) and the Intermediate Concepts Measure (ICM) for Educational Leaders. Using a neo-Kohlbergian approach to moral reasoning, the study evaluates responses from multiple LLM platforms: ChatGPT-3.5, ChatGPT-4, ChatGPT-4O, Grok Premium Plus, Claude 3.5 Sonnet, Gemini, and Gemini Advanced. For DIT-2, Claude learns to prioritize the highest post-conventional moral reasoning score and N2 score (P-score 72, N2 score 71.10), followed by Gemini Advanced (P-score 64, N2 score 60.31) and Gemini (P-score 58, N2 score 52.11). Other LLMs performed as follows: Grok (P-score 48, N2 score 47.98), ChatGPT-4O (P-score 44, N2 score 55.07), ChatGPT-4 (P-score 44, N2 score 46.53), and ChatGPT-3.5 (P-score 18, N2 score 36.20). For the ICM Educational Leaders version, Gemini Advanced had the highest total ICM score of 0.90, followed by Claude 3.5 Sonnet and Gemini (both 0.86), ChatGPT-4O and ChatGPT-4 (both 0.78), Grok (0.61), and ChatGPT-3.5 (0.32). The findings indicate that some LLMs can generate responses consistent with sophisticated moral reasoning patterns, producing scores comparable to or exceeding graduate-level human participants (whose P-scores typically range from 38.5 to 42.3) and provide a methodological framework consisting of standardized assessment protocols and comparative analysis techniques for larger-scale research to improve our understanding of AI's potential in moral reasoning.

创建时间：

2025-10-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集