Supplementary material for the paper "Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI"

4TU.ResearchData2025-09-01 更新2026-04-23 收录

下载链接：

https://data.4tu.nl/datasets/cb208bd8-7cf4-42d5-ae5e-9ad2c654aeb3/1

下载链接

链接失效反馈

官方服务：

资源简介：

Supplementary material for the paper: Alam, M. S., & Bazilinskyy, P. (2025). Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI. Adjunct Proceedings of the 17th International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutoUI). Brisbane, QLD, Australia. https://doi.org/10.1145/3744335.3758477<br>This study evaluates the effectiveness of large language model-based personas for assessing external Human-Machine Interfaces (eHMIs) in automated vehicles. 13 different models namely Bak-LLaVA, ChatGPT-4o, DeepSeek-VL2-Tiny, Gemma3:12B, Gemma3:27B, Granite Vision 3.2, LLaMA 3.2 Vision, LLaVA-13B, LLaVA-34B, LLaVA-LLaMA-3, LLaVA-Phi3, MiniCPM-V and Moondream were tasked with simulating pedestrian decision making for 227 vehicle images equipped with eHMI. Confidence scores (0-100) were collected under two conditions: no memory (images independently assessed) and memory-enabled (conversation history preserved), each in 15 independent trials. The model outputs were compared with the ratings of 1,438 human participants. Gemma3:27B achieved the highest correlation with humans without memory (r = 0.85), while ChatGPT-4o performed best with memory (r = 0.81). DeepSeek-VL2-Tiny and BakLLaVA showed little sensitivity to context, and LLaVA-LLaMA-3, LLaVA-Phi3, LLaVA-13B and Moondream consistently produced limited-range output.<br>It has the following structure:* code/* code/.python-version: Pins the Python interpreter version (3.9.21) for environment consistency.* code/analysis.py: Main analysis script that processes outputs, computes statistics (e.g., correlations with human data), and produces result figures.* code/common.py: Contains functions for configuration management, dictionary search, and data serialisation.* code/custom_logger.py: Implements a custom logger class for handling string formatting and logging at various levels.* code/default.config: Configuration file specifying paths for data, plotly template, and plots directory.* code/logmod.py: Initialises and configures the logger with customisable display and storage options, supporting colored logs, threading, and multiprocessing.* code/main.py: Python script that produces all figures and analyses.* code/Makefile: Defines shortcut commands for setup, running analysis, and cleaning project outputs.* code/pyproject.toml: Defines project dependencies and metadata for the `uv` environment manager.* code/uv.lock: Lockfile with pinned dependency versions for reproducible builds.* models/* code/models/chat_gpt.py: Wrapper for interacting with ChatGPT (Vision), including prompt formatting, sending images, and parsing responses.* code/models/deepseek.py: Wrapper for DeepSeek-VL2 models, coordinating inference, inputs, and outputs.* code/models/ollama.py: Interface to run local Ollama models with specific parameters (temperature, context, history).* deepseek_vl2/* code/deepseek_vl2/__init__.py: Makes the deepseek_vl2 folder a package; initialises the DeepSeek-VL2 module structure.* code/deepseek_vl2/models/: Contains model definition files for DeepSeek-VL2.* code/deepseek_vl2/serve/: Implements server or API endpoints for running DeepSeek-VL2 inference.* code/deepseek_vl2/utils/: Utility scripts (helper functions, preprocessing, logging, etc.) used across DeepSeek-VL2.* data/* data/avg_with_memory.csv: Stores the averaged model confidence scores across 15 trials (with conversation memory enabled), aggregated per image.* data/avg_without_memory.csv: Stores the averaged model confidence scores across 15 trials (without conversation memory enabled), aggregated per image.* data/with_memory/: Contains all the raw output files directly generated by the LLM under the memory condition.* data/with_memory/analysed/: Subdirectory that stores the numeric values extracted from the raw outputs.* data/without_memory/: Contains all the raw output files generated by the LLM under the no-memory condition.* data/without_memory/analysed/: Subdirectory that stores the numeric values extracted from the raw outputs.* crowd_data: Includes the original images shown to participants and the corresponding averaged human responses, which serve as the benchmark for comparing against LLM outputs (sourced from DOI: 10.54941/ahfe1002444).

本内容为下述论文的补充材料：Alam, M. S. 与 Bazilinskyy, P. (2025). 《进退两难？大语言模型洞悉自动驾驶车辆前行人的心理状态》。收录于第17届国际汽车用户界面与交互式车辆应用大会（Automotive User Interfaces and Interactive Vehicular Applications, AutoUI）增补论文集，澳大利亚昆士兰州布里斯班。DOI: 10.1145/3744335.3758477 本研究评估了基于大语言模型（Large Language Models, LLMs）的角色设定在评估自动驾驶车辆外部人机交互界面（external Human-Machine Interfaces, eHMIs）中的有效性。研究选取了13款不同模型，分别为Bak-LLaVA、ChatGPT-4o、DeepSeek-VL2-Tiny、Gemma3:12B、Gemma3:27B、Granite Vision 3.2、LLaMA 3.2 Vision、LLaVA-13B、LLaVA-34B、LLaVA-LLaMA-3、LLaVA-Phi3、MiniCPM-V及Moondream，要求其模拟行人针对227张搭载eHMIs的车辆图像做出决策的过程。实验在两种条件下收集置信度评分（0-100分）：无记忆条件（独立评估单张图像）与启用记忆条件（保留对话历史），每种条件均开展15次独立重复实验。将模型输出结果与1438名人类参与者的评分进行对比。结果显示，Gemma3:27B在无记忆条件下与人类评分的相关性最高（r=0.85），而ChatGPT-4o在启用记忆条件下表现最优（r=0.81）。DeepSeek-VL2-Tiny与BakLLaVA对上下文的敏感度较低，LLaVA-LLaMA-3、LLaVA-Phi3、LLaVA-13B及Moondream则始终生成取值范围较窄的输出结果。本项目结构如下： * `code/` * `code/.python-version`：用于固定Python解释器版本（3.9.21），保障实验环境一致性。 * `code/analysis.py`：主分析脚本，用于处理模型输出、计算统计指标（如与人类数据的相关性）并生成结果图表。 * `code/common.py`：包含配置管理、字典检索及数据序列化相关的函数。 * `code/custom_logger.py`：实现自定义日志类，支持字符串格式化及多级别日志记录。 * `code/default.config`：配置文件，指定数据路径、Plotly模板及图表输出目录。 * `code/logmod.py`：初始化并配置日志器，支持自定义显示与存储选项，兼容彩色日志、多线程及多进程。 * `code/main.py`：用于生成所有图表与分析结果的Python脚本。 * `code/Makefile`：定义快捷命令，用于项目搭建、运行分析及清理输出文件。 * `code/pyproject.toml`：为`uv`环境管理器定义项目依赖与元数据。 * `code/uv.lock`：依赖锁定文件，固定依赖版本以保障构建可复现性。 * `models/` * `code/models/chat_gpt.py`：ChatGPT（视觉版）交互封装脚本，包含提示词格式化、图像发送及响应解析功能。 * `code/models/deepseek.py`：DeepSeek-VL2系列模型封装脚本，协调推理流程、输入输出处理。 * `code/models/ollama.py`：本地Ollama模型交互接口，支持指定温度、上下文窗口及历史对话等参数。 * `deepseek_vl2/` * `code/deepseek_vl2/__init__.py`：将deepseek_vl2文件夹标记为Python包，初始化DeepSeek-VL2模块结构。 * `code/deepseek_vl2/models/`：包含DeepSeek-VL2的模型定义文件。 * `code/deepseek_vl2/serve/`：实现DeepSeek-VL2推理的服务器或API端点。 * `code/deepseek_vl2/utils/`：DeepSeek-VL2通用工具脚本，包含辅助函数、预处理、日志等功能。 * `data/` * `data/avg_with_memory.csv`：存储启用记忆条件下15次重复实验的平均模型置信度评分，按图像聚合。 * `data/avg_without_memory.csv`：存储无记忆条件下15次重复实验的平均模型置信度评分，按图像聚合。 * `data/with_memory/`：存储启用记忆条件下大语言模型生成的所有原始输出文件。 * `data/with_memory/analysed/`：子目录，存储从原始输出中提取的数值结果。 * `data/without_memory/`：存储无记忆条件下大语言模型生成的所有原始输出文件。 * `data/without_memory/analysed/`：子目录，存储从原始输出中提取的数值结果。 * `crowd_data`：包含向参与者展示的原始图像及对应的平均人类评分数据，作为对比大语言模型输出的基准（数据来源DOI: 10.54941/ahfe1002444）。

创建时间：

2025-09-01