Supplementary material for the paper "Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI"
收藏4TU.ResearchData2025-09-01 更新2026-04-23 收录
下载链接:
https://data.4tu.nl/datasets/cb208bd8-7cf4-42d5-ae5e-9ad2c654aeb3/1
下载链接
链接失效反馈官方服务:
资源简介:
Supplementary material for the paper: Alam, M. S., & Bazilinskyy, P. (2025). Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI. Adjunct Proceedings of the 17th International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutoUI). Brisbane, QLD, Australia. https://doi.org/10.1145/3744335.3758477<br>This study evaluates the effectiveness of large language model-based personas for assessing external Human-Machine Interfaces (eHMIs) in automated vehicles. 13 different models namely Bak-LLaVA, ChatGPT-4o, DeepSeek-VL2-Tiny, Gemma3:12B, Gemma3:27B, Granite Vision 3.2, LLaMA 3.2 Vision, LLaVA-13B, LLaVA-34B, LLaVA-LLaMA-3, LLaVA-Phi3, MiniCPM-V and Moondream were tasked with simulating pedestrian decision making for 227 vehicle images equipped with eHMI. Confidence scores (0-100) were collected under two conditions: no memory (images independently assessed) and memory-enabled (conversation history preserved), each in 15 independent trials. The model outputs were compared with the ratings of 1,438 human participants. Gemma3:27B achieved the highest correlation with humans without memory (r = 0.85), while ChatGPT-4o performed best with memory (r = 0.81). DeepSeek-VL2-Tiny and BakLLaVA showed little sensitivity to context, and LLaVA-LLaMA-3, LLaVA-Phi3, LLaVA-13B and Moondream consistently produced limited-range output.<br>It has the following structure:* code/* code/.python-version: Pins the Python interpreter version (3.9.21) for environment consistency.* code/analysis.py: Main analysis script that processes outputs, computes statistics (e.g., correlations with human data), and produces result figures.* code/common.py: Contains functions for configuration management, dictionary search, and data serialisation.* code/custom_logger.py: Implements a custom logger class for handling string formatting and logging at various levels.* code/default.config: Configuration file specifying paths for data, plotly template, and plots directory.* code/logmod.py: Initialises and configures the logger with customisable display and storage options, supporting colored logs, threading, and multiprocessing.* code/main.py: Python script that produces all figures and analyses.* code/Makefile: Defines shortcut commands for setup, running analysis, and cleaning project outputs.* code/pyproject.toml: Defines project dependencies and metadata for the `uv` environment manager.* code/uv.lock: Lockfile with pinned dependency versions for reproducible builds.* models/* code/models/chat_gpt.py: Wrapper for interacting with ChatGPT (Vision), including prompt formatting, sending images, and parsing responses.* code/models/deepseek.py: Wrapper for DeepSeek-VL2 models, coordinating inference, inputs, and outputs.* code/models/ollama.py: Interface to run local Ollama models with specific parameters (temperature, context, history).* deepseek_vl2/* code/deepseek_vl2/__init__.py: Makes the deepseek_vl2 folder a package; initialises the DeepSeek-VL2 module structure.* code/deepseek_vl2/models/: Contains model definition files for DeepSeek-VL2.* code/deepseek_vl2/serve/: Implements server or API endpoints for running DeepSeek-VL2 inference.* code/deepseek_vl2/utils/: Utility scripts (helper functions, preprocessing, logging, etc.) used across DeepSeek-VL2.* data/* data/avg_with_memory.csv: Stores the averaged model confidence scores across 15 trials (with conversation memory enabled), aggregated per image.* data/avg_without_memory.csv: Stores the averaged model confidence scores across 15 trials (without conversation memory enabled), aggregated per image.* data/with_memory/: Contains all the raw output files directly generated by the LLM under the memory condition.* data/with_memory/analysed/: Subdirectory that stores the numeric values extracted from the raw outputs.* data/without_memory/: Contains all the raw output files generated by the LLM under the no-memory condition.* data/without_memory/analysed/: Subdirectory that stores the numeric values extracted from the raw outputs.* crowd_data: Includes the original images shown to participants and the corresponding averaged human responses, which serve as the benchmark for comparing against LLM outputs (sourced from DOI: 10.54941/ahfe1002444).
本内容为下述论文的补充材料:Alam, M. S. 与 Bazilinskyy, P. (2025). 《进退两难?大语言模型洞悉自动驾驶车辆前行人的心理状态》。收录于第17届国际汽车用户界面与交互式车辆应用大会(Automotive User Interfaces and Interactive Vehicular Applications, AutoUI)增补论文集,澳大利亚昆士兰州布里斯班。DOI: 10.1145/3744335.3758477
本研究评估了基于大语言模型(Large Language Models, LLMs)的角色设定在评估自动驾驶车辆外部人机交互界面(external Human-Machine Interfaces, eHMIs)中的有效性。研究选取了13款不同模型,分别为Bak-LLaVA、ChatGPT-4o、DeepSeek-VL2-Tiny、Gemma3:12B、Gemma3:27B、Granite Vision 3.2、LLaMA 3.2 Vision、LLaVA-13B、LLaVA-34B、LLaVA-LLaMA-3、LLaVA-Phi3、MiniCPM-V及Moondream,要求其模拟行人针对227张搭载eHMIs的车辆图像做出决策的过程。实验在两种条件下收集置信度评分(0-100分):无记忆条件(独立评估单张图像)与启用记忆条件(保留对话历史),每种条件均开展15次独立重复实验。将模型输出结果与1438名人类参与者的评分进行对比。结果显示,Gemma3:27B在无记忆条件下与人类评分的相关性最高(r=0.85),而ChatGPT-4o在启用记忆条件下表现最优(r=0.81)。DeepSeek-VL2-Tiny与BakLLaVA对上下文的敏感度较低,LLaVA-LLaMA-3、LLaVA-Phi3、LLaVA-13B及Moondream则始终生成取值范围较窄的输出结果。
本项目结构如下:
* `code/`
* `code/.python-version`:用于固定Python解释器版本(3.9.21),保障实验环境一致性。
* `code/analysis.py`:主分析脚本,用于处理模型输出、计算统计指标(如与人类数据的相关性)并生成结果图表。
* `code/common.py`:包含配置管理、字典检索及数据序列化相关的函数。
* `code/custom_logger.py`:实现自定义日志类,支持字符串格式化及多级别日志记录。
* `code/default.config`:配置文件,指定数据路径、Plotly模板及图表输出目录。
* `code/logmod.py`:初始化并配置日志器,支持自定义显示与存储选项,兼容彩色日志、多线程及多进程。
* `code/main.py`:用于生成所有图表与分析结果的Python脚本。
* `code/Makefile`:定义快捷命令,用于项目搭建、运行分析及清理输出文件。
* `code/pyproject.toml`:为`uv`环境管理器定义项目依赖与元数据。
* `code/uv.lock`:依赖锁定文件,固定依赖版本以保障构建可复现性。
* `models/`
* `code/models/chat_gpt.py`:ChatGPT(视觉版)交互封装脚本,包含提示词格式化、图像发送及响应解析功能。
* `code/models/deepseek.py`:DeepSeek-VL2系列模型封装脚本,协调推理流程、输入输出处理。
* `code/models/ollama.py`:本地Ollama模型交互接口,支持指定温度、上下文窗口及历史对话等参数。
* `deepseek_vl2/`
* `code/deepseek_vl2/__init__.py`:将deepseek_vl2文件夹标记为Python包,初始化DeepSeek-VL2模块结构。
* `code/deepseek_vl2/models/`:包含DeepSeek-VL2的模型定义文件。
* `code/deepseek_vl2/serve/`:实现DeepSeek-VL2推理的服务器或API端点。
* `code/deepseek_vl2/utils/`:DeepSeek-VL2通用工具脚本,包含辅助函数、预处理、日志等功能。
* `data/`
* `data/avg_with_memory.csv`:存储启用记忆条件下15次重复实验的平均模型置信度评分,按图像聚合。
* `data/avg_without_memory.csv`:存储无记忆条件下15次重复实验的平均模型置信度评分,按图像聚合。
* `data/with_memory/`:存储启用记忆条件下大语言模型生成的所有原始输出文件。
* `data/with_memory/analysed/`:子目录,存储从原始输出中提取的数值结果。
* `data/without_memory/`:存储无记忆条件下大语言模型生成的所有原始输出文件。
* `data/without_memory/analysed/`:子目录,存储从原始输出中提取的数值结果。
* `crowd_data`:包含向参与者展示的原始图像及对应的平均人类评分数据,作为对比大语言模型输出的基准(数据来源DOI: 10.54941/ahfe1002444)。
创建时间:
2025-09-01



