five

TLab交通测评数据集

收藏
魔搭社区2026-02-07 更新2026-02-21 收录
下载链接:
https://modelscope.cn/datasets/TLabFineTuning/TLab_Traffic_Eval
下载链接
链接失效反馈
官方服务:
资源简介:
# TRIP-Evaluate: 多模态综合交通大模型垂直领域评测基准数据集简介 **TRIP-Evaluate** 是一个专为交通垂直领域多模态大模型设计的系统性评测基准。针对通用大模型在处理中国特有交通法规、复杂工程设计标准及多智能体交互逻辑时存在的局限性,TRIP-Evaluate 构建了一套覆盖“车辆端、管理端、出行者端、规划设计端”四大核心维度的评测体系。 本项目旨在量化检测行业微调模型(如盘古交通大模型)相较于通用基座模型在专业知识记忆、逻辑推理、数值计算及场景语义理解能力上的差异,为交通行业大模型的落地应用提供标准化的评估参考。 **样本规模**:目标 1000+,当前核心测试集共 **198条**(包含 158条文本样本与 40条多模态图像样本)。 **语言**:简体中文 (zh-CN) **任务类型**:单项选择题 (Multiple Choice), 多模态问答 (VQA) **如需使用数据,请先通过邮箱与zzhou602@seu.edu联系,并附上使用人信息、单位名称、使用说明以及modelscope账号名。在收到申请后,我们会尽快同意您的modelscope使用申请。** # 评测分类体系 Traffic-Eval 摒弃了单一的总分评价模式,采用 **"一级角色 (Role) - 二级任务 (Task) - 三级知识点 (Knowledge)"** 的立体评价架构,支持生成细粒度的能力雷达图。 ## Class A: Vehicle Agent (车辆端) 侧重于自动驾驶系统的感知、决策与控制能力。 **感知与定位**:传感器融合、目标检测。 **决策规划**:路径规划算法、行为决策逻辑、局部路径规划。 **控制执行**:车辆纵向控制、横向控制动力学。 **交通安全**:行车视距计算、行驶稳定性分析。 ## Class B: Management Agent (管理端) 侧重于交通管控、设施运维与法规执行能力。 **交通信号控制**:绿信比计算、相位设计优化。 **道路标志标线**:国标 GB5768 标志设置规范、标线设计。 **智能路侧设备**:RSU 部署策略、V2X 通信。 **交通安全**:冲突点管理、事故预防策略。 ## Class C: Traveler Agent (出行者端) 侧重于用户行为分析与主观安全评价。 **出行行为分析**:出行需求预测(四阶段法)、交通方式选择逻辑。 **交通安全**:交通事故分析、人行设施安全评价。 ## Class D: Planning Agent (规划设计端) 侧重于道路几何设计与宏观路网规划。 **道路几何设计**:视距计算、平曲线/纵断面设计、建筑限界。 **路网规划**:路网布局结构、通行能力与容量计算。 # Dataset Structure (数据结构) 数据以 CSV/JSON 格式存储,每条样本包含完整的元数据标注以支持多维度归因分析。 | **Field Name** | **Type** | **Description** | | --- | --- | --- | | id | String | 样本唯一标识符 (e.g., B_b_1) | | role_tag | String | 一级分类:车辆端/管理端/出行者端/规划设计端 | | subject_tag | String | 二级分类:具体任务场景 (e.g., 交通安全) | | knowledge_point | String | 三级分类:具体考点 (e.g., 视距三角形) | | question | String | 题干描述,包含场景设定与问题 | | options | List/Map | 选项集合 (A/B/C/D) | | answer | String | 标准答案选项 | | explanation | String | 答案解析,引用法规条文或计算公式 | | difficulty | String | 难度系数:简单/中等/困难 | | modality | String | 数据模态:Text-only / Image / Point Cloud | | capability_tag | String | 能力维度:知识记忆/逻辑推理/数值计算/场景语义 | # 评测榜单 基于 **Traffic-Eval v2.0** 核心测试集(198条),我们对当前主流大模型进行了严格的分模态评测。为保证结果的科学性,文本逻辑(Language)**与**视觉感知(Vision)采用独立赛道展示。 综合性能与一级角色能力榜单 下表展示了各模型在不同模态下的总准确率,以及在四大一级角色(Role)维度的得分情况。 | **Rank** | **Model Name** | **Text Acc (Language)** | **Vision Acc (Image)** | | --- | --- | --- | --- | | 1 | **Deepseek-V3.2** | **84.2%** | \- | | 2 | **Qwen-max** | 81.0% | \- | | 3 | **Qwen2.5-coder-7b** | 79.7% | \- | | 4 | **Llama-3.2-90b-vis** | 77.2% | **92.5%** | | 5 | **Qwen2.5-coder-32b** | 75.9% | \- | | 6 | **Gemma-2-9b-it** | 75.9% | \- | | 7 | **GPT-oss-20b** | 74.7% | \- | | 8 | **Gemma-3-27b-it** | 74.7% | \- | | 9 | **Llama-3.2-11b-vis** | 61.4% | **77.5%** | | **Model Name** | **出行者端 (Traveler)** | **管理端 (Management)** | **规划设计端 (Planning)** | **车辆端 (Vehicle)** | | --- | --- | --- | --- | --- | | **Deepseek-V3.2** | 88.9% | **91.7%** | **76.8%** | 94.1% | | **Qwen-max** | 85.2% | **91.7%** | 72.0% | **100.0%** | | **Qwen2.5-coder-7b** | 81.5% | 90.3% | 73.2% | 88.2% | | **Llama-3.2-90b-vis** | 81.5% | 90.3% | 67.1% | 88.2% | | **Qwen2.5-coder-32b** | **85.2%** | 88.9% | 64.6% | 94.1% | | **Gemma-2-9b-it** | 74.1% | 90.3% | 65.9% | 94.1% | | **GPT-oss-20b** | 50.0% | 89.4% | 64.9% | 75.0% | | **Gemma-3-27b-it** | 70.4% | 91.7% | 64.6% | 82.4% | | **Llama-3.2-11b-vis** | 59.3% | 76.4% | 47.6% | 58.8% | **Note**: Vision Acc 仅列出 Llama-3.2 Vision 系列的真实图像测试结果,其余模型未进行该模态测试。 ### 分角色能力评估 通过对**二级科目(Subject)的横向对比分析,我们发现参测大模型普遍存在“重记忆、轻推理”的能力结构性失衡,这种“科目能力分化” (Performance Disparity)** 现象在非 SOTA 模型中尤为显著。 ## “数值计算”与“工程设计”是行业共性瓶颈 **普遍现象**:无论是 7B 的轻量模型还是 90B 的大模型,在涉及“**通行能力**”(Capacity Calculation)和“**道路几何设计**”(如平曲线参数推算)的科目中,准确率均比其自身平均水平低 **15%-25%**。 **归因**:这表明当前通用的“下一个词预测”机制在处理严密的交通工程数学推演时,仍难以达到工业级可用的精度,必须依赖外部工具(Tool-use)增强。 ## “知识-应用”断层:从法规背诵到安全推理的普遍衰减 (The Knowledge-Application Gap) **普遍现象**:绝大多数模型在“交通法规”与“管理端”(侧重静态知识记忆)上的得分,显著高于“交通安全”与“规划设计”(侧重动态逻辑推理)。 **数据支撑**: 第一梯队:即便是综合能力最强的 **Qwen-max**,从“车辆端概念”(100%)到“规划设计应用”(72.0%),也存在 **28%** 的能力衰减。 中腰部模型:**Gemma-3-27b** 在管理端得分 91.7%,但在涉及复杂博弈的“交通安全”科目中降至 65.2%。 极端案例:**GPT-oss-20b** 则是这一趋势的极端体现。它在法规合规性上表现出专家级水平(88.5%),但在需要结合环境进行风险预判的“交通安全”科目中近乎失效(16.7%)。 **结论**:这揭示了一个行业风险点——**不能仅凭模型能“背诵法条”就认为其具备“安全驾驶”的认知能力**。 ## “视觉-语言”能力非同步发展 **普遍现象**:多模态模型在“看”和“读”的能力上往往无法对齐。 **案例分析**:**Llama-3.2-11b** 呈现出鲜明的“视强文弱”特征(视觉 77.5% vs 文本 61.4%),而 **Deepseek-v3.2** 等模型则完全依赖强大的文本逻辑。 **启示**:未来的交通大模型微调,需要特别注意**对齐**视觉感知器与语言基座的逻辑水平,避免出现“眼睛看懂了,大脑分析错了”的倒挂现象。 # 后续计划 针对当前评测结果呈现的“文本强、视觉分化大”及“工程计算弱”的特点,Traffic-Eval 下一阶段将重点开展: **难度进阶**:针对 SOTA 模型在简单题普遍高分的现状,重点引入 **多步逻辑推理** 与 **复杂工程计算** 题目。增加极端天气事故判定与模糊法规边界的边缘场景(Corner Cases)。 **多模态增强**:在 Llama-3.2 的基础上,引入 Qwen-VL-Max、GPT-4o 等前沿模型加入 Vision 赛道。增加基于真实路测数据的激光雷达点云 (Point Cloud) 样本,重点评估模型在真实物理世界中的空间感知与风险预判能力。 **数据扩充**:通过 RAG 技术结合《公路工程技术标准》等文档,将数据集规模扩充至 1000 条以上,覆盖更多细分工程场景。 **归因分析**:开发自动化评估工具,输出模型在“绿信比计算”、“视距校验”等细分知识点的能力雷达图,实现从“打分”到“诊断”的跨越。 ## 贡献人员名单 - **模型贡献方**:TRIP项目小组 - **核心贡献者**:龚晗、周臻、洪奇(东南大学复杂交通网络研究中心) - **支持团队**:东南大学交通学院

# TRIP-Evaluate: Introduction to a Vertical Evaluation Benchmark Dataset for Multimodal Large Traffic Models **TRIP-Evaluate** is a systematic evaluation benchmark specifically designed for multimodal large models in the traffic vertical domain. Aiming at the limitations of general large models in handling China-specific traffic regulations, complex engineering design standards, and multi-agent interaction logic, TRIP-Evaluate constructs an evaluation system covering four core dimensions: Vehicle Agent, Management Agent, Traveler Agent, and Planning Agent. This project aims to quantitatively detect the differences in professional knowledge memorization, logical reasoning, numerical calculation, and scene semantic understanding capabilities between industry fine-tuned models (e.g., Pangu Traffic LLM) and general base models, providing a standardized evaluation reference for the deployment and application of large models in the traffic industry. **Dataset Scale**: Target scale: over 1000 samples; current core test set: 198 samples in total (including 158 text-only samples and 40 multimodal image samples). **Language**: Simplified Chinese (zh-CN) **Task Types**: Multiple Choice, Visual Question Answering (VQA) **Usage Requirements**: Please contact zzhou602@seu.edu via email before using the dataset, and attach your personal information, unit name, usage instructions, and ModelScope account name. We will approve your ModelScope usage request as soon as possible after receiving the application. # Evaluation Classification System The Traffic-Eval benchmark abandons the single total score evaluation mode, adopting a three-dimensional evaluation framework of **"Level 1 Role - Level 2 Task - Level 3 Knowledge Point"**, which supports the generation of fine-grained capability radar charts. ## Class A: Vehicle Agent Focuses on the perception, decision-making, and control capabilities of autonomous driving systems. **Perception and Localization**: Sensor fusion, object detection. **Decision Planning**: Path planning algorithms, behavior decision logic, local path planning. **Control Execution**: Longitudinal vehicle control, lateral control dynamics. **Traffic Safety**: Driving sight distance calculation, driving stability analysis. ## Class B: Management Agent Focuses on traffic control, facility operation and maintenance, and law enforcement capabilities. **Traffic Signal Control**: Green split calculation, phase design optimization. **Road Markings and Signs**: National standard GB5768 sign setting specifications, marking design. **Intelligent Roadside Equipment**: RSU deployment strategy, V2X communication. **Traffic Safety**: Conflict point management, accident prevention strategies. ## Class C: Traveler Agent Focuses on user behavior analysis and subjective safety evaluation. **Travel Behavior Analysis**: Travel demand prediction (four-stage method), traffic mode selection logic. **Traffic Safety**: Traffic accident analysis, pedestrian facility safety evaluation. ## Class D: Planning Agent Focuses on road geometric design and macroscopic road network planning. **Road Geometric Design**: Sight distance calculation, horizontal curve/longitudinal section design, building clearance. **Road Network Planning**: Road network layout structure, traffic capacity and volume calculation. # Dataset Structure Data is stored in CSV/JSON format, with complete metadata annotations for each sample to support multi-dimensional attribution analysis. | Field Name | Type | Description | | --- | --- | --- | | id | String | Unique sample identifier (e.g., B_b_1) | | role_tag | String | Level 1 classification: Vehicle Agent / Management Agent / Traveler Agent / Planning Agent | | subject_tag | String | Level 2 classification: Specific task scenario (e.g., Traffic Safety) | | knowledge_point | String | Level 3 classification: Specific test point (e.g., Sight Triangle) | | question | String | Stem description, including scenario setting and question | | options | List/Map | Option set (A/B/C/D) | | answer | String | Standard answer option | | explanation | String | Answer explanation, citing regulatory provisions or calculation formulas | | difficulty | String | Difficulty coefficient: Easy / Medium / Hard | | modality | String | Data modality: Text-only / Image / Point Cloud | | capability_tag | String | Capability dimension: Knowledge Memorization / Logical Reasoning / Numerical Calculation / Scene Semantics | # Evaluation Leaderboards Based on the core test set of **Traffic-Eval v2.0** (198 samples), we conducted strict modality-specific evaluations on current mainstream large models. To ensure the scientific nature of the results, the Language (text logic) and Vision (visual perception) tracks are displayed separately. ## Comprehensive Performance and Level 1 Role Capability Leaderboards The table below shows the total accuracy of each model under different modalities, as well as their scores across the four Level 1 Role dimensions. | Rank | Model Name | Text Acc (Language) | Vision Acc (Image) | | --- | --- | --- | --- | | 1 | **Deepseek-V3.2** | **84.2%** | - | | 2 | **Qwen-max** | 81.0% | - | | 3 | **Qwen2.5-coder-7b** | 79.7% | - | | 4 | **Llama-3.2-90b-vis** | 77.2% | **92.5%** | | 5 | **Qwen2.5-coder-32b** | 75.9% | - | | 6 | **Gemma-2-9b-it** | 75.9% | - | | 7 | **GPT-oss-20b** | 74.7% | - | | 8 | **Gemma-3-27b-it** | 74.7% | - | | 9 | **Llama-3.2-11b-vis** | 61.4% | **77.5%** | | Model Name | Traveler Agent | Management Agent | Planning Agent | Vehicle Agent | | --- | --- | --- | --- | --- | | **Deepseek-V3.2** | 88.9% | **91.7%** | **76.8%** | 94.1% | | **Qwen-max** | 85.2% | **91.7%** | 72.0% | **100.0%** | | **Qwen2.5-coder-7b** | 81.5% | 90.3% | 73.2% | 88.2% | | **Llama-3.2-90b-vis** | 81.5% | 90.3% | 67.1% | 88.2% | | **Qwen2.5-coder-32b** | **85.2%** | 88.9% | 64.6% | 94.1% | | **Gemma-2-9b-it** | 74.1% | 90.3% | 65.9% | 94.1% | | **GPT-oss-20b** | 50.0% | 89.4% | 64.9% | 75.0% | | **Gemma-3-27b-it** | 70.4% | 91.7% | 64.6% | 82.4% | | **Llama-3.2-11b-vis** | 59.3% | 76.4% | 47.6% | 58.8% | **Note**: Vision Acc only lists the real image test results of the Llama-3.2 Vision series models, and the remaining models did not perform this modality test. ### Role-specific Capability Evaluation Through a horizontal comparative analysis of **Level 2 Subjects**, we found that the tested large models generally have a structural imbalance in capabilities of "valuing memorization over reasoning", and this "performance disparity" phenomenon is particularly prominent in non-SOTA models. ## "Numerical Calculation" and "Engineering Design" are Industry-wide Common Bottlenecks **General Phenomenon**: Whether it is a lightweight 7B model or a 90B large model, the accuracy rate in subjects involving "**Capacity Calculation**" and "**Road Geometric Design**" (e.g., horizontal curve parameter calculation) is 15%-25% lower than their own average level. **Attribution**: This indicates that the current general "next-token prediction" mechanism still struggles to achieve industrial-grade usable accuracy when dealing with rigorous traffic engineering mathematical deductions, and must rely on external tool-use enhancements. ## "Knowledge-Application Gap": Universal Decline from Regulatory Recitation to Safety Reasoning **General Phenomenon**: The scores of most models in "Traffic Regulations" and "Management Agent" (focusing on static knowledge memorization) are significantly higher than those in "Traffic Safety" and "Planning Agent" (focusing on dynamic logical reasoning). **Data Support**: - First-tier models: Even the most comprehensive model **Qwen-max** has a 28% capability decline from "Vehicle Agent Concepts" (100%) to "Planning and Design Applications" (72.0%). - Mid-tier models: **Gemma-3-27b** scores 91.7% in Management Agent, but drops to 65.2% in the "Traffic Safety" subject involving complex game theory. - Extreme case: **GPT-oss-20b** is an extreme example of this trend. It exhibits expert-level performance in regulatory compliance (88.5%), but nearly fails in the "Traffic Safety" subject that requires environmental-based risk prediction (16.7%). **Conclusion**: This reveals an industry risk point—**one cannot assume that a model has the cognitive ability for "safe driving" just because it can "recite laws and regulations"**. ## "Vision-Language" Capabilities Develop Asynchronously **General Phenomenon**: Multimodal models often fail to align their "seeing" and "reading" capabilities. **Case Analysis**: **Llama-3.2-11b** shows a distinct "strong vision, weak text" characteristic (Vision 77.5% vs Text 61.4%), while models like **Deepseek-v3.2** rely entirely on powerful text logic. **Implications**: Future fine-tuning of traffic large models needs to pay special attention to **aligning** the logical levels of visual perceivers and language bases, avoiding the inverted phenomenon of "eyes understand but brain analyzes incorrectly". # Future Plans In response to the characteristics of "strong text performance, large visual divergence, and weak engineering calculation" shown in the current evaluation results, Traffic-Eval will focus on the following in the next stage: 1. **Difficulty Advancement**: Aiming at the current situation that SOTA models generally score high on simple questions, focus on introducing **multi-step logical reasoning** and **complex engineering calculation** questions. Add edge cases involving extreme weather accident determination and ambiguous regulatory boundaries. 2. **Multimodal Enhancement**: Based on Llama-3.2, introduce cutting-edge models such as Qwen-VL-Max and GPT-4o to join the Vision track. Add LiDAR point cloud samples based on real road test data, focusing on evaluating the model's spatial perception and risk prediction capabilities in the real physical world. 3. **Data Expansion**: Combine documents such as "Technical Standards for Highway Engineering" through RAG technology to expand the dataset scale to over 1000 samples, covering more subdivided engineering scenarios. 4. **Attribution Analysis**: Develop automated evaluation tools to output capability radar charts of models on subdivided knowledge points such as "green split calculation" and "sight distance verification", realizing the leap from "scoring" to "diagnosis". ## Contributor List - **Model Contributor**: TRIP Project Team - **Core Contributors**: Han Gong, Zhen Zhou, Qi Hong (Complex Traffic Network Research Center, Southeast University) - **Support Team**: School of Transportation, Southeast University
提供机构:
maas
创建时间:
2026-01-10
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务