下载链接：

https://modelscope.cn/datasets/AI-ModelScope/CoVLA-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# CoVLA-Dataset <h2 style="color:#FF6666;">WACV 2025 Oral</h2> **CoVLA-Dataset** is a dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. It includes 10,000 30-second video clips, paired with trajectory targets and language annotations generated from CAN data and front camera footage. For more details, please visit our project page [https://turingmotors.github.io/covla-ad/](https://turingmotors.github.io/covla-ad/). <img src="https://turingmotors.github.io/covla-ad/static/images/overview.png" alt="Overview of the Dataset" width="600"> ## Data fields | **Key** | **Value** | |-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **image** | <img src="https://turingmotors.github.io/covla-ad/static/images/sample.png" alt="sample" width="300"> | **frame_id** | 329 | | **vEgo** | 10.03304386138916 | | **vEgoRaw** | 10.020833015441895 | | **aEgo** | 0.46339523792266846 | | **steeringAngleDeg** | 0.6606917381286621 | | **steeringTorque** | -83.0 | | **brake** | 0.0 | | **brakePressed** | false | | **gas** | 0.0949999988079071 | | **gasPressed** | true | | **doorOpen** | false | | **seatbeltUnlatched** | false | | **gearShifter** | drive | | **leftBlinker** | false | | **rightBlinker** | false | | **orientations_calib** | [2.3436582957260557, 0.5339828947300967, 1.3629659149020594] | | **orientations_ecef** | [2.3389552760497168, 0.5209895497170147, 1.353589728168173] | | **orientations_ned** | [0.0025234392011709832, 0.03227332984737223, -2.2615545172406692] | | **positions_ecef** | [-3980150.365520416, 3315762.367044255, 3708484.8043875922] | | **velocities_calib** | [9.879017074377433, -0.011840230995096795, 0.024564830387060477] | | **velocities_ecef** | [1.7610653813101715, 8.306048478869922, -5.0501415195236214] | | **accelerations_calib** | [0.27428175425116946, 0.12695569343062033, -0.10788516598110376] | | **accelerations_device** | [0.27649870813464505, 0.12283225142665075, -0.10699598243696486] | | **angular_velocities_calib** | [0.0026360116259363207, 0.004025109052377312, -0.00268604793365312] | | **angular_velocities_device** | [0.0027046335044321763, 0.003985098643058938, -0.0026774727056080635] | | **timestamp** | 1666768003100 | | **extrinsic_matrix** | [[-0.014968783967196942, -0.9998879633843899, -4.85357778264491e-05, 0.0], [0.003242381996824406, 1.2705494208814505e-22, -0.9999947418769201, 1.2200000286102295], [0.9998827102283637, -0.014968862590224792, 0.0032420187150516235, 0.0], [0.0, 0.0, 0.0, 1.0]] | | **intrinsic_matrix** | [[2648.0, 0.0, 964.0], [0.0, 2648.0, 604.0], [0.0, 0.0, 1.0]] | | **trajectory_count** | 60 | **trajectory** | [[0.0, -0.0, 0.0], [0.4950813837155965, 0.0002547887961875119, 0.0021622613513301494], [0.9982726849068438, 0.0056820013761280435, 0.008019814119642137], [1.5000274952496726, 0.0059424162043407655, 0.010366395805683198], [1.9714437957699504, 0.012072826164266363, 0.017691995618773503], [2.4978684260880795, 0.011601311998705278, 0.02386450425538476], [3.010815767380653, 0.01801527128027971, 0.03445721142353303], [3.507063998218958, 0.01701233281058208, 0.038337927578102234], [4.012620624170714, 0.024100599226699392, 0.045395340010689886], [4.514833598833565, 0.02495601111254716, 0.049133835162865874], [5.017161220493318, 0.03149524423866552, 0.05523633716707353], [5.51940086207554, 0.030085354586579783, 0.0629749739561262], [6.03533332268388, 0.033231232243281575, 0.07405741372199495], [6.537391640025451, 0.03051862039002601, 0.08446890058718093], [7.048671047316283, 0.038067441674022755, 0.09575308668400331], [7.55109590134654, 0.03431035592675249, 0.10149061037170799], [8.059086339126619, 0.042729229684233254, 0.10987290009657202], [8.52910950102711, 0.0361088815813233, 0.11378430761802129], [9.057420775293076, 0.04137374154529525, 0.11942084703760691], [9.56262721211865, 0.03109799109499287, 0.12977970617751178], [10.063355428131272, 0.031333084537993515, 0.14035971143267495], [10.564434359898204, 0.017195610432229166, 0.1523663378360089], [11.067897560093263, 0.015964352684473423, 0.16406888445093548], [11.538305780022156, -0.005298283878670548, 0.1742140300896913], [12.075701271632234, -0.013420317597075168, 0.18348975369247966], [12.57096145582652, -0.03554497226615074, 0.19262208554922391], [13.056727974695047, -0.049523833398930905, 0.20499172623121895], [13.57320019988525, -0.07404168277320623, 0.2147168274517664], [14.071046794195906, -0.07897020052519861, 0.2226606968611588], [14.538997968829394, -0.10256663521468153, 0.23212175013944475], [15.065563638878904, -0.1167763891342656, 0.2408806359762134], [15.55517235904856, -0.14451920391994882, 0.25216240490966046], [16.060878282608606, -0.16233348628138394, 0.2632859326068148], [16.56483257931331, -0.19450740003278352, 0.2759509621400748], [17.07504346354488, -0.20961007460225636, 0.2872020973948639], [17.570511139964328, -0.24407735023413227, 0.2988333759290186], [18.080854607007407, -0.26610472918341704, 0.31079787209427323], [18.59049646081564, -0.30203120166074393, 0.32420489211775605], [19.084604904329886, -0.32645382033607545, 0.34009129336809085], [19.63586747214387, -0.3657491267911684, 0.36160403574838995], [20.150774198207014, -0.3911628434767098, 0.3794843533824954], [20.666409031086534, -0.4277895374294906, 0.3966498889678393], [21.180019483939088, -0.458788621000815, 0.40918684690405555], [21.70154365867492, -0.490295037823036, 0.4215196756925751], [22.215836213869263, -0.5137219901398054, 0.43110400157805046], [22.728543128259094, -0.5469874293170292, 0.43765565801467715], [23.241384685764114, -0.571926428598341, 0.44717513184957286], [23.767547695050713, -0.6043245975304662, 0.461574553355278], [24.28656672722109, -0.6346200237378623, 0.4724674608211735], [24.809254106414958, -0.6666944983890827, 0.4837900904850035], [25.32670419066385, -0.6950365293779399, 0.49242971573187705], [25.821659469023185, -0.7245096058611247, 0.501958900633962], [26.370683105552995, -0.7502297676767269, 0.5119963762627946], [26.89741612568614, -0.7821684517861108, 0.5221244715437666], [27.398996546439875, -0.8079249885664631, 0.5315545062900925], [27.948541952161346, -0.8422125069605737, 0.5412863716179301], [28.473985439321677, -0.8667390346179196, 0.5504376709603114], [28.973416127630998, -0.8956524219968989, 0.5591793882574851], [29.520360419055418, -0.9238145692677926, 0.5670525689197388], [30.058969038298454, -0.9543859547220845, 0.5786333952064608]] | | **caption** | The ego vehicle is moving straight at a moderate speed following leading car with acceleration. There is a traffic light close to the ego vehicle displaying a green signal. It is sunny. The car is driving on a wide road. No pedestrians appear to be present. What the driver of ego vehicle should be careful is to keep an eye on the traffic light and be prepared to stop if the light changes. ## Usage We provide a simple tutorial. Please refer to [`tutorial.ipynb`](tutorial.ipynb) for instructions on how to load the data. ## License This repository includes data under our CoVLA-Dataset Licensing Terms and Conditions and the VideoLLaMA 2 licenses. Please make sure to review both licenses carefully. The video clips and CAN data are under the CoVLA-Dataset Licensing Terms and Conditions, while the natural language descriptions are under the [VideoLLaMA 2 license](https://github.com/DAMO-NLP-SG/VideoLLaMA2?tab=readme-ov-file#-license). VideoLLaMA2 License: > This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for non-commercial use ONLY, subject to the model Licenses of LLaMA and Mistral, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations. ## Acknowledgements This dataset is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO). ## Citation ``` @InProceedings{covla_wacv2025, author = {Arai, Hidehisa and Miwa, Keita and Sasaki, Kento and Watanabe, Kohei and Yamaguchi, Yu and Aoki, Shunsuke and Yamamoto, Issei}, title = {CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {1933-1943} } @article{damonlpsg2024videollama2, title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs}, author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong}, journal={arXiv preprint arXiv:2406.07476}, year={2024}, url = {https://arxiv.org/abs/2406.07476} }

# CoVLA数据集 ## WACV 2025 口头报告 **CoVLA数据集**是一项收录超80小时真实驾驶视频的专业数据集。本数据集依托一种基于自动化数据处理与字幕生成流水线的新颖可扩展方案，生成精准的驾驶轨迹，并搭配驾驶环境与操作的详细自然语言描述。数据集包含10000段30秒的视频片段，配套轨迹目标与由CAN总线数据及前置摄像头镜头画面生成的语言标注。如需了解更多细节，请访问我们的项目页面 [https://turingmotors.github.io/covla-ad/](https://turingmotors.github.io/covla-ad/)。 ![数据集概览](https://turingmotors.github.io/covla-ad/static/images/overview.png) ## 数据字段 | **键名** | **取值** | |-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **图像** | ![示例图像](https://turingmotors.github.io/covla-ad/static/images/sample.png) | **帧ID** | 329 | | **车辆自身纵向速度（vEgo）** | 10.03304386138916 | | **原始车辆纵向速度（vEgoRaw）** | 10.020833015441895 | | **车辆自身纵向加速度（aEgo）** | 0.46339523792266846 | | **转向盘角度（steeringAngleDeg，单位：度）** | 0.6606917381286621 | | **转向扭矩（steeringTorque）** | -83.0 | | **制动踏板开度** | 0.0 | | **制动踏板被按下状态** | false | | **加速踏板开度** | 0.0949999988079071 | | **加速踏板被按下状态** | true | | **车门开启状态** | false | | **安全带未扣状态** | false | | **换挡杆挡位** | drive（驾驶挡） | | **左转向灯状态** | false | | **右转向灯状态** | false | | **校准后姿态角（orientations_calib）** | [2.3436582957260557, 0.5339828947300967, 1.3629659149020594] | | **地心地固坐标系下姿态角（orientations_ecef）** | [2.3389552760497168, 0.5209895497170147, 1.353589728168173] | | **北东地坐标系下姿态角（orientations_ned）** | [0.0025234392011709832, 0.03227332984737223, -2.2615545172406692] | | **地心地固坐标系下位置（positions_ecef）** | [-3980150.365520416, 3315762.367044255, 3708484.8043875922] | | **校准后速度（velocities_calib）** | [9.879017074377433, -0.011840230995096795, 0.024564830387060477] | | **地心地固坐标系下速度（velocities_ecef）** | [1.7610653813101715, 8.306048478869922, -5.0501415195236214] | | **校准后加速度（accelerations_calib）** | [0.27428175425116946, 0.12695569343062033, -0.10788516598110376] | | **设备端原始加速度（accelerations_device）** | [0.27649870813464505, 0.12283225142665075, -0.10699598243696486] | | **校准后角速度（angular_velocities_calib）** | [0.0026360116259363207, 0.004025109052377312, -0.00268604793365312] | | **设备端原始角速度（angular_velocities_device）** | [0.0027046335044321763, 0.003985098643058938, -0.0026774727056080635] | | **时间戳** | 1666768003100 | | **外参矩阵（extrinsic_matrix）** | [[-0.014968783967196942, -0.9998879633843899, -4.85357778264491e-05, 0.0], [0.003242381996824406, 1.2705494208814505e-22, -0.9999947418769201, 1.2200000286102295], [0.9998827102283637, -0.014968862590224792, 0.0032420187150516235, 0.0], [0.0, 0.0, 0.0, 1.0]] | | **内参矩阵（intrinsic_matrix）** | [[2648.0, 0.0, 964.0], [0.0, 2648.0, 604.0], [0.0, 0.0, 1.0]] | | **轨迹点数** | 60 | **轨迹数据（trajectory）** | （此处省略长数组） | | **自然语言描述（caption）** | 自车（ego vehicle）正以中等速度直行，跟随前车并处于加速状态。附近有一处交通信号灯，显示为绿灯。天气晴朗，车辆行驶在宽阔道路上，未出现行人。驾驶员需留意交通信号灯，若信号灯变色则需做好停车准备。 ## 使用方法我们提供了简单的教程，请参阅 [`tutorial.ipynb`](tutorial.ipynb) 以了解如何加载该数据集。 ## 许可协议本仓库包含的数据遵循CoVLA数据集许可条款与条件，以及VideoLLaMA 2许可协议。请务必仔细审阅两份许可协议。视频片段与CAN总线数据受CoVLA数据集许可条款与条件约束，而自然语言描述则遵循[VideoLLaMA 2许可协议](https://github.com/DAMO-NLP-SG/VideoLLaMA2?tab=readme-ov-file#-license)。 VideoLLaMA2许可协议： > 本项目基于Apache 2.0许可协议发布，详情请参见LICENSE文件。本服务仅作为研究预览版本，仅供非商业用途使用，需遵守LLaMA与Mistral的模型许可协议、OpenAI生成数据的使用条款，以及ShareGPT的隐私政策。如发现任何潜在违规情况，请与我们联系。 ## 致谢本数据集基于由新能源与工业技术开发组织（NEDO）资助的项目JPNP20017的研究成果开发。 ## 引用格式 @InProceedings{covla_wacv2025, author = {Arai, Hidehisa and Miwa, Keita and Sasaki, Kento and Watanabe, Kohei and Yamaguchi, Yu and Aoki, Shunsuke and Yamamoto, Issei}, title = {CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {1933-1943} } @article{damonlpsg2024videollama2, title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs}, author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong}, journal={arXiv preprint arXiv:2406.07476}, year={2024}, url = {https://arxiv.org/abs/2406.07476} }

应用场景：