下载链接：

https://modelscope.cn/datasets/MCG-NJU/VideoChatOnline-IT

下载链接

链接失效反馈

官方服务：

资源简介：

## Overview This dataset provides a comprehensive collection for **Online Spatial-Temporal Understanding tasks**, covering multiple domains including Dense Video Captioning, Video Grounding, Step Localization, Spatial-Temporal Action Localization, and Object Tracking. ## Data Formation Our pipeline begins with 96K high-quality samples curated from 5 tasks across 12 datasets. The conversion process enhances online spatiotemporal understanding through template transformation. We strategically insert queries along the timeline in an organized interleaved format for each video sample to facilitate temporal context differentiation. | **Category** | **Dataset** | **Count** | **Query** | **Response** | |----------------------------------------|----------------------------------|-----------|-----------|-------------| | **Temporal Grounding** | DiDeMo | 33,002 | Identify whether a specific event is still ongoing at present or has it concluded. Provide the start time of the event and its duration up to the query timestamp. | `<start time> - <event duration>: duration up to query timestamp.` | | | QuerYD | 14,620 | | | | | HiREST | 459 | | | | | Charades-STA | 12,408 | | | | **Object Tracking** | LaSOT | 1,120 | Track the object currently based on a brief description or box. | (1) Past trajectory up to the present with brief descriptions; (2) Track the object sequentially in future frames as they become available. | | | GOT10k | 8,250 | | | | **Step Localization and Captioning** | COIN | 9,029 | List steps completed up to the current point, excluding previously reported ones. | `<start time> - <end time>, <step description>...` | | | HiREST | 459 | | | | **Dense Video Captioning** | ActivityNet Captions | 10,009 | Identify and list events up to the current point, excluding previously reported ones. | `<start time> - <end time>, <event description>...` | | | VITT | 5,141 | | | | | YouCook2 | 1,192 | | | | **Spatial Temporal Action Localization** | AVA | 160 | Identify current and past actions of a person at a specific box at present. | List actions for the person over time, with corresponding positions. | | **Total number of datasets:** | | **96k** | | | --- ### Additional Information: - **Interleave Format:** Temporally Random Insert (T3, T2, T1) - **Video Timeline:** Processed for **Online Video LLM** ## Data Formats * Format 1: Conversational QA (LLaVA-style) ```json { "video": "116/NLy71UrHElw.mp4", "conversations": [ { "from": "human", "timestamps": 1026.0, # Video timestamp in seconds "value": "<video>\nBased on current observation, list events..." }, { "from": "gpt", "value": "21.0s - 22.0s (duration: 1.0s), begin to run up..." } ] } ``` Format 2: Template-based Tracking ```json { "video": "GOT-10k_Train_006457", "fps": 1, # Frame rate "all_image_files": ["00000001.jpg", ...], # Keyframe paths "image_bboxes": [ # Temporal object tracking data { "timestamp": 0.0, "bbox": [0.412, 0.517, 0.452, 0.753] # [x1,y1,x2,y2] }, ... ], "query_template": { # Randomized temporal insertion "from": "human", "value": "Track the location of \"person\" at <bbox> over time..." } } ``` ## Source Data | **Task** | **Dataset** | **Source** | |-------------------------------|----------------------------|------------------------------------------------------------------------------------| | Dense Video Captioning | `ActivityNet Captions` | [Source](http://activity-net.org/download.html) | | | `ViTT` | [Source](https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT) | | | `YouCook2` | [Source](http://youcook2.eecs.umich.edu/) | | Temporal Video Grounding | `DiDeMo` | [Source](https://github.com/LisaAnne/LocalizingMoments?tab=readme-ov-file#dataset) | | | `QuerYD` | [Source](https://www.robots.ox.ac.uk/~vgg/data/queryd/) | | | `HiREST_grounding` | [Source](https://github.com/j-min/HiREST) | | | `Charades-STA` | [Source](https://github.com/jiyanggao/TALL) | | Step Localization | `COIN` | [Source](https://github.com/coin-dataset/annotations) | | | `HiREST_step` | [Source](https://github.com/j-min/HiREST) | | Spatial Temporal Action Localization | `AVA` | [Source](https://research.google.com/ava/download.html) | | Object Tracking | `GOT 10K` | [Source](http://got-10k.aitestunion.com/) | | | `LaSOT` | [Source](http://vision.cs.stonybrook.edu/~lasot/) | ## Citation If you find this project useful in your research, please consider cite: ```BibTeX @article{huang2024online, title={Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method}, author={Huang, Zhenpeng and Li, Xinhao and Li, Jiaqi and Wang, Jing and Zeng, Xiangyu and Liang, Cheng and Wu, Tao and Chen, Xi and Li, Liang and Wang, Limin}, journal={arXiv preprint arXiv:2501.00584}, year={2024} } ```

## 概览本数据集为**在线时空理解任务（Online Spatial-Temporal Understanding tasks）**提供了全面的数据集集合，覆盖密集视频字幕生成（Dense Video Captioning）、视频接地（Video Grounding）、步骤定位（Step Localization）、时空动作定位（Spatial-Temporal Action Localization）以及目标跟踪（Object Tracking）等多个研究领域。 ## 数据构建我们的流水线首先从覆盖12个数据集的5项任务中精选出9.6万条高质量样本。转换流程通过模板变换强化在线时空理解能力，我们为每个视频样本按有序交错格式在时间轴上精准插入查询语句，以辅助区分时序上下文。 | **类别（Category）** | **数据集（Dataset）** | **数量（Count）** | **查询（Query）** | **响应（Response）** | |----------------------------------------|----------------------------------|-----------|-----------|-------------| | **时序视频接地（Temporal Grounding）** | DiDeMo | 33,002 | 判断特定事件当前是否仍在进行，或已结束。请给出该事件的起始时间，以及截至查询时间戳的事件时长。 | `<起始时间> - <事件时长>：截至查询时间戳的时长。` | | | QuerYD | 14,620 | | | | | HiREST | 459 | | | | | Charades-STA | 12,408 | | | | **目标跟踪（Object Tracking）** | LaSOT | 1,120 | 基于简要描述或边界框，实时跟踪目标对象。 | (1) 截至当前的过往轨迹及简要描述；(2) 在后续可用帧中按顺序跟踪目标对象。 | | | GOT10k | 8,250 | | | | **步骤定位与字幕生成（Step Localization and Captioning）** | COIN | 9,029 | 列出截至当前已完成的步骤，且不包含此前已汇报的步骤。 | `<起始时间> - <结束时间>, <步骤描述>...` | | | HiREST | 459 | | | | **密集视频字幕生成（Dense Video Captioning）** | ActivityNet Captions | 10,009 | 识别并列出截至当前的所有事件，且不包含此前已汇报的事件。 | `<起始时间> - <结束时间>, <事件描述>...` | | | VITT | 5,141 | | | | | YouCook2 | 1,192 | | | | **时空动作定位（Spatial Temporal Action Localization）** | AVA | 160 | 识别当前特定边界框内人物的当前及过往动作。 | 按时间顺序列出该人物的动作及对应位置。 | | **数据集总数量：** | | **96k** | | | --- ### 补充信息 - **交错格式（Interleave Format）：** 时序随机插入（T3、T2、T1） - **视频时间轴：** 已针对**在线视频大语言模型（Online Video LLM）**进行处理 ## 数据格式 * 格式1：对话式问答（LLaVA风格，Conversational QA (LLaVA-style)） json { "video": "116/NLy71UrHElw.mp4", "conversations": [ { "from": "human", "timestamps": 1026.0, # 视频时间戳，单位：秒 "value": "<video> Based on current observation, list events..." }, { "from": "gpt", "value": "21.0s - 22.0s (duration: 1.0s), begin to run up..." } ] } 格式2：基于模板的跟踪（Template-based Tracking） json { "video": "GOT-10k_Train_006457", "fps": 1, # 帧率（Frame rate） "all_image_files": ["00000001.jpg", ...], # 关键帧路径 "image_bboxes": [ # 时序目标跟踪数据 { "timestamp": 0.0, "bbox": [0.412, 0.517, 0.452, 0.753] # [左上角x坐标, 左上角y坐标, 右下角x坐标, 右下角y坐标] }, ... ], "query_template": { # 随机时序插入 "from": "human", "value": "Track the location of "person" at <bbox> over time..." } } ## 源数据集 | **任务（Task）** | **数据集（Dataset）** | **来源（Source）** | |-------------------------------|----------------------------|------------------------------------------------------------------------------------| | 密集视频字幕生成（Dense Video Captioning） | `ActivityNet Captions` | [来源](http://activity-net.org/download.html) | | | `ViTT` | [来源](https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT) | | | `YouCook2` | [来源](http://youcook2.eecs.umich.edu/) | | 时序视频接地（Temporal Video Grounding） | `DiDeMo` | [来源](https://github.com/LisaAnne/LocalizingMoments?tab=readme-ov-file#dataset) | | | `QuerYD` | [来源](https://www.robots.ox.ac.uk/~vgg/data/queryd/) | | | `HiREST_grounding` | [来源](https://github.com/j-min/HiREST) | | | `Charades-STA` | [来源](https://github.com/jiyanggao/TALL) | | 步骤定位（Step Localization） | `COIN` | [来源](https://github.com/coin-dataset/annotations) | | | `HiREST_step` | [来源](https://github.com/j-min/HiREST) | | 时空动作定位（Spatial Temporal Action Localization） | `AVA` | [来源](https://research.google.com/ava/download.html) | | 目标跟踪（Object Tracking） | `GOT 10K` | [来源](http://got-10k.aitestunion.com/) | | | `LaSOT` | [来源](http://vision.cs.stonybrook.edu/~lasot/) | ## 引用声明如您在研究中使用本项目，请引用以下文献： BibTeX @article{huang2024online, title={Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method}, author={Huang, Zhenpeng and Li, Xinhao and Li, Jiaqi and Wang, Jing and Zeng, Xiangyu and Liang, Cheng and Wu, Tao and Chen, Xi and Li, Liang and Wang, Limin}, journal={arXiv preprint arXiv:2501.00584}, year={2024} }

应用场景：