Tarsier2-Recap-585K

Name: Tarsier2-Recap-585K
Creator: maas
Published: 2026-01-09 02:33:27
License: 暂无描述

魔搭社区2026-01-09 更新2025-03-22 收录

下载链接：

https://modelscope.cn/datasets/thomas/Tarsier2-Recap-585K

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Tarsier2-Recap-585K ## Dataset Description - **Language(s):** English - **License:** Apache License 2.0 - **Technical Report:** https://arxiv.org/abs/2501.07888 - **Repository:** https://github.com/bytedance/tarsier/tree/main ## Introduction ✨Tarsier2-Recap-585K✨ consists of 585K **distinct** video clips, lasting for **1972 hours** in total, from open-source datasets (e.g. VATEX, TGIF, LSMDC, etc.) and each one with a detailed video description annotated by **Tarsier2-7B**, _which beats GPT-4o in generating detailed and accurate video descriptions for video clips of 5~20 seconds_ (See the [DREAM-1K Leaderboard](https://tarsier-vlm.github.io/)). Experiments demonstrate its effectiveness in enhancing the capabilities of existing LVLMs for video description and general video understanding (See Section 4.3 of our [Technical Report](https://arxiv.org/abs/2501.07888)). ## Uses **Tarsier2-Recap-585K is only allow the use of this dataset for academic research and education purpose.** ### Dataset Composition ![images](./assets/figures/tarsier2-recap_data_composition.png) _**Note:** For Ego4D, as the raw videos are 4K resolution, which is too large to upload to HuggingFace. We only release the metadata, you can download the video from [Ego4D v2.0](https://ego4d-data.org/docs/start-here/) and map the video_file according to the vid (filename)._ ### Dataset Structure Tarsier2-Recap-585K contains 17 (WebVid-10M is splited into 3 parts and LSMD is splited into 4 parts) subsets, each contains a `metadata.json` and `videos.tar*`, and is organized as follows: ``` Tarsier2-Recap-585K ├── ActivityNet │ ├── metadata.json │ ├── videos.tar.part-001.tar │ ├── ... ... | ├── LSMDC_part-1 │ ├── metadata.json │ ├── videos.tar.part-001.tar │ ├── ... ├── LSMDC_part-2 │ ├── ... ... ├── LSMDC_part-4 │ ├── ... ├── SSV2 │ ├── metadata.json │ ├── videos.tar ├── WebVid-10M_part-1 │ ├── ... ... ├── WebVid-10M_part-3 │ ├── ... ``` For subsets with `videos.tar.part-*`, you should concatenate them before decompressing them. ### Data Format Tarsier2-Recap-585K shares the same basic data format with [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL/tree/main/qwen-vl-utils), as: ```yaml [ { "messages": [ { "role": "user", "content": [ { "type": "video", "video": { "video_file": "Oops/videos/25 Best Trampoline Fail Nominees - FailArmy Hall of Fame (July 2017)11.mp4", # video path "start_time": null, # null means start from 0s "end_time": null, # null means end at the end of the video "start_frame": null, # null means start from the first frame "end_frame": null # null means end at the last frame # assert (start_time or end_time) and (start_frame or end_frame) == False } }, { "type": "text", "text": "Describe the video in detail." } ] }, { "role": "assistant", "content": [ { "type": "text", "text": "A man is seen jumping on a trampoline in a backyard with a blue above-ground pool and a black shed in the background. He continues to jump higher on the trampoline, losing balance as he approaches the edge. The man stumbles and falls forward into the pool, creating a large splash. He lands on the ground beside the pool, lying on the grass. A small black dog runs towards the man, seemingly concerned.", } ] }], "dataset": "Oops", "task": "video/caption", "idx": "Oops_0" }, ... ] ``` ### Tips - **Recommended subsets**: If you found it is too expensive to download and use the complete dataset, we recommend the LSMDC, Charades, Charades-Ego, WebVid-10M, TREC-VTT, Oops and TGIF subsets (with order), which feature in more dynamic actions and events. - **Quick start**: As the data format is exactly same as of [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL/tree/main/qwen-vl-utils), except for the extra keys (_"start_time"/"end_time"_ and _"start_frame"/"end_frame"_) to control the start/end of the video clip, you can quickly start fine-tuning Qwen2-VL-2B on Tarsier2-Recap-585K with this repository: [finetune-Qwen2-VL](https://github.com/zhangfaen/finetune-Qwen2-VL), a simple implementation of DDP training. ## Citation If you found this repository useful, please consider citing our paper: ```bibtex @misc{yuan2025tarsier2advancinglargevisionlanguage, title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding}, author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin}, year={2025}, eprint={2501.07888}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.07888}, } ```

# Tarsier2-Recap-585K 数据集卡片 ## 数据集描述 - **语言：** 英语 - **许可证：** Apache许可证2.0 - **技术报告：** https://arxiv.org/abs/2501.07888 - **仓库地址：** https://github.com/bytedance/tarsier/tree/main ## 简介 ✨Tarsier2-Recap-585K✨包含58.5万个**独立**视频片段，总时长达**1972小时**，数据源自公开数据集（如VATEX、TGIF、LSMDC等），每个片段均配有由**Tarsier2-7B**标注的详细视频描述——该模型在生成5~20秒视频片段的详细准确描述方面，表现优于GPT-4o（详见[DREAM-1K排行榜](https://tarsier-vlm.github.io/)）。实验证明，该数据集可有效增强现有大视觉语言模型（Large Vision-Language Model, LVLM）的视频描述与通用视频理解能力（详见我们[技术报告](https://arxiv.org/abs/2501.07888)的第4.3节）。 ## 使用权限 **Tarsier2-Recap-585K仅可用于学术研究与教育目的。** ### 数据集组成 ![images](./assets/figures/tarsier2-recap_data_composition.png) **注意：** 针对Ego4D数据集，由于原始视频为4K分辨率，文件体积过大无法上传至HuggingFace，我们仅发布其元数据。你可从[Ego4D v2.0](https://ego4d-data.org/docs/start-here/)下载视频，并根据vid（文件名）映射video_file。 ### 数据集组织结构 Tarsier2-Recap-585K包含17个子集（WebVid-10M被拆分为3个部分，LSMDC被拆分为4个部分），每个子集均包含一个`metadata.json`与`videos.tar*`文件，组织结构如下： Tarsier2-Recap-585K ├── ActivityNet │ ├── metadata.json │ ├── videos.tar.part-001.tar │ ├── ... ... | ├── LSMDC_part-1 │ ├── metadata.json │ ├── videos.tar.part-001.tar │ ├── ... ├── LSMDC_part-2 │ ├── ... ... ├── LSMDC_part-4 │ ├── ... ├── SSV2 │ ├── metadata.json │ ├── videos.tar ├── WebVid-10M_part-1 │ ├── ... ... ├── WebVid-10M_part-3 │ ├── ... 对于带有`videos.tar.part-*`的子集，你需要先将其拼接后再解压。 ### 数据格式 Tarsier2-Recap-585K的基本数据格式与[Qwen2-VL](https://github.com/QwenLM/Qwen2-VL/tree/main/qwen-vl-utils)一致，具体格式如下： yaml [ { "messages": [ { "role": "user", "content": [ { "type": "video", "video": { "video_file": "Oops/videos/25 Best Trampoline Fail Nominees - FailArmy Hall of Fame (July 2017)11.mp4", # 视频路径 "start_time": null, # null表示从0秒开始 "end_time": null, # null表示至视频末尾 "start_frame": null, # null表示从第一帧开始 "end_frame": null # null表示至最后一帧 # 需满足(start_time或end_time)与(start_frame或end_frame)不同时为空 } }, { "type": "text", "text": "请详细描述该视频。" } ] }, { "role": "assistant", "content": [ { "type": "text", "text": "一名男子在后院的蹦床上跳跃，背景中有一个蓝色的地上泳池和一座黑色的棚屋。他继续在蹦床上跳得更高，在接近边缘时失去平衡。男子踉跄着向前扑进泳池，溅起大片水花。他落在泳池旁的草地上。一只小黑狗跑向该男子，似乎十分担忧。", } ] }], "dataset": "Oops", "task": "video/caption", "idx": "Oops_0" }, ... ] ### 使用提示 - **推荐子集：** 若你认为完整数据集的下载与使用成本过高，我们推荐按顺序使用LSMDC、Charades、Charades-Ego、WebVid-10M、TREC-VTT、Oops与TGIF子集，这些子集包含更多动态动作与事件。 - **快速上手：** 由于本数据集的数据格式与[Qwen2-VL](https://github.com/QwenLM/Qwen2-VL/tree/main/qwen-vl-utils)完全一致，仅额外增加了用于控制视频片段起止的`start_time`/`end_time`与`start_frame`/`end_frame`键值，你可通过本仓库[finetune-Qwen2-VL](https://github.com/zhangfaen/finetune-Qwen2-VL)快速在Tarsier2-Recap-585K上微调Qwen2-VL-2B，该仓库实现了简单的分布式数据并行（Distributed Data Parallel, DDP）训练。 ## 引用若本数据集对你的研究有所帮助，请引用我们的论文： bibtex @misc{yuan2025tarsier2advancinglargevisionlanguage, title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding}, author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin}, year={2025}, eprint={2501.07888}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.07888}, }

提供机构：

maas

创建时间：

2025-02-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集