Tarsier2-Recap-585K
收藏魔搭社区2026-01-09 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/thomas/Tarsier2-Recap-585K
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Tarsier2-Recap-585K
## Dataset Description
- **Language(s):** English
- **License:** Apache License 2.0
- **Technical Report:** https://arxiv.org/abs/2501.07888
- **Repository:** https://github.com/bytedance/tarsier/tree/main
## Introduction
✨Tarsier2-Recap-585K✨ consists of 585K **distinct** video clips, lasting for **1972 hours** in total, from open-source datasets (e.g. VATEX, TGIF, LSMDC, etc.) and each one with a detailed video description annotated by **Tarsier2-7B**, _which beats GPT-4o in generating detailed and accurate video descriptions for video clips of 5~20 seconds_ (See the [DREAM-1K Leaderboard](https://tarsier-vlm.github.io/)). Experiments demonstrate its effectiveness in enhancing the capabilities of existing LVLMs for video description and general video understanding (See Section 4.3 of our [Technical Report](https://arxiv.org/abs/2501.07888)).
## Uses
**Tarsier2-Recap-585K is only allow the use of this dataset for academic research and education purpose.**
### Dataset Composition

_**Note:** For Ego4D, as the raw videos are 4K resolution, which is too large to upload to HuggingFace. We only release the metadata, you can download the video from [Ego4D v2.0](https://ego4d-data.org/docs/start-here/) and map the video_file according to the vid (filename)._
### Dataset Structure
Tarsier2-Recap-585K contains 17 (WebVid-10M is splited into 3 parts and LSMD is splited into 4 parts) subsets, each contains a `metadata.json` and `videos.tar*`, and is organized as follows:
```
Tarsier2-Recap-585K
├── ActivityNet
│ ├── metadata.json
│ ├── videos.tar.part-001.tar
│ ├── ...
...
|
├── LSMDC_part-1
│ ├── metadata.json
│ ├── videos.tar.part-001.tar
│ ├── ...
├── LSMDC_part-2
│ ├── ...
...
├── LSMDC_part-4
│ ├── ...
├── SSV2
│ ├── metadata.json
│ ├── videos.tar
├── WebVid-10M_part-1
│ ├── ...
...
├── WebVid-10M_part-3
│ ├── ...
```
For subsets with `videos.tar.part-*`, you should concatenate them before decompressing them.
### Data Format
Tarsier2-Recap-585K shares the same basic data format with [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL/tree/main/qwen-vl-utils), as:
```yaml
[
{
"messages": [
{
"role": "user",
"content": [
{
"type": "video",
"video": {
"video_file": "Oops/videos/25 Best Trampoline Fail Nominees - FailArmy Hall of Fame (July 2017)11.mp4", # video path
"start_time": null, # null means start from 0s
"end_time": null, # null means end at the end of the video
"start_frame": null, # null means start from the first frame
"end_frame": null # null means end at the last frame
# assert (start_time or end_time) and (start_frame or end_frame) == False
}
},
{
"type": "text",
"text": "Describe the video in detail."
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "A man is seen jumping on a trampoline in a backyard with a blue above-ground pool and a black shed in the background. He continues to jump higher on the trampoline, losing balance as he approaches the edge. The man stumbles and falls forward into the pool, creating a large splash. He lands on the ground beside the pool, lying on the grass. A small black dog runs towards the man, seemingly concerned.",
}
]
}],
"dataset": "Oops",
"task": "video/caption",
"idx": "Oops_0"
},
...
]
```
### Tips
- **Recommended subsets**: If you found it is too expensive to download and use the complete dataset, we recommend the LSMDC, Charades, Charades-Ego, WebVid-10M, TREC-VTT, Oops and TGIF subsets (with order), which feature in more dynamic actions and events.
- **Quick start**: As the data format is exactly same as of [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL/tree/main/qwen-vl-utils), except for the extra keys (_"start_time"/"end_time"_ and _"start_frame"/"end_frame"_) to control the start/end of the video clip, you can quickly start fine-tuning Qwen2-VL-2B on Tarsier2-Recap-585K with this repository: [finetune-Qwen2-VL](https://github.com/zhangfaen/finetune-Qwen2-VL), a simple implementation of DDP training.
## Citation
If you found this repository useful, please consider citing our paper:
```bibtex
@misc{yuan2025tarsier2advancinglargevisionlanguage,
title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding},
author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin},
year={2025},
eprint={2501.07888},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.07888},
}
```
# Tarsier2-Recap-585K 数据集卡片
## 数据集描述
- **语言:** 英语
- **许可证:** Apache许可证2.0
- **技术报告:** https://arxiv.org/abs/2501.07888
- **仓库地址:** https://github.com/bytedance/tarsier/tree/main
## 简介
✨Tarsier2-Recap-585K✨包含58.5万个**独立**视频片段,总时长达**1972小时**,数据源自公开数据集(如VATEX、TGIF、LSMDC等),每个片段均配有由**Tarsier2-7B**标注的详细视频描述——该模型在生成5~20秒视频片段的详细准确描述方面,表现优于GPT-4o(详见[DREAM-1K排行榜](https://tarsier-vlm.github.io/))。实验证明,该数据集可有效增强现有大视觉语言模型(Large Vision-Language Model, LVLM)的视频描述与通用视频理解能力(详见我们[技术报告](https://arxiv.org/abs/2501.07888)的第4.3节)。
## 使用权限
**Tarsier2-Recap-585K仅可用于学术研究与教育目的。**
### 数据集组成

**注意:** 针对Ego4D数据集,由于原始视频为4K分辨率,文件体积过大无法上传至HuggingFace,我们仅发布其元数据。你可从[Ego4D v2.0](https://ego4d-data.org/docs/start-here/)下载视频,并根据vid(文件名)映射video_file。
### 数据集组织结构
Tarsier2-Recap-585K包含17个子集(WebVid-10M被拆分为3个部分,LSMDC被拆分为4个部分),每个子集均包含一个`metadata.json`与`videos.tar*`文件,组织结构如下:
Tarsier2-Recap-585K
├── ActivityNet
│ ├── metadata.json
│ ├── videos.tar.part-001.tar
│ ├── ...
...
|
├── LSMDC_part-1
│ ├── metadata.json
│ ├── videos.tar.part-001.tar
│ ├── ...
├── LSMDC_part-2
│ ├── ...
...
├── LSMDC_part-4
│ ├── ...
├── SSV2
│ ├── metadata.json
│ ├── videos.tar
├── WebVid-10M_part-1
│ ├── ...
...
├── WebVid-10M_part-3
│ ├── ...
对于带有`videos.tar.part-*`的子集,你需要先将其拼接后再解压。
### 数据格式
Tarsier2-Recap-585K的基本数据格式与[Qwen2-VL](https://github.com/QwenLM/Qwen2-VL/tree/main/qwen-vl-utils)一致,具体格式如下:
yaml
[
{
"messages": [
{
"role": "user",
"content": [
{
"type": "video",
"video": {
"video_file": "Oops/videos/25 Best Trampoline Fail Nominees - FailArmy Hall of Fame (July 2017)11.mp4", # 视频路径
"start_time": null, # null表示从0秒开始
"end_time": null, # null表示至视频末尾
"start_frame": null, # null表示从第一帧开始
"end_frame": null # null表示至最后一帧
# 需满足(start_time或end_time)与(start_frame或end_frame)不同时为空
}
},
{
"type": "text",
"text": "请详细描述该视频。"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "一名男子在后院的蹦床上跳跃,背景中有一个蓝色的地上泳池和一座黑色的棚屋。他继续在蹦床上跳得更高,在接近边缘时失去平衡。男子踉跄着向前扑进泳池,溅起大片水花。他落在泳池旁的草地上。一只小黑狗跑向该男子,似乎十分担忧。",
}
]
}],
"dataset": "Oops",
"task": "video/caption",
"idx": "Oops_0"
},
...
]
### 使用提示
- **推荐子集:** 若你认为完整数据集的下载与使用成本过高,我们推荐按顺序使用LSMDC、Charades、Charades-Ego、WebVid-10M、TREC-VTT、Oops与TGIF子集,这些子集包含更多动态动作与事件。
- **快速上手:** 由于本数据集的数据格式与[Qwen2-VL](https://github.com/QwenLM/Qwen2-VL/tree/main/qwen-vl-utils)完全一致,仅额外增加了用于控制视频片段起止的`start_time`/`end_time`与`start_frame`/`end_frame`键值,你可通过本仓库[finetune-Qwen2-VL](https://github.com/zhangfaen/finetune-Qwen2-VL)快速在Tarsier2-Recap-585K上微调Qwen2-VL-2B,该仓库实现了简单的分布式数据并行(Distributed Data Parallel, DDP)训练。
## 引用
若本数据集对你的研究有所帮助,请引用我们的论文:
bibtex
@misc{yuan2025tarsier2advancinglargevisionlanguage,
title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding},
author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin},
year={2025},
eprint={2501.07888},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.07888},
}
提供机构:
maas
创建时间:
2025-02-17



