Vision
收藏DataCite Commons2025-05-12 更新2025-05-17 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/S39DQU
下载链接
链接失效反馈官方服务:
资源简介:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>VISION Dataset</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 40px;
background-color: #f9f9f9;
color: #333;
}
h1 {
color: #004080;
}
h2 {
color: #0066cc;
margin-top: 40px;
}
p {
line-height: 1.6;
}
.highlight {
background-color: #e8f0fe;
padding: 10px;
border-left: 4px solid #007acc;
}
</style>
</head>
<body>
<h1>VISION Dataset</h1>
<p><strong>VISION (Vehicle Identification and Surveillance through Interactive Natural language)</strong> is a benchmark dataset designed for natural language-based vehicle retrieval in real-world surveillance environments.</p>
<h2>Why VISION?</h2>
<div class="highlight">
<p>Traditional vehicle retrieval models rely heavily on preprocessed representations and auxiliary tools, which limits their applicability in real-world surveillance systems.</p>
</div>
<p>VISION enables retrieval directly from raw surveillance video using only a multimodal model, without complex preprocessing pipelines.</p>
<h2>Key Features</h2>
<ul>
<li>~7,000 vehicle clips, 967,705 frames</li>
<li>Collected from the United States, South Korea, and Indonesia</li>
<li>Rich, fine-grained natural language annotations</li>
<li>Context-aware descriptions including vehicle motion and interactions</li>
<li>Greater diversity in road types, weather, and environments</li>
</ul>
<h2>Limitations of Previous Datasets</h2>
<p>The previous benchmark, <strong>CityFlow-NL</strong>, suffered from:</p>
<ul>
<li>Annotation inconsistencies and errors</li>
<li>Overly simplistic descriptions (e.g., “a black sedan going straight”)</li>
<li>Lack of diversity in data (limited to daytime, single country)</li>
</ul>
<h2>Contribution</h2>
<p>VISION provides a strong foundation for building robust, generalizable retrieval models suitable for complex urban environments and real-time surveillance systems.</p>
<footer style="margin-top:50px; font-size: 0.9em; color: #888;">
&copy; 2025 VISION Dataset Team | For research use only
</footer>
</body>
</html>
# VISION数据集
**VISION(交互式自然语言车辆识别与监控,Vehicle Identification and Surveillance through Interactive Natural Language)** 是一款面向真实监控场景下基于自然语言的车辆检索任务的基准数据集。
## 为何选择VISION?
传统车辆检索模型高度依赖预处理表征与辅助工具,这限制了其在真实监控系统中的应用适用性。
VISION可仅通过多模态模型直接从原始监控视频中完成检索,无需复杂的预处理流程。
## 核心特性
- 约7000个车辆片段,共计967705帧画面
- 数据采集自美国、韩国与印度尼西亚
- 配备丰富的细粒度自然语言标注
- 包含车辆运动与交互等上下文感知的描述信息
- 涵盖多样的道路类型、天气条件与场景环境
## 现有基准数据集的局限性
此前的基准数据集**CityFlow-NL**存在以下缺陷:
- 标注存在不一致性与错误
- 描述过于简化(例如"一辆黑色轿车直行")
- 数据多样性不足(仅覆盖日间场景且局限于单一国家)
## 核心贡献
VISION为构建适用于复杂城市环境与实时监控系统的鲁棒、可泛化检索模型提供了坚实的研发基础。
© 2025 VISION数据集团队 | 仅用于科研用途
提供机构:
Harvard Dataverse
创建时间:
2025-05-12



