five

nyu-visionx/VSI-Train-10k

收藏
Hugging Face2025-11-07 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/nyu-visionx/VSI-Train-10k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - visual-question-answering language: - en tags: - Video - Text - egocentric - spatial-reasoning - training-data size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: vsi_train_10k.parquet --- # VSI-Train-10k <a href="http://arxiv.org/abs/2511.04655" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/cs.CV-arXiv:2511.04655-red?logo=arxiv" height="20" /></a> <a href="https://vision-x-nyu.github.io/test-set-training/" target="_blank"><img alt="Website" src="https://img.shields.io/badge/🌎_Web-test--set--training-blue.svg" height="20" /></a> <a href="https://hf.co/datasets/nyu-visionx/VSI-Bench" target="_blank"><img alt="HF" src="https://img.shields.io/badge/HF-VSI--Bench_(Debiased)-FED123.svg?style&logo=HuggingFace" height="20" /></a> <a href="https://github.com/vision-x-nyu/test-set-training" target="_blank"><img alt="GitHub Code" src="https://img.shields.io/badge/Code-vision--x--nyu%2Ftest--set--training-white?&logo=github&logoColor=white" /></a> **Authors:** &ensp; <a href="https://ellisbrown.github.io/" target="_blank">Ellis Brown</a>, <a href="https://jihanyang.github.io/" target="_blank">Jihan Yang</a>, <a href="https://github.com/vealocia" target="_blank">Shusheng Yang</a>, <a href="https://cs.nyu.edu/~fergus" target="_blank">Rob Fergus</a>, <a href="https://www.sainingxie.com/" target="_blank">Saining Xie</a> --- **VSI-Train-10k** is an in-distribution training dataset of 10,000 question-answer pairs for visual-spatial intelligence tasks from egocentric video. This dataset accompanies the [VSI-Bench](https://hf.co/datasets/nyu-visionx/VSI-Bench) test set and is introduced in our paper: [Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts](http://arxiv.org/abs/2511.04655). ## Dataset Description VSI-Train-10k comprises 10,000 question-answer pairs generated using the same procedural logic as VSI-Bench, sourced from the **training splits** of ScanNet, ScanNet++, and ARKitScenes datasets (non-overlapping with VSI-Bench test videos). The dataset was created to study how MLLMs exploit statistical shortcuts and biases from in-distribution data. **Key characteristics:** - 10,000 QA pairs across 7 rule-based question types (excludes route planning) - Generated using the same templates and logic as VSI-Bench - Maximum 20 questions per question type per scene for diversity - Sourced from training splits only (no overlap with VSI-Bench test data) ## Dataset Structure ### Fields | Field Name | Description | | :--------- | :---------- | | `video` | Path to video file (e.g., `scannet_videos_128f/train/scene0335_02_128f.mp4`) | | `conversations` | List of conversation turns with `from` (human/gpt) and `value` (message content) | | `type` | Question format: `mc` (multiple choice) or `oe` (open-ended) | | `question_type` | High-level task category (e.g., `relative_distance`, `object_counting`) | | `question_type_detail` | Detailed task subcategory | | `source` | Video source dataset: `scannet`, `arkitscenes`, or `scannetpp` | | `question` | Full question text including instructions and options | | `ground_truth` | Correct answer | VSI-Train-10k includes 7 question types: object counting, spatial relations (closer/farther), object appearance order, size estimation, and more. Route planning questions are excluded. ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the training dataset vsi_train = load_dataset("nyu-visionx/VSI-Train-10k") # Access the data for example in vsi_train['train']: print(example['question']) print(example['ground_truth']) ``` ### Extracting Video Files The video files are compressed in `.tar.zst` format using [zstd](http://www.zstd.net/) (much faster than gzip). To extract all shards in parallel: ```bash # Install zstd if needed: sudo apt-get install zstd (Ubuntu/Debian) or brew install zstd (macOS) for shard in vsi_train_shard_*.tar.zst; do zstd -d "$shard" -c | tar -xf - & done; wait ``` ## Files - `vsi_train_10k.parquet`: Parquet file containing dataset annotations optimized for HuggingFace Datasets - `vsi_train_10k.jsonl`: Raw JSONL file with the same annotations - `vsi_train_shard_*.tar.zst`: Compressed video files (9 shards total) ## Generation Methodology VSI-Train-10k was generated following the VSI-Bench curation pipeline: 1. Object numbers, bounding boxes, and room sizes were extracted from the training splits of ScanNet, ScanNet++, and ARKitScenes 2. Question-answer pairs were generated using the same templates as VSI-Bench 3. We create `1430` question-answer pairs per question type, with a maximum of 20 questions per question type per scene 4. All questions maintain in-distribution consistency with VSI-Bench See the paper for more details. ## Source Data Videos from the **training splits** of [ScanNet](https://arxiv.org/abs/1702.04405), [ScanNet++](https://arxiv.org/abs/2308.11417), and [ARKitScenes](https://arxiv.org/abs/2111.08897) (non-overlapping with [VSI-Bench](https://huggingface.co/datasets/nyu-visionx/VSI-Bench) test videos). ## Citation If you use this dataset, please cite our paper: ```bibtex @article{brown2025benchmark, author = {Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining}, title = {Benchmark Designers Should ``Train on the Test Set'' to Expose Exploitable Non-Visual Shortcuts}, journal = {arXiv preprint arXiv:2511.04655}, year = {2025}, } ```
提供机构:
nyu-visionx
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作