PaulineLi/QuantiPhy

Name: PaulineLi/QuantiPhy
Creator: PaulineLi
Published: 2026-04-01 07:33:50
License: 暂无描述

Hugging Face2026-04-01 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/PaulineLi/QuantiPhy

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - video-text-to-text - visual-question-answering language: - en size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: test path: test_dataset.parquet --- # QuantiPhy ## Dataset Summary **QuantiPhy** is a benchmark for evaluating whether vision–language models (VLMs) can perform **quantitative physical inference** from visual evidence, rather than producing plausible but ungrounded numerical guesses. This repository contains the **official test set** of the QuantiPhy benchmark, consisting of **3,373 video–question (QA) pairs** across **556 videos**. Ground-truth answers are withheld to ensure fair evaluation. Each instance requires a model to output a **single continuous numerical value** (e.g., object size, velocity, or acceleration) in real-world units, given a short video and a natural-language question. > **Looking for the validation set?** > A 159-sample validation split with ground-truth answers is available at [PaulineLi/QuantiPhy-validation](https://huggingface.co/datasets/PaulineLi/QuantiPhy-validation) for model development, prompt tuning, and ablation studies. --- ## Supported Tasks - **Video-based numerical regression** - **Quantitative visual reasoning** - **Vision–language model evaluation** Tasks cover three core kinematic properties: - **Size** - **Velocity** - **Acceleration** All questions are **open-ended** and require predicting a real-valued scalar. --- ## Dataset Structure Each instance is represented as a structured video–text record with the following fields: | Field | Description | |---|---| | `video_id` | Unique identifier for the video (maps to `<video_id>.mp4` in the video folders) | | `video_source` | Data source (`simulation`, `lab`, `internet`, or `segmentation`) | | `video_type` | Four-character code encoding task configuration (see below) | | `fps` | Frame rate of the video | | `inference_type` | Prior/target configuration: `SS`, `SD`, `DS`, or `DD` | | `question` | Natural-language question with explicit physical units | | `prior` | Physical prior provided in world units (e.g., object size, velocity, or acceleration) | | `depth_info` | Depth/distance information for 3D configurations (null for 2D tasks) | Videos are short (typically **2–3 seconds**) and recorded with a **static camera** to ensure well-defined kinematic inference. ### Video Type Code Each `video_type` is a 4-character code `[P][D][O][B]`: | Position | Meaning | Values | |---|---|---| | **P** — Physical prior | S = Size, V = Velocity, A = Acceleration | | **D** — Dimensionality | 2 = 2D (planar), 3 = 3D (with depth) | | **O** — Object setting | S = Single-object, M = Multi-object | | **B** — Background | X = Plain, S = Simple, C = Complex | ### Inference Type | Code | Prior | Target | Description | |---|---|---|---| | `SS` | Static | Static | Infer a static quantity from a static prior | | `SD` | Static | Dynamic | Infer a dynamic quantity from a static prior | | `DS` | Dynamic | Static | Infer a static quantity from a dynamic prior | | `DD` | Dynamic | Dynamic | Infer a dynamic quantity from a dynamic prior | --- ## Task Design Overview Each instance provides the model with: - a short video depicting object motion, and - **one physical prior** in world units (object size, velocity at a given timestamp, or acceleration at a given timestamp). The model is then asked to infer a target kinematic quantity—possibly for a different object—expressed in real-world units. Tasks vary along four axes: 1. **Physical prior**: Size (S), Velocity (V), Acceleration (A) 2. **Dimensionality**: 2D (planar motion) or 3D (with depth variation) 3. **Object setting**: Single-object (S) or multi-object (M) 4. **Background complexity**: Plain (X), Simple (S), Complex (C) --- ## Dataset Statistics | | Count | |---|---| | QA pairs | 3,373 | | Unique videos | 556 | | Video types | 37 (18 × 2D + 19 × 3D) | **By source:** | Source | QA pairs | |---|---| | Simulation | 1,633 | | Lab | 811 | | Internet | 547 | | Segmentation | 382 | **By inference type:** | Type | QA pairs | |---|---| | DS (Dynamic → Static) | 1,689 | | SS (Static → Static) | 649 | | SD (Static → Dynamic) | 585 | | DD (Dynamic → Dynamic) | 450 | --- ## Videos Videos are provided in two resolutions: - `quantiphy_fullset_videos/` — original resolution - `quantiphy_fullset_videos_480p/` — 480p (for faster download and lower-resolution evaluation) Each video filename corresponds to the `video_id` field in the dataset (e.g., `simulation_0007.mp4`). --- ## Usage ```python from datasets import load_dataset ds = load_dataset("PaulineLi/QuantiPhy", split="test") print(ds[0]) # {'video_id': 'simulation_0007', 'video_source': 'simulation', ...} ``` For the **validation set** with ground-truth answers: ```python ds_val = load_dataset("PaulineLi/QuantiPhy-validation", split="validation") ``` --- ## Data Sources and Quality Control - **Simulation**: Blender-rendered scenes with precise physical ground truth. - **Laboratory capture**: Real-world recordings using calibrated depth and multi-view setups. - **Internet / author-recorded videos**: Carefully curated monocular videos meeting strict physical constraints. - **Segmentation**: Videos with segmented objects for controlled evaluation. All videos undergo manual review to remove: - excessive motion blur, - severe occlusion, - untrackable motion, - personally identifiable information (PII). --- ## License The **annotations and metadata** in this repository are released under the **Creative Commons Attribution 4.0 (CC BY 4.0)** license. Videos originate from simulated environments, laboratory recordings, and publicly available sources. Each video remains subject to its original license and terms of use. This release is intended for **research and evaluation purposes**. --- ## Authors **Puyin Li\***, **Tiange Xiang\***, **Ella Mao\***, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-Fei†, Ehsan Adeli† \* Equal contribution. --- ## Citation If you use QuantiPhy in your work, please cite: ```bibtex @article{li2025quantiphy, title = {QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models}, author = {Li, Puyin and Xiang, Tiange and Mao, Ella and Wei, Shirley and Chen, Xinye and Masood, Adnan and Li, Fei-Fei and Adeli, Ehsan}, journal = {arXiv preprint arXiv:2512.19526}, year = {2025} } ```

提供机构：

PaulineLi

5,000+

优质数据集

54 个

任务类型

进入经典数据集