TYTSTQ/ordinary-bench-subset-ablation
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TYTSTQ/ordinary-bench-subset-ablation
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
default: true
data_files:
- split: train
path: data/default/train*.parquet
task_categories:
- visual-question-answering
language:
- en
license: mit
tags:
- spatial-reasoning
- vlm-benchmark
- ordinal-relations
- 3d-scenes
- multi-view
- ablation-study
- subset-sensitivity
size_categories:
- 100K<n<1M
---
# ORDINARY-BENCH Subset Ablation Dataset
An ablation dataset testing whether VLMs are affected by **irrelevant objects** in the scene. For each parent scene (6-10 objects), all C(N,4) four-object subsets are re-rendered, and the **full QRR question bank** is asked — including questions about objects NOT present in the subset image.
> Main benchmark: [TYTSTQ/ordinary-bench](https://huggingface.co/datasets/TYTSTQ/ordinary-bench)
>
> Multi-view version: [TYTSTQ/ordinary-bench-multiview](https://huggingface.co/datasets/TYTSTQ/ordinary-bench-multiview)
>
> Source code: [GitHub - tasd12-ty/ordinary-bench-core](https://github.com/tasd12-ty/ordinary-bench-core)
## Overview
| | |
|---|---|
| Parent scenes | 10 (n06-n10, 2 per complexity level) |
| Subsets | 912 (all C(N,4) combinations) |
| Questions per subset | Full master QRR bank from parent scene |
| Total questions | 624,963 |
| Answerable | 13,543 (2.2%) — all 4 referenced objects present |
| N/A (refusal expected) | 611,420 (97.8%) — ≥1 referenced object missing |
| Images per subset | 5 (1 single-view + 4 multi-view) |
## Experimental Design
1. **Parent scenes**: 10 test scenes with 6-10 objects each
2. **Subset enumeration**: All C(N,4) four-object subsets (same positions, camera unchanged)
3. **Re-rendering**: Each subset is rendered with only 4 objects (Blender, same camera angles)
4. **Master QRR bank**: All pairwise distance comparisons from the parent scene (disjoint + shared_anchor + FDR decomposition)
5. **Question assignment**: Each question is labeled `answerable` (all referenced objects present) or N/A (≥1 missing)
### Key insight
When aggregating across subsets of the same parent scene, N/A answers can be **ignored** — only answerable predictions matter. This enables cross-subset consistency analysis.
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("TYTSTQ/ordinary-bench-subset-ablation", split="train")
sample = ds[0]
sample["single_view"] # PIL Image (single-view, 480x320)
sample["view_0"] # PIL Image (multi-view camera 0)
sample["view_1"] # PIL Image (multi-view camera 1)
sample["view_2"] # PIL Image (multi-view camera 2)
sample["view_3"] # PIL Image (multi-view camera 3)
sample["answerable"] # True/False
sample["missing_objects"] # JSON: [] or ["obj_5", "obj_7"]
sample["gt_comparator"] # "<", "~=", or ">"
sample["parent_scene_id"] # "n10_000082"
# Filter to answerable questions only
answerable = ds.filter(lambda x: x["answerable"])
# Filter by parent scene
parent_subset = ds.filter(lambda x: x["parent_scene_id"] == "n10_000082")
```
## Column Schema
| Column | Type | Description |
|--------|------|-------------|
| `scene_id` | string | Subset scene ID, e.g., `n10_000082__s0042` |
| `parent_scene_id` | string | Parent scene ID, e.g., `n10_000082` |
| `n_objects` | int | Objects in subset (always 4) |
| `n_objects_parent` | int | Objects in parent scene (6-10) |
| `single_view` | Image | Single-view render (480x320 PNG) |
| `view_0` | Image | Multi-view camera 0 (az=45°) |
| `view_1` | Image | Multi-view camera 1 (az=135°) |
| `view_2` | Image | Multi-view camera 2 (az=225°) |
| `view_3` | Image | Multi-view camera 3 (az=315°) |
| `objects` | string | JSON: objects visible in this subset |
| `all_objects_in_parent` | string | JSON: all objects in parent scene |
| `qid` | string | Question ID from master bank, e.g., `mqrr_0001` |
| `question_type` | string | Always `qrr` |
| `variant` | string | `disjoint` or `shared_anchor` |
| `answerable` | bool | True if all referenced objects are present |
| `missing_objects` | string | JSON list of object IDs not in subset |
| `gt_comparator` | string | Ground truth: `<`, `~=`, or `>` |
| `pair1` | string | JSON: first object pair |
| `pair2` | string | JSON: second object pair |
| `anchor` | string | Anchor object (shared_anchor variant) |
| `source` | string | `enumerate_qrr` or `fdr_decomposition` |
| `source_fdr_qid` | string | Original FDR question ID (if decomposed) |
## Parent Scenes
| Parent | N objects | C(N,4) subsets | Master QRR questions |
|--------|-----------|----------------|---------------------|
| n06_000080 | 6 | 15 | — |
| n06_000083 | 6 | 15 | — |
| n07_000087 | 7 | 35 | — |
| n07_000088 | 7 | 35 | — |
| n08_000084 | 8 | 70 | — |
| n08_000087 | 8 | 70 | — |
| n09_000083 | 9 | 126 | — |
| n09_000097 | 9 | 126 | — |
| n10_000082 | 10 | 210 | — |
| n10_000098 | 10 | 210 | — |
| **Total** | | **912** | **624,963** |
## Source Code
The full subset ablation pipeline is at `experiments/subset_ablation/` in the source repo:
| Script | Purpose |
|--------|---------|
| `enumerate_subsets.py` | C(N,4) subset enumeration |
| `render_subsets.py` | Batch rendering (single + multi-view) |
| `generate_master_questions.py` | Full QRR bank (incl. FDR decomposition) |
| `assign_subset_questions.py` | Per-subset question assignment + N/A labels |
| `run_subset_eval.py` | VLM evaluation with N/A support |
| `analyze_results.py` | Cross-subset consistency analysis |
## License
MIT
提供机构:
TYTSTQ



