five

tiiuae/PBench

收藏
Hugging Face2026-04-01 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/tiiuae/PBench
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - image-segmentation language: - en tags: - referring-expression-segmentation - perception - segmentation - benchmark pretty_name: PBench size_categories: - 1K<n<10K dataset_info: features: - name: image dtype: image - name: expression dtype: string - name: masks list: - name: size list: int64 - name: counts dtype: string - name: count dtype: int32 - name: id dtype: int64 splits: - name: level_0 num_bytes: 5534494360 num_examples: 1930 - name: level_1 num_bytes: 2783412551 num_examples: 1108 - name: level_2 num_bytes: 3352253137 num_examples: 969 - name: level_3 num_bytes: 3609353809 num_examples: 1089 - name: level_4 num_bytes: 2751869956 num_examples: 861 - name: dense num_bytes: 677685577 num_examples: 381 download_size: 18678114659 dataset_size: 18709069390 configs: - config_name: default data_files: - split: level_0 path: data/level_0-* - split: level_1 path: data/level_1-* - split: level_2 path: data/level_2-* - split: level_3 path: data/level_3-* - split: level_4 path: data/level_4-* - split: dense path: data/dense-* --- # PBench: A Perception Benchmark for Referring Expression Segmentation Referring expression segmentation is increasingly becoming a core building block for real products and real workflows. People want to select objects using language inside creative tools, robots need to ground instructions like "pick up the blue wrench" to pixels, and video analytics systems are starting to support natural language queries like "track the red van" instead of fixed class lists. As models improve, a simple frustration remains: a single aggregate score can look good while hiding very different failure modes. Some models handle clean object categories well but struggle when the expression depends on reading text. Others can follow attributes but break on spatial layout. When those skills are averaged together, it becomes hard to see what a model has actually learned, and what it still misses. PBench is designed to make that diagnostic step easier. We propose a multi-level referring expression segmentation benchmark that evaluates vision-language perception across a structured hierarchy of skills. PBench assigns each sample to a **single complexity level** by construction, so you can measure performance per skill and compare models in a way that is easy to interpret. This release contains **6,338 samples** with **83,977 instance masks** across **5,090 unique expressions**, organized into five levels plus a dense split. ## What PBench measures PBench organizes expressions into five levels plus a dense split. Each level is meant to isolate a primary perceptual capability. By design, a sample belongs to exactly one level, so a per-level score is a closer proxy for the underlying skill than an aggregate score over mixed phenomena. <p align="center"> <img src="level_progression.png" width="90%"/> </p> ### Level 0: General object classes The foundation level tests basic object recognition and mask quality on common object categories. Expressions are short noun phrases like `car`, `person`, `tree`. The average expression length is 1.3 words. What typically breaks here is not language, but vision: boundary quality, small instances, and partial occlusion. ### Level 1: Fine-grained attributes and subtypes Level 1 adds descriptive detail that forces the model to use attributes or subtype distinctions. Expressions include properties (color, size, material), conditions (old, broken, dirty), subtypes (sedan vs SUV), states (open door), and components (cracked windshield). Typical expressions look like `red car` or `dirty white pickup truck`. The average expression length is 3.9 words. These samples often fail when the attribute is subtle, the object is partially visible, or multiple candidates share most attributes. ### Level 2: Text as identifier (OCR) Level 2 tests whether a model can use in-image text to identify or disambiguate an object. Expressions reference brands and product variants (`Diet Coke`), store-specific items (`Starbucks coffee cup`), or visible signage (`Emergency exit door`). The average expression length is 3.7 words. The key failure mode is simple: if the model does not read the text reliably, it will often choose a plausible but wrong instance. ### Level 3: Spatial relationships and layout Level 3 focuses on spatial reasoning. Expressions specify objects by relative position or scene layout: `car on the left`, `bird above the tree`, `third window from left`, `people inside the building`. This level has the longest expressions on average at 6.7 words, which mostly reflects how people naturally describe spatial constraints. Models often struggle when the reference frame is ambiguous (left of what), when depth cues are subtle, or when small spatial errors change the identity of the target. ### Level 4: Relationships and interactions Level 4 targets relational understanding between objects. Expressions describe actions (`person holding umbrella`), functional links (`key for door`), comparisons (`tallest building in the row`), and physical interactions (`book resting on table`). The average expression length is 4.6 words. These samples are often hard because the correct mask depends on understanding who is interacting with what, not just what objects are present. ### Dense split: Dense instance segmentation The dense split uses simple object-class expressions similar to Level 0, but in visually crowded scenes containing many instances of the same class. With an average of 181 masks per sample (up to 679), this split stress-tests whether a model can segment instances exhaustively rather than picking a single easy match. ## Dataset Statistics | Split | Capability | Samples | Total Masks | Unique Expressions | |---|---|---|---|---| | level_0 | General object classes | 1,930 | 7,543 | 939 | | level_1 | Attributes & subtypes | 1,108 | 2,740 | 1,085 | | level_2 | Text / OCR | 969 | 1,361 | 954 | | level_3 | Spatial relationships | 1,089 | 1,751 | 1,083 | | level_4 | Relationships & interactions | 861 | 1,731 | 839 | | dense | Dense instance segmentation | 381 | 68,851 | 264 | | **Total** | | **6,338** | **83,977** | **5,090** | ## Schema Each sample contains the following fields: | Field | Type | Description | |---|---|---| | `id` | `int` | Global ID in range 0 to 6,634 (not necessarily contiguous) | | `image` | `Image` | RGB image (original resolution) | | `expression` | `string` | Referring expression | | `masks` | `[{size: [H,W], counts: str}]` | List of COCO RLE segmentation masks | | `count` | `int` | Number of masks for this expression | ## Usage ```python from datasets import load_dataset ds = load_dataset("tiiuae/PBench") # Access a specific level sample = ds["level_0"][0] print(sample["expression"]) # e.g. "car" print(sample["count"]) # number of masks # Decode masks with pycocotools from pycocotools import mask as mask_utils import numpy as np for m in sample["masks"]: rle = {"size": m["size"], "counts": m["counts"].encode("utf-8")} binary_mask = mask_utils.decode(rle) # H x W numpy array (0/1) ``` ## Notes and limitations - **Ambiguity is real**: some expressions can be underspecified without additional context. We keep the level assignment focused on the dominant skill, but natural language still has edge cases. - **OCR noise**: for Level 2, visual text quality, font, and occlusion can dominate difficulty, even when the rest of the scene is simple. - **Spatial reference frames**: for Level 3, phrases like left, right, and in front of can be sensitive to camera viewpoint and to what the model treats as the reference object. ## Citation If you use PBench in your work, please cite: ```bibtex @article{bevli2026falcon, title = {Falcon Perception}, author = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit}, journal = {arXiv preprint arXiv:2603.27365}, year = {2026}, url = {https://arxiv.org/abs/2603.27365} } ```
提供机构:
tiiuae
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作