HumanRef
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/IDEA-Research/HumanRef
下载链接
链接失效反馈官方服务:
资源简介:
<div align=center>
<img src="assets/teaser.jpg" width=800 >
</div>
**This repository only contains the HumanRef Benchmark and the evaluation code.**
# 1. Introduction
HumanRef is a large-scale human-centric referring expression dataset designed for multi-instance human referring in natural scenes. Unlike traditional referring datasets that focus on one-to-one object referring, HumanRef supports referring to multiple individuals simultaneously through natural language descriptions.
Key features of HumanRef include:
- **Multi-Instance Referring**: A single referring expression can correspond to multiple individuals, better reflecting real-world scenarios
- **Diverse Referring Types**: Covers 6 major types of referring expressions:
- Attribute-based (e.g., gender, age, clothing)
- Position-based (relative positions between humans or with environment)
- Interaction-based (human-human or human-environment interactions)
- Reasoning-based (complex logical combinations)
- Celebrity Recognition
- Rejection Cases (non-existent references)
- **High-Quality Data**:
- 34,806 high-resolution images (>1000×1000 pixels)
- 103,028 referring expressions in training set
- 6,000 carefully curated expressions in benchmark set
- Average 8.6 persons per image
- Average 2.2 target boxes per referring expression
The dataset aims to advance research in human-centric visual understanding and referring expression comprehension in complex, multi-person scenarios.
# 2. Statistics
## HumanRef Dataset Statistics
| Type | Attribute | Position | Interaction | Reasoning | Celebrity | Rejection | Total |
|------|-----------|----------|-------------|-----------|-----------|-----------|--------|
| **HumanRef Train** |
| Images | 8,614 | 7,577 | 1,632 | 4,474 | 4,990 | 7,519 | 34,806 |
| Referrings | 52,513 | 22,496 | 2,911 | 6,808 | 4,990 | 13,310 | 103,028 |
| Avg. boxes/ref | 2.9 | 1.9 | 3.1 | 3.0 | 1.0 | 0 | 2.2 |
| **HumanRef Benchmark** |
| Images | 838 | 972 | 940 | 982 | 1,000 | 1,000 | 5,732 |
| Referrings | 1,000 | 1,000 | 1,000 | 1,000 | 1,000 | 1,000 | 6,000 |
| Avg. boxes/ref | 2.8 | 2.1 | 2.1 | 2.7 | 1.1 | 0 | 2.2 |
## Comparison with Existing Datasets
| Dataset | Images | Refs | Vocabs | Avg. Size | Avg. Person/Image | Avg. Words/Ref | Avg. Boxes/Ref |
|---------|--------|------|---------|-----------|------------------|----------------|----------------|
| RefCOCO | 1,519 | 10,771 | 1,874 | 593x484 | 5.72 | 3.43 | 1 |
| RefCOCO+ | 1,519 | 10,908 | 2,288 | 592x484 | 5.72 | 3.34 | 1 |
| RefCOCOg | 1,521 | 5,253 | 2,479 | 585x480 | 2.73 | 9.07 | 1 |
| HumanRef | 5,732 | 6,000 | 2,714 | 1432x1074 | 8.60 | 6.69 | 2.2 |
Note: For a fair comparison, the statistics for RefCOCO/+/g only include human-referring cases.
## Distribution Visualization
<div align=center>
<img src="assets/distribution.jpg" width=600 >
</div>
# 3. Usage
## 3.1 Visualization
HumanRef Benchmark contains 6 domains, each domain may have multiple sub-domains.
| Domain | Subdomain | Num Referrings |
|--------|-----------|--------|
| attribute | 1000_attribute_retranslated_with_mask | 1000 |
| position | 500_inner_position_data_with_mask | 500 |
| position | 500_outer_position_data_with_mask | 500 |
| celebrity | 1000_celebrity_data_with_mask | 1000 |
| interaction | 500_inner_interaction_data_with_mask | 500 |
| interaction | 500_outer_interaction_data_with_mask | 500 |
| reasoning | 229_outer_position_two_stage_with_mask | 229 |
| reasoning | 271_positive_then_negative_reasoning_with_mask | 271 |
| reasoning | 500_inner_position_two_stage_with_mask | 500 |
| rejection | 1000_rejection_referring_with_mask | 1000 |
To visualize the dataset, you can run the following command:
```bash
python tools/visualize.py \
--anno_path annotations.jsonl \
--image_root_dir images \
--domain_anme attribute \
--sub_domain_anme 1000_attribute_retranslated_with_mask \
--vis_path visualize \
--num_images 50 \
--vis_mask True
```
## 3.2 Evaluation
### 3.2.1 Metrics
We evaluate the referring task using three main metrics: Precision, Recall, and DensityF1 Score.
#### Basic Metrics
- **Precision & Recall**: For each referring expression, a predicted bounding box is considered correct if its IoU with any ground truth box exceeds a threshold. Following COCO evaluation protocol, we report average performance across IoU thresholds from 0.5 to 0.95 in steps of 0.05.
- **Point-based Evaluation**: For models that only output points (e.g., Molmo), a prediction is considered correct if the predicted point falls within the mask of the corresponding instance. Note that this is less strict than IoU-based metrics.
- **Rejection Accuracy**: For the rejection subset, we calculate:
```
Rejection Accuracy = Number of correctly rejected expressions / Total number of expressions
```
where a correct rejection means the model predicts no boxes for a non-existent reference.
#### DensityF1 Score
To penalize over-detection (predicting too many boxes), we introduce the DensityF1 Score:
```
DensityF1 = (1/N) * Σ [2 * (Precision_i * Recall_i)/(Precision_i + Recall_i) * D_i]
```
where D_i is the density penalty factor:
```
D_i = min(1.0, GT_Count_i / Predicted_Count_i)
```
where:
- N is the number of referring expressions
- GT_Count_i is the total number of persons in image i
- Predicted_Count_i is the number of predicted boxes for referring expression i
This penalty factor reduces the score when models predict significantly more boxes than the actual number of people in the image, discouraging over-detection strategies.
### 3.2.2 Evaluation Script
#### Prediction Format
Before running the evaluation, you need to prepare your model's predictions in the correct format. Each prediction should be a JSON line in a JSONL file with the following structure:
```json
{
"id": "image_id",
"extracted_predictions": [[x1, y1, x2, y2], [x1, y1, x2, y2], ...]
}
```
Where:
- id: The image identifier matching the ground truth data
- extracted_predictions: A list of bounding boxes in [x1, y1, x2, y2] format or points in [x, y] format
For rejection cases (where no humans should be detected), you should either:
- Include an empty list: "extracted_predictions": []
- Include a list with an empty box: "extracted_predictions": [[]]
#### Running the Evaluation
You can run the evaluation script using the following command:
```bash
python metric/recall_precision_densityf1.py \
--gt_path IDEA-Research/HumanRef/annotations.jsonl \
--pred_path path/to/your/predictions.jsonl \
--pred_names "Your Model Name" \
--dump_path IDEA-Research/HumanRef/evaluation_results/your_model_results
```
Parameters:
- --gt_path: Path to the ground truth annotations file
- --pred_path: Path to your prediction file(s). You can provide multiple paths to compare different models
- --pred_names: Names for your models (for display in the results)
- --dump_path: Directory to save the evaluation results in markdown and JSON formats
Evaluating Multiple Models:
To compare multiple models, provide multiple prediction files:
```bash
python metric/recall_precision_densityf1.py \
--gt_path IDEA-Research/HumanRef/annotations.jsonl \
--pred_path model1_results.jsonl model2_results.jsonl model3_results.jsonl \
--pred_names "Model 1" "Model 2" "Model 3" \
--dump_path IDEA-Research/HumanRef/evaluation_results/comparison
```
#### Programmatic Usage
```python
from metric.recall_precision_densityf1 import recall_precision_densityf1
recall_precision_densityf1(
gt_path="IDEA-Research/HumanRef/annotations.jsonl",
pred_path=["path/to/your/predictions.jsonl"],
dump_path="IDEA-Research/HumanRef/evaluation_results/your_model_results"
)
```
#### Metrics Explained
The evaluation produces several metrics:
1. For point predictions:
- Recall@Point
- Precision@Point
- DensityF1@Point
2. For box predictions:
- Recall@0.5 (IoU threshold of 0.5)
- Recall@0.5:0.95 (mean recall across IoU thresholds from 0.5 to 0.95)
- Precision@0.5
- Precision@0.5:0.95
- DensityF1@0.5
- DensityF1@0.5:0.95
3. Rejection Score: Accuracy in correctly identifying images with no humans
The results are broken down by:
- Domain and subdomain
- Box count ranges (1, 2-5, 6-10, >10)
The DensityF1 metric is particularly important as it accounts for both precision/recall and the density of humans in the image.
#### Output
The evaluation generates two tables:
- Comparative Domain and Subdomain Metrics
- Comparative Box Count Metrics
These are displayed in the console and saved as markdown and JSON files if a dump path is provided.
### 3.2.3 Comparison with Other Models
We provide the evaluation results of several models on HumanRef in the [evaluation_results](evaluation_results) folder.
You can also run the evaluation script to compare your model with others.
```bash
python metric/recall_precision_densityf1.py \
--gt_path IDEA-Research/HumanRef/annotations.jsonl \
--pred_path \
"IDEA-Research/HumanRef/evaluation_results/eval_deepseekvl2/deepseekvl2_small_results.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_ferret/ferret7b_results.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_groma/groma7b_results.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_internvl2/internvl2.5_8b_results.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_shikra/shikra7b_results.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_molmo/molmo-7b-d-0924_results.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_qwen2vl/qwen2.5-7B.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_chatrex/ChatRex-Vicuna7B.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_dinox/dinox_results.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_rexseek/rexseek_7b.jsonl" \
"IDEA-Research/HumanRef/evaluation_results/eval_full_gt_person/results.jsonl" \
--pred_names \
"DeepSeek-VL2-small" \
"Ferret-7B" \
"Groma-7B" \
"InternVl-2.5-8B" \
"Shikra-7B" \
"Molmo-7B-D-0924" \
"Qwen2.5-VL-7B" \
"ChatRex-7B" \
"DINOX" \
"RexSeek-7B" \
"Baseline" \
--dump_path IDEA-Research/HumanRef/evaluation_results/all_models_comparison
```
<div align=center>
<img src="assets/teaser.jpg" width=800 >
</div>
**本仓库仅包含HumanRef基准测试集(HumanRef Benchmark)与评估代码。**
# 1. 简介
HumanRef是一款面向自然场景下多实例人类指代任务的大规模以人为中心的指代表达式(referring expression)数据集。与传统聚焦于一对一物体指代的指代数据集不同,HumanRef支持通过自然语言描述同时指代多个个体。
HumanRef的核心特性如下:
- **多实例指代**:单条指代表达式可对应多个个体,更贴合真实世界场景
- **多样化指代类型**:覆盖6大类指代表达式类型:
- 基于属性(如性别、年龄、衣着)
- 基于位置(人类间或人类与环境间的相对位置)
- 基于交互(人与人或人与环境的交互)
- 基于推理(复杂逻辑组合)
- 名人识别
- 拒识样本(指代对象不存在的情况)
- **高质量数据**:
- 34806张高分辨率图像(分辨率大于1000×1000像素)
- 训练集包含103028条指代表达式
- 基准测试集包含6000条精心筛选的指代表达式
- 每张图像平均包含8.6个人物
- 每条指代表达式平均对应2.2个目标边界框(bounding box)
本数据集旨在推动复杂多人物场景下以人为中心的视觉理解与指代表达式理解(referring expression comprehension)相关研究的发展。
# 2. 统计数据
## HumanRef数据集统计数据
| 类别 | 属性类 | 位置类 | 交互类 | 推理类 | 名人类 | 拒识类 | 总计 |
|------|-----------|----------|-------------|-----------|-----------|-----------|--------|
| **HumanRef 训练集** |
| 图像数量 | 8,614 | 7,577 | 1,632 | 4,474 | 4,990 | 7,519 | 34,806 |
| 指代表达式数量 | 52,513 | 22,496 | 2,911 | 6,808 | 4,990 | 13,310 | 103,028 |
| 每条指代平均边界框数 | 2.9 | 1.9 | 3.1 | 3.0 | 1.0 | 0 | 2.2 |
| **HumanRef 基准测试集** |
| 图像数量 | 838 | 972 | 940 | 982 | 1,000 | 1,000 | 5,732 |
| 指代表达式数量 | 1,000 | 1,000 | 1,000 | 1,000 | 1,000 | 1,000 | 6,000 |
| 每条指代平均边界框数 | 2.8 | 2.1 | 2.1 | 2.7 | 1.1 | 0 | 2.2 |
## 与现有数据集的对比
| 数据集 | 图像数量 | 指代表达式数量 | 词汇量 | 平均分辨率 | 每张图像平均人物数 | 每条指代平均单词数 | 每条指代平均边界框数 |
|---------|--------|------|---------|-----------|------------------|----------------|----------------|
| RefCOCO | 1,519 | 10,771 | 1,874 | 593x484 | 5.72 | 3.43 | 1 |
| RefCOCO+ | 1,519 | 10,908 | 2,288 | 592x484 | 5.72 | 3.34 | 1 |
| RefCOCOg | 1,521 | 5,253 | 2,479 | 585x480 | 2.73 | 9.07 | 1 |
| HumanRef | 5,732 | 6,000 | 2,714 | 1432x1074 | 8.60 | 6.69 | 2.2 |
注:为保证对比公平,RefCOCO/+/g的统计数据仅包含人类指代相关样本。
## 分布可视化
<div align=center>
<img src="assets/distribution.jpg" width=600 >
</div>
# 3. 使用方法
## 3.1 可视化
HumanRef基准测试集包含6大领域,每个领域下设多个子领域。
| 领域 | 子领域 | 指代表达式数量 |
|--------|-----------|--------|
| 属性类 | 1000_attribute_retranslated_with_mask | 1000 |
| 位置类 | 500_inner_position_data_with_mask | 500 |
| 位置类 | 500_outer_position_data_with_mask | 500 |
| 名人类 | 1000_celebrity_data_with_mask | 1000 |
| 交互类 | 500_inner_interaction_data_with_mask | 500 |
| 交互类 | 500_outer_interaction_data_with_mask | 500 |
| 推理类 | 229_outer_position_two_stage_with_mask | 229 |
| 推理类 | 271_positive_then_negative_reasoning_with_mask | 271 |
| 推理类 | 500_inner_position_two_stage_with_mask | 500 |
| 拒识类 | 1000_rejection_referring_with_mask | 1000 |
若需可视化数据集,可运行以下命令:
bash
python tools/visualize.py
--anno_path annotations.jsonl
--image_root_dir images
--domain_anme attribute
--sub_domain_anme 1000_attribute_retranslated_with_mask
--vis_path visualize
--num_images 50
--vis_mask True
## 3.2 评估
### 3.2.1 评估指标
我们采用三大核心指标评估指代任务:精确率(Precision)、召回率(Recall)与DensityF1得分(DensityF1 Score)。
#### 基础指标
- **精确率与召回率**:对于每条指代表达式,若预测边界框与任意真实边界框的交并比(Intersection over Union,IoU)超过设定阈值,则认为该预测框正确。遵循COCO评估协议,我们报告IoU阈值从0.5到0.95、步长为0.05时的平均性能。
- **基于点的评估**:针对仅输出点坐标的模型(如Molmo),若预测点落在对应实例的掩码(mask)区域内,则认为预测正确。需注意该评估标准比基于IoU的指标更为宽松。
- **拒识准确率**:针对拒识样本子集,我们计算:
拒识准确率 = 正确拒识的指代表达式数量 / 总指代表达式数量
其中正确拒识指模型未为不存在的指代对象预测任何边界框。
#### DensityF1得分
为惩罚过检测(预测过多边界框)的情况,我们引入DensityF1得分:
DensityF1 = (1/N) * Σ [2 * (Precision_i * Recall_i)/(Precision_i + Recall_i) * D_i]
其中D_i为密度惩罚因子:
D_i = min(1.0, GT_Count_i / Predicted_Count_i)
各参数说明:
- N:指代表达式总数量
- GT_Count_i:第i张图像中的真实人物总数
- Predicted_Count_i:第i条指代表达式对应的预测边界框总数
该惩罚因子会在模型预测的边界框数量显著超过图像中实际人物数时降低得分,从而抑制过检测策略。
### 3.2.2 评估脚本
#### 预测格式
在运行评估前,需将模型预测结果整理为符合要求的格式。每条预测结果应为JSONL(JSON Lines)文件中的一条JSON行,结构如下:
json
{
"id": "image_id",
"extracted_predictions": [[x1, y1, x2, y2], [x1, y1, x2, y2], ...]
}
其中:
- id:与真实标注数据匹配的图像标识符
- extracted_predictions:以[x1, y1, x2, y2]格式存储的边界框列表,或以[x, y]格式存储的点坐标列表
对于拒识样本(不应检测到任何人的情况),你可以选择:
- 传入空列表:`"extracted_predictions": []`
- 传入包含空框的列表:`"extracted_predictions": [[]]`
#### 运行评估
可通过以下命令运行评估脚本:
bash
python metric/recall_precision_densityf1.py
--gt_path IDEA-Research/HumanRef/annotations.jsonl
--pred_path path/to/your/predictions.jsonl
--pred_names "Your Model Name"
--dump_path IDEA-Research/HumanRef/evaluation_results/your_model_results
参数说明:
- --gt_path:真实标注文件路径
- --pred_path:预测文件路径(可传入多个路径以对比不同模型)
- --pred_names:模型名称(用于结果展示)
- --dump_path:保存评估结果的目录,结果将以Markdown与JSON格式存储
##### 多模型评估
若需对比多个模型,可传入多个预测文件路径:
bash
python metric/recall_precision_densityf1.py
--gt_path IDEA-Research/HumanRef/annotations.jsonl
--pred_path model1_results.jsonl model2_results.jsonl model3_results.jsonl
--pred_names "Model 1" "Model 2" "Model 3"
--dump_path IDEA-Research/HumanRef/evaluation_results/comparison
#### 编程式调用
python
from metric.recall_precision_densityf1 import recall_precision_densityf1
recall_precision_densityf1(
gt_path="IDEA-Research/HumanRef/annotations.jsonl",
pred_path=["path/to/your/predictions.jsonl"],
dump_path="IDEA-Research/HumanRef/evaluation_results/your_model_results"
)
#### 指标说明
评估将输出以下多类指标:
1. 针对点坐标预测:
- 召回率@点(Recall@Point)
- 精确率@点(Precision@Point)
- DensityF1@点(DensityF1@Point)
2. 针对边界框预测:
- 召回率@0.5(IoU阈值为0.5)
- 召回率@0.5:0.95(IoU阈值从0.5到0.95的平均召回率)
- 精确率@0.5
- 精确率@0.5:0.95
- DensityF1@0.5
- DensityF1@0.5:0.95
3. 拒识得分:正确识别无人物图像的准确率
评估结果将按以下维度拆分:
- 领域与子领域
- 边界框数量区间(1个、2-5个、6-10个、>10个)
DensityF1指标尤为重要,因为它同时兼顾了精确率/召回率与图像中的人物密度。
#### 输出结果
评估将生成两张表格:
- 领域与子领域对比指标表
- 边界框数量区间对比指标表
这些结果将在控制台中展示,若指定了dump_path,则会同时保存为Markdown与JSON文件。
### 3.2.3 与其他模型的对比
我们已在[evaluation_results](evaluation_results)文件夹中提供了多款模型在HumanRef上的评估结果。你也可以运行评估脚本,将你的模型与其他模型进行对比。
bash
python metric/recall_precision_densityf1.py
--gt_path IDEA-Research/HumanRef/annotations.jsonl
--pred_path
"IDEA-Research/HumanRef/evaluation_results/eval_deepseekvl2/deepseekvl2_small_results.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_ferret/ferret7b_results.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_groma/groma7b_results.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_internvl2/internvl2.5_8b_results.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_shikra/shikra7b_results.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_molmo/molmo-7b-d-0924_results.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_qwen2vl/qwen2.5-7B.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_chatrex/ChatRex-Vicuna7B.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_dinox/dinox_results.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_rexseek/rexseek_7b.jsonl"
"IDEA-Research/HumanRef/evaluation_results/eval_full_gt_person/results.jsonl"
--pred_names
"DeepSeek-VL2-small"
"Ferret-7B"
"Groma-7B"
"InternVl-2.5-8B"
"Shikra-7B"
"Molmo-7B-D-0924"
"Qwen2.5-VL-7B"
"ChatRex-7B"
"DINOX"
"RexSeek-7B"
"Baseline"
--dump_path IDEA-Research/HumanRef/evaluation_results/all_models_comparison
提供机构:
maas
创建时间:
2025-10-20



