NJU-LINK/DR3-Eval
收藏Hugging Face2026-04-20 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/NJU-LINK/DR3-Eval
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
pretty_name: "DR³-Eval"
configs:
- config_name: en
data_files:
- split: test
path: datasets_en/query.jsonl
- config_name: zh
data_files:
- split: test
path: datasets_zh/query.jsonl
size_categories:
- n<1K
task_categories:
- text-generation
- question-answering
language:
- en
- zh
tags:
- deep-research
- multimodal
- benchmark
- evaluation
- report-generation
- RAG
---
<h1 align="center">DR<sup>3</sup>-Eval: Towards Realistic and Reproducible<br>Deep Research Evaluation</h1>
<p align="center">
<a href="https://arxiv.org/abs/2604.14683">
<img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Arxiv Paper">
</a>
<a href="https://huggingface.co/papers/2604.14683">
<img src="https://img.shields.io/badge/🤗%20HuggingFace-Paper-orange.svg" alt="HuggingFace Paper">
</a>
<a href="https://huggingface.co/datasets/NJU-LINK/DR3-Eval">
<img src="https://img.shields.io/badge/🤗%20HuggingFace-Dataset-yellow.svg" alt="HuggingFace Dataset">
</a>
<a href="https://github.com/NJU-LINK/DR3-Eval">
<img src="https://img.shields.io/badge/GitHub-Code-blue.svg" alt="GitHub">
</a>
<a href="https://nju-link.github.io/DR3-Eval/">
<img src="https://img.shields.io/badge/🌐%20Homepage-Project%20Page-orange.svg" alt="Homepage">
</a>
<a href="LICENSE">
<img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License">
</a>
</p>
---
## ✨ Overview
**DR³-Eval** is a **realistic, reproducible, and multimodal** evaluation benchmark for Deep Research Agents, focusing on multi-file report generation tasks.
Existing benchmarks face a fundamental tension between **realism**, **controllability**, and **reproducibility** when evaluating deep research agents. DR³-Eval addresses this through the following design:
- 🔬 **Real User Scenarios**: Tasks are constructed from real user-provided multimodal files, covering **3 major domains and 13 sub-domains**
- 📦 **Static Sandbox Corpora**: An independent static research sandbox is built for each task, containing supportive, distracting, and noisy documents
- 🎯 **Reverse Construction Method**: Queries are reverse-engineered from verified evidence documents, eliminating evaluation ambiguity
- 📊 **Multi-dimensional Evaluation**: Five dimensions — Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality
<p align="center">
<img src="assets/intro.png" width="88%" alt="Comparison of DR³-Eval with other benchmarks">
<br>
<em>Figure 1. Comparison of DR³-Eval with existing deep research benchmarks. DR³-Eval supports both user files and sandbox corpora, providing a realistic and reproducible multimodal evaluation environment.</em>
</p>
---
## 🏆 Benchmark Comparison
DR³-Eval is the first deep research evaluation benchmark that simultaneously satisfies all of the following: user file input, static sandbox corpora, multimodality, real-world scenarios, multi-file upload, and reverse construction.
<p align="center">
<img src="assets/benchmark_comparison.png" width="88%" alt="Benchmark Comparison">
<br>
<em>Figure 2. Comprehensive comparison of DR³-Eval with representative benchmarks.</em>
</p>
---
## 🧩 Framework and Pipeline
The overall framework of DR³-Eval consists of three core components:
1. 📝 **Data Construction**: Synthesizes search paths from real multimodal files through a diverge-converge mechanism, establishes static sandboxes with controllable signal-to-noise ratios, and generates queries via reverse engineering
2. 🤖 **DR³-Agent**: Hierarchical multi-agent architecture
3. 📊 **Evaluation Protocol**: A multi-dimensional metric suite that comprehensively evaluates evidence retrieval and report generation performance
<p align="center">
<img src="assets/framework.png" width="88%" alt="Framework Overview">
<br>
<em>Figure 3. DR³-Eval framework overview. Includes data construction, DR³-Agent multi-agent system, and multi-dimensional evaluation protocol.</em>
</p>
---
## 📊 Dataset Statistics
- **100** independent tasks (50 English + 50 Chinese)
- **3** major domains, **13** sub-domains
- **68%** of tasks involve multimodal input
- Average of **2.24** user files per task, up to 6
- Sandbox corpus contains an average of **465.5** web pages under the 512k configuration
<p align="center">
<img src="assets/data_stas.png" width="88%" alt="Dataset Statistics">
<br>
<em>Figure 4. Dataset statistics. (a) Domain distribution. (b) File type distribution. (c) Distribution of user files per task.</em>
</p>
### File Type Distribution
| Category | Count | Subtypes |
|----------|-------|----------|
| Document | 103 | PDF(76), Markdown(10), Word(7), Text(5), PPT(5) |
| Image | 62 | PNG(37), JPEG(21), WebP(4) |
| Video | 31 | MP4(31) |
| Audio | 4 | MP3(4) |
| Data | 14 | CSV(7), Excel(7) |
| Other | 10 | HTML(10) |
| **Total** | **224** | - |
### Language Distribution
| Language | Cases | Avg Files/Case | Avg Multimodal Files/Case |
|----------|-------|----------------|--------------------------|
| Chinese | 50 | 1.82 | 0.82 |
| English | 50 | 2.66 | 1.12 |
| **Total** | **100** | **2.24** | **0.97** |
---
## 🧩 Dataset Structure
```
DR3-Eval/
├── datasets_en/
│ ├── query.jsonl # English task queries (50 tasks)
│ ├── 001/ # Task folder with user files
│ │ ├── file1.pdf
│ │ ├── file2.mp4
│ │ └── ...
│ ├── 002/
│ └── ...
├── datasets_zh/
│ ├── query.jsonl # Chinese task queries (50 tasks)
│ ├── 001/
│ └── ...
└── README.md
```
Each task folder contains the **user-provided multimodal files** referenced in the corresponding query. The `query.jsonl` file contains the task queries with the following fields:
| Field | Description |
|-------|-------------|
| `task` | Task ID (e.g., "001") |
| `query` | The natural language research query |
| `user_files` | List of user-provided file names |
---
## 📐 Evaluation Metrics
| Dimension | Metric | Description |
|-----------|--------|-------------|
| **Information Retrieval** | IR (Information Recall) | Coverage of key insights from user files and sandbox corpus |
| **Information Retrieval** | CC (Citation Coverage) | Extent to which the report cites necessary source documents |
| **Report Generation** | FA (Factual Accuracy) | Factual correctness of cited claims in the report |
| **Report Generation** | IF (Instruction Following) | Whether the report satisfies all requirements in the query |
| **Report Generation** | DQ (Depth Quality) | Analytical depth and logical rigor of the report |
---
## 🚀 Quick Start
### 📥 Download
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="NJU-LINK/DR3-Eval",
repo_type="dataset",
local_dir="./DR3-Eval"
)
```
### 📖 Load Queries
```python
import json
# Load English queries
with open("DR3-Eval/datasets_en/query.jsonl") as f:
en_queries = [json.loads(line) for line in f]
# Load Chinese queries
with open("DR3-Eval/datasets_zh/query.jsonl") as f:
zh_queries = [json.loads(line) for line in f]
print(f"English tasks: {len(en_queries)}")
print(f"Chinese tasks: {len(zh_queries)}")
```
---
## 📝 Citation
If you find this work useful, please cite:
```bibtex
@article{dr3eval2026,
title={DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation},
author={},
journal={arXiv preprint arXiv:2604.14683},
year={2026}
}
```
## 🌟 License
This project is licensed under the Apache License 2.0.
提供机构:
NJU-LINK



