NJU-LINK/DR3-Eval

Name: NJU-LINK/DR3-Eval
Creator: NJU-LINK
Published: 2026-04-20 09:23:23
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/NJU-LINK/DR3-Eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 pretty_name: "DR³-Eval" configs: - config_name: en data_files: - split: test path: datasets_en/query.jsonl - config_name: zh data_files: - split: test path: datasets_zh/query.jsonl size_categories: - n<1K task_categories: - text-generation - question-answering language: - en - zh tags: - deep-research - multimodal - benchmark - evaluation - report-generation - RAG --- <h1 align="center">DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation</h1> <a href="https://arxiv.org/abs/2604.14683"> <img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Arxiv Paper"> </a> <a href="https://huggingface.co/papers/2604.14683"> <img src="https://img.shields.io/badge/🤗%20HuggingFace-Paper-orange.svg" alt="HuggingFace Paper"> </a> <a href="https://huggingface.co/datasets/NJU-LINK/DR3-Eval"> <img src="https://img.shields.io/badge/🤗%20HuggingFace-Dataset-yellow.svg" alt="HuggingFace Dataset"> </a> <a href="https://github.com/NJU-LINK/DR3-Eval"> <img src="https://img.shields.io/badge/GitHub-Code-blue.svg" alt="GitHub"> </a> <a href="https://nju-link.github.io/DR3-Eval/"> <img src="https://img.shields.io/badge/🌐%20Homepage-Project%20Page-orange.svg" alt="Homepage"> </a> <a href="LICENSE"> <img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"> </a> --- ## ✨ Overview **DR³-Eval** is a **realistic, reproducible, and multimodal** evaluation benchmark for Deep Research Agents, focusing on multi-file report generation tasks. Existing benchmarks face a fundamental tension between **realism**, **controllability**, and **reproducibility** when evaluating deep research agents. DR³-Eval addresses this through the following design: - 🔬 **Real User Scenarios**: Tasks are constructed from real user-provided multimodal files, covering **3 major domains and 13 sub-domains** - 📦 **Static Sandbox Corpora**: An independent static research sandbox is built for each task, containing supportive, distracting, and noisy documents - 🎯 **Reverse Construction Method**: Queries are reverse-engineered from verified evidence documents, eliminating evaluation ambiguity - 📊 **Multi-dimensional Evaluation**: Five dimensions — Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality <img src="assets/intro.png" width="88%" alt="Comparison of DR³-Eval with other benchmarks"> Figure 1. Comparison of DR³-Eval with existing deep research benchmarks. DR³-Eval supports both user files and sandbox corpora, providing a realistic and reproducible multimodal evaluation environment. --- ## 🏆 Benchmark Comparison DR³-Eval is the first deep research evaluation benchmark that simultaneously satisfies all of the following: user file input, static sandbox corpora, multimodality, real-world scenarios, multi-file upload, and reverse construction. <img src="assets/benchmark_comparison.png" width="88%" alt="Benchmark Comparison"> Figure 2. Comprehensive comparison of DR³-Eval with representative benchmarks. --- ## 🧩 Framework and Pipeline The overall framework of DR³-Eval consists of three core components: 1. 📝 **Data Construction**: Synthesizes search paths from real multimodal files through a diverge-converge mechanism, establishes static sandboxes with controllable signal-to-noise ratios, and generates queries via reverse engineering 2. 🤖 **DR³-Agent**: Hierarchical multi-agent architecture 3. 📊 **Evaluation Protocol**: A multi-dimensional metric suite that comprehensively evaluates evidence retrieval and report generation performance <img src="assets/framework.png" width="88%" alt="Framework Overview"> Figure 3. DR³-Eval framework overview. Includes data construction, DR³-Agent multi-agent system, and multi-dimensional evaluation protocol. --- ## 📊 Dataset Statistics - **100** independent tasks (50 English + 50 Chinese) - **3** major domains, **13** sub-domains - **68%** of tasks involve multimodal input - Average of **2.24** user files per task, up to 6 - Sandbox corpus contains an average of **465.5** web pages under the 512k configuration <img src="assets/data_stas.png" width="88%" alt="Dataset Statistics"> Figure 4. Dataset statistics. (a) Domain distribution. (b) File type distribution. (c) Distribution of user files per task. ### File Type Distribution | Category | Count | Subtypes | |----------|-------|----------| | Document | 103 | PDF(76), Markdown(10), Word(7), Text(5), PPT(5) | | Image | 62 | PNG(37), JPEG(21), WebP(4) | | Video | 31 | MP4(31) | | Audio | 4 | MP3(4) | | Data | 14 | CSV(7), Excel(7) | | Other | 10 | HTML(10) | | **Total** | **224** | - | ### Language Distribution | Language | Cases | Avg Files/Case | Avg Multimodal Files/Case | |----------|-------|----------------|--------------------------| | Chinese | 50 | 1.82 | 0.82 | | English | 50 | 2.66 | 1.12 | | **Total** | **100** | **2.24** | **0.97** | --- ## 🧩 Dataset Structure ``` DR3-Eval/ ├── datasets_en/ │ ├── query.jsonl # English task queries (50 tasks) │ ├── 001/ # Task folder with user files │ │ ├── file1.pdf │ │ ├── file2.mp4 │ │ └── ... │ ├── 002/ │ └── ... ├── datasets_zh/ │ ├── query.jsonl # Chinese task queries (50 tasks) │ ├── 001/ │ └── ... └── README.md ``` Each task folder contains the **user-provided multimodal files** referenced in the corresponding query. The `query.jsonl` file contains the task queries with the following fields: | Field | Description | |-------|-------------| | `task` | Task ID (e.g., "001") | | `query` | The natural language research query | | `user_files` | List of user-provided file names | --- ## 📐 Evaluation Metrics | Dimension | Metric | Description | |-----------|--------|-------------| | **Information Retrieval** | IR (Information Recall) | Coverage of key insights from user files and sandbox corpus | | **Information Retrieval** | CC (Citation Coverage) | Extent to which the report cites necessary source documents | | **Report Generation** | FA (Factual Accuracy) | Factual correctness of cited claims in the report | | **Report Generation** | IF (Instruction Following) | Whether the report satisfies all requirements in the query | | **Report Generation** | DQ (Depth Quality) | Analytical depth and logical rigor of the report | --- ## 🚀 Quick Start ### 📥 Download ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="NJU-LINK/DR3-Eval", repo_type="dataset", local_dir="./DR3-Eval" ) ``` ### 📖 Load Queries ```python import json # Load English queries with open("DR3-Eval/datasets_en/query.jsonl") as f: en_queries = [json.loads(line) for line in f] # Load Chinese queries with open("DR3-Eval/datasets_zh/query.jsonl") as f: zh_queries = [json.loads(line) for line in f] print(f"English tasks: {len(en_queries)}") print(f"Chinese tasks: {len(zh_queries)}") ``` --- ## 📝 Citation If you find this work useful, please cite: ```bibtex @article{dr3eval2026, title={DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation}, author={}, journal={arXiv preprint arXiv:2604.14683}, year={2026} } ``` ## 🌟 License This project is licensed under the Apache License 2.0.

提供机构：

NJU-LINK

5,000+

优质数据集

54 个

任务类型

进入经典数据集