five

NJU-LINK/DR3-Eval

收藏
Hugging Face2026-04-20 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/NJU-LINK/DR3-Eval
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 pretty_name: "DR³-Eval" configs: - config_name: en data_files: - split: test path: datasets_en/query.jsonl - config_name: zh data_files: - split: test path: datasets_zh/query.jsonl size_categories: - n<1K task_categories: - text-generation - question-answering language: - en - zh tags: - deep-research - multimodal - benchmark - evaluation - report-generation - RAG --- <h1 align="center">DR<sup>3</sup>-Eval: Towards Realistic and Reproducible<br>Deep Research Evaluation</h1> <p align="center"> <a href="https://arxiv.org/abs/2604.14683"> <img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Arxiv Paper"> </a> <a href="https://huggingface.co/papers/2604.14683"> <img src="https://img.shields.io/badge/🤗%20HuggingFace-Paper-orange.svg" alt="HuggingFace Paper"> </a> <a href="https://huggingface.co/datasets/NJU-LINK/DR3-Eval"> <img src="https://img.shields.io/badge/🤗%20HuggingFace-Dataset-yellow.svg" alt="HuggingFace Dataset"> </a> <a href="https://github.com/NJU-LINK/DR3-Eval"> <img src="https://img.shields.io/badge/GitHub-Code-blue.svg" alt="GitHub"> </a> <a href="https://nju-link.github.io/DR3-Eval/"> <img src="https://img.shields.io/badge/🌐%20Homepage-Project%20Page-orange.svg" alt="Homepage"> </a> <a href="LICENSE"> <img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"> </a> </p> --- ## ✨ Overview **DR³-Eval** is a **realistic, reproducible, and multimodal** evaluation benchmark for Deep Research Agents, focusing on multi-file report generation tasks. Existing benchmarks face a fundamental tension between **realism**, **controllability**, and **reproducibility** when evaluating deep research agents. DR³-Eval addresses this through the following design: - 🔬 **Real User Scenarios**: Tasks are constructed from real user-provided multimodal files, covering **3 major domains and 13 sub-domains** - 📦 **Static Sandbox Corpora**: An independent static research sandbox is built for each task, containing supportive, distracting, and noisy documents - 🎯 **Reverse Construction Method**: Queries are reverse-engineered from verified evidence documents, eliminating evaluation ambiguity - 📊 **Multi-dimensional Evaluation**: Five dimensions — Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality <p align="center"> <img src="assets/intro.png" width="88%" alt="Comparison of DR³-Eval with other benchmarks"> <br> <em>Figure 1. Comparison of DR³-Eval with existing deep research benchmarks. DR³-Eval supports both user files and sandbox corpora, providing a realistic and reproducible multimodal evaluation environment.</em> </p> --- ## 🏆 Benchmark Comparison DR³-Eval is the first deep research evaluation benchmark that simultaneously satisfies all of the following: user file input, static sandbox corpora, multimodality, real-world scenarios, multi-file upload, and reverse construction. <p align="center"> <img src="assets/benchmark_comparison.png" width="88%" alt="Benchmark Comparison"> <br> <em>Figure 2. Comprehensive comparison of DR³-Eval with representative benchmarks.</em> </p> --- ## 🧩 Framework and Pipeline The overall framework of DR³-Eval consists of three core components: 1. 📝 **Data Construction**: Synthesizes search paths from real multimodal files through a diverge-converge mechanism, establishes static sandboxes with controllable signal-to-noise ratios, and generates queries via reverse engineering 2. 🤖 **DR³-Agent**: Hierarchical multi-agent architecture 3. 📊 **Evaluation Protocol**: A multi-dimensional metric suite that comprehensively evaluates evidence retrieval and report generation performance <p align="center"> <img src="assets/framework.png" width="88%" alt="Framework Overview"> <br> <em>Figure 3. DR³-Eval framework overview. Includes data construction, DR³-Agent multi-agent system, and multi-dimensional evaluation protocol.</em> </p> --- ## 📊 Dataset Statistics - **100** independent tasks (50 English + 50 Chinese) - **3** major domains, **13** sub-domains - **68%** of tasks involve multimodal input - Average of **2.24** user files per task, up to 6 - Sandbox corpus contains an average of **465.5** web pages under the 512k configuration <p align="center"> <img src="assets/data_stas.png" width="88%" alt="Dataset Statistics"> <br> <em>Figure 4. Dataset statistics. (a) Domain distribution. (b) File type distribution. (c) Distribution of user files per task.</em> </p> ### File Type Distribution | Category | Count | Subtypes | |----------|-------|----------| | Document | 103 | PDF(76), Markdown(10), Word(7), Text(5), PPT(5) | | Image | 62 | PNG(37), JPEG(21), WebP(4) | | Video | 31 | MP4(31) | | Audio | 4 | MP3(4) | | Data | 14 | CSV(7), Excel(7) | | Other | 10 | HTML(10) | | **Total** | **224** | - | ### Language Distribution | Language | Cases | Avg Files/Case | Avg Multimodal Files/Case | |----------|-------|----------------|--------------------------| | Chinese | 50 | 1.82 | 0.82 | | English | 50 | 2.66 | 1.12 | | **Total** | **100** | **2.24** | **0.97** | --- ## 🧩 Dataset Structure ``` DR3-Eval/ ├── datasets_en/ │ ├── query.jsonl # English task queries (50 tasks) │ ├── 001/ # Task folder with user files │ │ ├── file1.pdf │ │ ├── file2.mp4 │ │ └── ... │ ├── 002/ │ └── ... ├── datasets_zh/ │ ├── query.jsonl # Chinese task queries (50 tasks) │ ├── 001/ │ └── ... └── README.md ``` Each task folder contains the **user-provided multimodal files** referenced in the corresponding query. The `query.jsonl` file contains the task queries with the following fields: | Field | Description | |-------|-------------| | `task` | Task ID (e.g., "001") | | `query` | The natural language research query | | `user_files` | List of user-provided file names | --- ## 📐 Evaluation Metrics | Dimension | Metric | Description | |-----------|--------|-------------| | **Information Retrieval** | IR (Information Recall) | Coverage of key insights from user files and sandbox corpus | | **Information Retrieval** | CC (Citation Coverage) | Extent to which the report cites necessary source documents | | **Report Generation** | FA (Factual Accuracy) | Factual correctness of cited claims in the report | | **Report Generation** | IF (Instruction Following) | Whether the report satisfies all requirements in the query | | **Report Generation** | DQ (Depth Quality) | Analytical depth and logical rigor of the report | --- ## 🚀 Quick Start ### 📥 Download ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="NJU-LINK/DR3-Eval", repo_type="dataset", local_dir="./DR3-Eval" ) ``` ### 📖 Load Queries ```python import json # Load English queries with open("DR3-Eval/datasets_en/query.jsonl") as f: en_queries = [json.loads(line) for line in f] # Load Chinese queries with open("DR3-Eval/datasets_zh/query.jsonl") as f: zh_queries = [json.loads(line) for line in f] print(f"English tasks: {len(en_queries)}") print(f"Chinese tasks: {len(zh_queries)}") ``` --- ## 📝 Citation If you find this work useful, please cite: ```bibtex @article{dr3eval2026, title={DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation}, author={}, journal={arXiv preprint arXiv:2604.14683}, year={2026} } ``` ## 🌟 License This project is licensed under the Apache License 2.0.
提供机构:
NJU-LINK
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作