hal-utokyo/PaperWrite-Bench
收藏Hugging Face2026-04-14 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/hal-utokyo/PaperWrite-Bench
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
language:
- en
tags:
- agent
size_categories:
- n<1K
---
# Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
<p align="left">
<a href="https://atsumiyai.github.io/">Atsuyuki Miyai</a>,
Mashiro Toyooka*,
<a href="https://zaiyingzhao.github.io/">Zaiying Zhao</a>*,
Kenta Watanabe*,
<br>
<a href="https://scholar.google.com/citations?user=rE9iY5MAAAAJ&hl=ja">Toshihiko Yamasaki</a>,
<a href="https://scholar.google.co.jp/citations?user=CJRhhi0AAAAJ&hl=en">Kiyoharu Aizawa</a>
<br>
The University of Tokyo
<br>
*: Equal Contribution
</p>
<p align="left">
<a href="https://agent4science-utokyo.github.io/PaperRecon_HP/">🌐 Project Page</a> |
<a href="https://arxiv.org/pdf/2604.01128">📄 Paper</a> |
<a href="https://github.com/Agent4Science-UTokyo/PaperRecon">💻 Code</a> |
<a href="https://huggingface.co/datasets/hal-utokyo/PaperWrite-Bench">🤗 Dataset</a>
</p>
## Background
As coding agents advance rapidly, rigorous evaluation of AI-driven research automation and its risks is essential for sustainable scientific progress. With AI-written paper submissions to academic venues already observed and AI Scientists growing rapidly, the research community must continuously monitor both the capabilities and risks of AI-driven writing through reliable evaluation.
## Overview
**We introduce Paper Reconstruction Evaluation (PaperRecon)**, an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source.
**We introduce PaperWrite-Bench**, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our key findings are:
1. **Claude Code achieves higher presentation quality than Codex.** Claude Code better captures the key elements required for scientific writing across sections.
2. **Codex produces fewer hallucinations than Claude Code.** While Claude Code exhibits more than 10 hallucinations per paper on average, Codex limits this to around 3.
3. **Writing capability improves with model advances.** This also suggests that Paper Reconstruction Evaluation serves as a reliable metric for tracking progress in writing ability.
## PaperWrite-Bench
PaperWrite-Bench consists of 51 papers from top-tier venues (NeurIPS, ICML, ICLR, CVPR, ECCV, ACL, NAACL, etc.) across diverse domains published after 2025. The full list of papers is available [here](https://docs.google.com/spreadsheets/d/1MXg8oEP_Aw3aldz-3hzpTkH2UK7Ju_CHi7lyfTEcOxE/edit?gid=0#gid=0).
We sincerely thank the authors of these papers for their efforts in making their work publicly available, including code releases.
## Usage
Refer to <a href="https://github.com/Agent4Science-UTokyo/PaperRecon">💻 Code</a>
## LICENSE
The papers, LaTeX sources, and codebases included in PaperWrite-Bench are the intellectual property of their respective authors and are subject to their original licenses. We have excluded repositories that explicitly prohibit redistribution. Please refer to each paper's repository for license details.
The full list of papers is available [here](https://docs.google.com/spreadsheets/d/1MXg8oEP_Aw3aldz-3hzpTkH2UK7Ju_CHi7lyfTEcOxE/edit?gid=0#gid=0).
提供机构:
hal-utokyo
搜集汇总
数据集介绍

构建方式
在人工智能驱动科研写作的背景下,PaperWrite-Bench的构建遵循严谨的学术遴选原则。该数据集从NeurIPS、ICML、ICLR、CVPR、ECCV、ACL及NAACL等顶级学术会议中,精心选取了51篇于2025年后发表的论文,覆盖了多个学科领域。每篇论文均附带完整的LaTeX源码和代码仓库,确保了数据的原始性与丰富性。构建过程严格尊重原作者的知识产权,仅收录明确允许分发的公开资源,从而为评估AI论文写作能力提供了高质量、多样化的基准素材。
特点
PaperWrite-Bench的核心特点在于其设计用于系统评估AI生成学术论文的质量。数据集通过解构论文重构任务,将评价维度清晰划分为呈现质量与幻觉程度两个方面。呈现质量依据结构化量规进行衡量,而幻觉程度则依托于原始论文来源,通过智能体评估进行量化。这种双维度评估框架能够细致捕捉AI模型在科学写作中的优势与缺陷,为追踪模型在学术写作能力上的进展提供了可靠且可复现的度量标准。
使用方法
使用该数据集时,研究者需遵循论文重构评估框架。首先,基于原始论文生成一份概述文件;随后,驱动AI智能体仅依据该概述及有限附加资源,尝试重构出完整的论文。最终,将AI生成的论文与原始版本在呈现结构和事实一致性上进行对比分析。具体实施细节,包括评估量规与智能体评估流程,需参考项目官方发布的代码库,以确保评估过程的规范性与结果的可比性。
背景与挑战
背景概述
随着人工智能在科研自动化领域的迅猛发展,对AI生成学术论文的能力与风险进行严谨评估,已成为推动科学可持续进步的关键议题。PaperWrite-Bench数据集由东京大学的研究团队于近期构建,旨在通过论文重构评估框架,系统性地衡量AI模型在学术写作中的表现质量与事实一致性。该数据集聚焦于文本生成任务,核心研究问题在于如何解构并量化AI撰写论文时的呈现能力与幻觉现象,为学术界监控AI写作的演进趋势提供了重要的实证基础。
当前挑战
该数据集致力于解决AI学术写作中呈现质量与事实幻觉的双重评估挑战,其构建过程面临多重困难。在领域问题层面,需要设计能够正交解耦呈现与幻觉的评估维度,并确保评估指标既涵盖科学写作的结构要素,又能扎根于原始论文进行事实核查。在数据集构建中,挑战包括从顶级会议中筛选2025年后发表的代表性论文,并处理知识产权与分发许可的合规性问题,同时需在有限规模下保持学科领域的多样性,以支撑评估的泛化性。
常用场景
经典使用场景
在人工智能驱动的科学写作领域,PaperWrite-Bench数据集为评估大语言模型生成学术论文的能力提供了标准化测试环境。该数据集通过论文重构评估框架,要求智能体仅依据论文概述和有限资源重新撰写全文,随后从呈现质量和事实一致性两个维度与原文进行对比分析。这一经典场景不仅模拟了AI辅助科研写作的实际流程,更建立了可量化的评估体系,为比较不同模型在学术写作任务上的表现提供了基准。
解决学术问题
该数据集有效解决了人工智能生成学术内容时呈现质量与事实一致性难以量化评估的学术难题。通过解耦呈现维度与幻觉维度,研究者能够分别考察模型在结构组织、论证逻辑方面的表达能力,以及在事实引用、数据陈述方面的可靠性。这种精细化评估机制为识别AI写作系统的能力边界提供了方法论基础,推动了可解释性评估框架的发展,对防范学术不端风险具有重要理论意义。
衍生相关工作
基于该数据集衍生的经典研究包括东京大学团队提出的论文重构评估框架,该框架首次将呈现质量与事实幻觉进行正交解耦分析。后续研究进一步扩展了评估维度,如斯坦福大学开发的SciCheck系统增加了参考文献可信度评估指标,MIT团队则构建了跨学科论文生成基准。这些工作共同推动了学术写作评估从单一质量评分向多维度诊断体系的演进。
以上内容由遇见数据集搜集并总结生成



