SpreadsheetBench

Name: SpreadsheetBench
Creator: maas
Published: 2025-12-05 16:54:40
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/SpreadsheetBench

下载链接

链接失效反馈

官方服务：

资源简介：

# SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation | [Paper](https://arxiv.org/abs/2406.14991) | [Github](https://github.com/RUCKBReasoning/SpreadsheetBench) | [Homepage](https://spreadsheetbench.github.io/) | We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from **912 real questions** gathered from online Excel forums, which reflect the intricate needs of users. The associated spreadsheets from the forums contain a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements. Furthermore, we propose a more reliable evaluation metric akin to online judge platforms, where multiple spreadsheet files are created as test cases for each instruction, ensuring the evaluation of robust solutions capable of handling spreadsheets with varying values. Our comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance, highlighting the benchmark's difficulty. # Data Statistics SpreadsheetBench comprising 912 instructions and 2,729 test cases, with an average of three test cases per instruction. The instructions in our benchmark cover a broad spectrum of spreadsheet manipulation types, including find, extract, sum, highlight, remove, modify, count, delete, calculate, and display. The spreadsheet files in our benchmark contain tabular data with various row size, column size, number of table and table formats. # Dataset Introduction The data are located in ``all_data_912.tar.gz``, containing 912 data points in JSON formats. Each data point includes the following five attributes: - ``id``: The unique id of the data point. - ``instruction``: The question about spreadsheet manipulation. - ``spreadsheet_path``: The folder path that stores the test cases. - ``instruction_type``: The type of the question (i.e., Cell-Level Manipulation or Sheet-Level Manipulation). - ``answer_position``: The cell position where the answer needs to be filled in. The ```spreadsheet``` folder contains the corresponding spreadsheet files of the data points. There are two hundred folders in the ```spreadsheet``` folder named as unique ids. In each folder, there are multiple test cases named ```{No.}_{id}_input.xlsx``` and ```{No.}_{id}_answer.xlsx```, represent the input file and answer file, respectively. # Environment Setup To run the evaluation of SpreadsheetBench, please refer to our [GitHub repo](https://github.com/RUCKBReasoning/SpreadsheetBench) for detailed configuration. # Evaluation Results Our benchmark is hard for current LLMs and LLM Agents. Even the SOTA model ([ChatGPT Agent](https://openai.com/index/introducing-chatgpt-agent/)) can only achieve 45.5% performance.

# SpreadsheetBench：面向真实世界挑战性电子表格操作任务 | [论文](https://arxiv.org/abs/2406.14991) | [GitHub仓库](https://github.com/RUCKBReasoning/SpreadsheetBench) | [项目主页](https://spreadsheetbench.github.io/) | 我们推出了SpreadsheetBench，这是一个完全源自真实场景的挑战性电子表格操作基准测试集，旨在让当前的大语言模型（Large Language Model，LLM）沉浸式体验电子表格用户的实际工作流程。与现有依赖合成查询与简化电子表格文件的基准测试集不同，SpreadsheetBench的构建基于从在线Excel论坛收集的**912道真实问题**，这些问题反映了用户复杂多样的实际需求。论坛中配套的电子表格包含多种表格数据，例如多表结构、非标准关系表以及大量非文本元素。此外，我们提出了一种更可靠的评估指标，其设计思路类似在线判题平台：为每条指令创建多个电子表格文件作为测试用例，以此确保能够评估出可应对不同数值的电子表格的稳健解决方案。我们在单轮与多轮推理场景下对多款大语言模型开展了全面评估，结果显示当前最优（State-of-the-Art，SOTA）模型与人类表现之间存在显著差距，凸显了该基准测试集的挑战性。 # 数据统计 SpreadsheetBench共包含912条指令与2729个测试用例，每条指令平均对应3个测试用例。本基准测试集的指令涵盖了广泛的电子表格操作类型，包括查找、提取、求和、高亮、移除、修改、计数、删除、计算与展示。基准测试集中的电子表格文件包含不同行数、列数、表数量与表格式的表格数据。 # 数据集介绍数据集存放于``all_data_912.tar.gz``中，包含912条JSON格式的数据点。每条数据点包含以下5个属性： - ``id``：数据点的唯一标识符。 - ``instruction``：关于电子表格操作的问题描述。 - ``spreadsheet_path``：存储测试用例的文件夹路径。 - ``instruction_type``：问题类型，分为单元格级操作（Cell-Level Manipulation）与工作表级操作（Sheet-Level Manipulation）。 - ``answer_position``：需要填入答案的单元格位置。 ``spreadsheet``文件夹包含各数据点对应的电子表格文件。该文件夹下共有200个以唯一标识符命名的子文件夹。每个子文件夹中包含多个测试用例文件，命名格式为``{No.}_{id}_input.xlsx``与``{No.}_{id}_answer.xlsx``，分别对应输入文件与答案文件。 # 环境配置如需运行SpreadsheetBench的评估流程，请参考我们的[GitHub仓库](https://github.com/RUCKBReasoning/SpreadsheetBench)获取详细的配置说明。 # 评估结果本基准测试集对当前的大语言模型与AI智能体均极具挑战性。即使是当前最优模型（[ChatGPT智能体](https://openai.com/index/introducing-chatgpt-agent/)），其性能也仅能达到45.5%。

提供机构：

maas

创建时间：

2025-10-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集