VisPlotBench
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/VisPlotBench
下载链接
链接失效反馈官方服务:
资源简介:
# VisPlotBench (A Multi-Language Benchmark for Visualization Coding Agents)
[**🌐 Homepage**](https://tiger-ai-lab.github.io/VisCoder2) | [**💻 GitHub**](https://github.com/TIGER-AI-Lab/VisCoder2) | [**📖 Paper**](https://arxiv.org/abs/2510.23642) | [**🤗 VisCoder2**](https://hf.co/collections/TIGER-Lab/viscoder2)
---
## 🔔 News
- **🔥 [2025-10-25]** VisPlotBench is released as part of the **VisCoder2** project, providing the first systematic benchmark for multi-language visualization coding agents.
- **📦 [2025-10-25]** Evaluation scripts are now available on the [GitHub repository](https://github.com/TIGER-AI-Lab/VisCoder2/tree/main/VisPlotBench).
---
## Dataset Description
**VisPlotBench** is a benchmark for evaluating visualization coding agents across **eight programming languages**.
Unlike prior efforts that target a single language or chart type, VisPlotBench features **888 executable tasks**, **rendered outputs**, and a standardized **execute–render–score** protocol for both initial generation and multi-round self-debug evaluation.
Each task provides:
- a **natural-language instruction** describing the visualization goal,
- corresponding **reference code** in one of eight supported languages, and
- the **rendered reference image** for visual alignment evaluation.

---
## Data Construction
VisPlotBench combines curated examples from library documentation, high-quality open-source code, and programmatic rendering pipelines. All code snippets are executed in isolated environments to ensure **valid rendering and executability**, and visually trivial outputs are removed. Each task is annotated with a **Visual Category** and **Subtype**, covering **13 categories** such as Bars, Lines, Areas, 3D, Scatter, Hierarchies, Networks & Flows, and Music.
Tasks are then extended with a five-component instruction schema:
> **Setup → Plot Instruction → Data Instruction → Task Description → Style Description**
This ensures consistent structure across languages while preserving language-specific syntax and conventions.
---
## Evaluation Protocol
VisPlotBench defines a unified **execute–render–score** evaluation pipeline:
1. **Execution Pass Rate (Exec Pass)** — checks if generated code runs successfully and produces a valid visualization.
2. **Task Score** — assesses instruction compliance using an LLM-based semantic rubric.
3. **Visual Score** — measures perceptual similarity between generated and reference images.
The benchmark also supports **multi-round self-debugging**, where models can refine code up to three times using feedback from execution logs, simulating real-world visualization correction loops.
---
## Language Configurations
VisPlotBench provides eight separate configurations, each corresponding to a supported visualization language:
| Language | #Test Samples |
|-----------|---------------|
| Python | 196 |
| Vega-Lite | 129 |
| LilyPond | 55 |
| Mermaid | 131 |
| SVG | 65 |
| LaTeX | 112 |
| Asymptote | 92 |
| HTML | 108 |
Each configuration includes verified executable examples with paired natural-language descriptions and rendered outputs.
---
## Citation
```bibtex
@article{ni2025viscoder2,
title={VisCoder2: Building Multi-Language Visualization Coding Agents},
author={Ni, Yuansheng and Cai, Songcheng and Chen, Xiangchao and Liang, Jiarong and Lyu, Zhiheng and Deng, Jiaqi and Zou, Kai and Nie, Ping and Yuan, Fei and Yue, Xiang and others},
journal={arXiv preprint arXiv:2510.23642},
year={2025}
}
@article{ni2025viscoder,
title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
journal={arXiv preprint arXiv:2506.03930},
year={2025}
}
```
# VisPlotBench:面向可视化编码智能体的多语言基准测试集
[**🌐 主页**](https://tiger-ai-lab.github.io/VisCoder2) | [**💻 GitHub 仓库**](https://github.com/TIGER-AI-Lab/VisCoder2) | [**📖 论文**](https://arxiv.org/abs/2510.23642) | [**🤗 VisCoder2 数据集集合**](https://hf.co/collections/TIGER-Lab/viscoder2)
---
## 🔔 最新动态
- **🔥 [2025-10-25]** VisPlotBench 作为 **VisCoder2** 项目的一部分正式发布,是首个面向多语言可视化编码智能体的系统化基准测试集。
- **📦 [2025-10-25]** 评估脚本现已上线 [GitHub 仓库](https://github.com/TIGER-AI-Lab/VisCoder2/tree/main/VisPlotBench)。
---
## 数据集说明
**VisPlotBench** 是一款面向多语言可视化编码智能体的基准测试集,覆盖**8种编程语言**。
与此前仅针对单一语言或单一图表类型的相关研究不同,VisPlotBench 包含**888个可执行任务**、**渲染输出结果**,以及一套标准化的**执行-渲染-评分**流程,可支持初始代码生成与多轮自调试评估。
每个任务均包含:
- 描述可视化目标的**自然语言指令**,
- 对应8种支持语言之一的**参考代码**,
- 用于视觉对齐评估的**渲染参考图像**。

---
## 数据构建
VisPlotBench 整合了来自库文档、高质量开源代码以及程序化渲染管线的精选示例。所有代码片段均在隔离环境中执行,以确保**渲染有效性与代码可执行性**,同时移除了视觉无差异的输出结果。每个任务均标注有**视觉类别**与**子类型**,覆盖柱状图、折线图、面积图、3D图表、散点图、层级图、网络图与流向图、音乐可视化等**13大类**。
随后,所有任务均基于五组件指令框架进行扩展:
> **设置说明 → 绘图指令 → 数据说明 → 任务描述 → 风格说明**
该设计确保了不同语言间的结构一致性,同时保留了各语言特有的语法与使用习惯。
---
## 评估流程
VisPlotBench 定义了一套统一的**执行-渲染-评分**评估流程:
1. **执行通过率 (Exec Pass)**:检查生成的代码能否成功运行并生成有效的可视化结果。
2. **任务得分**:基于大语言模型 (Large Language Model) 构建的语义评判标准,评估生成代码是否符合指令要求。
3. **视觉得分**:衡量生成图像与参考图像之间的感知相似度。
该基准测试集同时支持**多轮自调试**,模型可利用执行日志反馈对代码进行最多三轮优化,模拟真实世界中的可视化代码修正流程。
---
## 语言配置
VisPlotBench 提供8种独立配置,每种对应一种支持的可视化编程语言:
| 编程语言 | 测试样本数 |
|-----------|---------------|
| Python | 196 |
| Vega-Lite | 129 |
| LilyPond | 55 |
| Mermaid | 131 |
| SVG | 65 |
| LaTeX | 112 |
| Asymptote | 92 |
| HTML | 108 |
每种配置均包含经过验证的可执行示例,以及配套的自然语言描述与渲染输出结果。
---
## 引用
bibtex
@article{ni2025viscoder2,
title={VisCoder2: Building Multi-Language Visualization Coding Agents},
author={Ni, Yuansheng and Cai, Songcheng and Chen, Xiangchao and Liang, Jiarong and Lyu, Zhiheng and Deng, Jiaqi and Zou, Kai and Nie, Ping and Yuan, Fei and Yue, Xiang and others},
journal={arXiv preprint arXiv:2510.23642},
year={2025}
}
@article{ni2025viscoder,
title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
journal={arXiv preprint arXiv:2506.03930},
year={2025}
}
提供机构:
maas
创建时间:
2025-10-29



