VisCode-200K
收藏魔搭社区2025-12-05 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/VisCode-200K
下载链接
链接失效反馈官方服务:
资源简介:
# VisCode-200K
[🏠 Project Page](https://tiger-ai-lab.github.io/VisCoder) | [💻 GitHub](https://github.com/TIGER-AI-Lab/VisCoder) | [📖 Paper](https://arxiv.org/abs/2506.03930) | [🤗 VisCoder-3B](https://huggingface.co/TIGER-Lab/VisCoder-3B) | [🤗 VisCoder-7B](https://huggingface.co/TIGER-Lab/VisCoder-7B)
**VisCode-200K** is a large-scale instruction-tuning dataset for training language models to generate and debug **executable Python visualization code**.
## 🧠 Overview
VisCode-200K contains over **200,000** samples for executable Python visualization tasks. Each sample includes a natural language instruction and the corresponding Python code, structured as a `messages` list in ChatML format.
We construct VisCode-200K through a scalable pipeline that integrates cleaned plotting code, synthetic instruction generation, runtime validation, and multi-turn dialogue construction.

## 📁 Data Format
Each sample is a JSON object with the following two keys:
```json
{
"uuid": "6473df7ef4704da0a218ea71dc2d641b",
"messages": [
{"role": "user", "content": "Instruction..."},
{"role": "assistant", "content": "Visualization Python code..."}
]
}
```
- `uuid`: A unique identifier for the sample.
- `messages`: A list of dialogue turns following format:
- The **user** provides a natural language instruction describing a visualization task.
- The **assistant** responds with Python code that generates the corresponding plot using a variety of libraries.
## 🧪 Use Cases
VisCode-200K is designed for:
- 📊 Instruction tuning for Python visualization code generation.
- 🔁 Multi-turn self-correction via dialogue with execution feedback.
- 🧠 Training models to align natural language, code semantics, and visual outputs.
This dataset supports the development of [VisCoder](https://huggingface.co/collections/TIGER-Lab/viscoder-6840333efe87c4888bc93046) models, including [VisCoder-3B](https://huggingface.co/TIGER-Lab/VisCoder-3B) and [VisCoder-7B](https://huggingface.co/TIGER-Lab/VisCoder-7B), evaluated on [PandasPlotBench](https://github.com/TIGER-AI-Lab/VisCoder/tree/main/eval).
## 📖 Citation
```bibtex
@article{ni2025viscoder,
title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
journal={arXiv preprint arXiv:2506.03930},
year={2025}
}
```
# VisCode-200K
[🏠 项目主页](https://tiger-ai-lab.github.io/VisCoder) | [💻 GitHub 仓库](https://github.com/TIGER-AI-Lab/VisCoder) | [📖 研究论文](https://arxiv.org/abs/2506.03930) | [🤗 VisCoder-3B](https://huggingface.co/TIGER-Lab/VisCoder-3B) | [🤗 VisCoder-7B](https://huggingface.co/TIGER-Lab/VisCoder-7B)
**VisCode-200K** 是一款大规模指令微调数据集,用于训练大语言模型(Large Language Model,LLM)生成并调试可执行Python可视化代码。
## 🧠 概述
VisCode-200K包含超过20万个面向可执行Python可视化任务的样本。每个样本均包含一条自然语言指令与对应的Python代码,结构采用ChatML格式的`messages`列表。
我们通过整合清洗后的绘图代码、合成指令生成、运行时验证与多轮对话构建的可扩展流水线构建了VisCode-200K数据集。

## 📁 数据格式
每个样本均为JSON对象,包含以下两个键:
json
{
"uuid": "6473df7ef4704da0a218ea71dc2d641b",
"messages": [
{"role": "user", "content": "Instruction..."},
{"role": "assistant", "content": "Visualization Python code..."}
]
}
- `uuid`:样本的唯一标识符。
- `messages`:遵循以下格式的多轮对话列表:
- **用户(user)** 提供描述可视化任务的自然语言指令。
- **助手(assistant)** 回复使用各类绘图库生成对应图表的Python代码。
## 🧪 应用场景
VisCode-200K专为以下场景设计:
- 📊 Python可视化代码生成的指令微调任务。
- 🔁 结合执行反馈的多轮自校正对话。
- 🧠 训练模型对齐自然语言、代码语义与可视化输出的任务。
本数据集可用于开发VisCoder系列模型([VisCoder](https://huggingface.co/collections/TIGER-Lab/viscoder-6840333efe87c4888bc93046)),包括VisCoder-3B与VisCoder-7B,相关评估可通过[PandasPlotBench](https://github.com/TIGER-AI-Lab/VisCoder/tree/main/eval)完成。
## 📖 引用
bibtex
@article{ni2025viscoder,
title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
journal={arXiv preprint arXiv:2506.03930},
year={2025}
}
提供机构:
maas
创建时间:
2025-06-06



