MermaidSeqBench
收藏魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/MermaidSeqBench
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for MermaidSeqBench
## Dataset Summary
This dataset provides a **human-verified benchmark** for assessing large language models (LLMs) on their ability to generate **Mermaid sequence diagrams** from natural language prompts.
The dataset was synthetically generated using large language models (LLMs), starting from a small set of seed examples provided by a subject-matter expert. All outputs were subsequently **manually verified and corrected by human annotators** to ensure validity and quality.
The dataset consists of **132 samples**, each containing:
- `nl_task_title`: Short title of the task
- `nl_task_desc`: Detailed natural language description of the task
- `input_prompt`: A full instructive natural language description of the task (to be given as input to the model)
- `expected_output`: The target Mermaid diagram code expected for the task
## Intended Uses
- **Benchmarking**: Evaluate LLM performance on generating valid Mermaid sequence diagrams
- **Instruction Tuning**: Can serve as a high-quality dataset for fine-tuning or adapting LLMs to produce structured diagram outputs
- **Evaluation**: Use as a test set for model comparison across diagram generation tasks
## Limitations
- **Task Scope**: The dataset is limited to **sequence diagrams** and does not cover other Mermaid diagram types (e.g., flowcharts, class diagrams)
- **Size**: With 132 examples, the dataset is relatively small and best suited for **evaluation**, not large-scale training
- **Synthetic Origins**: While corrected by humans, initial examples were LLM-generated and may reflect limitations or biases in those generative models
## Data Structure
Each row corresponds to one benchmark example
| Column | Description |
|-------------------|-------------|
| `nl_task_title` | Short task title |
| `nl_task_desc` | Detailed description of the natural language task |
| `input_prompt` | Instructional natural language prompt given to the model |
| `expected_output` | Target Mermaid sequence diagram code |
## Leaderboard
**Scoring.** Each model is evaluated on six criteria (Syntax, Mermaid Only, Logic, Completeness, Activation Handling, Error & Status Tracking) by two LLM-as-a-Judge evaluators: **DeepSeek-V3 (671B)** and **GPT-OSS (120B)**
For each model, we report:
- **DeepSeek-V3 Avg** = mean of its six DeepSeek-V3 scores
- **GPT-OSS Avg** = mean of its six GPT-OSS scores
- **Overall Avg** = mean of the above two averages
**Evaluation Code & Criteria.** The full evaluation pipeline, including prompt templates, judge setup, and per-criterion scoring logic, is available [on GitHub](https://github.com/IBM/MermaidSeqBench-Eval).
### Overall Ranking (higher is better)
| Rank | Model | DeepSeek-V3 Avg | GPT-OSS Avg | Overall Avg |
|---|-------------------------|----------------:|------------:|------------:|
| 1 | Qwen 2.5-7B-Instruct | 87.44 | 77.47 | 82.45 |
| 2 | Llama 3.1-8B-Instruct | 87.69 | 73.47 | 80.58 |
| 3 | Granite 3.3-8B-Instruct | 83.16 | 64.63 | 73.90 |
| 4 | Granite 3.3-2B-Instruct | 74.43 | 46.96 | 60.70 |
| 5 | Llama 3.2-1B-Instruct | 55.78 | 29.76 | 42.77 |
| 6 | Qwen 2.5-0.5B-Instruct | 46.99 | 29.41 | 38.20 |
> **Note.** The leaderboard aggregates across criteria for concise comparison; see the accompanying paper for per-criterion scores. Evaluations rely on LLM-as-a-Judge and may vary with judge choice and prompts.
## Citation
If you would like to cite this work in a paper or a presentation, the following is recommended (BibTeX entry):
```
@misc{shbita2025mermaidseqbenchevaluationbenchmarkllmtomermaid,
title={MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation},
author={Basel Shbita and Farhan Ahmed and Chad DeLuca},
year={2025},
eprint={2511.14967},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2511.14967},
}
```
# MermaidSeqBench 数据集卡片
## 数据集概述
本数据集为**人工验证的基准测试集**,用于评估大语言模型(Large Language Model,LLM)从自然语言提示生成**Mermaid时序图(Mermaid sequence diagrams)**的能力。
本数据集通过大语言模型合成生成,初始数据来自领域专家提供的少量种子示例。所有生成结果后续均经**人工标注者手动校验与修正**,以确保其合规性与质量。
本数据集共包含**132个样本**,每个样本包含以下字段:
- `nl_task_title`:任务简短标题
- `nl_task_desc`:任务的详细自然语言描述
- `input_prompt`:提供给模型的完整指导性自然语言提示(作为模型输入)
- `expected_output`:该任务对应的目标Mermaid时序图代码
## 预期用途
- **基准测试**:评估大语言模型生成合规Mermaid时序图的性能
- **指令微调**:可作为高质量数据集,用于微调或适配大语言模型以生成结构化图表输出
- **模型评估**:可作为测试集用于跨图表生成任务的模型对比
## 局限性说明
- **任务范围**:本数据集仅覆盖**时序图**,未包含其他Mermaid图表类型(如流程图、类图)
- **样本规模**:仅含132个示例,数据集规模相对较小,更适合用于**评估任务**,而非大规模训练
- **合成起源**:尽管经人工校正,初始示例仍由大语言模型生成,可能带有这些生成模型的局限性与固有偏差
## 数据结构
每一行对应一个基准测试示例
| 列名 | 说明 |
|-------------------|-------------|
| `nl_task_title` | 任务简短标题 |
| `nl_task_desc` | 自然语言任务的详细描述 |
| `input_prompt` | 提供给模型的指导性自然语言提示 |
| `expected_output` | 目标Mermaid时序图代码 |
## 排行榜
**评分规则**。每个模型由两名以大语言模型为裁判的评估者——**DeepSeek-V3 (671B)**与**GPT-OSS (120B)**——从六项标准(语法合规性、仅Mermaid规范适配、逻辑合理性、输出完整性、激活处理、错误与状态跟踪)进行评分。
针对每个模型,我们报告以下指标:
- **DeepSeek-V3 平均分**:该模型六项DeepSeek-V3评分的均值
- **GPT-OSS 平均分**:该模型六项GPT-OSS评分的均值
- **总平均分**:上述两项平均分的均值
**评估代码与评分标准**。完整的评估流程,包括提示模板、裁判设置、分项评分逻辑,均可在[GitHub](https://github.com/IBM/MermaidSeqBench-Eval)获取。
### 整体排名(得分越高性能越好)
| 排名 | 模型名称 | DeepSeek-V3 平均分 | GPT-OSS 平均分 | 总平均分 |
|---|-------------------------|----------------:|------------:|------------:|
| 1 | Qwen 2.5-7B-Instruct | 87.44 | 77.47 | 82.45 |
| 2 | Llama 3.1-8B-Instruct | 87.69 | 73.47 | 80.58 |
| 3 | Granite 3.3-8B-Instruct | 83.16 | 64.63 | 73.90 |
| 4 | Granite 3.3-2B-Instruct | 74.43 | 46.96 | 60.70 |
| 5 | Llama 3.2-1B-Instruct | 55.78 | 29.76 | 42.77 |
| 6 | Qwen 2.5-0.5B-Instruct | 46.99 | 29.41 | 38.20 |
> **注。** 本排行榜整合各分项得分以简化对比,详细分项得分请参阅配套论文。评估采用LLM作为裁判的方式,得分可能因裁判选择与提示模板不同而存在差异。
## 引用方式
若需在论文或演示文稿中引用本工作,推荐使用以下BibTeX条目:
@misc{shbita2025mermaidseqbenchevaluationbenchmarkllmtomermaid,
title={MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation},
author={Basel Shbita and Farhan Ahmed and Chad DeLuca},
year={2025},
eprint={2511.14967},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2511.14967},
}
提供机构:
maas
创建时间:
2025-10-04



