MermaidSeqBench

Name: MermaidSeqBench
Creator: maas
Published: 2025-11-27 16:51:57
License: 暂无描述

魔搭社区2025-11-27 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/ibm-research/MermaidSeqBench

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for MermaidSeqBench ## Dataset Summary This dataset provides a **human-verified benchmark** for assessing large language models (LLMs) on their ability to generate **Mermaid sequence diagrams** from natural language prompts. The dataset was synthetically generated using large language models (LLMs), starting from a small set of seed examples provided by a subject-matter expert. All outputs were subsequently **manually verified and corrected by human annotators** to ensure validity and quality. The dataset consists of **132 samples**, each containing: - `nl_task_title`: Short title of the task - `nl_task_desc`: Detailed natural language description of the task - `input_prompt`: A full instructive natural language description of the task (to be given as input to the model) - `expected_output`: The target Mermaid diagram code expected for the task ## Intended Uses - **Benchmarking**: Evaluate LLM performance on generating valid Mermaid sequence diagrams - **Instruction Tuning**: Can serve as a high-quality dataset for fine-tuning or adapting LLMs to produce structured diagram outputs - **Evaluation**: Use as a test set for model comparison across diagram generation tasks ## Limitations - **Task Scope**: The dataset is limited to **sequence diagrams** and does not cover other Mermaid diagram types (e.g., flowcharts, class diagrams) - **Size**: With 132 examples, the dataset is relatively small and best suited for **evaluation**, not large-scale training - **Synthetic Origins**: While corrected by humans, initial examples were LLM-generated and may reflect limitations or biases in those generative models ## Data Structure Each row corresponds to one benchmark example | Column | Description | |-------------------|-------------| | `nl_task_title` | Short task title | | `nl_task_desc` | Detailed description of the natural language task | | `input_prompt` | Instructional natural language prompt given to the model | | `expected_output` | Target Mermaid sequence diagram code | ## Leaderboard **Scoring.** Each model is evaluated on six criteria (Syntax, Mermaid Only, Logic, Completeness, Activation Handling, Error & Status Tracking) by two LLM-as-a-Judge evaluators: **DeepSeek-V3 (671B)** and **GPT-OSS (120B)** For each model, we report: - **DeepSeek-V3 Avg** = mean of its six DeepSeek-V3 scores - **GPT-OSS Avg** = mean of its six GPT-OSS scores - **Overall Avg** = mean of the above two averages **Evaluation Code & Criteria.** The full evaluation pipeline, including prompt templates, judge setup, and per-criterion scoring logic, is available [on GitHub](https://github.com/IBM/MermaidSeqBench-Eval). ### Overall Ranking (higher is better) | Rank | Model | DeepSeek-V3 Avg | GPT-OSS Avg | Overall Avg | |---|-------------------------|----------------:|------------:|------------:| | 1 | Qwen 2.5-7B-Instruct | 87.44 | 77.47 | 82.45 | | 2 | Llama 3.1-8B-Instruct | 87.69 | 73.47 | 80.58 | | 3 | Granite 3.3-8B-Instruct | 83.16 | 64.63 | 73.90 | | 4 | Granite 3.3-2B-Instruct | 74.43 | 46.96 | 60.70 | | 5 | Llama 3.2-1B-Instruct | 55.78 | 29.76 | 42.77 | | 6 | Qwen 2.5-0.5B-Instruct | 46.99 | 29.41 | 38.20 | > **Note.** The leaderboard aggregates across criteria for concise comparison; see the accompanying paper for per-criterion scores. Evaluations rely on LLM-as-a-Judge and may vary with judge choice and prompts. ## Citation If you would like to cite this work in a paper or a presentation, the following is recommended (BibTeX entry): ``` @misc{shbita2025mermaidseqbenchevaluationbenchmarkllmtomermaid, title={MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation}, author={Basel Shbita and Farhan Ahmed and Chad DeLuca}, year={2025}, eprint={2511.14967}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2511.14967}, } ```

# MermaidSeqBench 数据集卡片 ## 数据集概述本数据集为**人工验证的基准测试集**，用于评估大语言模型（Large Language Model，LLM）从自然语言提示生成**Mermaid时序图（Mermaid sequence diagrams）**的能力。本数据集通过大语言模型合成生成，初始数据来自领域专家提供的少量种子示例。所有生成结果后续均经**人工标注者手动校验与修正**，以确保其合规性与质量。本数据集共包含**132个样本**，每个样本包含以下字段： - `nl_task_title`：任务简短标题 - `nl_task_desc`：任务的详细自然语言描述 - `input_prompt`：提供给模型的完整指导性自然语言提示（作为模型输入） - `expected_output`：该任务对应的目标Mermaid时序图代码 ## 预期用途 - **基准测试**：评估大语言模型生成合规Mermaid时序图的性能 - **指令微调**：可作为高质量数据集，用于微调或适配大语言模型以生成结构化图表输出 - **模型评估**：可作为测试集用于跨图表生成任务的模型对比 ## 局限性说明 - **任务范围**：本数据集仅覆盖**时序图**，未包含其他Mermaid图表类型（如流程图、类图） - **样本规模**：仅含132个示例，数据集规模相对较小，更适合用于**评估任务**，而非大规模训练 - **合成起源**：尽管经人工校正，初始示例仍由大语言模型生成，可能带有这些生成模型的局限性与固有偏差 ## 数据结构每一行对应一个基准测试示例 | 列名 | 说明 | |-------------------|-------------| | `nl_task_title` | 任务简短标题 | | `nl_task_desc` | 自然语言任务的详细描述 | | `input_prompt` | 提供给模型的指导性自然语言提示 | | `expected_output` | 目标Mermaid时序图代码 | ## 排行榜 **评分规则**。每个模型由两名以大语言模型为裁判的评估者——**DeepSeek-V3 (671B)**与**GPT-OSS (120B)**——从六项标准（语法合规性、仅Mermaid规范适配、逻辑合理性、输出完整性、激活处理、错误与状态跟踪）进行评分。针对每个模型，我们报告以下指标： - **DeepSeek-V3 平均分**：该模型六项DeepSeek-V3评分的均值 - **GPT-OSS 平均分**：该模型六项GPT-OSS评分的均值 - **总平均分**：上述两项平均分的均值 **评估代码与评分标准**。完整的评估流程，包括提示模板、裁判设置、分项评分逻辑，均可在[GitHub](https://github.com/IBM/MermaidSeqBench-Eval)获取。 ### 整体排名（得分越高性能越好） | 排名 | 模型名称 | DeepSeek-V3 平均分 | GPT-OSS 平均分 | 总平均分 | |---|-------------------------|----------------:|------------:|------------:| | 1 | Qwen 2.5-7B-Instruct | 87.44 | 77.47 | 82.45 | | 2 | Llama 3.1-8B-Instruct | 87.69 | 73.47 | 80.58 | | 3 | Granite 3.3-8B-Instruct | 83.16 | 64.63 | 73.90 | | 4 | Granite 3.3-2B-Instruct | 74.43 | 46.96 | 60.70 | | 5 | Llama 3.2-1B-Instruct | 55.78 | 29.76 | 42.77 | | 6 | Qwen 2.5-0.5B-Instruct | 46.99 | 29.41 | 38.20 | > **注。** 本排行榜整合各分项得分以简化对比，详细分项得分请参阅配套论文。评估采用LLM作为裁判的方式，得分可能因裁判选择与提示模板不同而存在差异。 ## 引用方式若需在论文或演示文稿中引用本工作，推荐使用以下BibTeX条目： @misc{shbita2025mermaidseqbenchevaluationbenchmarkllmtomermaid, title={MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation}, author={Basel Shbita and Farhan Ahmed and Chad DeLuca}, year={2025}, eprint={2511.14967}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2511.14967}, }

提供机构：

maas

创建时间：

2025-10-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集