five

MermaidSeqBench

收藏
魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/MermaidSeqBench
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for MermaidSeqBench ## Dataset Summary This dataset provides a **human-verified benchmark** for assessing large language models (LLMs) on their ability to generate **Mermaid sequence diagrams** from natural language prompts. The dataset was synthetically generated using large language models (LLMs), starting from a small set of seed examples provided by a subject-matter expert. All outputs were subsequently **manually verified and corrected by human annotators** to ensure validity and quality. The dataset consists of **132 samples**, each containing: - `nl_task_title`: Short title of the task - `nl_task_desc`: Detailed natural language description of the task - `input_prompt`: A full instructive natural language description of the task (to be given as input to the model) - `expected_output`: The target Mermaid diagram code expected for the task ## Intended Uses - **Benchmarking**: Evaluate LLM performance on generating valid Mermaid sequence diagrams - **Instruction Tuning**: Can serve as a high-quality dataset for fine-tuning or adapting LLMs to produce structured diagram outputs - **Evaluation**: Use as a test set for model comparison across diagram generation tasks ## Limitations - **Task Scope**: The dataset is limited to **sequence diagrams** and does not cover other Mermaid diagram types (e.g., flowcharts, class diagrams) - **Size**: With 132 examples, the dataset is relatively small and best suited for **evaluation**, not large-scale training - **Synthetic Origins**: While corrected by humans, initial examples were LLM-generated and may reflect limitations or biases in those generative models ## Data Structure Each row corresponds to one benchmark example | Column | Description | |-------------------|-------------| | `nl_task_title` | Short task title | | `nl_task_desc` | Detailed description of the natural language task | | `input_prompt` | Instructional natural language prompt given to the model | | `expected_output` | Target Mermaid sequence diagram code | ## Leaderboard **Scoring.** Each model is evaluated on six criteria (Syntax, Mermaid Only, Logic, Completeness, Activation Handling, Error & Status Tracking) by two LLM-as-a-Judge evaluators: **DeepSeek-V3 (671B)** and **GPT-OSS (120B)** For each model, we report: - **DeepSeek-V3 Avg** = mean of its six DeepSeek-V3 scores - **GPT-OSS Avg** = mean of its six GPT-OSS scores - **Overall Avg** = mean of the above two averages **Evaluation Code & Criteria.** The full evaluation pipeline, including prompt templates, judge setup, and per-criterion scoring logic, is available [on GitHub](https://github.com/IBM/MermaidSeqBench-Eval). ### Overall Ranking (higher is better) | Rank | Model | DeepSeek-V3 Avg | GPT-OSS Avg | Overall Avg | |---|-------------------------|----------------:|------------:|------------:| | 1 | Qwen 2.5-7B-Instruct | 87.44 | 77.47 | 82.45 | | 2 | Llama 3.1-8B-Instruct | 87.69 | 73.47 | 80.58 | | 3 | Granite 3.3-8B-Instruct | 83.16 | 64.63 | 73.90 | | 4 | Granite 3.3-2B-Instruct | 74.43 | 46.96 | 60.70 | | 5 | Llama 3.2-1B-Instruct | 55.78 | 29.76 | 42.77 | | 6 | Qwen 2.5-0.5B-Instruct | 46.99 | 29.41 | 38.20 | > **Note.** The leaderboard aggregates across criteria for concise comparison; see the accompanying paper for per-criterion scores. Evaluations rely on LLM-as-a-Judge and may vary with judge choice and prompts. ## Citation If you would like to cite this work in a paper or a presentation, the following is recommended (BibTeX entry): ``` @misc{shbita2025mermaidseqbenchevaluationbenchmarkllmtomermaid, title={MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation}, author={Basel Shbita and Farhan Ahmed and Chad DeLuca}, year={2025}, eprint={2511.14967}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2511.14967}, } ```

# MermaidSeqBench 数据集卡片 ## 数据集概述 本数据集为**人工验证的基准测试集**,用于评估大语言模型(Large Language Model,LLM)从自然语言提示生成**Mermaid时序图(Mermaid sequence diagrams)**的能力。 本数据集通过大语言模型合成生成,初始数据来自领域专家提供的少量种子示例。所有生成结果后续均经**人工标注者手动校验与修正**,以确保其合规性与质量。 本数据集共包含**132个样本**,每个样本包含以下字段: - `nl_task_title`:任务简短标题 - `nl_task_desc`:任务的详细自然语言描述 - `input_prompt`:提供给模型的完整指导性自然语言提示(作为模型输入) - `expected_output`:该任务对应的目标Mermaid时序图代码 ## 预期用途 - **基准测试**:评估大语言模型生成合规Mermaid时序图的性能 - **指令微调**:可作为高质量数据集,用于微调或适配大语言模型以生成结构化图表输出 - **模型评估**:可作为测试集用于跨图表生成任务的模型对比 ## 局限性说明 - **任务范围**:本数据集仅覆盖**时序图**,未包含其他Mermaid图表类型(如流程图、类图) - **样本规模**:仅含132个示例,数据集规模相对较小,更适合用于**评估任务**,而非大规模训练 - **合成起源**:尽管经人工校正,初始示例仍由大语言模型生成,可能带有这些生成模型的局限性与固有偏差 ## 数据结构 每一行对应一个基准测试示例 | 列名 | 说明 | |-------------------|-------------| | `nl_task_title` | 任务简短标题 | | `nl_task_desc` | 自然语言任务的详细描述 | | `input_prompt` | 提供给模型的指导性自然语言提示 | | `expected_output` | 目标Mermaid时序图代码 | ## 排行榜 **评分规则**。每个模型由两名以大语言模型为裁判的评估者——**DeepSeek-V3 (671B)**与**GPT-OSS (120B)**——从六项标准(语法合规性、仅Mermaid规范适配、逻辑合理性、输出完整性、激活处理、错误与状态跟踪)进行评分。 针对每个模型,我们报告以下指标: - **DeepSeek-V3 平均分**:该模型六项DeepSeek-V3评分的均值 - **GPT-OSS 平均分**:该模型六项GPT-OSS评分的均值 - **总平均分**:上述两项平均分的均值 **评估代码与评分标准**。完整的评估流程,包括提示模板、裁判设置、分项评分逻辑,均可在[GitHub](https://github.com/IBM/MermaidSeqBench-Eval)获取。 ### 整体排名(得分越高性能越好) | 排名 | 模型名称 | DeepSeek-V3 平均分 | GPT-OSS 平均分 | 总平均分 | |---|-------------------------|----------------:|------------:|------------:| | 1 | Qwen 2.5-7B-Instruct | 87.44 | 77.47 | 82.45 | | 2 | Llama 3.1-8B-Instruct | 87.69 | 73.47 | 80.58 | | 3 | Granite 3.3-8B-Instruct | 83.16 | 64.63 | 73.90 | | 4 | Granite 3.3-2B-Instruct | 74.43 | 46.96 | 60.70 | | 5 | Llama 3.2-1B-Instruct | 55.78 | 29.76 | 42.77 | | 6 | Qwen 2.5-0.5B-Instruct | 46.99 | 29.41 | 38.20 | > **注。** 本排行榜整合各分项得分以简化对比,详细分项得分请参阅配套论文。评估采用LLM作为裁判的方式,得分可能因裁判选择与提示模板不同而存在差异。 ## 引用方式 若需在论文或演示文稿中引用本工作,推荐使用以下BibTeX条目: @misc{shbita2025mermaidseqbenchevaluationbenchmarkllmtomermaid, title={MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation}, author={Basel Shbita and Farhan Ahmed and Chad DeLuca}, year={2025}, eprint={2511.14967}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2511.14967}, }
提供机构:
maas
创建时间:
2025-10-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作