SpongeBOB9684/mermaid-text-to-diagram
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/SpongeBOB9684/mermaid-text-to-diagram
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
tags:
- Mermaid
datasets:
- Celiadraw/text-to-mermaid
- djds4rce/mermaid-synthetic
---
# Mermaid 11.14 Training Dataset
Dataset for fine-tuning Qwen3.5-0.8B for Mermaid Studio's text-to-Mermaid generation.
## Overview
This dataset contains **9,913** validated Mermaid 11.14 examples for training an ultra-light (<1B parameter) instruct model.
### Documentation Base
This dataset is **strictly based on Mermaid 11.14.0** official documentation:
- 📖 Official documentation: https://mermaid.js.org/intro/
- 🔧 Syntax specifications: https://mermaid.js.org/syntax/
- ✅ All examples conform to version 11.14.0
### Sources
This dataset combines three sources:
**1. Celiadraw/text-to-mermaid** - Existing examples (non-LLM generated)
- 📦 Dataset: https://huggingface.co/datasets/Celiadraw/text-to-mermaid
- 📊 Contribution: 8,912 examples (~90%)
- ℹ️ Type: Existing data (not LLM generated)
**2. djds4rce/mermaid-synthetic** - Existing examples (non-LLM generated)
- 📦 Dataset: https://huggingface.co/datasets/djds4rce/mermaid-synthetic
- 📊 Contribution: 926 examples (~9%)
- ℹ️ Type: Existing data (not LLM generated)
- 🔑 License: MIT License
**3. Edge Cases (LLM Generated)** - New examples for Mermaid 11.14 features
- 📊 Contribution: 75 examples (~1%)
- 🤖 Type: Generated via LLM to cover missing Mermaid 11.14 features
- ⚠️ **Important**: Only ~1% of dataset, strictly validated before inclusion
**Distribution Summary**:
- Existing data (non-LLM): 9,838 examples (~99%)
- LLM generated (edge cases): 75 examples (~1%)
## Dataset Statistics
### Diagram Types
| Type | Count | Percentage |
|------|-------|------------|
| flowchart | 5,418 | 54.7% |
| sequence | 1,501 | 15.1% |
| class | 1,026 | 10.4% |
| er | 715 | 7.2% |
| state | 503 | 5.1% |
| gantt | 408 | 4.1% |
| mindmap | 165 | 1.7% |
| unknown | 74 | 0.7% |
| pie | 71 | 0.7% |
| git | 30 | 0.3% |
| journey | 2 | 0.0% |
### Complexity Distribution
| Level | Description | Count | Percentage |
|-------|-------------|-------|------------|
| 1 | Simple (≤3 nodes) | 757 | 7.6% |
| 2 | Low (4-6 nodes) | 1,634 | 16.5% |
| 3 | Medium (7-12 nodes) | 4,591 | 46.3% |
| 4 | High (13-25 nodes, subgraphs) | 2,493 | 25.1% |
| 5 | Very Complex (>25 nodes, styling) | 438 | 4.4% |
### Feature Coverage
| Feature | Count |
|---------|-------|
| markdown-in-nodes | 5,341 |
| basic | 4,411 |
| subgraphs | 529 |
| styling | 161 |
| multi-directional-arrows | 49 |
| event-nodes | 17 |
| bolt-nodes | 17 |
| window-nodes | 17 |
## Schema
Each example contains the following fields:
```json
{
"instruction": "Natural language description of the diagram",
"context": "Optional conversational context",
"mermaid": "Validated Mermaid 11.14 code",
"version": "11.14",
"complexity_score": 1-5,
"diagram_type": "flowchart|sequence|state|class|er|...",
"feature_tags": ["feature1", "feature2", ...],
"validation_status": "validated",
"source": "celiadraw|djds4rce|generated|...",
"timestamp": "2026-04-12T..."
}
```
## Splits
- **Train**: 80% of examples
- **Validation**: 10% of examples
- **Test**: 10% of examples
## Methodology
### Validation
All examples have been validated using `@mermaid-js/mermaid-cli` (mmdc) to ensure:
- ✅ Correct Mermaid 11.14 syntax
- ✅ Successful rendering
- ✅ No parse errors
- ✅ Conformity with official specifications
### Data Sources
datasets:
**Existing Examples (Celiadraw, djds4rce)**
- Used as-is from original datasets
- No LLM modifications
- Upgraded to Mermaid 11.14 syntax where needed
**Edge Cases (LLM Generated)**
- ~1% of total dataset
- Generated via LLM specifically to cover missing Mermaid 11.14 features
- Strictly validated against mermaid-cli before inclusion
- Covers edge cases, complex scenarios, and new syntax features
**Why LLM Generation?**
- To cover Mermaid 11.14 features not present in existing datasets
- To provide diverse edge cases for robust model training
- All LLM-generated examples undergo same rigorous validation as existing data
## Documentation
### Official Mermaid Resources
This dataset is strictly based on **Mermaid 11.14.0** specifications:
- 📖 **Documentation**: https://mermaid.js.org/intro/
- 🔧 **Syntax Guide**: https://mermaid.js.org/syntax/
- 📚 **API Reference**: https://mermaid.js.org/intro/#Syntax
**Conformity**
- ✅ All examples validated against Mermaid 11.14.0
- ✅ Rendering tested with mermaid-cli
- ✅ No deprecated syntax
## Usage
### Loading the Dataset (Python)
```python
import json
# Load training data
train_data = []
with open('train.jsonl', 'r') as f:
for line in f:
train_data.append(json.loads(line))
# Example format
print(train_data[0].keys())
# dict_keys(['instruction', 'context', 'mermaid', 'version',
# 'complexity_score', 'diagram_type', 'feature_tags',
# 'validation_status', 'source', 'timestamp'])
```
### Fine-Tuning with HuggingFace
```python
from datasets import Dataset
# Load JSONL dataset
dataset = Dataset.from_json('train.jsonl')
# Format for instruction tuning
def format_example(example):
return {
"instruction": example["instruction"],
"input": example["context"] or "",
"output": example["mermaid"]
}
formatted_dataset = dataset.map(format_example)
```
## Validation
All examples have been validated using `@mermaid-js/mermaid-cli` (mmdc) to ensure:
- Correct Mermaid 11.14 syntax
- Successful rendering
- No parse errors
## License
This dataset is derived from sources under the following licenses:
- **Celiadraw/text-to-mermaid**: Public dataset
- **djds4rce/mermaid-synthetic**: MIT License
- **Edge cases (LLM generated)**: Mermaid Studio project
Modifications and improvements by the Mermaid Studio project are licensed under **Apache 2.0**.
---
Generated: 2026-04-13T09:48:28.385943
Mermaid Version: 11.14
Purpose: Ultra-light model training for browser deployment
提供机构:
SpongeBOB9684



