five

SpongeBOB9684/mermaid-text-to-diagram

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/SpongeBOB9684/mermaid-text-to-diagram
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en tags: - Mermaid datasets: - Celiadraw/text-to-mermaid - djds4rce/mermaid-synthetic --- # Mermaid 11.14 Training Dataset Dataset for fine-tuning Qwen3.5-0.8B for Mermaid Studio's text-to-Mermaid generation. ## Overview This dataset contains **9,913** validated Mermaid 11.14 examples for training an ultra-light (<1B parameter) instruct model. ### Documentation Base This dataset is **strictly based on Mermaid 11.14.0** official documentation: - 📖 Official documentation: https://mermaid.js.org/intro/ - 🔧 Syntax specifications: https://mermaid.js.org/syntax/ - ✅ All examples conform to version 11.14.0 ### Sources This dataset combines three sources: **1. Celiadraw/text-to-mermaid** - Existing examples (non-LLM generated) - 📦 Dataset: https://huggingface.co/datasets/Celiadraw/text-to-mermaid - 📊 Contribution: 8,912 examples (~90%) - ℹ️ Type: Existing data (not LLM generated) **2. djds4rce/mermaid-synthetic** - Existing examples (non-LLM generated) - 📦 Dataset: https://huggingface.co/datasets/djds4rce/mermaid-synthetic - 📊 Contribution: 926 examples (~9%) - ℹ️ Type: Existing data (not LLM generated) - 🔑 License: MIT License **3. Edge Cases (LLM Generated)** - New examples for Mermaid 11.14 features - 📊 Contribution: 75 examples (~1%) - 🤖 Type: Generated via LLM to cover missing Mermaid 11.14 features - ⚠️ **Important**: Only ~1% of dataset, strictly validated before inclusion **Distribution Summary**: - Existing data (non-LLM): 9,838 examples (~99%) - LLM generated (edge cases): 75 examples (~1%) ## Dataset Statistics ### Diagram Types | Type | Count | Percentage | |------|-------|------------| | flowchart | 5,418 | 54.7% | | sequence | 1,501 | 15.1% | | class | 1,026 | 10.4% | | er | 715 | 7.2% | | state | 503 | 5.1% | | gantt | 408 | 4.1% | | mindmap | 165 | 1.7% | | unknown | 74 | 0.7% | | pie | 71 | 0.7% | | git | 30 | 0.3% | | journey | 2 | 0.0% | ### Complexity Distribution | Level | Description | Count | Percentage | |-------|-------------|-------|------------| | 1 | Simple (≤3 nodes) | 757 | 7.6% | | 2 | Low (4-6 nodes) | 1,634 | 16.5% | | 3 | Medium (7-12 nodes) | 4,591 | 46.3% | | 4 | High (13-25 nodes, subgraphs) | 2,493 | 25.1% | | 5 | Very Complex (>25 nodes, styling) | 438 | 4.4% | ### Feature Coverage | Feature | Count | |---------|-------| | markdown-in-nodes | 5,341 | | basic | 4,411 | | subgraphs | 529 | | styling | 161 | | multi-directional-arrows | 49 | | event-nodes | 17 | | bolt-nodes | 17 | | window-nodes | 17 | ## Schema Each example contains the following fields: ```json { "instruction": "Natural language description of the diagram", "context": "Optional conversational context", "mermaid": "Validated Mermaid 11.14 code", "version": "11.14", "complexity_score": 1-5, "diagram_type": "flowchart|sequence|state|class|er|...", "feature_tags": ["feature1", "feature2", ...], "validation_status": "validated", "source": "celiadraw|djds4rce|generated|...", "timestamp": "2026-04-12T..." } ``` ## Splits - **Train**: 80% of examples - **Validation**: 10% of examples - **Test**: 10% of examples ## Methodology ### Validation All examples have been validated using `@mermaid-js/mermaid-cli` (mmdc) to ensure: - ✅ Correct Mermaid 11.14 syntax - ✅ Successful rendering - ✅ No parse errors - ✅ Conformity with official specifications ### Data Sources datasets: **Existing Examples (Celiadraw, djds4rce)** - Used as-is from original datasets - No LLM modifications - Upgraded to Mermaid 11.14 syntax where needed **Edge Cases (LLM Generated)** - ~1% of total dataset - Generated via LLM specifically to cover missing Mermaid 11.14 features - Strictly validated against mermaid-cli before inclusion - Covers edge cases, complex scenarios, and new syntax features **Why LLM Generation?** - To cover Mermaid 11.14 features not present in existing datasets - To provide diverse edge cases for robust model training - All LLM-generated examples undergo same rigorous validation as existing data ## Documentation ### Official Mermaid Resources This dataset is strictly based on **Mermaid 11.14.0** specifications: - 📖 **Documentation**: https://mermaid.js.org/intro/ - 🔧 **Syntax Guide**: https://mermaid.js.org/syntax/ - 📚 **API Reference**: https://mermaid.js.org/intro/#Syntax **Conformity** - ✅ All examples validated against Mermaid 11.14.0 - ✅ Rendering tested with mermaid-cli - ✅ No deprecated syntax ## Usage ### Loading the Dataset (Python) ```python import json # Load training data train_data = [] with open('train.jsonl', 'r') as f: for line in f: train_data.append(json.loads(line)) # Example format print(train_data[0].keys()) # dict_keys(['instruction', 'context', 'mermaid', 'version', # 'complexity_score', 'diagram_type', 'feature_tags', # 'validation_status', 'source', 'timestamp']) ``` ### Fine-Tuning with HuggingFace ```python from datasets import Dataset # Load JSONL dataset dataset = Dataset.from_json('train.jsonl') # Format for instruction tuning def format_example(example): return { "instruction": example["instruction"], "input": example["context"] or "", "output": example["mermaid"] } formatted_dataset = dataset.map(format_example) ``` ## Validation All examples have been validated using `@mermaid-js/mermaid-cli` (mmdc) to ensure: - Correct Mermaid 11.14 syntax - Successful rendering - No parse errors ## License This dataset is derived from sources under the following licenses: - **Celiadraw/text-to-mermaid**: Public dataset - **djds4rce/mermaid-synthetic**: MIT License - **Edge cases (LLM generated)**: Mermaid Studio project Modifications and improvements by the Mermaid Studio project are licensed under **Apache 2.0**. --- Generated: 2026-04-13T09:48:28.385943 Mermaid Version: 11.14 Purpose: Ultra-light model training for browser deployment
提供机构:
SpongeBOB9684
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作