haowu1234/signal-dsl-dataset
收藏Hugging Face2026-03-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/haowu1234/signal-dsl-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- text2text-generation
language:
- en
- zh
tags:
- dsl
- domain-specific-language
- code-generation
- routing
- llm-routing
- signal-router
- synthetic
pretty_name: Signal DSL Dataset
size_categories:
- 100K<n<1M
---
# Signal DSL Dataset
A synthetic dataset for training models to generate **Signal DSL** (Domain-Specific Language) configurations from natural language descriptions.
## Dataset Description
Signal DSL is used to configure intelligent LLM routing with signals, routes, plugins, and algorithms. This dataset contains:
| Split | Samples | Description |
|-------|---------|-------------|
| **stage1_syntax_pt** | 18000 | Pure DSL for syntax pre-training |
| **stage2_sft** | 102087 | NL→DSL pairs for instruction following |
| **stage3_dpo** | 52532 | Preference pairs for DPO training |
| **eval_benchmark** | 200 | Held-out evaluation set |
## Signal DSL Overview
### Core Components
1. **SIGNAL**: Define detection signals (keyword, domain, embedding, etc.)
2. **ROUTE**: Conditional routing rules based on signals
3. **PLUGIN**: Add capabilities (RAG, cache, memory, etc.)
4. **ALGORITHM**: Ranking/selection algorithms
5. **BACKEND**: External service configurations
### Example DSL
```dsl
SIGNAL keyword code_keywords {
keywords: ["code", "programming", "debug", "function"]
threshold: 0.8
}
SIGNAL domain code_domain {
description: "Code and programming related queries"
}
ROUTE code_route (description = "Route code queries to specialist") {
PRIORITY 100
WHEN keyword("code_keywords") OR domain("code_domain")
MODEL "deepseek-coder" (
reasoning = true,
temperature = 0.1
)
}
```
## Data Format
### Stage 1: Syntax Pre-training (Completion)
```json
{
"id": "dsl_001",
"dsl": "SIGNAL keyword kw_1 { keywords: [\"urgent\"] }",
"complexity": "L1"
}
```
### Stage 2: SFT (Instruction-Input-Output)
```json
{
"id": "sft_001",
"instruction": "Convert the following natural language description into Signal DSL configuration.",
"input": "Create a route that sends math questions to GPT-4",
"output": "SIGNAL domain math { ... } ROUTE math_route { ... }",
"style": "en_formal",
"complexity": "L2"
}
```
### Stage 3: DPO (Preference Pairs)
```json
{
"id": "dpo_001",
"prompt": "Generate a valid Signal DSL configuration.",
"chosen": "SIGNAL keyword kw { keywords: [\"test\"] }",
"rejected": "SIGNAL keyword kw { keywords: [\"test\" }",
"mutation_type": "syntax_error",
"mutation_category": "missing_bracket"
}
```
## Complexity Levels
| Level | Description | Signals | Routes | Plugins |
|-------|-------------|---------|--------|---------|
| L1 | Simple | 1-2 | 1 | 0 |
| L2 | Basic | 2-3 | 1-2 | 0-1 |
| L3 | Medium | 3-5 | 2-3 | 1-2 |
| L4 | Complex | 5-8 | 3-5 | 2-4 |
| L5 | Expert | 8+ | 5+ | 4+ |
## Usage
```python
from datasets import load_dataset
# Load all splits
dataset = load_dataset("haowu1234/signal-dsl-dataset")
# Load specific split
sft_data = load_dataset("haowu1234/signal-dsl-dataset", split="stage2_sft")
# Iterate through samples
for sample in sft_data:
print(f"Input: {sample['input']}")
print(f"Output: {sample['output']}")
```
## Training with this Dataset
This dataset is designed for 3-stage training:
1. **Stage 1 (Syntax PT)**: Train language model on pure DSL to learn syntax
2. **Stage 2 (SFT)**: Fine-tune on NL→DSL pairs for instruction following
3. **Stage 3 (DPO)**: Preference optimization to prefer correct over incorrect DSL
## Generation Process
Data was generated using:
- **CFG Random Walk**: Grammar-based generation ensuring syntactic correctness
- **Template Expansion**: Schema-aware field value generation
- **Negative Sampling**: Systematic mutation for preference pairs
- **NL Paraphrasing**: Multiple linguistic styles (formal/casual, EN/ZH)
## Citation
```bibtex
@dataset{signal-dsl-dataset,
author = {Signal Router Team},
title = {Signal DSL Dataset: Synthetic Training Data for DSL Generation},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/haowu1234/signal-dsl-dataset}
}
```
## License
Apache 2.0 - See LICENSE for details.
提供机构:
haowu1234



