five

haowu1234/signal-dsl-dataset

收藏
Hugging Face2026-03-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/haowu1234/signal-dsl-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - text2text-generation language: - en - zh tags: - dsl - domain-specific-language - code-generation - routing - llm-routing - signal-router - synthetic pretty_name: Signal DSL Dataset size_categories: - 100K<n<1M --- # Signal DSL Dataset A synthetic dataset for training models to generate **Signal DSL** (Domain-Specific Language) configurations from natural language descriptions. ## Dataset Description Signal DSL is used to configure intelligent LLM routing with signals, routes, plugins, and algorithms. This dataset contains: | Split | Samples | Description | |-------|---------|-------------| | **stage1_syntax_pt** | 18000 | Pure DSL for syntax pre-training | | **stage2_sft** | 102087 | NL→DSL pairs for instruction following | | **stage3_dpo** | 52532 | Preference pairs for DPO training | | **eval_benchmark** | 200 | Held-out evaluation set | ## Signal DSL Overview ### Core Components 1. **SIGNAL**: Define detection signals (keyword, domain, embedding, etc.) 2. **ROUTE**: Conditional routing rules based on signals 3. **PLUGIN**: Add capabilities (RAG, cache, memory, etc.) 4. **ALGORITHM**: Ranking/selection algorithms 5. **BACKEND**: External service configurations ### Example DSL ```dsl SIGNAL keyword code_keywords { keywords: ["code", "programming", "debug", "function"] threshold: 0.8 } SIGNAL domain code_domain { description: "Code and programming related queries" } ROUTE code_route (description = "Route code queries to specialist") { PRIORITY 100 WHEN keyword("code_keywords") OR domain("code_domain") MODEL "deepseek-coder" ( reasoning = true, temperature = 0.1 ) } ``` ## Data Format ### Stage 1: Syntax Pre-training (Completion) ```json { "id": "dsl_001", "dsl": "SIGNAL keyword kw_1 { keywords: [\"urgent\"] }", "complexity": "L1" } ``` ### Stage 2: SFT (Instruction-Input-Output) ```json { "id": "sft_001", "instruction": "Convert the following natural language description into Signal DSL configuration.", "input": "Create a route that sends math questions to GPT-4", "output": "SIGNAL domain math { ... } ROUTE math_route { ... }", "style": "en_formal", "complexity": "L2" } ``` ### Stage 3: DPO (Preference Pairs) ```json { "id": "dpo_001", "prompt": "Generate a valid Signal DSL configuration.", "chosen": "SIGNAL keyword kw { keywords: [\"test\"] }", "rejected": "SIGNAL keyword kw { keywords: [\"test\" }", "mutation_type": "syntax_error", "mutation_category": "missing_bracket" } ``` ## Complexity Levels | Level | Description | Signals | Routes | Plugins | |-------|-------------|---------|--------|---------| | L1 | Simple | 1-2 | 1 | 0 | | L2 | Basic | 2-3 | 1-2 | 0-1 | | L3 | Medium | 3-5 | 2-3 | 1-2 | | L4 | Complex | 5-8 | 3-5 | 2-4 | | L5 | Expert | 8+ | 5+ | 4+ | ## Usage ```python from datasets import load_dataset # Load all splits dataset = load_dataset("haowu1234/signal-dsl-dataset") # Load specific split sft_data = load_dataset("haowu1234/signal-dsl-dataset", split="stage2_sft") # Iterate through samples for sample in sft_data: print(f"Input: {sample['input']}") print(f"Output: {sample['output']}") ``` ## Training with this Dataset This dataset is designed for 3-stage training: 1. **Stage 1 (Syntax PT)**: Train language model on pure DSL to learn syntax 2. **Stage 2 (SFT)**: Fine-tune on NL→DSL pairs for instruction following 3. **Stage 3 (DPO)**: Preference optimization to prefer correct over incorrect DSL ## Generation Process Data was generated using: - **CFG Random Walk**: Grammar-based generation ensuring syntactic correctness - **Template Expansion**: Schema-aware field value generation - **Negative Sampling**: Systematic mutation for preference pairs - **NL Paraphrasing**: Multiple linguistic styles (formal/casual, EN/ZH) ## Citation ```bibtex @dataset{signal-dsl-dataset, author = {Signal Router Team}, title = {Signal DSL Dataset: Synthetic Training Data for DSL Generation}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/haowu1234/signal-dsl-dataset} } ``` ## License Apache 2.0 - See LICENSE for details.
提供机构:
haowu1234
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作