lyraaaa/synthprompts_148k

Name: lyraaaa/synthprompts_148k
Creator: lyraaaa
Published: 2026-04-18 07:46:43
License: 暂无描述

Hugging Face2026-04-18 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/lyraaaa/synthprompts_148k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en --- # Diverse User Prompts 148,464 synthetic first-turn user prompts across 150 topics, 56 writing styles, 8 lengths, and 12 complexity types. Every prompt is answerable and self-contained — no prior context needed. **Warning - some prompts (especially the short ones) may not be answerable without further context. Will be filtered in future versions.** ## Schema JSONL, one object per line: ```json { "topic": "backend web development", "style": "frustrated and at the end of their rope — has tried everything, patience exhausted, needs this to work", "length": "a few paragraphs — 150-300 words, thorough setup with multiple aspects", "complexity": "works on my machine — something works in one context but fails in another for unclear reasons", "prompt": "I've been debugging this for six hours and I'm about to lose it..." } ``` | Field | Type | Description | |-------|------|-------------| | `topic` | string | Domain category. 150 values (software engineering, cooking, philosophy, etc.) | | `style` | string | How the person writes. 56 values (casual, clinical, slang-heavy, academic, etc.) | | `length` | string | Target word count bucket. 8 values, from "3-8 words" to "500+" | | `complexity` | string | Problem structure. 12 values (straightforward, XY problem, has a gotcha, etc.) | | `prompt` | string | The user prompt text | ## Stats | | | |---|---| | Prompts | 153,829 | | Avg length | 176 words | | Median length | 118 words | | Range | 3 – 1,691 words | | Duplicates | 166 (0.1%) | | Unique topics | 150 | | Unique styles | 56 | ### Word count distribution | Range | % | |-------|---| | 0–10 | 9% | | 10–50 | 22% | | 50–150 | 30% | | 150–350 | 25% | | 350+ | 14% | ## Axes Prompts are generated from four independent axes, permuted combinatorially: **Topic** (150): plumbing, backend web development, consciousness and subjective experience, baking and pastry, competitive card games, ... **Style** (56): "super casual, like texting a friend", "terse and clinical — spec-heavy, no wasted words", "4chan greentext energy", "ESL speaker — grammatically creative", ... **Length** (8): "extremely terse — 3-8 words" through "an essay — 500+ words" **Complexity** (12): straightforward, has a gotcha, XY problem, works on my machine, overconstrained, missing context, time pressure, already tried everything, scale problem, legacy constraints, conflicting information, multiple interacting issues 20 verbose styles are excluded from the 2 shortest lengths since those voices can't work in under 15 words. This reduces 806,400 naive permutations to 734,400 valid ones. ## Generation Each prompt comes from a single API call to Gemma 4 31B (`google/gemma-4-31b-it` via OpenRouter). One permutation per call, 128 calls in parallel. The model receives the spec and returns raw prompt text. Metadata is attached by the script, not the model. The system prompt tells the model: - Every prompt must be answerable (not a thank-you, not a monologue) - Every prompt is a first turn (no prior context) - Style is for variety, not rigid compliance (substance over style-matching) - No example words in style descriptions (the model uses its own vocabulary) ## Use cases - Fine-tuning: pair with responses from your target model for SFT datasets - Evaluation: test how models handle diverse styles, complexity types, and domains - Robustness: stress-test against misspelled, slang-heavy, or unusual inputs - Classification: train on the labeled style/topic/complexity metadata

提供机构：

lyraaaa

5,000+

优质数据集

54 个

任务类型

进入经典数据集