five

lyraaaa/synthprompts_148k

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lyraaaa/synthprompts_148k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en --- # Diverse User Prompts 148,464 synthetic first-turn user prompts across 150 topics, 56 writing styles, 8 lengths, and 12 complexity types. Every prompt is answerable and self-contained — no prior context needed. **Warning - some prompts (especially the short ones) may not be answerable without further context. Will be filtered in future versions.** ## Schema JSONL, one object per line: ```json { "topic": "backend web development", "style": "frustrated and at the end of their rope — has tried everything, patience exhausted, needs this to work", "length": "a few paragraphs — 150-300 words, thorough setup with multiple aspects", "complexity": "works on my machine — something works in one context but fails in another for unclear reasons", "prompt": "I've been debugging this for six hours and I'm about to lose it..." } ``` | Field | Type | Description | |-------|------|-------------| | `topic` | string | Domain category. 150 values (software engineering, cooking, philosophy, etc.) | | `style` | string | How the person writes. 56 values (casual, clinical, slang-heavy, academic, etc.) | | `length` | string | Target word count bucket. 8 values, from "3-8 words" to "500+" | | `complexity` | string | Problem structure. 12 values (straightforward, XY problem, has a gotcha, etc.) | | `prompt` | string | The user prompt text | ## Stats | | | |---|---| | Prompts | 153,829 | | Avg length | 176 words | | Median length | 118 words | | Range | 3 – 1,691 words | | Duplicates | 166 (0.1%) | | Unique topics | 150 | | Unique styles | 56 | ### Word count distribution | Range | % | |-------|---| | 0–10 | 9% | | 10–50 | 22% | | 50–150 | 30% | | 150–350 | 25% | | 350+ | 14% | ## Axes Prompts are generated from four independent axes, permuted combinatorially: **Topic** (150): plumbing, backend web development, consciousness and subjective experience, baking and pastry, competitive card games, ... **Style** (56): "super casual, like texting a friend", "terse and clinical — spec-heavy, no wasted words", "4chan greentext energy", "ESL speaker — grammatically creative", ... **Length** (8): "extremely terse — 3-8 words" through "an essay — 500+ words" **Complexity** (12): straightforward, has a gotcha, XY problem, works on my machine, overconstrained, missing context, time pressure, already tried everything, scale problem, legacy constraints, conflicting information, multiple interacting issues 20 verbose styles are excluded from the 2 shortest lengths since those voices can't work in under 15 words. This reduces 806,400 naive permutations to 734,400 valid ones. ## Generation Each prompt comes from a single API call to Gemma 4 31B (`google/gemma-4-31b-it` via OpenRouter). One permutation per call, 128 calls in parallel. The model receives the spec and returns raw prompt text. Metadata is attached by the script, not the model. The system prompt tells the model: - Every prompt must be answerable (not a thank-you, not a monologue) - Every prompt is a first turn (no prior context) - Style is for variety, not rigid compliance (substance over style-matching) - No example words in style descriptions (the model uses its own vocabulary) ## Use cases - Fine-tuning: pair with responses from your target model for SFT datasets - Evaluation: test how models handle diverse styles, complexity types, and domains - Robustness: stress-test against misspelled, slang-heavy, or unusual inputs - Classification: train on the labeled style/topic/complexity metadata
提供机构:
lyraaaa
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作