five

Yusiko/prompter

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Yusiko/prompter
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - az - en - tr - ru - de - fr - zh - ar - es - ja license: apache-2.0 task_categories: - text-generation task_ids: - language-modeling tags: - prompt-engineering - alpaca - multilingual - instruction-following - llm - fine-tuning - qlora - unsloth pretty_name: Prompter — Multilingual Prompt Engineering Dataset size_categories: - 1K<n<10K --- # 🧠 Prompter — Multilingual Prompt Engineering Dataset A high-quality **5,000-sample** instruction dataset for fine-tuning language models to act as expert prompt engineers. Given any simple user input, the model learns to expand it into a detailed, structured, professional prompt following best practices from Google's Prompt Engineering whitepaper. --- ## 📋 Dataset Summary | Field | Value | |-------|-------| | 🌍 Languages | 10 (az, en, tr, ru, de, fr, zh, ar, es, ja) | | 📦 Format | Alpaca (instruction / input / output) | | 📐 Size | 5,000 samples | | 🗂️ File format | Parquet | | 📜 License | Apache 2.0 | --- ## 🎯 What Does This Dataset Teach? The model learns to take a **short, vague prompt** and transform it into a **complete, production-ready prompt** with: - 🎭 **Role assignment** — "You are a senior software engineer..." - 🗂️ **System prompting** — defining big-picture context and output format - 📝 **Contextual prompting** — task-specific background and constraints - 🔗 **Chain-of-Thought (CoT)** — "Let's think step by step..." - 🎯 **Few-shot examples** — showing patterns before the task - 🔙 **Step-back prompting** — general principles before specific implementation - ✅ **Positive instructions** — what to do, not what to avoid - 📊 **Structured output** — JSON, markdown, tables where appropriate --- ## 💡 Sample **Instruction** (in Azerbaijani): ``` Prompt mühəndisi kimi aşağıdakı sadə promptu tam detallı, professional versiyaya çevir ``` **Input:** ``` Write a Python function ``` **Output (truncated):** ```markdown ## System Prompt You are a senior software engineer with 10+ years of Python/FastAPI experience. Your task is to write a Python function with production-quality standards. ## Role & Context Act as a senior engineer conducting a thorough code review and implementation session... ## Step-by-Step Instructions 1. Clarify requirements, edge cases, and constraints before writing any code 2. Design the interface and data structures first, then implement logic 3. Write the implementation with comprehensive inline documentation 4. Add input validation and robust error handling for all edge cases 5. Write unit tests covering happy path, edge cases, and error scenarios ## Output Requirements - Implementation: Complete, working code with no placeholders - Documentation: Inline comments explaining non-obvious logic - Tests: At minimum 3 test cases ... ``` --- ## 🏗️ Dataset Structure ```python { "instruction": str, # Task description in one of 10 languages "input": str, # Simple user prompt (1–6 words) "output": str, # Full expanded professional prompt } ``` ### 📊 Output Type Distribution | Type | Count | Description | |------|-------|-------------| | 🔷 Standard | ~3,280 | Role + system + contextual prompting | | 🔶 Few-shot | ~1,000 | 2 examples shown before the main task | | 🔹 Chain-of-Thought | ~460 | Step-by-step reasoning structure | | 🔸 Step-back | ~260 | General principles → specific implementation | ### 🗂️ Domain Coverage | Domain | Examples | |--------|----------| | 💻 Coding | Python, APIs, databases, testing, DevOps | | ✍️ Writing | Blog posts, docs, emails, reports | | 📊 Analysis | Code review, architecture, performance | | 🤖 ML / AI | Fine-tuning, RAG, agents, embeddings | | ☁️ DevOps | CI/CD, Kubernetes, Terraform, monitoring | | 📦 Data | ETL pipelines, schemas, data quality | | 💼 Business | OKRs, strategy, product roadmaps | ### 🌍 Language Distribution Instructions are evenly distributed across 10 languages: `Azerbaijani` · `English` · `Turkish` · `Russian` · `German` · `French` · `Chinese` · `Arabic` · `Spanish` · `Japanese` --- ## 🚀 Usage ### Load the dataset ```python from datasets import load_dataset dataset = load_dataset("Yusiko/prompter") print(dataset["train"][0]) ``` ### Fine-tune with Unsloth (recommended) ```python from unsloth import FastLanguageModel from trl import SFTTrainer, SFTConfig model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-4B", max_seq_length = 1024, dtype = torch.float32, load_in_4bit = True, ) # Alpaca prompt template HEADER = ( "Below is an instruction that describes a task, paired with an input " "that provides further context. Write a response that appropriately " "completes the request.\n\n" ) def format_sample(examples): texts = [] for inst, inp, out in zip(examples["instruction"], examples["input"], examples["output"]): text = ( HEADER + "### Instruction:\n" + str(inst or "") + "\n\n" + "### Input:\n" + str(inp or "") + "\n\n" + "### Response:\n" + str(out or "") + tokenizer.eos_token ) texts.append(text) return {"text": texts} ``` > ⚠️ **Important:** Use string concatenation instead of `.format()` — the output texts contain `{curly braces}` that will cause `KeyError` with `str.format()`. ### Inference prompt format ``` Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: As a prompt engineer, transform this simple input into a fully detailed, professional prompt ### Input: {your simple prompt here} ### Response: ``` --- ## 📐 Design Principles This dataset was built following the **Google Prompt Engineering Whitepaper (February 2025)** by Lee Boonstra. Every output follows these rules: - 🟢 **Role prompting** — assigns a specific expert persona to the model - 🟢 **System prompting** — sets overarching context and output requirements - 🟢 **Positive instructions** — tells the model what to do, not what to avoid - 🟢 **Specific output format** — each prompt specifies the desired response structure - 🟢 **Action verbs** — uses verbs like Analyze, Generate, Implement, Write, Evaluate - 🟢 **Structured reasoning** — CoT entries guide the model through intermediate steps - 🟢 **Step-back abstraction** — 5% of entries establish general principles before specifics --- ## 🤗 Models Trained on This Dataset | Model | Base | Method | Link | |---------------------|------------|-----------------|-----------------| | Qwen3.5-4B Prompter | Qwen3.5-4B | QLoRA (Unsloth) | *(coming soon)* | --- ## 📄 Citation If you use this dataset in your work, please cite: ```bibtex @dataset{yusiko_prompter_2025, author = {Yusif}, title = {Prompter: Multilingual Prompt Engineering Dataset}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/Yusiko/prompter} } ``` --- ## 🙏 Acknowledgements - [Google Prompt Engineering Whitepaper](https://drive.google.com/file/d/1AbaBYbEa_EbPelsT40-vj64L-2IwUJHG/view) — Lee Boonstra et al. - [Unsloth](https://github.com/unslothai/unsloth) — 2x faster fine-tuning - [TRL](https://github.com/huggingface/trl) — SFTTrainer --- *Built with ❤️ by [Yusif](https://huggingface.co/Yusiko) · Apache 2.0 License*
提供机构:
Yusiko
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作