five

nchapman/figaro-data-v1

收藏
Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nchapman/figaro-data-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text-generation language: - en tags: - chat - sft - system-prompt - roleplay - creative-writing - function-calling pretty_name: Figaro Data size_categories: - 100K<n<1M --- # Figaro Data Training dataset for [Figaro](https://github.com/nchapman/figaro), a fine-tuned language model that follows its system prompt and hopefully sounds more like a person, not a chatbot. Built from 11 public conversation datasets, cleaned and enriched through an automated [pipeline](https://github.com/nchapman/figaro-data). Every conversation has a system message that specifically describes how the response should sound — a persona, a writing style, a character card, or a tool definition that matches what the conversation actually contains. The model learns that the system prompt *means something*. ## Format Standard chat messages format. Every example has a `messages` array with `role` and `content` fields: ```json { "messages": [ {"role": "system", "content": "You are a noir fiction writer. Your prose is tight and cynical..."}, {"role": "user", "content": "Write the opening scene of a detective story set in 1940s Chicago."}, {"role": "assistant", "content": "Rain hit the window like it had a grudge..."} ] } ``` Every example has a system message. This is by design. ## Blend ~175K conversations across these categories: | Category | Source | Count | % | |---|---|---|---| | General Q&A | Magpie-Pro-300K | 60,000 | 34% | | Instruction following | tulu-3-personas | 30,000 | 17% | | Prose | kalo-opus-instruct | 22,000 | 13% | | System compliance | SystemChat-2.0 | 20,000 | 11% | | RP — character | OpenCharacter | 15,000 | 9% | | Function calling | hermes-function-calling | 8,000 | 5% | | Creative writing | nopm_claude_writing | 6,350 | 4% | | RP — fandom | bluemoon-fandom | 5,000 | 3% | | RP — movie scripts | cinematika | 5,000 | 3% | | Literary | Gutenberg-SFT | 3,000 | 2% | | RP — human-written | LimaRP | 800 | <1% | ## How it was built Nine pipeline stages: 1. **Pull** — Download and normalize all source datasets into a common messages format. 2. **Validate** — Enforce conversation structure: proper turn alternation, no blank messages, no duplicates. 3. **Filter** — Remove refusals via regex patterns, a [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) classifier, and [LlamaGuard](https://huggingface.co/meta-llama/Llama-Guard-4-12B) content safety filtering on select datasets. 4. **Deslop** — Score assistant responses for AI-speak density using the [antislop](https://github.com/sam-paech/antislop-sampler) phrase list. Mid-range slop is rewritten by an LLM. High-density slop is removed. 5. **Enrich** — Give every conversation a system message that matches its content. Nine strategies handle different dataset types: curated personas for general Q&A, extracted personas from instruction-following prompts, genre-matched prompts for creative writing, LLM-generated character cards for roleplay, preserved tool definitions for function calling. 6. **Dedup** — MinHash LSH deduplication within and across datasets, keeping higher-priority sources on conflicts. 7. **Decontaminate** — Remove examples too similar to eval benchmarks (IFEval, MT-Bench, MMLU, OR-Bench, Sorry-Bench) via embedding cosine similarity. 8. **Blend** — Sample each dataset to its target count, add source metadata, shuffle. 9. **Push** — Upload to HuggingFace Hub. The pipeline code is at [nchapman/figaro-data](https://github.com/nchapman/figaro-data). ## Source datasets and licenses | Dataset | License | |---|---| | [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered) | Llama 3.1 Community | | [allenai/tulu-3-sft-personas-instruction-following](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) | ODC-BY | | [cognitivecomputations/SystemChat-2.0](https://huggingface.co/datasets/cognitivecomputations/SystemChat-2.0) | Apache 2.0 | | [anthracite-org/kalo-opus-instruct-22k-no-refusal](https://huggingface.co/datasets/anthracite-org/kalo-opus-instruct-22k-no-refusal) | — | | [anthracite-org/nopm_claude_writing_fixed](https://huggingface.co/datasets/anthracite-org/nopm_claude_writing_fixed) | — | | [ConicCat/Gutenberg-SFT](https://huggingface.co/datasets/ConicCat/Gutenberg-SFT) | — | | [grimulkan/LimaRP-augmented](https://huggingface.co/datasets/grimulkan/LimaRP-augmented) | — | | [xywang1/OpenCharacter](https://huggingface.co/datasets/xywang1/OpenCharacter) | Apache 2.0 | | [Squish42/bluemoon-fandom-1-1-rp-cleaned](https://huggingface.co/datasets/Squish42/bluemoon-fandom-1-1-rp-cleaned) | — | | [jondurbin/cinematika-v0.1](https://huggingface.co/datasets/jondurbin/cinematika-v0.1) | — | | [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) | Apache 2.0 | This dataset is a blend of the above sources, each under their own license. Use accordingly.
提供机构:
nchapman
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作