nchapman/figaro-data-v1
收藏Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nchapman/figaro-data-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- text-generation
language:
- en
tags:
- chat
- sft
- system-prompt
- roleplay
- creative-writing
- function-calling
pretty_name: Figaro Data
size_categories:
- 100K<n<1M
---
# Figaro Data
Training dataset for [Figaro](https://github.com/nchapman/figaro), a fine-tuned language model that follows its system prompt and hopefully sounds more like a person, not a chatbot.
Built from 11 public conversation datasets, cleaned and enriched through an automated [pipeline](https://github.com/nchapman/figaro-data). Every conversation has a system message that specifically describes how the response should sound — a persona, a writing style, a character card, or a tool definition that matches what the conversation actually contains. The model learns that the system prompt *means something*.
## Format
Standard chat messages format. Every example has a `messages` array with `role` and `content` fields:
```json
{
"messages": [
{"role": "system", "content": "You are a noir fiction writer. Your prose is tight and cynical..."},
{"role": "user", "content": "Write the opening scene of a detective story set in 1940s Chicago."},
{"role": "assistant", "content": "Rain hit the window like it had a grudge..."}
]
}
```
Every example has a system message. This is by design.
## Blend
~175K conversations across these categories:
| Category | Source | Count | % |
|---|---|---|---|
| General Q&A | Magpie-Pro-300K | 60,000 | 34% |
| Instruction following | tulu-3-personas | 30,000 | 17% |
| Prose | kalo-opus-instruct | 22,000 | 13% |
| System compliance | SystemChat-2.0 | 20,000 | 11% |
| RP — character | OpenCharacter | 15,000 | 9% |
| Function calling | hermes-function-calling | 8,000 | 5% |
| Creative writing | nopm_claude_writing | 6,350 | 4% |
| RP — fandom | bluemoon-fandom | 5,000 | 3% |
| RP — movie scripts | cinematika | 5,000 | 3% |
| Literary | Gutenberg-SFT | 3,000 | 2% |
| RP — human-written | LimaRP | 800 | <1% |
## How it was built
Nine pipeline stages:
1. **Pull** — Download and normalize all source datasets into a common messages format.
2. **Validate** — Enforce conversation structure: proper turn alternation, no blank messages, no duplicates.
3. **Filter** — Remove refusals via regex patterns, a [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) classifier, and [LlamaGuard](https://huggingface.co/meta-llama/Llama-Guard-4-12B) content safety filtering on select datasets.
4. **Deslop** — Score assistant responses for AI-speak density using the [antislop](https://github.com/sam-paech/antislop-sampler) phrase list. Mid-range slop is rewritten by an LLM. High-density slop is removed.
5. **Enrich** — Give every conversation a system message that matches its content. Nine strategies handle different dataset types: curated personas for general Q&A, extracted personas from instruction-following prompts, genre-matched prompts for creative writing, LLM-generated character cards for roleplay, preserved tool definitions for function calling.
6. **Dedup** — MinHash LSH deduplication within and across datasets, keeping higher-priority sources on conflicts.
7. **Decontaminate** — Remove examples too similar to eval benchmarks (IFEval, MT-Bench, MMLU, OR-Bench, Sorry-Bench) via embedding cosine similarity.
8. **Blend** — Sample each dataset to its target count, add source metadata, shuffle.
9. **Push** — Upload to HuggingFace Hub.
The pipeline code is at [nchapman/figaro-data](https://github.com/nchapman/figaro-data).
## Source datasets and licenses
| Dataset | License |
|---|---|
| [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered) | Llama 3.1 Community |
| [allenai/tulu-3-sft-personas-instruction-following](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) | ODC-BY |
| [cognitivecomputations/SystemChat-2.0](https://huggingface.co/datasets/cognitivecomputations/SystemChat-2.0) | Apache 2.0 |
| [anthracite-org/kalo-opus-instruct-22k-no-refusal](https://huggingface.co/datasets/anthracite-org/kalo-opus-instruct-22k-no-refusal) | — |
| [anthracite-org/nopm_claude_writing_fixed](https://huggingface.co/datasets/anthracite-org/nopm_claude_writing_fixed) | — |
| [ConicCat/Gutenberg-SFT](https://huggingface.co/datasets/ConicCat/Gutenberg-SFT) | — |
| [grimulkan/LimaRP-augmented](https://huggingface.co/datasets/grimulkan/LimaRP-augmented) | — |
| [xywang1/OpenCharacter](https://huggingface.co/datasets/xywang1/OpenCharacter) | Apache 2.0 |
| [Squish42/bluemoon-fandom-1-1-rp-cleaned](https://huggingface.co/datasets/Squish42/bluemoon-fandom-1-1-rp-cleaned) | — |
| [jondurbin/cinematika-v0.1](https://huggingface.co/datasets/jondurbin/cinematika-v0.1) | — |
| [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) | Apache 2.0 |
This dataset is a blend of the above sources, each under their own license. Use accordingly.
提供机构:
nchapman



