hunt3rx99/meowllm-miso

Name: hunt3rx99/meowllm-miso
Creator: hunt3rx99
Published: 2026-04-06 21:11:38
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/hunt3rx99/meowllm-miso

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en tags: - tiny - character - synthetic - cat - educational pretty_name: MeowLLM Miso Dataset size_categories: - 10K<n<100K --- # MeowLLM / Miso Dataset Training data for MeowLLM — a ~3.5M parameter character language model that speaks as a house cat named Miso. ## Dataset Summary 20,000 single-turn (input, output) pairs across 15 topical categories plus a hard-negative "assistant trap" deflection category. All outputs are in a single consistent character voice (lowercase, short, cat-themed, no assistant phrases). The dataset is produced by a **slot-based compositional template generator** with strict per-category filtering and de-duplication. No human annotation — everything is deterministic from hand-written slot banks plus a random seed. ## Supported Tasks - **Text generation**: training a tiny character model that produces responses in a consistent voice. - **Character fidelity evaluation**: the held-out prompt suite in `meow/eval_cases.py` is explicitly excluded from the training split. ## Languages English (lowercase only). ## Dataset Structure ### Data Fields Each sample is a JSON object with: | field | type | description | |----------|------|--------------------------------------------------| | input | str | Human prompt (things a person might say to a cat)| | output | str | Miso's in-character response | | category | str | One of 15 topical category names | | source | str | "template" or "llm" (augmented) | ### Data Splits | split | samples | |-------|---------| | train | 19,000 | | val | 1,000 | ### Categories `greeting`, `hunger`, `naps`, `boxes`, `windows`, `birds`, `humans`, `dogs`, `vacuum`, `rain`, `affection`, `territory`, `nonsense_questions`, `being_picked_up`, `jealousy` Category balance in the train split is between 5.4% and 7.7% per category (fairly uniform via round-robin generation). ## Data Creation ### Source Data All data is synthetically generated from hand-written slot banks in [`meow/generate_data.py`](https://github.com/phanii9/MeowLLM/blob/main/meow/generate_data.py). Each category has: - `inputs`: a bank of prompt phrasings - `cores`: the main response clause bank - `openers`: optional leading phrases - `sensories`: optional sensory detail extensions - `redirects`: optional secondary clauses During generation, each sample is assembled by picking one core and optionally attaching other slots according to per-category probabilities. The result is then validated against [`meow/rules.py`](https://github.com/phanii9/MeowLLM/blob/main/meow/rules.py) before being written. ### Filtering Every generated sample must pass: 1. **Strict lowercase** — no capital letters anywhere 2. **Length**: 1–3 sentences, ≤35 words, ≥1 word 3. **No banned phrases** — whole-phrase matching against a list of ~40 assistant-speak patterns ("as an ai", "i can help you", "certainly", "of course", etc.) 4. **Per-category keyword requirement** — e.g., `hunger` outputs must mention food-related words; `vacuum` outputs must mention hiding or the vacuum itself 5. **Cat-framing fallback** — long outputs without a category-specific rule must contain general cat vocabulary 6. **Deduplication** — exact (input, output) duplicates are rejected 7. **Eval leakage check** — any input matching a held-out eval prompt is rejected ### Known Limitations - The dataset is deliberately narrow. Miso knows about 15 topics and nothing else. - Vocabulary is small (~1700 BPE tokens) and lowercase-only. - Responses are short by design (1–3 sentences). - Outputs are in English and reflect a specific cultural framing of "house cat" (American/European domestic cat). ## Personal and Sensitive Information None. All samples are synthetic and describe a fictional cat's views on food, naps, and boxes. ## Considerations for Using the Data ### Social Impact This is a tiny educational dataset for a character model. It has no realistic misuse surface — the model trained on it is too small for open-ended generation and too in-character to be useful as a generic assistant. ### Bias Miso is a slightly smug indoor cat. That bias is intentional and documented in [`persona.md`](https://github.com/phanii9/MeowLLM/blob/main/persona.md). ## Additional Information ### Dataset Curators phanii9 ### Licensing MIT License. ### Citation ```bibtex @software{meowllm2026, author = {phanii9}, title = {MeowLLM: a tiny character language model that talks like a house cat}, year = {2026}, url = {https://github.com/phanii9/MeowLLM} } ```

提供机构：

hunt3rx99

5,000+

优质数据集

54 个

任务类型

进入经典数据集