Blackbean109/caveman-world-knowledge-150k
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Blackbean109/caveman-world-knowledge-150k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- text-generation
- question-answering
pretty_name: Caveman World Knowledge 150K
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: train.parquet
- split: validation
path: validation.parquet
---
# Caveman World Knowledge 150K
## Dataset description
Caveman-style instruction dataset with two blended behaviors:
- known world knowledge responses (Wikipedia-like content rewritten in caveman voice)
- unknown-question reactions with mood labels: angry, argue, attack
This dataset is intended for instruction tuning and style conditioning.
## Dataset structure
Each row is a JSON object with fields:
- `id`: unique row id
- `source`: `wikipedia`, `fallback`, or `synthetic`
- `topic`: world topic or `unknown`
- `instruction`: user question/prompt
- `response`: caveman-style answer
- `mood`: `neutral`, `angry`, `argue`, or `attack`
- `knowledge_status`: `known` or `unknown`
- `style`: `caveman`
- `language`: `en`
## Intended use
- style transfer experiments
- robust unknown-question behavior tuning
- synthetic instruction tuning with persona control
## Limitations
- automatically generated paraphrases can contain factual simplifications
- persona language is intentionally ungrammatical
- unknown behavior includes aggressive tone and should be reviewed for deployment suitability
## Generation process
- topic list from `topics_world.txt`
- unknown prompt list from `unknown_questions.txt`
- summaries fetched from Wikipedia when available
- fallback local facts for offline generation
- text rewritten to caveman style via rule-based transforms
## Recommended checks
- sample and manually audit factual quality
- profanity and safety filtering if used in public products
- domain balancing checks (science/history/geography/etc.)
提供机构:
Blackbean109



