five

Crownelius/Word-Puzzles-ARC-Unique-50000

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Crownelius/Word-Puzzles-ARC-Unique-50000
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification - question-answering language: - en pretty_name: Word Puzzles ARC Unique 46000 size_categories: - 10K<n<100K tags: - puzzles - reasoning - synthetic - word-games - logic configs: - config_name: default data_files: - split: train path: train.jsonl - split: validation path: validation.jsonl - split: test path: test.jsonl --- [[<img src="https://huggingface.co/crownelius/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5/resolve/main/banner.png" width="350"/>](https://ko-fi.com/abcuo)](https://ko-fi.com/abcuo) # Word-Puzzles-ARC-Unique-46000 This dataset is a synthetic 46,000-row word-puzzle corpus focused on answerable reasoning tasks with explicit gold answers. ## Version This upload corresponds to the harder `v2` build. ## Bucket mix - `15,000` formal deduction - `12,500` constraint-based lexical deduction - `10,000` symbolic substitution - `7,500` semantic association - `1,000` riddles ## Hardening changes in v2 - formal puzzles use 6 entities instead of 5 - lexical puzzles use two-stage elimination - cryptograms are longer and reveal fewer mappings - semantic tasks lean more on analogies, synonyms, and antonyms ## Usefulness This dataset is useful when you want a medium-scale corpus of answerable language reasoning tasks with explicit gold targets. ### Good use cases - Training or fine-tuning models on structured verbal reasoning - Evaluating multi-step deduction, elimination, and constraint tracking - Stress-testing symbolic consistency on cryptogram-style substitution tasks - Measuring whether a model can move between lexical, semantic, and formal puzzle regimes inside one dataset - Generating curriculum mixtures where the task family is explicit in the `bucket` and `puzzle_type` fields ### Why it is useful - Every row has a non-empty gold `answer` - The dataset is bucketed, so you can train or evaluate per reasoning type - The `v2` build is harder than the first version, especially in formal deduction, lexical elimination, and symbolic substitution - The rows are synthetic and internally consistent, which makes large-scale filtering and sampling easier - Because this corpus was uniquely generated rather than copied from standard public benchmark sets, the risk of benchmark contamination from prior memorization is substantially lower ### Especially strong buckets - `formal_deduction`: good for explicit consistency and ordering reasoning - `constraint_based_lexical_deduction`: good for hypothesis pruning under partial evidence - `symbolic_substitution`: good for maintaining and updating a structured mapping hypothesis ## Limitations - This is not a factual knowledge benchmark - Some semantic and riddle items are still easier or noisier than the strongest formal/symbolic buckets - The data is synthetic, so it is better for reasoning supervision than for measuring real-world knowledge coverage ## Recommended Splits And Evaluation ### Recommended split strategy - `80/10/10` train/dev/test is a reasonable default for fine-tuning - Keep the bucket ratio approximately constant across splits - If you want a harder evaluation, build bucket-wise answer-disjoint test sets where the exact `answer` string does not appear in training for that bucket - For lexical tasks, a stricter setting is to hold out both answer strings and nearby prompt templates when possible ### Suggested evaluation views - Overall exact-match accuracy across the whole dataset - Exact-match accuracy by `bucket` - Exact-match accuracy by `puzzle_type` - Calibration by `quality_tier` - Error slices on the strongest reasoning buckets: - `formal_deduction` - `constraint_based_lexical_deduction` - `symbolic_substitution` ### Good benchmark settings - `In-distribution`: random split with preserved bucket ratios - `Answer-holdout`: test answers are unseen within the same bucket - `Template-stress`: evaluate on held-out puzzle types or prompt styles within a bucket - `Mixed-reasoning`: evaluate on the full distribution to test switching between reasoning modes ### Leakage cautions - The dataset is synthetic, so template overlap is possible even when exact prompts differ - Some semantic and riddle rows reuse small source banks, so they should not carry the full weight of the benchmark - If you want the cleanest benchmark, report both: - full-corpus score - score on the higher-signal subset of formal, lexical, and symbolic buckets ## Files - `train.jsonl` - `validation.jsonl` - `test.jsonl` - `word_reasoning_puzzles.jsonl` - `word_reasoning_puzzles.csv` - `build_summary.json` - `hf_split_summary.json` ## Row schema Each row includes: - `id` - `bucket` - `puzzle_type` - `prompt` - `answer` - `rationale` - `quality_tier` - `metadata` ## Notes - All rows have non-empty gold answers. - The dataset is synthetic and intended for reasoning/data-generation use, not as a factual knowledge benchmark.
提供机构:
Crownelius
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作