Crownelius/Word-Puzzles-ARC-Unique-50000
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Crownelius/Word-Puzzles-ARC-Unique-50000
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
- question-answering
language:
- en
pretty_name: Word Puzzles ARC Unique 46000
size_categories:
- 10K<n<100K
tags:
- puzzles
- reasoning
- synthetic
- word-games
- logic
configs:
- config_name: default
data_files:
- split: train
path: train.jsonl
- split: validation
path: validation.jsonl
- split: test
path: test.jsonl
---
[[<img src="https://huggingface.co/crownelius/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5/resolve/main/banner.png" width="350"/>](https://ko-fi.com/abcuo)](https://ko-fi.com/abcuo)
# Word-Puzzles-ARC-Unique-46000
This dataset is a synthetic 46,000-row word-puzzle corpus focused on answerable reasoning tasks with explicit gold answers.
## Version
This upload corresponds to the harder `v2` build.
## Bucket mix
- `15,000` formal deduction
- `12,500` constraint-based lexical deduction
- `10,000` symbolic substitution
- `7,500` semantic association
- `1,000` riddles
## Hardening changes in v2
- formal puzzles use 6 entities instead of 5
- lexical puzzles use two-stage elimination
- cryptograms are longer and reveal fewer mappings
- semantic tasks lean more on analogies, synonyms, and antonyms
## Usefulness
This dataset is useful when you want a medium-scale corpus of answerable language reasoning tasks with explicit gold targets.
### Good use cases
- Training or fine-tuning models on structured verbal reasoning
- Evaluating multi-step deduction, elimination, and constraint tracking
- Stress-testing symbolic consistency on cryptogram-style substitution tasks
- Measuring whether a model can move between lexical, semantic, and formal puzzle regimes inside one dataset
- Generating curriculum mixtures where the task family is explicit in the `bucket` and `puzzle_type` fields
### Why it is useful
- Every row has a non-empty gold `answer`
- The dataset is bucketed, so you can train or evaluate per reasoning type
- The `v2` build is harder than the first version, especially in formal deduction, lexical elimination, and symbolic substitution
- The rows are synthetic and internally consistent, which makes large-scale filtering and sampling easier
- Because this corpus was uniquely generated rather than copied from standard public benchmark sets, the risk of benchmark contamination from prior memorization is substantially lower
### Especially strong buckets
- `formal_deduction`: good for explicit consistency and ordering reasoning
- `constraint_based_lexical_deduction`: good for hypothesis pruning under partial evidence
- `symbolic_substitution`: good for maintaining and updating a structured mapping hypothesis
## Limitations
- This is not a factual knowledge benchmark
- Some semantic and riddle items are still easier or noisier than the strongest formal/symbolic buckets
- The data is synthetic, so it is better for reasoning supervision than for measuring real-world knowledge coverage
## Recommended Splits And Evaluation
### Recommended split strategy
- `80/10/10` train/dev/test is a reasonable default for fine-tuning
- Keep the bucket ratio approximately constant across splits
- If you want a harder evaluation, build bucket-wise answer-disjoint test sets where the exact `answer` string does not appear in training for that bucket
- For lexical tasks, a stricter setting is to hold out both answer strings and nearby prompt templates when possible
### Suggested evaluation views
- Overall exact-match accuracy across the whole dataset
- Exact-match accuracy by `bucket`
- Exact-match accuracy by `puzzle_type`
- Calibration by `quality_tier`
- Error slices on the strongest reasoning buckets:
- `formal_deduction`
- `constraint_based_lexical_deduction`
- `symbolic_substitution`
### Good benchmark settings
- `In-distribution`: random split with preserved bucket ratios
- `Answer-holdout`: test answers are unseen within the same bucket
- `Template-stress`: evaluate on held-out puzzle types or prompt styles within a bucket
- `Mixed-reasoning`: evaluate on the full distribution to test switching between reasoning modes
### Leakage cautions
- The dataset is synthetic, so template overlap is possible even when exact prompts differ
- Some semantic and riddle rows reuse small source banks, so they should not carry the full weight of the benchmark
- If you want the cleanest benchmark, report both:
- full-corpus score
- score on the higher-signal subset of formal, lexical, and symbolic buckets
## Files
- `train.jsonl`
- `validation.jsonl`
- `test.jsonl`
- `word_reasoning_puzzles.jsonl`
- `word_reasoning_puzzles.csv`
- `build_summary.json`
- `hf_split_summary.json`
## Row schema
Each row includes:
- `id`
- `bucket`
- `puzzle_type`
- `prompt`
- `answer`
- `rationale`
- `quality_tier`
- `metadata`
## Notes
- All rows have non-empty gold answers.
- The dataset is synthetic and intended for reasoning/data-generation use, not as a factual knowledge benchmark.
提供机构:
Crownelius



