beta3/GridCorpus_9M_Sudoku_Puzzles_Enriched
收藏Hugging Face2026-03-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/beta3/GridCorpus_9M_Sudoku_Puzzles_Enriched
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: gridcorpus
data_files:
- split: data
path: gridcorpus.csv
- config_name: pipeline_log
data_files:
- split: logs
path: pipeline_log.csv
license: cc-by-4.0
language:
- en
tags:
- sudoku
- puzzle
- constraint-satisfaction
- backtracking
- solver
- difficulty
- benchmarking
- feature-extraction
- llm-evaluation
- reasoning
task_categories:
- feature-extraction
- text-generation
- other
size_categories:
- 1M<n<10M
---
[](https://www.kaggle.com/datasets/beta3logic/gridcorpus-9m-sudoku-puzzles-enriched/data?select=gridcorpus.csv) [](https://huggingface.co/datasets/beta3/GridCorpus_9M_Sudoku_Puzzles_Enriched) [](https://creativecommons.org/licenses/by/4.0/)
```
╔══════════════════════════════════════════════════════════════════════╗
║ ║
║ G R I D C O R P U S ║
║ ║
║ "004300209005009001070060043..." ║
║ │ ║
║ ▼ ║
║ ┌───────┬───────┬───────┐ missing_cells → 36 ║
║ │ · · 4 │ 3 · · │ 2 · 9 │ given_cells → 45 ║
║ │ · · 5 │ · · 9 │ · · 1 │ given_ratio → 0.556 ║
║ │ · 7 · │ · 6 · │ · 4 3 │ row_given_counts → [3,2,3,...] ║
║ ├───────┼───────┼───────┤ col_given_counts → [4,3,5,...] ║
║ │ · · 6 │ · · 2 │ · 8 7 │ naked_singles → 8 ║
║ │ 1 9 · │ · · 7 │ 4 · · │ hidden_singles → 15 ║
║ │ · 5 · │ · 8 3 │ 1 6 · │ initial_res_rate → 0.489 ║
║ ├───────┼───────┼───────┤ requires_backtrack → False ║
║ │ · · · │ · · · │ 6 · 9 │ backtrack_depth → 0 ║
║ │ · 7 3 │ · · · │ · 5 · │ propagation_steps → 45 ║
║ │ 8 · · │ 2 · · │ 1 · 3 │ difficulty_tier → medium ║
║ └───────┴───────┴───────┘ ║
║ ║
╚══════════════════════════════════════════════════════════════════════╝
```
---
## What is this
GridCorpus is an enriched version of the original [9 Million Sudoku Puzzles and Solutions](https://www.kaggle.com/datasets/rohanrao/sudoku) dataset by Rohan Rao. The raw dataset gives you 9 million puzzle-solution pairs as 81-character strings. That's useful, but it tells you nothing about how hard each puzzle actually is or what its structure looks like numerically.
This dataset takes those 9 million puzzles, runs each one through a purpose-built instrumented solver, and extracts a set of features that describe both the structure of the puzzle and the computational effort required to solve it. The result is a dataset where you can filter by difficulty, analyze solving complexity, or feed structured grids directly into models without any preprocessing.
It is the backing data for the [NineGrid LLM benchmark](https://www.kaggle.com/benchmarks/beta3logic/ninegrid-sudoku-reasoning-benchmark-for-llms), which evaluates Large Language Models on Sudoku constraint satisfaction.
---
## Source
**Original dataset:** [9 Million Sudoku Puzzles and Solutions](https://www.kaggle.com/datasets/rohanrao/sudoku) — Rohan Rao, Kaggle.
No puzzles were filtered, reordered, or modified. All added columns are derived computationally from the original `puzzle` and `solution` strings.
---
### Grid representation
| Column | Type | Description |
|-----------------|--------|--------------------------------------------------|
| `puzzle_grid` | string | JSON-encoded `list[list[int]]`, 9×9. Zeros for missing cells |
| `solution_grid` | string | JSON-encoded `list[list[int]]`, 9×9. No zeros |
These are the canonical grid format used by the benchmark. To use them in Python:
```python
import json
puzzle_grid = json.loads(df["puzzle_grid"].iloc[0]) # list[list[int]]
solution_grid = json.loads(df["solution_grid"].iloc[0])
```
---
### Structural features
These are computed directly from the puzzle string — no solver required.
| Column | Type | Description |
|--------------------|-------|----------------------------------------------------------|
| `missing_cells` | int | Count of `0` characters. Range: [17, 64] |
| `given_cells` | int | 81 − missing_cells |
| `given_ratio` | float | given_cells / 81 |
| `row_given_counts` | string | JSON list of 9 ints — given cells per row |
| `col_given_counts` | string | JSON list of 9 ints — given cells per column |
**How they're calculated:**
`missing_cells` is a straight count of zeros in the 81-char string. `given_ratio` normalizes that count to [0, 1] so puzzles with different clue densities are comparable on the same scale. `row_given_counts` and `col_given_counts` split the 81 cells into their respective units and count the non-zero values in each — they always sum to `given_cells`.
---
### Solving complexity features
These require actually running a solver on the puzzle. The solver is a backtracking implementation with a **Minimum Remaining Values (MRV)** heuristic: at each branching point, it picks the empty cell with the fewest valid candidates, which minimizes the search tree. The solver is fully deterministic — same puzzle always produces identical metrics.
| Column | Type | Description |
|--------------------------------|-------|----------------------------------------------------------------|
| `naked_singles_count` | int | Empty cells with exactly one valid candidate at the initial puzzle state |
| `hidden_singles_count` | int | Empty cells that are the only valid position for some digit in their row, column, or box — excluding naked singles |
| `initial_resolution_rate` | float | (naked + hidden) / missing_cells. Always in [0, 1] |
| `requires_backtrack` | bool | Whether the solver needed any branching beyond propagation |
| `backtrack_depth` | int | Maximum recursion depth reached. 0 = solved by propagation only |
| `constraint_propagation_steps` | int | Cell assignments made during propagation before any branching |
**How they're calculated:**
A **naked single** is a cell where, after eliminating all digits already present in its row, column, and box, only one digit remains. It can be assigned directly without guessing.
A **hidden single** is subtler: the cell may have multiple candidates, but one of those digits has no other valid position within the row, column, or box — so it must go there. These are detected by scanning all 27 units (9 rows + 9 columns + 9 boxes) and finding digits with only one candidate position. Naked singles are excluded from this count to avoid double-counting the same cell.
`initial_resolution_rate` measures how much of the puzzle is solvable from the starting state using only these two deterministic techniques, before any guessing is needed. A value of 1.0 means the puzzle collapses entirely under propagation. Values close to 0 indicate puzzles where almost nothing can be determined without branching.
The solver then runs to completion, counting how deep the backtracking search goes. `backtrack_depth = 0` means no branching was needed at all. Each unit of depth represents one guess that couldn't be resolved by propagation alone.
---
### Difficulty label
| Column | Type | Values |
|-------------------|--------|-------------------------------------------|
| `difficulty_tier` | string | `easy` / `medium` / `hard` / `expert` |
**Tier assignment logic:**
```
easy → backtrack_depth = 0 and initial_resolution_rate > 0.70
medium → backtrack_depth = 0 and initial_resolution_rate ≤ 0.70
hard → backtrack_depth ∈ [1, 5]
expert → backtrack_depth > 5
```
The threshold at 0.70 for easy/medium separates puzzles where most missing cells are immediately deterministic (easy) from those that require more careful scanning of units even without backtracking (medium). The hard/expert split at depth 5 is based on the observed distribution across 9M puzzles — depth > 5 represents the tail of genuinely search-intensive puzzles.
**This is solver complexity, not human difficulty.** The tiers correlate well with human-perceived difficulty, but they measure computational search effort rather than puzzle aesthetics. A puzzle rated `expert` here required 5+ levels of recursive search from the solver — that's a meaningful signal regardless of how a human might approach it.
---
## Companion file — pipeline_log.csv
Alongside `gridcorpus.csv`, this dataset includes `pipeline_log.csv` — one row per processed chunk, generated during the enrichment pipeline. It serves as a processing audit trail and lets you verify that metrics are consistent across the full dataset, not just the final aggregate.
| Column | Type | Description |
|--------------------------|-------|---------------------------------------------------------------|
| `chunk_idx` | int | Chunk index (0-based). Each chunk is 100,000 rows |
| `rows` | int | Number of rows in this chunk |
| `elapsed_s` | float | Processing time in seconds for this chunk |
| `missing_cells_mean` | float | Average missing cells across puzzles in this chunk |
| `given_ratio_mean` | float | Average given_ratio in this chunk |
| `naked_singles_mean` | float | Average naked_singles_count in this chunk |
| `hidden_singles_mean` | float | Average hidden_singles_count in this chunk |
| `resolution_rate_mean` | float | Average initial_resolution_rate in this chunk |
| `requires_backtrack_pct` | float | Percentage of puzzles in this chunk that required backtracking |
| `backtrack_depth_max` | int | Maximum backtrack_depth observed in this chunk |
| `backtrack_depth_mean` | float | Average backtrack_depth in this chunk |
| `propagation_steps_mean` | float | Average constraint_propagation_steps in this chunk |
| `tier_easy` | int | Count of easy puzzles in this chunk |
| `tier_medium` | int | Count of medium puzzles in this chunk |
| `tier_hard` | int | Count of hard puzzles in this chunk |
| `tier_expert` | int | Count of expert puzzles in this chunk |
| `cumulative_rows` | int | Total rows processed up to and including this chunk |
The log is append-only and chunk-safe — if the pipeline was interrupted and resumed, each chunk appears exactly once. Consistent `missing_cells_mean` and `resolution_rate_mean` values across chunks confirm that the source dataset is uniformly distributed, with no ordering artifacts.
---
## Observed distributions (9,000,000 puzzles)
```
Total puzzles 9,000,000
missing_cells avg 42.08
given_ratio avg 0.4805
resolution_rate avg 0.4526
requires_backtrack 1,639,390 puzzles (18.2%)
difficulty_tier:
easy 1,210,679 (13.5%)
medium 6,149,931 (68.3%)
hard 1,493,604 (16.6%)
expert 145,786 ( 1.6%)
```
---
## License
Derived dataset. Source data credited to Rohan Rao under its original Kaggle terms. All derived columns are released under CC BY 4.0.
---
```
· · · · · · · · ·
· · · · · · · · ·
· · · · · · · · ·
Nine cells. Nine rows. Nine columns.
Forty-three quintillion possible grids.
One solution.
```
提供机构:
beta3



