Name: beta3/GridCorpus_9M_Sudoku_Puzzles_Enriched
Creator: beta3
Published: 2026-03-07 21:40:49
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/beta3/GridCorpus_9M_Sudoku_Puzzles_Enriched

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: gridcorpus data_files: - split: data path: gridcorpus.csv - config_name: pipeline_log data_files: - split: logs path: pipeline_log.csv license: cc-by-4.0 language: - en tags: - sudoku - puzzle - constraint-satisfaction - backtracking - solver - difficulty - benchmarking - feature-extraction - llm-evaluation - reasoning task_categories: - feature-extraction - text-generation - other size_categories: - 1M<n<10M --- [![Kaggle](https://img.shields.io/badge/Kaggle-GridCorpus-20BEFF?style=flat&logo=kaggle&logoColor=white)](https://www.kaggle.com/datasets/beta3logic/gridcorpus-9m-sudoku-puzzles-enriched/data?select=gridcorpus.csv) [![HuggingFace](https://img.shields.io/badge/HuggingFace-GridCorpus-FFD21E?style=flat&logo=huggingface&logoColor=black)](https://huggingface.co/datasets/beta3/GridCorpus_9M_Sudoku_Puzzles_Enriched) [![License](https://img.shields.io/badge/license-CC%20BY%204.0-blue?style=flat)](https://creativecommons.org/licenses/by/4.0/) ``` ╔══════════════════════════════════════════════════════════════════════╗ ║ ║ ║ G R I D C O R P U S ║ ║ ║ ║ "004300209005009001070060043..." ║ ║ │ ║ ║ ▼ ║ ║ ┌───────┬───────┬───────┐ missing_cells → 36 ║ ║ │ · · 4 │ 3 · · │ 2 · 9 │ given_cells → 45 ║ ║ │ · · 5 │ · · 9 │ · · 1 │ given_ratio → 0.556 ║ ║ │ · 7 · │ · 6 · │ · 4 3 │ row_given_counts → [3,2,3,...] ║ ║ ├───────┼───────┼───────┤ col_given_counts → [4,3,5,...] ║ ║ │ · · 6 │ · · 2 │ · 8 7 │ naked_singles → 8 ║ ║ │ 1 9 · │ · · 7 │ 4 · · │ hidden_singles → 15 ║ ║ │ · 5 · │ · 8 3 │ 1 6 · │ initial_res_rate → 0.489 ║ ║ ├───────┼───────┼───────┤ requires_backtrack → False ║ ║ │ · · · │ · · · │ 6 · 9 │ backtrack_depth → 0 ║ ║ │ · 7 3 │ · · · │ · 5 · │ propagation_steps → 45 ║ ║ │ 8 · · │ 2 · · │ 1 · 3 │ difficulty_tier → medium ║ ║ └───────┴───────┴───────┘ ║ ║ ║ ╚══════════════════════════════════════════════════════════════════════╝ ``` --- ## What is this GridCorpus is an enriched version of the original [9 Million Sudoku Puzzles and Solutions](https://www.kaggle.com/datasets/rohanrao/sudoku) dataset by Rohan Rao. The raw dataset gives you 9 million puzzle-solution pairs as 81-character strings. That's useful, but it tells you nothing about how hard each puzzle actually is or what its structure looks like numerically. This dataset takes those 9 million puzzles, runs each one through a purpose-built instrumented solver, and extracts a set of features that describe both the structure of the puzzle and the computational effort required to solve it. The result is a dataset where you can filter by difficulty, analyze solving complexity, or feed structured grids directly into models without any preprocessing. It is the backing data for the [NineGrid LLM benchmark](https://www.kaggle.com/benchmarks/beta3logic/ninegrid-sudoku-reasoning-benchmark-for-llms), which evaluates Large Language Models on Sudoku constraint satisfaction. --- ## Source **Original dataset:** [9 Million Sudoku Puzzles and Solutions](https://www.kaggle.com/datasets/rohanrao/sudoku) — Rohan Rao, Kaggle. No puzzles were filtered, reordered, or modified. All added columns are derived computationally from the original `puzzle` and `solution` strings. --- ### Grid representation | Column | Type | Description | |-----------------|--------|--------------------------------------------------| | `puzzle_grid` | string | JSON-encoded `list[list[int]]`, 9×9. Zeros for missing cells | | `solution_grid` | string | JSON-encoded `list[list[int]]`, 9×9. No zeros |   These are the canonical grid format used by the benchmark. To use them in Python: ```python import json puzzle_grid = json.loads(df["puzzle_grid"].iloc[0]) # list[list[int]] solution_grid = json.loads(df["solution_grid"].iloc[0]) ``` --- ### Structural features These are computed directly from the puzzle string — no solver required. | Column | Type | Description | |--------------------|-------|----------------------------------------------------------| | `missing_cells` | int | Count of `0` characters. Range: [17, 64] | | `given_cells` | int | 81 − missing_cells | | `given_ratio` | float | given_cells / 81 | | `row_given_counts` | string | JSON list of 9 ints — given cells per row | | `col_given_counts` | string | JSON list of 9 ints — given cells per column |   **How they're calculated:** `missing_cells` is a straight count of zeros in the 81-char string. `given_ratio` normalizes that count to [0, 1] so puzzles with different clue densities are comparable on the same scale. `row_given_counts` and `col_given_counts` split the 81 cells into their respective units and count the non-zero values in each — they always sum to `given_cells`. --- ### Solving complexity features These require actually running a solver on the puzzle. The solver is a backtracking implementation with a **Minimum Remaining Values (MRV)** heuristic: at each branching point, it picks the empty cell with the fewest valid candidates, which minimizes the search tree. The solver is fully deterministic — same puzzle always produces identical metrics. | Column | Type | Description | |--------------------------------|-------|----------------------------------------------------------------| | `naked_singles_count` | int | Empty cells with exactly one valid candidate at the initial puzzle state | | `hidden_singles_count` | int | Empty cells that are the only valid position for some digit in their row, column, or box — excluding naked singles | | `initial_resolution_rate` | float | (naked + hidden) / missing_cells. Always in [0, 1] | | `requires_backtrack` | bool | Whether the solver needed any branching beyond propagation | | `backtrack_depth` | int | Maximum recursion depth reached. 0 = solved by propagation only | | `constraint_propagation_steps` | int | Cell assignments made during propagation before any branching |   **How they're calculated:** A **naked single** is a cell where, after eliminating all digits already present in its row, column, and box, only one digit remains. It can be assigned directly without guessing. A **hidden single** is subtler: the cell may have multiple candidates, but one of those digits has no other valid position within the row, column, or box — so it must go there. These are detected by scanning all 27 units (9 rows + 9 columns + 9 boxes) and finding digits with only one candidate position. Naked singles are excluded from this count to avoid double-counting the same cell. `initial_resolution_rate` measures how much of the puzzle is solvable from the starting state using only these two deterministic techniques, before any guessing is needed. A value of 1.0 means the puzzle collapses entirely under propagation. Values close to 0 indicate puzzles where almost nothing can be determined without branching. The solver then runs to completion, counting how deep the backtracking search goes. `backtrack_depth = 0` means no branching was needed at all. Each unit of depth represents one guess that couldn't be resolved by propagation alone. --- ### Difficulty label | Column | Type | Values | |-------------------|--------|-------------------------------------------| | `difficulty_tier` | string | `easy` / `medium` / `hard` / `expert` |   **Tier assignment logic:** ``` easy → backtrack_depth = 0 and initial_resolution_rate > 0.70 medium → backtrack_depth = 0 and initial_resolution_rate ≤ 0.70 hard → backtrack_depth ∈ [1, 5] expert → backtrack_depth > 5 ``` The threshold at 0.70 for easy/medium separates puzzles where most missing cells are immediately deterministic (easy) from those that require more careful scanning of units even without backtracking (medium). The hard/expert split at depth 5 is based on the observed distribution across 9M puzzles — depth > 5 represents the tail of genuinely search-intensive puzzles. **This is solver complexity, not human difficulty.** The tiers correlate well with human-perceived difficulty, but they measure computational search effort rather than puzzle aesthetics. A puzzle rated `expert` here required 5+ levels of recursive search from the solver — that's a meaningful signal regardless of how a human might approach it. --- ## Companion file — pipeline_log.csv Alongside `gridcorpus.csv`, this dataset includes `pipeline_log.csv` — one row per processed chunk, generated during the enrichment pipeline. It serves as a processing audit trail and lets you verify that metrics are consistent across the full dataset, not just the final aggregate. | Column | Type | Description | |--------------------------|-------|---------------------------------------------------------------| | `chunk_idx` | int | Chunk index (0-based). Each chunk is 100,000 rows | | `rows` | int | Number of rows in this chunk | | `elapsed_s` | float | Processing time in seconds for this chunk | | `missing_cells_mean` | float | Average missing cells across puzzles in this chunk | | `given_ratio_mean` | float | Average given_ratio in this chunk | | `naked_singles_mean` | float | Average naked_singles_count in this chunk | | `hidden_singles_mean` | float | Average hidden_singles_count in this chunk | | `resolution_rate_mean` | float | Average initial_resolution_rate in this chunk | | `requires_backtrack_pct` | float | Percentage of puzzles in this chunk that required backtracking | | `backtrack_depth_max` | int | Maximum backtrack_depth observed in this chunk | | `backtrack_depth_mean` | float | Average backtrack_depth in this chunk | | `propagation_steps_mean` | float | Average constraint_propagation_steps in this chunk | | `tier_easy` | int | Count of easy puzzles in this chunk | | `tier_medium` | int | Count of medium puzzles in this chunk | | `tier_hard` | int | Count of hard puzzles in this chunk | | `tier_expert` | int | Count of expert puzzles in this chunk | | `cumulative_rows` | int | Total rows processed up to and including this chunk |   The log is append-only and chunk-safe — if the pipeline was interrupted and resumed, each chunk appears exactly once. Consistent `missing_cells_mean` and `resolution_rate_mean` values across chunks confirm that the source dataset is uniformly distributed, with no ordering artifacts. --- ## Observed distributions (9,000,000 puzzles) ``` Total puzzles 9,000,000 missing_cells avg 42.08 given_ratio avg 0.4805 resolution_rate avg 0.4526 requires_backtrack 1,639,390 puzzles (18.2%) difficulty_tier: easy 1,210,679 (13.5%) medium 6,149,931 (68.3%) hard 1,493,604 (16.6%) expert 145,786 ( 1.6%) ``` --- ## License Derived dataset. Source data credited to Rohan Rao under its original Kaggle terms. All derived columns are released under CC BY 4.0. --- ``` · · · · · · · · · · · · · · · · · · · · · · · · · · · Nine cells. Nine rows. Nine columns. Forty-three quintillion possible grids. One solution. ```

应用场景：