Chess-Nut-Engine/chess-sft-eval

Name: Chess-Nut-Engine/chess-sft-eval
Creator: Chess-Nut-Engine
Published: 2026-03-20 04:08:19
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Chess-Nut-Engine/chess-sft-eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en size_categories: - 10K<n<100K tags: - chess - sft - evaluation - benchmark - chess960 pretty_name: Chess SFT Eval & Benchmark configs: - config_name: eval_splits data_files: - split: test path: "eval_splits/*.jsonl" - config_name: benchmark data_files: - split: test path: "benchmark/*.jsonl" - config_name: eval_chess960 data_files: - split: test path: "eval_splits/chess960.jsonl" - config_name: eval_endgames data_files: - split: test path: "eval_splits/endgames.jsonl" - config_name: eval_evaluation data_files: - split: test path: "eval_splits/evaluation.jsonl" - config_name: eval_mate data_files: - split: test path: "eval_splits/mate.jsonl" - config_name: eval_openings data_files: - split: test path: "eval_splits/openings.jsonl" - config_name: eval_perception data_files: - split: test path: "eval_splits/perception.jsonl" - config_name: eval_planning data_files: - split: test path: "eval_splits/planning.jsonl" - config_name: eval_rules data_files: - split: test path: "eval_splits/rules.jsonl" - config_name: eval_tactics data_files: - split: test path: "eval_splits/tactics.jsonl" - config_name: bench_chess960 data_files: - split: test path: "benchmark/chess960.jsonl" - config_name: bench_endgames data_files: - split: test path: "benchmark/endgames.jsonl" - config_name: bench_evaluation data_files: - split: test path: "benchmark/evaluation.jsonl" - config_name: bench_mate data_files: - split: test path: "benchmark/mate.jsonl" - config_name: bench_openings data_files: - split: test path: "benchmark/openings.jsonl" - config_name: bench_perception data_files: - split: test path: "benchmark/perception.jsonl" - config_name: bench_planning data_files: - split: test path: "benchmark/planning.jsonl" - config_name: bench_rules data_files: - split: test path: "benchmark/rules.jsonl" - config_name: bench_tactics data_files: - split: test path: "benchmark/tactics.jsonl" --- # Chess SFT Eval & Benchmark Held-out evaluation splits and a frozen benchmark for the [Chess SFT training pipeline](https://huggingface.co/datasets/Chess-Nut-Engine/chess-sft-data). Every FEN in these files is **excluded from training data** via a blocklist to guarantee zero contamination. | | | |---|---| | **Eval examples** | 13,000 | | **Benchmark examples** | 13,000 | | **Splits** | 9 (perception, rules, tactics, evaluation, openings, endgames, planning, chess960, mate) | | **Format** | JSONL | | **Training companion** | [`Chess-Nut-Engine/chess-sft-data`](https://huggingface.co/datasets/Chess-Nut-Engine/chess-sft-data) | ## How eval and benchmark differ - **Eval splits** contain raw held-out positions with ground-truth labels (FENs, legal moves, puzzle solutions, etc.). Use these for flexible evaluation with custom metrics. - **Frozen benchmark** is a deterministic, versioned subset of the eval splits (seed=42). Each row has a standardized `prompt` + `gold_answer` + `metric_type` format for reproducible scoring. The `manifest.json` file tracks version and split sizes. ## Eval Split Schema Eval split rows vary by split but share a common core: ```json { "fen": "r2q1rk1/4P1pp/3p4/2pN4/...", "side": "black", "legal_moves": ["g8h8", "g8f7", ...], ... } ``` Fields are task-specific (e.g. `legal_moves` for rules, `puzzle_moves` for tactics). ## Benchmark Schema Every benchmark row follows a uniform structure for automated scoring: ```json { "example_id": "perception_00000", "split": "perception", "task_type": "board_print", "fen": "qrkrn1bb/pp1p2pp/2p1np2/...", "prompt": "Position (FEN): ...\nShow me the board.", "gold_answer": "8 q r k r n . b b\n7 ...", "metric_type": "exact_match", "metadata": {} } ``` | Field | Description | |-------|-------------| | `example_id` | Unique identifier (`{split}_{index}`) | | `split` | Which eval category this belongs to | | `task_type` | Specific task within the split | | `fen` | Chess position in FEN notation | | `prompt` | The question to pose to the model | | `gold_answer` | Ground truth answer for scoring | | `metric_type` | How to score: `exact_match`, `set_match`, `f1`, or `numeric_tolerance` | | `metadata` | Optional extra context (puzzle rating, eval depth, etc.) | ## Loading ```python from datasets import load_dataset # Load all eval splits ds = load_dataset("Chess-Nut-Engine/chess-sft-eval", "eval_splits") # Load all benchmark rows ds = load_dataset("Chess-Nut-Engine/chess-sft-eval", "benchmark") # Load a single split ds = load_dataset("Chess-Nut-Engine/chess-sft-eval", "eval_tactics") ds = load_dataset("Chess-Nut-Engine/chess-sft-eval", "bench_tactics") ``` ## Eval Splits | Split | Description | Examples | Size | |-------|-------------|----------|------| | `chess960` | Fischer Random — all task types applied to Chess960 positions | 500 | 0.1 MB | | `endgames` | Endgame play — classification, WDL prediction, tablebase best moves | 1,500 | 0.2 MB | | `evaluation` | Position assessment — material balance, Stockfish-calibrated evaluation | 1,500 | 0.3 MB | | `mate` | Checkmate patterns — forced mate detection and execution | 1,000 | 0.1 MB | | `openings` | Opening knowledge — identification, continuation, principles (ECO holdout) | 500 | 0.2 MB | | `perception` | Board reading — FEN-to-board, piece identification, state tracking | 2,000 | 0.3 MB | | `planning` | Strategic planning — best move selection, puzzle solving, move consequences | 2,000 | 0.9 MB | | `rules` | Move legality — legal move generation, check detection, special rules | 2,000 | 0.3 MB | | `tactics` | Tactical patterns — captures, threats, pins, forks, hanging pieces | 2,000 | 1.0 MB | ## Frozen Benchmark Deterministically sampled from eval splits (seed=42, version tracked in `manifest.json`). | Split | Examples | Size | |-------|----------|------| | `chess960` | 500 | 0.2 MB | | `endgames` | 1,500 | 0.5 MB | | `evaluation` | 1,500 | 0.6 MB | | `mate` | 1,000 | 0.4 MB | | `openings` | 500 | 0.3 MB | | `perception` | 2,000 | 1.0 MB | | `planning` | 2,000 | 0.8 MB | | `rules` | 2,000 | 0.8 MB | | `tactics` | 2,000 | 0.9 MB | ## Decontamination Eval/benchmark positions are held out from training through multiple mechanisms: 1. **FEN blocklist** — every FEN in these splits is on a blocklist checked during generation 2. **ECO holdout** — opening eval positions come from held-out ECO code families 3. **Depth filtering** — evaluation benchmark uses only depth-40+ Stockfish analyses ## License This dataset is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

提供机构：

Chess-Nut-Engine

5,000+

优质数据集

54 个

任务类型

进入经典数据集