MostLime/chess-elite-uci

Name: MostLime/chess-elite-uci
Creator: MostLime
Published: 2026-03-03 14:35:10
License: 暂无描述

Hugging Face2026-03-03 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/MostLime/chess-elite-uci

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation - reinforcement-learning language: - en tags: - chess - uci - transformer - games - elite - lichess - tokenized size_categories: - 1M<n<10M --- # chess-elite-uci A transformer-ready dataset of ~7.8 million elite chess games, pre-tokenized in UCI notation with a deterministic 1977-token vocabulary. Built for training chess language models directly with no preprocessing required. ## Dataset Summary | Field | Value | |---|---| | Total games | 7,805,503 | | Average sequence length | 94.24 tokens | | Max sequence length | 255 tokens | | Vocabulary size | 1,977 tokens | | Mean combined Elo | 5,211 (~2,606 per player) | ## Sources **Lichess Elite Database** (June 2020 – November 2025) Games where both players are rated 2500+ vs 2300+ (2022 onwards: 2500+ vs 2300+; prior: 2400+ vs 2200+). Source: [database.nikonoel.fr](https://database.nikonoel.fr). Licensed CC0. ## Vocabulary The vocabulary contains **1,977 tokens** and is fully deterministic and enumerated from chess geometry, not derived from data. It will never produce OOV tokens for any legal chess game. | ID | Token | Description |---|---|---| | 0 | `<PAD>` | Padding | 1 | `<W>` | POV token: white wins / white side for draws | | 2 | `` | POV token: black wins / black side for draws | | 3 | `<CHECKMATE>` | Terminal: game ended in checkmate | | 4 | `<RESIGN>` | Terminal: losing side resigned (≥ 40 ply) | | 5 | `<STALEMATE>` | Terminal: draw by stalemate | | 6 | `<REPETITION>` | Terminal: draw by threefold repetition | | 7 | `<FIFTY_MOVE>` | Terminal: draw by 50-move rule | | 8 | `<INSUFF_MATERIAL>` | Terminal: draw by insufficient material | | 9+ | a1a2 … h7h8q | 1968 UCI move strings, sorted lexicographically | The full vocabulary is provided in `vocab.json` as `{ token_str: int_id }`. ## Sequence Format Every game is encoded as a flat list of integer token IDs: ``` [ <POV> | m1 | m2 | m3 | ... | mN | <TERMINAL> ] ``` - **POV token** (position 0): `<W>` if white wins, `` if black wins. For draws, assigned randomly 50/50 between `<W>` and ``. - **Move tokens** (positions 1 to N): UCI half-moves alternating white/black, e.g. `e2e4`, `e7e5`, `g1f3`, `e1g1` (castling), `e7e8q` (promotion). - **Terminal token** (position N+1): encodes why the game ended. Maximum sequence length is **255 tokens** (1 POV + 253 moves + 1 terminal). Sequences are variable length, pad to 255 with `<PAD>` (ID 0) in your DataLoader. ## NTP Loss Mask The `ntp_mask` column contains a binary list of the same length as `token_ids`. It indicates which positions should have next-token-prediction (NTP) loss applied during training: ``` Position NTP loss ───────────────────────────── POV token 1 (always) Winning side move 1 Losing side move 0 (context only) Terminal token 1 (always) Draw game moves 1 (both sides, since neither lost) ``` This implements win-conditioned training: the model learns to predict the winning side's moves given the POV token, while still attending to the losing side's moves as context. Usage in PyTorch: ```python loss = cross_entropy(logits, labels, reduction="none") loss = (loss * ntp_mask).sum() / ntp_mask.sum() ``` ## Filtering Games were filtered as follows before inclusion: **Decisive games (1-0 / 0-1):** - **Checkmates**: verified by `board.is_checkmate()` on the final position. No length minimum. - **Resignations**: not checkmate, minimum 40 halfmoves (20 moves each side). **Draws (1/2-1/2):** - Only **forced draws** are included: stalemate, insufficient material, 50-move rule, threefold repetition. - Draw-by-agreement is excluded (`board.is_game_over(claim_draw=True)` must return True). **All games:** - Maximum 253 halfmoves (fits within 255-token sequence budget). - Both player Elo values must be present and non-zero. - All moves must be legally parseable by python-chess. **Game type breakdown:** | Type | Count | % | |---|---|---| | White checkmate | 1,702,751 | 21.8% | | White resignation | 2,000,000 | 25.6% | | Black checkmate | 1,702,752 | 21.8% | | Black resignation | 2,000,000 | 25.6% | | Forced draw | 400,000 | 5.1% | ## Schema ```python { "white_elo": int32, # white player Elo "black_elo": int32, # black player Elo "combined_elo": int32, # white_elo + black_elo "result": string, # "1-0", "0-1", or "1/2-1/2" "game_type": string, # "checkmate", "resignation", or "forced_draw" "pov": string, # "<W>" or "" "terminal": string, # "<CHECKMATE>", "<RESIGN>", "<STALEMATE>", ... "source": string, # "lichess" "moves_uci": string, # space-separated UCI moves, human-readable "token_ids": list[int32], # encoded sequence, use this for training "ntp_mask": list[int32], # 1 = apply NTP loss, 0 = skip "seq_len": int32, # len(token_ids), always in [3, 255] } ``` ## Usage ```python from datasets import load_dataset import json # Load dataset ds = load_dataset("MostLime/chess-elite-uci", split="train") # Load vocabulary with open("vocab.json") as f: vocab = json.load(f) id_to_token = {v: k for k, v in vocab.items()} # Decode a game row = ds[0] tokens = [id_to_token[i] for i in row["token_ids"]] print(" ".join(tokens)) # → <W> e2e4 e7e5 g1f3 b8c6 f1b5 ... <RESIGN> # PyTorch DataLoader import torch from torch.utils.data import DataLoader def collate(batch): max_len = 255 token_ids = torch.zeros(len(batch), max_len, dtype=torch.long) ntp_mask = torch.zeros(len(batch), max_len, dtype=torch.float) for i, row in enumerate(batch): n = row["seq_len"] token_ids[i, :n] = torch.tensor(row["token_ids"], dtype=torch.long) ntp_mask[i, :n] = torch.tensor(row["ntp_mask"], dtype=torch.float) return {"token_ids": token_ids, "ntp_mask": ntp_mask} loader = DataLoader(ds, batch_size=32, collate_fn=collate) ``` ## Inference At inference time, prepend the POV token for the side the model plays as, then feed opponent moves as context and sample responses: ```python # Model plays as white sequence = [vocab["<W>"]] # Opponent plays e7e5 — append as context sequence.append(vocab["e7e5"]) # Sample model's next move from legal UCI moves for the current position ``` Terminal tokens are never generated during normal play. The game ends when the opponent resigns or a draw is claimed externally. ## Citation ```bibtex @dataset{mostlime2026chessEliteUCI, author = {MostLime}, title = {chess-elite-uci: A Transformer-Ready Dataset of Elite Chess Games}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/MostLime/chess-elite-uci} } ``` ## Acknowledgements - [Lichess Elite Database](https://database.nikonoel.fr) by nikonoel — CC0 - [python-chess](https://python-chess.readthedocs.io) for move parsing and board state verification - [Modal](https://modal.com) for distributed compute

--- 许可证：CC BY 4.0 任务类别： - 文本生成 - 强化学习语言： - 英语标签： - 国际象棋 - UCI - Transformer - 游戏 - 顶级对局 - Lichess - 预分词规模类别： - 100万 < 样本量 < 1000万 --- # chess-elite-uci 本数据集为适配Transformer的约780万条顶级国际象棋对局数据集，采用UCI记谱法完成预分词，使用确定性1977 Token词汇表，可直接用于训练国际象棋大语言模型，无需额外预处理。 ## 数据集概览 | 字段 | 数值 | |---|---| | 总对局数 | 7,805,503 | | 平均序列长度 | 94.24 Token | | 最大序列长度 | 255 Token | | 词汇表大小 | 1977 Token | | 平均联合等级分 | 5,211（双方玩家各约2,606） | ## 数据来源 **Lichess顶级对局数据库**（2020年6月 – 2025年11月）对局双方等级分均满足：2022年及以后为双方均≥2500与≥2300；2022年之前为双方均≥2400与≥2200。数据来源：[database.nikonoel.fr](https://database.nikonoel.fr)，采用CC0协议授权。 ## 词汇表本词汇表包含**1977个Token**，完全基于国际象棋几何规则确定性枚举生成，而非从数据中衍生，可确保所有合法国际象棋对局均不会出现未登录词（OOV）。 | ID | Token | 描述 |---|---|---| | 0 | `<PAD>` | 填充标记 | 1 | `<W>` | 视角标记：白方获胜/和棋时代表白方视角 | 2 | `` | 视角标记：黑方获胜/和棋时代表黑方视角 | 3 | `<CHECKMATE>` | 结束标记：对局以将死告终 | 4 | `<RESIGN>` | 结束标记：落方认输（≥40步半回合） | 5 | `<STALEMATE>` | 结束标记：和棋（无子可动） | 6 | `<REPETITION>` | 结束标记：三回合重复和棋 | 7 | `<FIFTY_MOVE>` | 结束标记：50回合规则和棋 | 8 | `<INSUFF_MATERIAL>` | 结束标记：兵力不足和棋 | 9+ | a1a2 … h7h8q | 1968个按字典序排序的UCI走法字符串完整词汇表以`{ token_str: int_id }`格式存储于`vocab.json`文件中。 ## 序列格式每一条对局均编码为整数Token ID构成的一维列表： [ <POV> | m1 | m2 | m3 | ... | mN | <TERMINAL> ] - **视角标记（位置0）**：白方获胜时为`<W>`，黑方获胜时为``；和棋时以50%概率随机分配为`<W>`或``。 - **走法标记（位置1至N）**：UCI格式的半回合走法，按白方、黑方交替顺序排列，例如`e2e4`、`e7e5`、`g1f3`、`e1g1`（王车易位）、`e7e8q`（兵升变）。 - **结束标记（位置N+1）**：编码对局结束原因。最大序列长度为**255 Token**（1个视角标记 + 253个走法标记 + 1个结束标记）。序列长度可变，在数据加载器中可使用`<PAD>`（ID 0）进行填充至255 Token。 ## 下一个Token预测损失掩码 `ntp_mask`列包含与`token_ids`长度一致的二进制列表，用于标记训练过程中需应用下一个Token预测（NTP）损失的位置：位置 NTP损失权重 ───────────────────────────── 视角标记 1 （始终应用）胜方走法 1 负方走法 0 （仅作为上下文）结束标记 1 （始终应用）和棋对局走法 1 （双方均应用，因无一方落败）该设计实现了胜方条件化训练：模型可基于视角标记学习预测胜方走法，同时仍将负方走法作为上下文进行注意力计算。 PyTorch中的使用示例： python loss = cross_entropy(logits, labels, reduction="none") loss = (loss * ntp_mask).sum() / ntp_mask.sum() ## 过滤规则数据集收录前的过滤规则如下： **非平局对局（1-0 / 0-1）：** - **将死对局**：通过最终局面的`board.is_checkmate()`验证，无最小长度限制。 - **认输对局**：非将死对局，且至少包含40个半回合（双方各20回合）。 **和棋对局（1/2-1/2）：** - 仅收录**强制和棋**：包括无子可动和棋、兵力不足和棋、50回合规则和棋、三回合重复和棋。 - 协议和棋不予收录（需满足`board.is_game_over(claim_draw=True)`返回True）。 **所有对局通用规则：** - 最多包含253个半回合（适配255 Token的序列长度限制）。 - 双方玩家的等级分均需存在且非零。 - 所有走法均可通过python-chess库合法解析。 **对局类型分布：** | 对局类型 | 数量 | 占比 | |---|---|---| | 白方将死 | 1,702,751 | 21.8% | | 白方获胜（对手认输） | 2,000,000 | 25.6% | | 黑方将死 | 1,702,752 | 21.8% | | 黑方获胜（对手认输） | 2,000,000 | 25.6% | | 强制和棋 | 400,000 | 5.1% | ## 数据结构 python { "white_elo": int32, # 白方玩家等级分 "black_elo": int32, # 黑方玩家等级分 "combined_elo": int32, # 双方等级分之和 "result": string, # 对局结果："1-0"、"0-1" 或 "1/2-1/2" "game_type": string, # 对局类型："checkmate"（将死）、"resignation"（认输）或 "forced_draw"（强制和棋） "pov": string, # 视角标记："<W>" 或 "" "terminal": string, # 结束标记："<CHECKMATE>"、"<RESIGN>"、"<STALEMATE>" 等 "source": string, # 数据来源："lichess" "moves_uci": string, # 以空格分隔的UCI走法，人类可读格式 "token_ids": list[int32], # 编码后的序列，用于模型训练 "ntp_mask": list[int32], # 1 = 应用NTP损失，0 = 跳过该位置 "seq_len": int32, # token_ids的长度，范围始终为[3, 255] } ## 使用方法 python from datasets import load_dataset import json # 加载数据集 ds = load_dataset("MostLime/chess-elite-uci", split="train") # 加载词汇表 with open("vocab.json") as f: vocab = json.load(f) id_to_token = {v: k for k, v in vocab.items()} # 解码一条对局 row = ds[0] tokens = [id_to_token[i] for i in row["token_ids"]] print(" ".join(tokens)) # → <W> e2e4 e7e5 g1f3 ... <RESIGN> # PyTorch数据加载器示例 import torch from torch.utils.data import DataLoader def collate(batch): max_len = 255 token_ids = torch.zeros(len(batch), max_len, dtype=torch.long) ntp_mask = torch.zeros(len(batch), max_len, dtype=torch.float) for i, row in enumerate(batch): n = row["seq_len"] token_ids[i, :n] = torch.tensor(row["token_ids"], dtype=torch.long) ntp_mask[i, :n] = torch.tensor(row["ntp_mask"], dtype=torch.float) return {"token_ids": token_ids, "ntp_mask": ntp_mask} loader = DataLoader(ds, batch_size=32, collate_fn=collate) ## 推理流程推理阶段，需先添加模型所执方的视角标记作为前缀，再将对手的走法作为上下文输入，随后采样生成模型的回应： python # 模型执白 sequence = [vocab["<W>"]] # 对手走e7e5，将其作为上下文添加至序列 sequence.append(vocab["e7e5"]) # 从当前局面的合法UCI走法中采样模型的下一步走法正常推理过程中不会生成结束标记。对局将在对手认输或外部触发和棋时结束。 ## 引用格式 bibtex @dataset{mostlime2026chessEliteUCI, author = {MostLime}, title = {chess-elite-uci: A Transformer-Ready Dataset of Elite Chess Games}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/MostLime/chess-elite-uci} } ## 致谢 - [Lichess顶级对局数据库](https://database.nikonoel.fr)，作者nikonoel，采用CC0协议授权 - [python-chess](https://python-chess.readthedocs.io)库，用于走法解析与局面状态验证 - [Modal](https://modal.com)平台，提供分布式计算支持

提供机构：

MostLime

5,000+

优质数据集

54 个

任务类型

进入经典数据集