five

MostLime/chess-elite-uci

收藏
Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MostLime/chess-elite-uci
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - reinforcement-learning language: - en tags: - chess - uci - transformer - games - elite - lichess - tokenized size_categories: - 1M<n<10M --- # chess-elite-uci A transformer-ready dataset of ~7.8 million elite chess games, pre-tokenized in UCI notation with a deterministic 1977-token vocabulary. Built for training chess language models directly with no preprocessing required. ## Dataset Summary | Field | Value | |---|---| | Total games | 7,805,503 | | Average sequence length | 94.24 tokens | | Max sequence length | 255 tokens | | Vocabulary size | 1,977 tokens | | Mean combined Elo | 5,211 (~2,606 per player) | ## Sources **Lichess Elite Database** (June 2020 – November 2025) Games where both players are rated 2500+ vs 2300+ (2022 onwards: 2500+ vs 2300+; prior: 2400+ vs 2200+). Source: [database.nikonoel.fr](https://database.nikonoel.fr). Licensed CC0. ## Vocabulary The vocabulary contains **1,977 tokens** and is fully deterministic and enumerated from chess geometry, not derived from data. It will never produce OOV tokens for any legal chess game. | ID | Token | Description |---|---|---| | 0 | `<PAD>` | Padding | 1 | `<W>` | POV token: white wins / white side for draws | | 2 | `<B>` | POV token: black wins / black side for draws | | 3 | `<CHECKMATE>` | Terminal: game ended in checkmate | | 4 | `<RESIGN>` | Terminal: losing side resigned (≥ 40 ply) | | 5 | `<STALEMATE>` | Terminal: draw by stalemate | | 6 | `<REPETITION>` | Terminal: draw by threefold repetition | | 7 | `<FIFTY_MOVE>` | Terminal: draw by 50-move rule | | 8 | `<INSUFF_MATERIAL>` | Terminal: draw by insufficient material | | 9+ | a1a2 … h7h8q | 1968 UCI move strings, sorted lexicographically | The full vocabulary is provided in `vocab.json` as `{ token_str: int_id }`. ## Sequence Format Every game is encoded as a flat list of integer token IDs: ``` [ <POV> | m1 | m2 | m3 | ... | mN | <TERMINAL> ] ``` - **POV token** (position 0): `<W>` if white wins, `<B>` if black wins. For draws, assigned randomly 50/50 between `<W>` and `<B>`. - **Move tokens** (positions 1 to N): UCI half-moves alternating white/black, e.g. `e2e4`, `e7e5`, `g1f3`, `e1g1` (castling), `e7e8q` (promotion). - **Terminal token** (position N+1): encodes why the game ended. Maximum sequence length is **255 tokens** (1 POV + 253 moves + 1 terminal). Sequences are variable length, pad to 255 with `<PAD>` (ID 0) in your DataLoader. ## NTP Loss Mask The `ntp_mask` column contains a binary list of the same length as `token_ids`. It indicates which positions should have next-token-prediction (NTP) loss applied during training: ``` Position NTP loss ───────────────────────────── POV token 1 (always) Winning side move 1 Losing side move 0 (context only) Terminal token 1 (always) Draw game moves 1 (both sides, since neither lost) ``` This implements win-conditioned training: the model learns to predict the winning side's moves given the POV token, while still attending to the losing side's moves as context. Usage in PyTorch: ```python loss = cross_entropy(logits, labels, reduction="none") loss = (loss * ntp_mask).sum() / ntp_mask.sum() ``` ## Filtering Games were filtered as follows before inclusion: **Decisive games (1-0 / 0-1):** - **Checkmates**: verified by `board.is_checkmate()` on the final position. No length minimum. - **Resignations**: not checkmate, minimum 40 halfmoves (20 moves each side). **Draws (1/2-1/2):** - Only **forced draws** are included: stalemate, insufficient material, 50-move rule, threefold repetition. - Draw-by-agreement is excluded (`board.is_game_over(claim_draw=True)` must return True). **All games:** - Maximum 253 halfmoves (fits within 255-token sequence budget). - Both player Elo values must be present and non-zero. - All moves must be legally parseable by python-chess. **Game type breakdown:** | Type | Count | % | |---|---|---| | White checkmate | 1,702,751 | 21.8% | | White resignation | 2,000,000 | 25.6% | | Black checkmate | 1,702,752 | 21.8% | | Black resignation | 2,000,000 | 25.6% | | Forced draw | 400,000 | 5.1% | ## Schema ```python { "white_elo": int32, # white player Elo "black_elo": int32, # black player Elo "combined_elo": int32, # white_elo + black_elo "result": string, # "1-0", "0-1", or "1/2-1/2" "game_type": string, # "checkmate", "resignation", or "forced_draw" "pov": string, # "<W>" or "<B>" "terminal": string, # "<CHECKMATE>", "<RESIGN>", "<STALEMATE>", ... "source": string, # "lichess" "moves_uci": string, # space-separated UCI moves, human-readable "token_ids": list[int32], # encoded sequence, use this for training "ntp_mask": list[int32], # 1 = apply NTP loss, 0 = skip "seq_len": int32, # len(token_ids), always in [3, 255] } ``` ## Usage ```python from datasets import load_dataset import json # Load dataset ds = load_dataset("MostLime/chess-elite-uci", split="train") # Load vocabulary with open("vocab.json") as f: vocab = json.load(f) id_to_token = {v: k for k, v in vocab.items()} # Decode a game row = ds[0] tokens = [id_to_token[i] for i in row["token_ids"]] print(" ".join(tokens)) # → <W> e2e4 e7e5 g1f3 b8c6 f1b5 ... <RESIGN> # PyTorch DataLoader import torch from torch.utils.data import DataLoader def collate(batch): max_len = 255 token_ids = torch.zeros(len(batch), max_len, dtype=torch.long) ntp_mask = torch.zeros(len(batch), max_len, dtype=torch.float) for i, row in enumerate(batch): n = row["seq_len"] token_ids[i, :n] = torch.tensor(row["token_ids"], dtype=torch.long) ntp_mask[i, :n] = torch.tensor(row["ntp_mask"], dtype=torch.float) return {"token_ids": token_ids, "ntp_mask": ntp_mask} loader = DataLoader(ds, batch_size=32, collate_fn=collate) ``` ## Inference At inference time, prepend the POV token for the side the model plays as, then feed opponent moves as context and sample responses: ```python # Model plays as white sequence = [vocab["<W>"]] # Opponent plays e7e5 — append as context sequence.append(vocab["e7e5"]) # Sample model's next move from legal UCI moves for the current position ``` Terminal tokens are never generated during normal play. The game ends when the opponent resigns or a draw is claimed externally. ## Citation ```bibtex @dataset{mostlime2026chessEliteUCI, author = {MostLime}, title = {chess-elite-uci: A Transformer-Ready Dataset of Elite Chess Games}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/MostLime/chess-elite-uci} } ``` ## Acknowledgements - [Lichess Elite Database](https://database.nikonoel.fr) by nikonoel — CC0 - [python-chess](https://python-chess.readthedocs.io) for move parsing and board state verification - [Modal](https://modal.com) for distributed compute

--- 许可证:CC BY 4.0 任务类别: - 文本生成 - 强化学习 语言: - 英语 标签: - 国际象棋 - UCI - Transformer - 游戏 - 顶级对局 - Lichess - 预分词 规模类别: - 100万 < 样本量 < 1000万 --- # chess-elite-uci 本数据集为适配Transformer的约780万条顶级国际象棋对局数据集,采用UCI记谱法完成预分词,使用确定性1977 Token词汇表,可直接用于训练国际象棋大语言模型,无需额外预处理。 ## 数据集概览 | 字段 | 数值 | |---|---| | 总对局数 | 7,805,503 | | 平均序列长度 | 94.24 Token | | 最大序列长度 | 255 Token | | 词汇表大小 | 1977 Token | | 平均联合等级分 | 5,211(双方玩家各约2,606) | ## 数据来源 **Lichess顶级对局数据库**(2020年6月 – 2025年11月) 对局双方等级分均满足:2022年及以后为双方均≥2500与≥2300;2022年之前为双方均≥2400与≥2200。数据来源:[database.nikonoel.fr](https://database.nikonoel.fr),采用CC0协议授权。 ## 词汇表 本词汇表包含**1977个Token**,完全基于国际象棋几何规则确定性枚举生成,而非从数据中衍生,可确保所有合法国际象棋对局均不会出现未登录词(OOV)。 | ID | Token | 描述 |---|---|---| | 0 | `<PAD>` | 填充标记 | 1 | `<W>` | 视角标记:白方获胜/和棋时代表白方视角 | 2 | `<B>` | 视角标记:黑方获胜/和棋时代表黑方视角 | 3 | `<CHECKMATE>` | 结束标记:对局以将死告终 | 4 | `<RESIGN>` | 结束标记:落方认输(≥40步半回合) | 5 | `<STALEMATE>` | 结束标记:和棋(无子可动) | 6 | `<REPETITION>` | 结束标记:三回合重复和棋 | 7 | `<FIFTY_MOVE>` | 结束标记:50回合规则和棋 | 8 | `<INSUFF_MATERIAL>` | 结束标记:兵力不足和棋 | 9+ | a1a2 … h7h8q | 1968个按字典序排序的UCI走法字符串 完整词汇表以`{ token_str: int_id }`格式存储于`vocab.json`文件中。 ## 序列格式 每一条对局均编码为整数Token ID构成的一维列表: [ <POV> | m1 | m2 | m3 | ... | mN | <TERMINAL> ] - **视角标记(位置0)**:白方获胜时为`<W>`,黑方获胜时为`<B>`;和棋时以50%概率随机分配为`<W>`或`<B>`。 - **走法标记(位置1至N)**:UCI格式的半回合走法,按白方、黑方交替顺序排列,例如`e2e4`、`e7e5`、`g1f3`、`e1g1`(王车易位)、`e7e8q`(兵升变)。 - **结束标记(位置N+1)**:编码对局结束原因。 最大序列长度为**255 Token**(1个视角标记 + 253个走法标记 + 1个结束标记)。序列长度可变,在数据加载器中可使用`<PAD>`(ID 0)进行填充至255 Token。 ## 下一个Token预测损失掩码 `ntp_mask`列包含与`token_ids`长度一致的二进制列表,用于标记训练过程中需应用下一个Token预测(NTP)损失的位置: 位置 NTP损失权重 ───────────────────────────── 视角标记 1 (始终应用) 胜方走法 1 负方走法 0 (仅作为上下文) 结束标记 1 (始终应用) 和棋对局走法 1 (双方均应用,因无一方落败) 该设计实现了胜方条件化训练:模型可基于视角标记学习预测胜方走法,同时仍将负方走法作为上下文进行注意力计算。 PyTorch中的使用示例: python loss = cross_entropy(logits, labels, reduction="none") loss = (loss * ntp_mask).sum() / ntp_mask.sum() ## 过滤规则 数据集收录前的过滤规则如下: **非平局对局(1-0 / 0-1):** - **将死对局**:通过最终局面的`board.is_checkmate()`验证,无最小长度限制。 - **认输对局**:非将死对局,且至少包含40个半回合(双方各20回合)。 **和棋对局(1/2-1/2):** - 仅收录**强制和棋**:包括无子可动和棋、兵力不足和棋、50回合规则和棋、三回合重复和棋。 - 协议和棋不予收录(需满足`board.is_game_over(claim_draw=True)`返回True)。 **所有对局通用规则:** - 最多包含253个半回合(适配255 Token的序列长度限制)。 - 双方玩家的等级分均需存在且非零。 - 所有走法均可通过python-chess库合法解析。 **对局类型分布:** | 对局类型 | 数量 | 占比 | |---|---|---| | 白方将死 | 1,702,751 | 21.8% | | 白方获胜(对手认输) | 2,000,000 | 25.6% | | 黑方将死 | 1,702,752 | 21.8% | | 黑方获胜(对手认输) | 2,000,000 | 25.6% | | 强制和棋 | 400,000 | 5.1% | ## 数据结构 python { "white_elo": int32, # 白方玩家等级分 "black_elo": int32, # 黑方玩家等级分 "combined_elo": int32, # 双方等级分之和 "result": string, # 对局结果:"1-0"、"0-1" 或 "1/2-1/2" "game_type": string, # 对局类型:"checkmate"(将死)、"resignation"(认输)或 "forced_draw"(强制和棋) "pov": string, # 视角标记:"<W>" 或 "<B>" "terminal": string, # 结束标记:"<CHECKMATE>"、"<RESIGN>"、"<STALEMATE>" 等 "source": string, # 数据来源:"lichess" "moves_uci": string, # 以空格分隔的UCI走法,人类可读格式 "token_ids": list[int32], # 编码后的序列,用于模型训练 "ntp_mask": list[int32], # 1 = 应用NTP损失,0 = 跳过该位置 "seq_len": int32, # token_ids的长度,范围始终为[3, 255] } ## 使用方法 python from datasets import load_dataset import json # 加载数据集 ds = load_dataset("MostLime/chess-elite-uci", split="train") # 加载词汇表 with open("vocab.json") as f: vocab = json.load(f) id_to_token = {v: k for k, v in vocab.items()} # 解码一条对局 row = ds[0] tokens = [id_to_token[i] for i in row["token_ids"]] print(" ".join(tokens)) # → <W> e2e4 e7e5 g1f3 ... <RESIGN> # PyTorch数据加载器示例 import torch from torch.utils.data import DataLoader def collate(batch): max_len = 255 token_ids = torch.zeros(len(batch), max_len, dtype=torch.long) ntp_mask = torch.zeros(len(batch), max_len, dtype=torch.float) for i, row in enumerate(batch): n = row["seq_len"] token_ids[i, :n] = torch.tensor(row["token_ids"], dtype=torch.long) ntp_mask[i, :n] = torch.tensor(row["ntp_mask"], dtype=torch.float) return {"token_ids": token_ids, "ntp_mask": ntp_mask} loader = DataLoader(ds, batch_size=32, collate_fn=collate) ## 推理流程 推理阶段,需先添加模型所执方的视角标记作为前缀,再将对手的走法作为上下文输入,随后采样生成模型的回应: python # 模型执白 sequence = [vocab["<W>"]] # 对手走e7e5,将其作为上下文添加至序列 sequence.append(vocab["e7e5"]) # 从当前局面的合法UCI走法中采样模型的下一步走法 正常推理过程中不会生成结束标记。对局将在对手认输或外部触发和棋时结束。 ## 引用格式 bibtex @dataset{mostlime2026chessEliteUCI, author = {MostLime}, title = {chess-elite-uci: A Transformer-Ready Dataset of Elite Chess Games}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/MostLime/chess-elite-uci} } ## 致谢 - [Lichess顶级对局数据库](https://database.nikonoel.fr),作者nikonoel,采用CC0协议授权 - [python-chess](https://python-chess.readthedocs.io)库,用于走法解析与局面状态验证 - [Modal](https://modal.com)平台,提供分布式计算支持
提供机构:
MostLime
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作