MostLime/chess-elite-uci
收藏Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MostLime/chess-elite-uci
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- reinforcement-learning
language:
- en
tags:
- chess
- uci
- transformer
- games
- elite
- lichess
- tokenized
size_categories:
- 1M<n<10M
---
# chess-elite-uci
A transformer-ready dataset of ~7.8 million elite chess games, pre-tokenized in UCI notation with a deterministic 1977-token vocabulary. Built for training chess language models directly with no preprocessing required.
## Dataset Summary
| Field | Value |
|---|---|
| Total games | 7,805,503 |
| Average sequence length | 94.24 tokens |
| Max sequence length | 255 tokens |
| Vocabulary size | 1,977 tokens |
| Mean combined Elo | 5,211 (~2,606 per player) |
## Sources
**Lichess Elite Database** (June 2020 – November 2025)
Games where both players are rated 2500+ vs 2300+ (2022 onwards: 2500+ vs 2300+; prior: 2400+ vs 2200+). Source: [database.nikonoel.fr](https://database.nikonoel.fr). Licensed CC0.
## Vocabulary
The vocabulary contains **1,977 tokens** and is fully deterministic and enumerated from chess geometry, not derived from data. It will never produce OOV tokens for any legal chess game.
| ID | Token | Description
|---|---|---|
| 0 | `<PAD>` | Padding
| 1 | `<W>` | POV token: white wins / white side for draws |
| 2 | `<B>` | POV token: black wins / black side for draws |
| 3 | `<CHECKMATE>` | Terminal: game ended in checkmate |
| 4 | `<RESIGN>` | Terminal: losing side resigned (≥ 40 ply) |
| 5 | `<STALEMATE>` | Terminal: draw by stalemate |
| 6 | `<REPETITION>` | Terminal: draw by threefold repetition |
| 7 | `<FIFTY_MOVE>` | Terminal: draw by 50-move rule |
| 8 | `<INSUFF_MATERIAL>` | Terminal: draw by insufficient material |
| 9+ | a1a2 … h7h8q | 1968 UCI move strings, sorted lexicographically |
The full vocabulary is provided in `vocab.json` as `{ token_str: int_id }`.
## Sequence Format
Every game is encoded as a flat list of integer token IDs:
```
[ <POV> | m1 | m2 | m3 | ... | mN | <TERMINAL> ]
```
- **POV token** (position 0): `<W>` if white wins, `<B>` if black wins. For draws, assigned randomly 50/50 between `<W>` and `<B>`.
- **Move tokens** (positions 1 to N): UCI half-moves alternating white/black, e.g. `e2e4`, `e7e5`, `g1f3`, `e1g1` (castling), `e7e8q` (promotion).
- **Terminal token** (position N+1): encodes why the game ended.
Maximum sequence length is **255 tokens** (1 POV + 253 moves + 1 terminal). Sequences are variable length, pad to 255 with `<PAD>` (ID 0) in your DataLoader.
## NTP Loss Mask
The `ntp_mask` column contains a binary list of the same length as `token_ids`. It indicates which positions should have next-token-prediction (NTP) loss applied during training:
```
Position NTP loss
─────────────────────────────
POV token 1 (always)
Winning side move 1
Losing side move 0 (context only)
Terminal token 1 (always)
Draw game moves 1 (both sides, since neither lost)
```
This implements win-conditioned training: the model learns to predict the winning side's moves given the POV token, while still attending to the losing side's moves as context.
Usage in PyTorch:
```python
loss = cross_entropy(logits, labels, reduction="none")
loss = (loss * ntp_mask).sum() / ntp_mask.sum()
```
## Filtering
Games were filtered as follows before inclusion:
**Decisive games (1-0 / 0-1):**
- **Checkmates**: verified by `board.is_checkmate()` on the final position. No length minimum.
- **Resignations**: not checkmate, minimum 40 halfmoves (20 moves each side).
**Draws (1/2-1/2):**
- Only **forced draws** are included: stalemate, insufficient material, 50-move rule, threefold repetition.
- Draw-by-agreement is excluded (`board.is_game_over(claim_draw=True)` must return True).
**All games:**
- Maximum 253 halfmoves (fits within 255-token sequence budget).
- Both player Elo values must be present and non-zero.
- All moves must be legally parseable by python-chess.
**Game type breakdown:**
| Type | Count | % |
|---|---|---|
| White checkmate | 1,702,751 | 21.8% |
| White resignation | 2,000,000 | 25.6% |
| Black checkmate | 1,702,752 | 21.8% |
| Black resignation | 2,000,000 | 25.6% |
| Forced draw | 400,000 | 5.1% |
## Schema
```python
{
"white_elo": int32, # white player Elo
"black_elo": int32, # black player Elo
"combined_elo": int32, # white_elo + black_elo
"result": string, # "1-0", "0-1", or "1/2-1/2"
"game_type": string, # "checkmate", "resignation", or "forced_draw"
"pov": string, # "<W>" or "<B>"
"terminal": string, # "<CHECKMATE>", "<RESIGN>", "<STALEMATE>", ...
"source": string, # "lichess"
"moves_uci": string, # space-separated UCI moves, human-readable
"token_ids": list[int32], # encoded sequence, use this for training
"ntp_mask": list[int32], # 1 = apply NTP loss, 0 = skip
"seq_len": int32, # len(token_ids), always in [3, 255]
}
```
## Usage
```python
from datasets import load_dataset
import json
# Load dataset
ds = load_dataset("MostLime/chess-elite-uci", split="train")
# Load vocabulary
with open("vocab.json") as f:
vocab = json.load(f)
id_to_token = {v: k for k, v in vocab.items()}
# Decode a game
row = ds[0]
tokens = [id_to_token[i] for i in row["token_ids"]]
print(" ".join(tokens))
# → <W> e2e4 e7e5 g1f3 b8c6 f1b5 ... <RESIGN>
# PyTorch DataLoader
import torch
from torch.utils.data import DataLoader
def collate(batch):
max_len = 255
token_ids = torch.zeros(len(batch), max_len, dtype=torch.long)
ntp_mask = torch.zeros(len(batch), max_len, dtype=torch.float)
for i, row in enumerate(batch):
n = row["seq_len"]
token_ids[i, :n] = torch.tensor(row["token_ids"], dtype=torch.long)
ntp_mask[i, :n] = torch.tensor(row["ntp_mask"], dtype=torch.float)
return {"token_ids": token_ids, "ntp_mask": ntp_mask}
loader = DataLoader(ds, batch_size=32, collate_fn=collate)
```
## Inference
At inference time, prepend the POV token for the side the model plays as, then feed opponent moves as context and sample responses:
```python
# Model plays as white
sequence = [vocab["<W>"]]
# Opponent plays e7e5 — append as context
sequence.append(vocab["e7e5"])
# Sample model's next move from legal UCI moves for the current position
```
Terminal tokens are never generated during normal play. The game ends when the opponent resigns or a draw is claimed externally.
## Citation
```bibtex
@dataset{mostlime2026chessEliteUCI,
author = {MostLime},
title = {chess-elite-uci: A Transformer-Ready Dataset of Elite Chess Games},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/MostLime/chess-elite-uci}
}
```
## Acknowledgements
- [Lichess Elite Database](https://database.nikonoel.fr) by nikonoel — CC0
- [python-chess](https://python-chess.readthedocs.io) for move parsing and board state verification
- [Modal](https://modal.com) for distributed compute
---
许可证:CC BY 4.0
任务类别:
- 文本生成
- 强化学习
语言:
- 英语
标签:
- 国际象棋
- UCI
- Transformer
- 游戏
- 顶级对局
- Lichess
- 预分词
规模类别:
- 100万 < 样本量 < 1000万
---
# chess-elite-uci
本数据集为适配Transformer的约780万条顶级国际象棋对局数据集,采用UCI记谱法完成预分词,使用确定性1977 Token词汇表,可直接用于训练国际象棋大语言模型,无需额外预处理。
## 数据集概览
| 字段 | 数值 |
|---|---|
| 总对局数 | 7,805,503 |
| 平均序列长度 | 94.24 Token |
| 最大序列长度 | 255 Token |
| 词汇表大小 | 1977 Token |
| 平均联合等级分 | 5,211(双方玩家各约2,606) |
## 数据来源
**Lichess顶级对局数据库**(2020年6月 – 2025年11月)
对局双方等级分均满足:2022年及以后为双方均≥2500与≥2300;2022年之前为双方均≥2400与≥2200。数据来源:[database.nikonoel.fr](https://database.nikonoel.fr),采用CC0协议授权。
## 词汇表
本词汇表包含**1977个Token**,完全基于国际象棋几何规则确定性枚举生成,而非从数据中衍生,可确保所有合法国际象棋对局均不会出现未登录词(OOV)。
| ID | Token | 描述
|---|---|---|
| 0 | `<PAD>` | 填充标记
| 1 | `<W>` | 视角标记:白方获胜/和棋时代表白方视角
| 2 | `<B>` | 视角标记:黑方获胜/和棋时代表黑方视角
| 3 | `<CHECKMATE>` | 结束标记:对局以将死告终
| 4 | `<RESIGN>` | 结束标记:落方认输(≥40步半回合)
| 5 | `<STALEMATE>` | 结束标记:和棋(无子可动)
| 6 | `<REPETITION>` | 结束标记:三回合重复和棋
| 7 | `<FIFTY_MOVE>` | 结束标记:50回合规则和棋
| 8 | `<INSUFF_MATERIAL>` | 结束标记:兵力不足和棋
| 9+ | a1a2 … h7h8q | 1968个按字典序排序的UCI走法字符串
完整词汇表以`{ token_str: int_id }`格式存储于`vocab.json`文件中。
## 序列格式
每一条对局均编码为整数Token ID构成的一维列表:
[ <POV> | m1 | m2 | m3 | ... | mN | <TERMINAL> ]
- **视角标记(位置0)**:白方获胜时为`<W>`,黑方获胜时为`<B>`;和棋时以50%概率随机分配为`<W>`或`<B>`。
- **走法标记(位置1至N)**:UCI格式的半回合走法,按白方、黑方交替顺序排列,例如`e2e4`、`e7e5`、`g1f3`、`e1g1`(王车易位)、`e7e8q`(兵升变)。
- **结束标记(位置N+1)**:编码对局结束原因。
最大序列长度为**255 Token**(1个视角标记 + 253个走法标记 + 1个结束标记)。序列长度可变,在数据加载器中可使用`<PAD>`(ID 0)进行填充至255 Token。
## 下一个Token预测损失掩码
`ntp_mask`列包含与`token_ids`长度一致的二进制列表,用于标记训练过程中需应用下一个Token预测(NTP)损失的位置:
位置 NTP损失权重
─────────────────────────────
视角标记 1 (始终应用)
胜方走法 1
负方走法 0 (仅作为上下文)
结束标记 1 (始终应用)
和棋对局走法 1 (双方均应用,因无一方落败)
该设计实现了胜方条件化训练:模型可基于视角标记学习预测胜方走法,同时仍将负方走法作为上下文进行注意力计算。
PyTorch中的使用示例:
python
loss = cross_entropy(logits, labels, reduction="none")
loss = (loss * ntp_mask).sum() / ntp_mask.sum()
## 过滤规则
数据集收录前的过滤规则如下:
**非平局对局(1-0 / 0-1):**
- **将死对局**:通过最终局面的`board.is_checkmate()`验证,无最小长度限制。
- **认输对局**:非将死对局,且至少包含40个半回合(双方各20回合)。
**和棋对局(1/2-1/2):**
- 仅收录**强制和棋**:包括无子可动和棋、兵力不足和棋、50回合规则和棋、三回合重复和棋。
- 协议和棋不予收录(需满足`board.is_game_over(claim_draw=True)`返回True)。
**所有对局通用规则:**
- 最多包含253个半回合(适配255 Token的序列长度限制)。
- 双方玩家的等级分均需存在且非零。
- 所有走法均可通过python-chess库合法解析。
**对局类型分布:**
| 对局类型 | 数量 | 占比 |
|---|---|---|
| 白方将死 | 1,702,751 | 21.8% |
| 白方获胜(对手认输) | 2,000,000 | 25.6% |
| 黑方将死 | 1,702,752 | 21.8% |
| 黑方获胜(对手认输) | 2,000,000 | 25.6% |
| 强制和棋 | 400,000 | 5.1% |
## 数据结构
python
{
"white_elo": int32, # 白方玩家等级分
"black_elo": int32, # 黑方玩家等级分
"combined_elo": int32, # 双方等级分之和
"result": string, # 对局结果:"1-0"、"0-1" 或 "1/2-1/2"
"game_type": string, # 对局类型:"checkmate"(将死)、"resignation"(认输)或 "forced_draw"(强制和棋)
"pov": string, # 视角标记:"<W>" 或 "<B>"
"terminal": string, # 结束标记:"<CHECKMATE>"、"<RESIGN>"、"<STALEMATE>" 等
"source": string, # 数据来源:"lichess"
"moves_uci": string, # 以空格分隔的UCI走法,人类可读格式
"token_ids": list[int32], # 编码后的序列,用于模型训练
"ntp_mask": list[int32], # 1 = 应用NTP损失,0 = 跳过该位置
"seq_len": int32, # token_ids的长度,范围始终为[3, 255]
}
## 使用方法
python
from datasets import load_dataset
import json
# 加载数据集
ds = load_dataset("MostLime/chess-elite-uci", split="train")
# 加载词汇表
with open("vocab.json") as f:
vocab = json.load(f)
id_to_token = {v: k for k, v in vocab.items()}
# 解码一条对局
row = ds[0]
tokens = [id_to_token[i] for i in row["token_ids"]]
print(" ".join(tokens))
# → <W> e2e4 e7e5 g1f3 ... <RESIGN>
# PyTorch数据加载器示例
import torch
from torch.utils.data import DataLoader
def collate(batch):
max_len = 255
token_ids = torch.zeros(len(batch), max_len, dtype=torch.long)
ntp_mask = torch.zeros(len(batch), max_len, dtype=torch.float)
for i, row in enumerate(batch):
n = row["seq_len"]
token_ids[i, :n] = torch.tensor(row["token_ids"], dtype=torch.long)
ntp_mask[i, :n] = torch.tensor(row["ntp_mask"], dtype=torch.float)
return {"token_ids": token_ids, "ntp_mask": ntp_mask}
loader = DataLoader(ds, batch_size=32, collate_fn=collate)
## 推理流程
推理阶段,需先添加模型所执方的视角标记作为前缀,再将对手的走法作为上下文输入,随后采样生成模型的回应:
python
# 模型执白
sequence = [vocab["<W>"]]
# 对手走e7e5,将其作为上下文添加至序列
sequence.append(vocab["e7e5"])
# 从当前局面的合法UCI走法中采样模型的下一步走法
正常推理过程中不会生成结束标记。对局将在对手认输或外部触发和棋时结束。
## 引用格式
bibtex
@dataset{mostlime2026chessEliteUCI,
author = {MostLime},
title = {chess-elite-uci: A Transformer-Ready Dataset of Elite Chess Games},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/MostLime/chess-elite-uci}
}
## 致谢
- [Lichess顶级对局数据库](https://database.nikonoel.fr),作者nikonoel,采用CC0协议授权
- [python-chess](https://python-chess.readthedocs.io)库,用于走法解析与局面状态验证
- [Modal](https://modal.com)平台,提供分布式计算支持
提供机构:
MostLime



