mateuszgrzyb/lichess-stockfish-normalized
收藏Hugging Face2025-11-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mateuszgrzyb/lichess-stockfish-normalized
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-regression
language:
- en
tags:
- chess
- lichess
- stockfish
- position-evaluation
- game-ai
size_categories:
- 100M<n<1B
---
# Lichess Chess Positions: ML-Ready Deduplicated Evaluations
## Dataset Description
A curated dataset of **316,072,343 unique chess positions** with Stockfish evaluations, optimized for training neural networks. This is a deduplicated, ML-ready version of the [Lichess evaluation database](https://database.lichess.org/#evals).
### Why This Dataset?
While Lichess provides deduplicated evaluations in JSONL.zst format, and HuggingFace hosts the full (non-deduplicated) version, this dataset offers:
**Unique advantages:**
- ✅ Deduplicated (like Lichess source)
- ✅ Parquet format (5-10x faster loading than JSONL.zst)
- ✅ Split into 10 manageable parts (easy incremental downloads)
- ✅ Optimized for ML (removed unnecessary columns)
- ✅ 80% smaller than non-deduplicated version
**Comparison:**
| Source | Duplicates | Format | Size | Splits |
|--------|-----------|--------|------|--------|
| [Lichess DB](https://database.lichess.org/#evals) | None | JSONL.zst | \~17GB (\~83GB decompressed) | 1 file |
| [HF Lichess](https://huggingface.co/datasets/Lichess/chess-position-evaluations) | Yes (784M rows) | Parquet | 30GB+ | 16 parts |
| **This dataset** | None (316M rows) | Parquet | ~7GB | 10 parts |
Perfect for researchers who want deduplicated data without decompressing 80GB+ JSONL.zst files.
## Dataset Structure
### Data Instance
One row of the dataset looks like this:
```json
{
"fen": "2bq1rk1/pr3ppn/1p2p3/7P/2pP1B1P/2P5/PPQ2PB1/R3R1K1 w - -",
"depth": 36,
"cp": 311,
"mate": null
}
```
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `fen` | string | Chess position in [FEN notation](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation) (pieces, active color, castling rights, en passant) |
| `depth` | int | Search depth reached by Stockfish engine |
| `cp` | int | Centipawn evaluation (-∞ to +∞). `null` if mate is certain |
| `mate` | int | Moves until mate. `null` if mate is not certain |
### Data Splits
The dataset is split into 10 equally-sized parts (~32M positions each) for convenient downloading:
```python
from datasets import load_dataset
# Load full dataset (all 10 parts)
dataset = load_dataset("mateuszgrzyb/lichess-stockfish-normalized", split="train")
# Or load specific percentage (faster download)
dataset = load_dataset("mateuszgrzyb/lichess-stockfish-normalized", split="train[:10%]")
# Or load by number of examples
dataset = load_dataset("mateuszgrzyb/lichess-stockfish-normalized", split="train[:1000000]")
```
## Dataset Creation
### Source Data
Original data: [Lichess evaluation database](https://database.lichess.org/#evals)
- **Source**: Lichess analysis board
- **Evaluator**: Stockfish (various versions and depths)
- **Collection**: Produced by Lichess users running Stockfish in browser during analysis
- **Update frequency**: Monthly (last updated: November 2025)
### Preprocessing Pipeline
The preprocessing was performed as part of the [Searchless Chess project](https://github.com/mateuszgrzyb-pl/searchless-chess):
1. **Data Loading**: Loaded in parts de-normalized posiotions with evaluations `Lichess/chess-position-evaluations` (~37GB)
2. **Deduplication**: For each unique FEN, retained only the evaluation with maximum `depth`
- Original: 784M rows with duplicates
- After dedup: 316M unique positions
3. **Column removal**: Removed `line` and `knodes` fields (not needed for position evaluation)
4. **Format conversion**: JSONL (original file in Lichess DataBase) → Parquet (faster I/O for ML workflows)
5. **Partitioning**: Split into 10 equal parts for manageable downloads
**Size reduction**: ~83GB (decompressed JSONL) | ~37GB (Parquet) → ~7GB (deduplicated Parquet) = **over 80% reduction**
### Quality Metrics
- **Unique positions**: 316,072,343
- **Average file size**: ~650MB per part
## Usage Example
### Basic Loading
```python
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("mateuszgrzyb/lichess-stockfish-normalized", split="train")
# Access data
print(f"Total: {len(dataset)}")
print(dataset[0])
```
### Incremental Loading (Memory-Efficient)
```python
from datasets import load_dataset
# Load one part at a time
for i in range(10):
part = load_dataset(
"mateuszgrzyb/lichess-stockfish-normalized",
split=f"train[{i*10}%:{(i+1)*10}%]"
)
# Process part...
train_on_part(part)
```
## Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{grzyb2025lichess,
author = {Grzyb, Mateusz},
title = {Lichess Chess Positions: ML-Ready Deduplicated Evaluations},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/mateuszgrzyb/lichess-stockfish-normalized}}
}
```
And cite the original Lichess database:
```bibtex
@misc{lichess2024database,
author = {Lichess},
title = {Lichess Elite Database},
year = {2024},
url = {https://database.lichess.org}
}
```
## Related Resources
- 📂 **Project Repository**: [Searchless Chess on GitHub](https://github.com/mateuszgrzyb-pl/searchless-chess)
- 📄 **Inspiration**: [Grandmaster-Level Chess Without Search](https://arxiv.org/abs/2402.04494) (DeepMind, 2024)
- ♟️ **Original Dataset**: [Lichess Evaluation Database](https://database.lichess.org/#evals)
- 🤗 **Non-Deduplicated Version**: [Lichess/chess-position-evaluations](https://huggingface.co/datasets/Lichess/chess-position-evaluations)
## License
This dataset is licensed under **CC BY 4.0**.
Original data from [Lichess](https://database.lichess.org/#evals) is licensed under CC0 1.0 (Public Domain).
## Dataset Curator
Created by **Mateusz Grzyb** as part of the [Searchless Chess project](https://github.com/mateuszgrzyb-pl/searchless-chess).
## Changelog
**v1.0.0** (November 2025)
- Initial release
- 316M deduplicated positions
- 10-part split in Parquet format
---
*Dataset last updated: November 2025*
提供机构:
mateuszgrzyb



