dotneB/chess_lakehouse
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dotneB/chess_lakehouse
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
pretty_name: Chess Lakehouse
tags:
- chess
- games
- game
- lumbrasgigabase
---
# Chess Lakehouse
Is it a dataset, a data warehouse or a data lake? It's a lakehouse 🏠🏞️🛶
## Versions
- [Lichess's openings dataset](https://huggingface.co/datasets/Lichess/chess-openings): 2026-01-05
- [Lumbras Gigabase](https://lumbrasgigabase.com/en/download-in-pgn-format-en/): 2026-03-03
- [duckdb chess extension](https://github.com/dotneB/duckdb-chess): 0.6.0
## Process
Pipeline process: https://github.com/dotneB/chess_lakehouse
- Uses DVC
- Data sources:
- [Lumbras Gigabase](https://lumbrasgigabase.com/en/download-in-pgn-format-en/) PGNs
- Use DuckDB [chess extension](https://duckdb.org/community_extensions/extensions/chess) ([source](https://github.com/dotneB/duckdb-chess)) to parse PGN into duckdb
- Export to Parquet to a Schema similar to Lichess games datasets.
### Dataset Enrichment
- Eco, Opening:
- Uses Lichess's [openings dataset](https://huggingface.co/datasets/Lichess/chess-openings) to resolve Eco and Opening fields.
- UTCDate:
- Uses the first, most complete date value that exists, with the following fallback chain `UTCDate` -> `Date` -> `EventDate`
- Some dates are incomplete/placeholder. Unknown month/day ?? become 01: `2000.??.??` -> `2000-01-01`, `2000.06.??` -> `2000-06-01`
- Out-of-range day is clamped to month end when year/month are valid: `2015.11.31` -> `2015-11-30`, `1997.02.29` -> `1997-02-28`
- TimeControl:
- A best effort lenient parsing: `40/5400+30:1800+30`
- Lenient shorthands: `3+2` -> `180+2`, `75|30` -> `4500+30`, `15 + 10` -> `900+10`
- Apostrophe forms: `10'+5''` -> `600+5`, `3' + 2''/mv from move 1` -> `180+2`
- Free text: `3 mins + 2 seconds increment` -> `180+2`, `90 minutes for 40 moves + 30 minutes for the rest + 30 seconds per move from move one` -> `40/5400+30:1800+30`
## Dataset Structure
This dataset is hive-partitioned into multiple parquet files on three keys: `DataSource`, `year` and `month`:
```
├─ data
│ └─ DataSource=LumbrasGigabase_Online
│ └─ year=2015
│ └─ month=01
│ └─ data_0.parquet
```
## Dataset Fields
```
┌─────────────┬─────────────────────┐
│ column_name │ column_type │
│ varchar │ varchar │
├─────────────┼─────────────────────┤
│ Event │ VARCHAR │
│ Site │ VARCHAR │
│ White │ VARCHAR │
│ Black │ VARCHAR │
│ Result │ VARCHAR │
│ WhiteTitle │ VARCHAR │
│ BlackTitle │ VARCHAR │
│ WhiteElo │ UINTEGER │
│ BlackElo │ UINTEGER │
│ UTCDate │ DATE │
│ UTCTime │ TIME WITH TIME ZONE │
│ ECO │ VARCHAR │
│ Opening │ VARCHAR │
│ Termination │ VARCHAR │
│ TimeControl │ VARCHAR │
│ Source │ VARCHAR │
│ movetext │ VARCHAR │
│ DataSource │ VARCHAR │
│ month │ VARCHAR │
│ year │ BIGINT │
├─────────────┴─────────────────────┤
│ 20 columns │
└───────────────────────────────────┘
```
## Credits
- [Lumbras Gigabase](https://lumbrasgigabase.com/en/download-in-pgn-format-en/): For collecting and curating a high quality collection of chess games PGNs
license: 知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议 (CC BY-NC-SA 4.0)
pretty_name: 国际象棋数据湖仓(Chess Lakehouse)
tags:
- 国际象棋
- 游戏
- 棋类
- Lumbras Gigabase
---
# 国际象棋数据湖仓(Chess Lakehouse)
它究竟是数据集、数据仓库还是数据湖?它实则是一款数据湖仓(Lakehouse)🏠🏞️🛶
## 版本列表
- [Lichess开局数据集](https://huggingface.co/datasets/Lichess/chess-openings): 2026年1月5日
- [Lumbras Gigabase数据集](https://lumbrasgigabase.com/en/download-in-pgn-format-en/): 2026年3月3日
- [DuckDB国际象棋扩展插件](https://github.com/dotneB/duckdb-chess): 0.6.0版本
## 处理流程
流水线处理流程:https://github.com/dotneB/chess_lakehouse
- 采用数据版本控制工具(DVC)
- 数据源:
- [Lumbras Gigabase数据集](https://lumbrasgigabase.com/en/download-in-pgn-format-en/)的可移植游戏格式(Portable Game Notation,PGN)文件
- 使用DuckDB的[国际象棋扩展插件](https://duckdb.org/community_extensions/extensions/chess)([源码地址](https://github.com/dotneB/duckdb-chess))将PGN格式数据解析并导入DuckDB数据库
- 导出为Parquet格式,采用与Lichess赛事数据集近似的数据结构(Schema)。
### 数据集增强与标准化处理
- ECO(国际象棋开局百科全书编码)与Opening字段:
- 依托Lichess的[开局数据集](https://huggingface.co/datasets/Lichess/chess-openings)解析并补全ECO与Opening字段。
- UTCDate字段处理:
- 优先采用现存的最完整日期值,回退优先级依次为:`UTCDate` -> `Date` -> `EventDate`
- 部分日期存在缺失或占位符:未知的月份/日期占位符`??`将被统一替换为`01`,例如`2000.??.??` -> `2000-01-01`,`2000.06.??` -> `2000-06-01`
- 当年份与月份有效时,超出当月最大天数的日期将被截断至当月最后一日,例如`2015.11.31` -> `2015-11-30`,`1997.02.29` -> `1997-02-28`
- 时间控制(TimeControl)字段处理:
- 采用宽松的尽力解析策略,示例如下:`40/5400+30:1800+30`
- 支持宽松的简写形式转换:`3+2` -> `180+2`,`75|30` -> `4500+30`,`15 + 10` -> `900+10`
- 支持带撇号的格式转换:`10'+5''` -> `600+5`,`3' + 2''/mv from move 1` -> `180+2`
- 支持自然文本格式解析:`3 mins + 2 seconds increment` -> `180+2`,`90 minutes for 40 moves + 30 minutes for the rest + 30 seconds per move from move one` -> `40/5400+30:1800+30`
## 数据集结构
本数据集采用Hive分区格式,基于三个键值进行分区存储:`DataSource`(数据源)、`year`(年份)与`month`(月份),目录结构如下:
├─ data
│ └─ DataSource=LumbrasGigabase_Online
│ └─ year=2015
│ └─ month=01
│ └─ data_0.parquet
## 数据集字段
┌─────────────┬─────────────────────┐
│ column_name │ column_type │
│ varchar │ varchar │
├─────────────┼─────────────────────┤
│ Event │ VARCHAR │
│ Site │ VARCHAR │
│ White │ VARCHAR │
│ Black │ VARCHAR │
│ Result │ VARCHAR │
│ WhiteTitle │ VARCHAR │
│ BlackTitle │ VARCHAR │
│ WhiteElo │ UINTEGER │
│ BlackElo │ UINTEGER │
│ UTCDate │ DATE │
│ UTCTime │ TIME WITH TIME ZONE │
│ ECO │ VARCHAR │
│ Opening │ VARCHAR │
│ Termination │ VARCHAR │
│ TimeControl │ VARCHAR │
│ Source │ VARCHAR │
│ movetext │ VARCHAR │
│ DataSource │ VARCHAR │
│ month │ VARCHAR │
│ year │ BIGINT │
├─────────────┴─────────────────────┤
│ 20 columns │
└───────────────────────────────────┘
## 致谢
- [Lumbras Gigabase数据集](https://lumbrasgigabase.com/en/download-in-pgn-format-en/): 感谢其收集并整理了高质量的国际象棋赛事PGN格式数据集。
提供机构:
dotneB



