mauricett/lichess_sf

Name: mauricett/lichess_sf
Creator: mauricett
Published: 2024-02-15 13:47:15
License: 暂无描述

Hugging Face2024-02-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mauricett/lichess_sf

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 tags: - chess - stockfish pretty_name: Lichess Games With Stockfish Analysis --- # Condensed Lichess Database This dataset is a condensed version of the Lichess database. It only includes games for which Stockfish evaluations were available. Currently, the dataset contains the entire year 2023, which consists of >100M games and >2B positions. Games are stored in a format that is much faster to process than the original PGN data. Requirements: ``` pip install zstandard python-chess datasets ``` # Quick Guide In the following, I explain the data format and how to use the dataset. At the end, you find a complete example script. ### 1. Loading The Dataset You can stream the data without storing it locally (~100 GB currently). The dataset requires `trust_remote_code=True` to execute the [custom data loading script](https://huggingface.co/datasets/mauricett/lichess_sf/blob/main/lichess_sf.py), which is necessary to decompress the files. See [HuggingFace's documentation](https://huggingface.co/docs/datasets/main/en/load_hub#remote-code) if you're unsure. ```py # Load dataset. dataset = load_dataset(path="mauricett/lichess_sf", split="train", streaming=True, trust_remote_code=True) ``` ### 2. Data Format The following definitions are important to understand. Please reread this section slowly and correctly when you have to decide how to draw FENs, moves and scores from the dataset. Let's draw a single sample and discuss it. ```py example = next(iter(dataset)) ``` A single sample from the dataset contains one complete chess game as a dictionary. The dictionary keys are as follows: 1. `example['fens']` --- A list of FENs in a slightly stripped format, missing the halfmove clock and fullmove number (see [definitions on wiki](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation#Definition)). The starting positions have been excluded (no player made a move yet). 2. `example['moves']` --- A list of moves in [UCI format](https://en.wikipedia.org/wiki/Universal_Chess_Interface). `example['moves'][42]` is the move that **led to** position `example['fens'][42]`, etc. 3. `example['scores']` --- A list of Stockfish evaluations (in centipawns) and the game's terminal outcome condition if one exists. Evaluations are from the perspective of the player who is next to move. If `example['fens'][42]` is black's turn, `example['scores'][42]` will be from black's perspective. If the game ended with a terminal condition, the last element of the list is a string 'C' (checkmate), 'S' (stalemate) or 'I' (insufficient material). Games with other outcome conditions have been excluded. 4. `example['WhiteElo'], example['BlackElo']` --- Player's Elos. ### 3. Define Functions for Preprocessing To use the data, you will require to define your own functions for transforming the data into your desired format. For this guide, let's define a few mock functions so I can show you how to use them. ```py # A mock tokenizer and functions for demonstration. class Tokenizer: def __init__(self): pass def __call__(self, example): return example # Transform Stockfish score and terminal outcomes. def score_fn(score): return score def preprocess(example, tokenizer, score_fn): # Get number of moves made in the game... max_ply = len(example['moves']) # ...and pick a position at random. random_position = random.randint(0, max_ply-2) # Get the FEN of our random choice. fen = example['fens'][random_position] # To get the move that leads to the *next* FEN, we have to add # +1 to the index. Same with the score, which is the evaluation # of that move. Please read the section about the data format clearly! move = example['moves'][random_position + 1] score = example['scores'][random_position + 1] # Transform data into the format of your choice. example['fens'] = tokenizer(fen) example['moves'] = tokenizer(move) example['scores'] = score_fn(score) return example tokenizer = Tokenizer() ``` ### 4. Shuffle And Preprocess Use `dataset.shuffle()` to properly shuffle the dataset. Use `dataset.map()` to apply our preprocessors. This will process individual samples in parallel if you're using multiprocessing (e.g. with PyTorch dataloader). ```py # Shuffle and apply your own preprocessing. dataset = dataset.shuffle(seed=42) dataset = dataset.map(preprocess, fn_kwargs={'tokenizer': tokenizer, 'score_fn': score_fn}) ``` # COMPLETE EXAMPLE You can try pasting this into Colab and it should work fine. Have fun! ```py import random from datasets import load_dataset from torch.utils.data import DataLoader # A mock tokenizer and functions for demonstration. class Tokenizer: def __init__(self): pass def __call__(self, example): return example def score_fn(score): # Transform Stockfish score and terminal outcomes. return score def preprocess(example, tokenizer, score_fn): # Get number of moves made in the game... max_ply = len(example['moves']) # ...and pick a position at random. random_position = random.randint(0, max_ply-2) # Get the FEN of our random choice. fen = example['fens'][random_position] # To get the move that leads to the *next* FEN, we have to add # +1 to the index. Same with the score, which is the evaluation # of that move. Please read the section about the data format clearly! move = example['moves'][random_position + 1] score = example['scores'][random_position + 1] # Transform data into the format of your choice. example['fens'] = tokenizer(fen) example['moves'] = tokenizer(move) example['scores'] = score_fn(score) return example tokenizer = Tokenizer() # Load dataset. dataset = load_dataset(path="mauricett/lichess_sf", split="train", streaming=True, trust_remote_code=True) # Shuffle and apply your own preprocessing. dataset = dataset.shuffle(seed=42) dataset = dataset.map(preprocess, fn_kwargs={'tokenizer': tokenizer, 'score_fn': score_fn}) # PyTorch dataloader dataloader = DataLoader(dataset, batch_size=1, num_workers=1) for batch in dataloader: # do stuff print(batch) break # Batch now looks like: # {'WhiteElo': tensor([1361]), 'BlackElo': tensor([1412]), 'fens': ['3R4/5ppk/p1b2rqp/1p6/8/5P1P/1PQ3P1/7K w - -'], 'moves': ['g8h7'], 'scores': ['-535']} # Much better! ```

提供机构：

mauricett

原始信息汇总

数据集概述

数据集名称

Lichess Games With Stockfish Analysis

数据集描述

该数据集是Lichess数据库的精简版本，仅包含具有Stockfish评估的游戏。目前，数据集包含2023年全年的游戏，总计超过1亿局游戏和超过20亿个局面。游戏以比原始PGN数据更快的处理格式存储。

数据格式

每个样本包含一局完整的国际象棋游戏，以字典形式存储，包含以下键：

example[fens] - 一个FEN列表，格式稍作简化，缺少半移动时钟和全移动数。起始位置已被排除。
example[moves] - 一个UCI格式的移动列表。example[moves][42] 是导致 example[fens][42] 位置的移动。
example[scores] - 一个Stockfish评估列表（以百分之一 pawn为单位），以及游戏的最终结果条件（如果有）。评估是从下一个移动玩家的角度进行的。如果 example[fens][42] 是黑方的回合，example[scores][42] 将是黑方的视角。如果游戏以终端条件结束，列表的最后一个元素是字符串 C（将死）、S（逼和）或 I（不足材料）。具有其他结果条件的游戏已被排除。
example[WhiteElo], example[BlackElo] - 玩家的Elo评分。

数据加载

数据集支持流式加载，无需本地存储（当前约100GB）。加载数据集需要设置 trust_remote_code=True 以执行自定义数据加载脚本。

加载数据集

dataset = load_dataset(path="mauricett/lichess_sf", split="train", streaming=True, trust_remote_code=True)

数据预处理

用户需要定义自己的函数来将数据转换为所需格式。以下是一个示例预处理函数：

示例预处理函数

def preprocess(example, tokenizer, score_fn): max_ply = len(example[moves]) random_position = random.randint(0, max_ply-2) fen = example[fens][random_position] move = example[moves][random_position + 1] score = example[scores][random_position + 1] example[fens] = tokenizer(fen) example[moves] = tokenizer(move) example[scores] = score_fn(score) return example

数据集操作

使用 dataset.shuffle() 进行数据集洗牌，使用 dataset.map() 应用预处理函数。

数据集洗牌和预处理

dataset = dataset.shuffle(seed=42) dataset = dataset.map(preprocess, fn_kwargs={tokenizer: tokenizer, score_fn: score_fn})

完整示例

以下是一个完整的示例代码，展示了如何加载、预处理和使用数据集：

py import random from datasets import load_dataset from torch.utils.data import DataLoader

class Tokenizer: def init(self): pass def call(self, example): return example

def score_fn(score): return score

tokenizer = Tokenizer()

dataset = load_dataset(path="mauricett/lichess_sf", split="train", streaming=True, trust_remote_code=True)

dataset = dataset.shuffle(seed=42) dataset = dataset.map(preprocess, fn_kwargs={tokenizer: tokenizer, score_fn: score_fn})

dataloader = DataLoader(dataset, batch_size=1, num_workers=1)

for batch in dataloader: print(batch) break

5,000+

优质数据集

54 个

任务类型

进入经典数据集