Waterhorse/chess_data
收藏国际象棋数据集
数据集描述
数据集概述
该数据集包含论文中描述的三种数据源:
- ChessCLIP数据集:用于训练CLIP的带注释的PGN。
- ChessGPT Base数据集:用于训练ChessGPT-Base的游戏数据集、语言数据集和混合数据集。
- ChessGPT Chat数据集:用于训练ChessGPT-Chat的对话数据集。
由于法律问题,ChessGPT数据集不开放国际象棋书籍、国际象棋论坛、国际象棋博客和YouTube转录数据集。ChessCLIP数据集不开放我们使用的两个商业注释数据集。
语言
数据集的主要语言是英语。
数据集结构
- ChessCLIP数据集:用于训练CLIP的带注释的PGN。
- ChessGPT Base数据集:
- 游戏数据集:ccrl, pro_player, lichess_db_37, chess_puzzles, chess_modeling。
- 语言数据集:redpajama, oscar, c4, pile, wikipedia, stackexchange,混合数据集:annotated_pgn。
- ChessGPT Chat数据集:与国际象棋相关的对话数据集。
数据实例
-
ChessCLIP数据集: python [Event "GMA, Wijk aan Zee NED"] [Site "?"] [Date "2003.??.??"] [Round "1"] [White "Anand,V"] [Black "Radjabov,T"] [Result "1/2"] [WhiteElo "2750"] [BlackElo "2620"] [ECO "C12"] [PlyCount "55"] [Annotator "Hathaway"]
- e4 e6 { Im not terribly familiar with the style of Radjabov, so I dont know if this is his usual opening. }
- d4 d5 3. Nc3 Nf6 (3...Bb4 { The Winawer Variation is probably best, though not as easy to play. }) 4. Bg5 { threatens e4-e5xf6 } (4. e5 { keeps pieces on the board and avoids ...dxe4 }) 4...Bb4 (4...Be7 { is more common and aims to trade dark-square bishops to ease Blacks cramp }) (4...dxe4 { aims to avoid any cramp by bringing pieces into alignment for trading, though White does get at least one very good piece (Ne4 or Bg5) and an easier time castling queen-side, to stir up king-side threats } ...
-
ChessGPT数据集: python {"metadata": {"id": "65268522", "title": "1923 USSR Chess Championship", "url": "https://en.wikipedia.org/wiki/1923%20USSR%20Chess%20Championship"}, "text": "The 1923 USSR Chess Championship was the second edition of USSR Chess Championship. Held from 8 to 24 July in Petrograd. The tournament was won by Peter Romanovsky. "pipeline_key": "wiki-squad-train.jsonl.zst-10549" }
-
ChessGPT Chat数据集: python {"conversations": [ {"author": "Human 0", "text": "Analyze the given sentence and identify its grammatical errors, if any. These type of problems are difficult to solve.", "url": "gpt-4-alpaca", "score": 0}, {"author": "Human 1", "text": "The sentence is incorrect. The error is in the usage of "these". Since "these" is a plural determiner, it should be used with a plural noun. Similarly, "type" is singular noun and it is used with the singular determiner, "this".
The correct sentence should be "This type of problem is difficult to solve." or "These types of problems are difficult to solve."", "url": "gpt-4-alpaca", "score": 0} ] }
数据分割
数据未分割。
数据集创建
更多详细信息请参阅论文《ChessGPT: Bridging Policy Learning and Language Modeling》。
许可信息
- 带注释的PGN:PGNlib, lichess, GameKnot, pathtomaster
- 游戏数据集:Lichess dataset, CCRL, pro-player, puzzle, Modeling data(Apache-2.0).
- 语言数据集:Wikipedia, Redpajama, Oscar, Pile, StackExchange, C4
- 对话数据集:Chessable forums, Reddit, gpt-4, sharegpt, oasst1(Apache-2.0), dolly-v2(MIT)
引用信息
bash @article{feng2023chessgpt, title={ChessGPT: Bridging Policy Learning and Language Modeling}, author={Feng, Xidong and Luo, Yicheng and Wang, Ziyan and Tang, Hongrui and Yang, Mengyue and Shao, Kun and Mguni, David and Du, Yali and Wang, Jun}, journal={arXiv preprint arXiv:2306.09200}, year={2023} }




