Sudoku-CTC-Reasoning
收藏魔搭社区2025-12-04 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/SakanaAI/Sudoku-CTC-Reasoning
下载链接
链接失效反馈官方服务:
资源简介:
<h1 align="center">
<b>Sudoku-Bench</b><br>
</h1>
<p align="center">
📝 <a href="https://pub.sakana.ai/sudoku">[Leaderboard]</a>
📝 <a href="https://arxiv.org/abs/2505.16135">[Technical Report]</a>
📝 <a href="https://sakana.ai/sudoku-bench">[Blog Post]</a><br>
🤗 <a href="https://huggingface.co/datasets/SakanaAI/Sudoku-Bench">[Sudoku-Bench puzzle dataset]</a>
🤗 <a href="https://huggingface.co/datasets/SakanaAI/Sudoku-CTC-Reasoning">[Sudoku-CTC-Reasoning dataset]</a>
</p>
## Sudoku-CTC-Reasoning dataset
The Sudoku-CTC-Reasoning dataset contains the reasoning traces of 1351 puzzles featured in the [Cracking the Cryptic](https://www.youtube.com/c/CrackingTheCryptic) YouTube channel, and thus provides rich learning signals for training LMs to learn reasoning in a Sudoku game or for a broader range of reasoning-intensive tasks.
> [!NOTE]
> This dataset is provided with permission from [Cracking the Cryptic](https://www.youtube.com/c/CrackingTheCryptic).
## Data statistics
- Videos with reasoning traces: 1351
- Total actions: 3539008
- Total actions (excluding highlights): 818145
- Total words: 8921707
> [!NOTE]
> There will be updates to the dataset with more reasoning traces, so please stay tuned.
## Combined ASR and Action Dataset
<img width="1403" alt="Image" src="https://github.com/user-attachments/assets/e8ff55ef-ebbe-4488-a045-57ba8c1f8d64" />
For each video we provide `action_data` and `asr_data`. The `action_data` is extracted from the youtube video using a video-to-actions pipeline (described below) and consists of a sequence of actions taken in the SudokuPad app as the host solves the puzzle. The `asr_data` is extracted from the youtube video using [Whisper](https://github.com/openai/whisper).
## Loading the dataset
The dataset has two subsets. We document the `raw` subset here, and the `processed` subset is described in the [Sudoku-Bench's data_processing README](https://github.com/SakanaAI/Sudoku-Bench/tree/main/src/ctc_processing).
```python
import datasets
dataset = datasets.load_dataset('SakanaAI/Sudoku-CTC-Reasoning', 'raw')
```
Each entry of `dataset` contains the following fields:
- `youtube_id`: the id of the youtube video
- `action_data`: the Sudokupad actions from the video
- `asr_data`: the audio transcript of the puzzle
- `puzzle_id`: the id of the puzzle
- `puzzle_data`: the puzzle data following the format of the Sudoku puzzle datasets.
## Actions format
The `action_data` for each video is a list of action groups. Each action group has the following fields:
- `idx`: the index of the action group in the video
- `frame`: the frame number of the corresponding frame in the video
- `time`: the time in seconds of the action group using the video's time axis
- `actions`: a list of strings of (serialized actions) taken on the SudokuPad board that occurred between the previous frame and the current frame.
- `serialized_state`: the serialized state of the SudokuPad board.
Typically each action group contains a single action.
### Serialized action format
The serialized action `'cd:+:7:r3c5'` denotes "add a candidate 7 to cell r3c5". In general we use the following action types:
Action type:
- `vl`: value (i.e. the actual proposed value of the cell)
- `cd`: candidate or center pencilmark
- `pm`: corner pencilmark
- `co`: color
- `sl`: select
- `ds`: deselect
Operation type:
- `+`: add to the current list
- `-`: remove from the current list
Value:
- `1-9`: for value, candidates, and pencilmarks
- `0-9`: for color, with mapping [here](https://github.com/SakanaAI/Sudoku-Bench/blob/main/src/sudokupad_interaction/app.py#L26).
Coordinates:
- `rxcy`: row and column of the action. In `sl` and `ds` actions, `rxcy` is a comma-separated list.
### Serialized state format
The serialized state can be loaded into [SudokuPad](https://github.com/SakanaAI/Sudoku-Bench/tree/main/src/sudokupad_interaction) by
```python
import requests
# with sudokupad_interaction/app.py running
response = requests.put("http://localhost:8000/set_state", json={"serialized_state": serialized_state})
```
The format of serialized state follows that used internally by SudokuPad: For each cell, use `/` to separate value, candidate, pencilmark, color, highlight, pen-tool, respectively, with trailing `/`s removed.
Example: A `serialized_state` of `{"cells":["6","/1,2,4,5", ...` indicates that `r1c1` has a value of 6, and `r1c2` has candidates (center small digits)1, 2, 4, 5.
## ASR format
The `asr_data` is the output of Whisper ASR using `model.transcribe(audio_file, language="en", task="transcribe", word_timestamps=True)` for `model = whisper.load_model("turbo")`. Please see [Whisper's documentation](https://github.com/openai/whisper) for details.
## Video-to-actions summary
Extracting sudokupad actions from the video is a multi-step process:
1. Detect the x, y, height, width of the Sudoku board in the video
2. Detect the x, y, height, width of the corresponding Sudoku board in the SudokuPad app with the same puzzle loaded
3. Using 1. and 2. and the location of individual cell rendering in the browser window in SudokuPad (the lines that make up the typically 9x9 grid), determine the corresponding cell locations in the youtube video.
4. Take a sequence of keyframes from the video cropped to the Sudoku board. A keyframe is where a pixel-wise change occured above a threshold. A keyframe is a candidate for when an action was taken.
5. For each keyframe, use a trained resnet classifier to map the pixel input to a multiclass prediction each of the (typically 81) SudokuPad cell states, which includes the colors, pencilmarks (corner small digits), candidates (center small digits), and current value (center large digit). The resnet was trained on synthetic data from the SudokuPad app. The exact images to feed into the resnet are determined by the cell locations from 1. and 2. Combine the individual cell state predictions to obtain a single board state for this keyframe.
6. From the sequence of states, determine the state-diffs to get the sequence of actions, saved as `action_data`.
> [!NOTE]
> The code for the video-to-actions pipeline itself is not open-sourced as part of [Sudoku-Bench](https://github.com/SakanaAI/Sudoku-Bench).
## References
- [CTC YouTube channel](https://www.youtube.com/c/CrackingTheCryptic)
- [CTC catalogue](https://ctc-catalogue.com/)
## Citation
```bibtex
@misc{seely2025sudoku-bench,
title={{Sudoku-Bench}},
author={Seely, Jeffrey and Imajuku, Yuki and Zhao, Tianyu and Cetin, Edoardo and Jones, Llion},
howpublished = {\url{https://github.com/SakanaAI/Sudoku-Bench}},
year={2025}
}
```
<div align="center"><b>数独基准测试集(Sudoku-Bench)</b><br></div>
<p align="center">
📝 <a href="https://pub.sakana.ai/sudoku">[排行榜]</a>
📝 <a href="https://arxiv.org/abs/2505.16135">[技术报告]</a>
📝 <a href="https://sakana.ai/sudoku-bench">[博客文章]</a><br>
🤗 <a href="https://huggingface.co/datasets/SakanaAI/Sudoku-Bench">[数独基准测试集谜题数据集]</a>
🤗 <a href="https://huggingface.co/datasets/SakanaAI/Sudoku-CTC-Reasoning">[Sudoku-CTC-Reasoning数据集]</a>
</p>
## Sudoku-CTC-Reasoning数据集
本数据集收录了YouTube频道[Cracking the Cryptic](https://www.youtube.com/c/CrackingTheCryptic)中的1351个数独谜题的推理过程记录,可为训练大语言模型(Large Language Model,LLM)学习数独游戏中的推理能力,或是更广泛的推理密集型任务提供丰富的学习信号。
> [!NOTE]
> 本数据集的使用已获得[Cracking the Cryptic](https://www.youtube.com/c/CrackingTheCryptic)的授权。
## 数据统计
- 带推理过程的视频:1351个
- 总操作数:3539008
- 总操作数(不含高亮操作):818145
- 总文本量:8921707
> [!NOTE]
> 本数据集将持续更新,新增更多推理过程记录,请持续关注。
## 联合ASR与操作数据集
<img width="1403" alt="Image" src="https://github.com/user-attachments/assets/e8ff55ef-ebbe-4488-a045-57ba8c1f8d64" />
针对每个视频,我们均提供`action_data`与`asr_data`两类数据。其中`action_data`通过下述的视频转操作流水线从YouTube视频中提取得到,包含主播在解谜过程中于SudokuPad应用内执行的一系列操作序列。`asr_data`则通过[Whisper](https://github.com/openai/whisper)从YouTube视频中提取得到。
## 数据集加载
本数据集包含两个子集:本文档此处仅介绍`raw`(原始)子集,`processed`(预处理)子集的相关说明详见[Sudoku-Bench数据处理README文档](https://github.com/SakanaAI/Sudoku-Bench/tree/main/src/ctc_processing)。
python
import datasets
dataset = datasets.load_dataset('SakanaAI/Sudoku-CTC-Reasoning', 'raw')
数据集的每个样本条目包含以下字段:
- `youtube_id`:YouTube视频的唯一标识符
- `action_data`:视频中对应的SudokuPad操作记录
- `asr_data`:该谜题的音频转写文本
- `puzzle_id`:谜题的唯一标识符
- `puzzle_data`:遵循数独谜题数据集格式的谜题数据
## 操作格式
每个视频的`action_data`由若干操作组构成,每个操作组包含以下字段:
- `idx`:操作组在视频中的索引
- `frame`:对应视频帧的帧号
- `time`:以秒为单位的操作组时间戳(基于视频时间轴)
- `actions`:在上一帧与当前帧之间,于SudokuPad棋盘上执行的序列化操作字符串列表
- `serialized_state`:SudokuPad棋盘的序列化状态
通常每个操作组仅包含单个操作。
### 序列化操作格式
序列化操作字符串`'cd:+:7:r3c5'`表示「在r3c5单元格中添加候选数字7」。总体而言,我们采用以下几类操作标识:
#### 操作类型(Action type)
- `vl`:数值(即单元格的实际预设值)
- `cd`:候选数(居中的小数字标记)
- `pm`:边角铅笔标记
- `co`:着色标记
- `sl`:选中单元格
- `ds`:取消选中单元格
#### 操作符类型(Operation type)
- `+`:向当前列表中添加内容
- `-`:从当前列表中移除内容
#### 数值标识(Value)
- `1-9`:用于数值、候选数及铅笔标记的取值
- `0-9`:用于着色标记,颜色映射关系详见[此处](https://github.com/SakanaAI/Sudoku-Bench/blob/main/src/sudokupad_interaction/app.py#L26)
#### 坐标标识(Coordinates)
- `rxcy`:对应操作的行与列。对于`sl`和`ds`类型的操作,`rxcy`为以逗号分隔的多个坐标列表。
### 序列化状态格式
可通过以下代码将序列化状态加载至[SudokuPad](https://github.com/SakanaAI/Sudoku-Bench/tree/main/src/sudokupad_interaction)中:
python
import requests
# 需确保sudokupad_interaction/app.py已运行
response = requests.put("http://localhost:8000/set_state", json={"serialized_state": serialized_state})
序列化状态的格式遵循SudokuPad内部使用的规范:针对每个单元格,使用`/`依次分隔数值、候选数、铅笔标记、着色、高亮、笔刷工具类型,并移除末尾多余的`/`。
示例:`{"cells":["6","/1,2,4,5", ...}`形式的`serialized_state`表示r1c1单元格的预设数值为6,r1c2单元格的候选数(居中小数字)为1、2、4、5。
### ASR数据格式
`asr_data`为Whisper ASR的输出结果,具体使用的配置为`model = whisper.load_model("turbo")`,并调用`model.transcribe(audio_file, language="en", task="transcribe", word_timestamps=True)`。详细说明请参阅[Whisper官方文档](https://github.com/openai/whisper)。
## 视频转操作流水线说明
从视频中提取SudokuPad操作的流程分为多个步骤:
1. 检测视频中数独棋盘的坐标(x、y)、高度与宽度
2. 检测SudokuPad应用中加载相同谜题时,对应数独棋盘的坐标(x、y)、高度与宽度
3. 结合步骤1与步骤2的结果,以及SudokuPad应用浏览器窗口中单个单元格的渲染位置(即构成典型9×9网格的线条),确定YouTube视频中对应的单元格位置
4. 从视频中提取裁剪至数独棋盘区域的关键帧序列。关键帧指像素变化量超过阈值的帧,可作为操作执行的候选时间点
5. 针对每个关键帧,使用预训练的ResNet分类器将像素输入映射为(通常为81个)SudokuPad单元格状态的多分类预测结果,其中包含着色、铅笔标记(边角小数字)、候选数(居中小数字)及当前预设值(居中大数字)。该ResNet模型基于SudokuPad应用生成的合成数据进行训练。输入ResNet的具体图像由步骤1与步骤2得到的单元格位置确定。将所有单元格的状态预测结果整合,即可得到该关键帧对应的完整棋盘状态
6. 通过棋盘状态序列计算状态差异,得到操作序列并保存为`action_data`。
> [!NOTE]
> 视频转操作流水线的代码并未作为[Sudoku-Bench](https://github.com/SakanaAI/Sudoku-Bench)的一部分开源。
## 参考资料
- [CTC YouTube频道](https://www.youtube.com/c/CrackingTheCryptic)
- [CTC目录网站](https://ctc-catalogue.com/)
## 引用格式
bibtex
@misc{seely2025sudoku-bench,
title={{Sudoku-Bench}},
author={Seely, Jeffrey and Imajuku, Yuki and Zhao, Tianyu and Cetin, Edoardo and Jones, Llion},
howpublished = {url{https://github.com/SakanaAI/Sudoku-Bench}},
year={2025}
}
提供机构:
maas
创建时间:
2025-04-22



