mhieuuu/ac-solver-dataset
收藏Hugging Face2026-02-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mhieuuu/ac-solver-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
task_categories:
- reinforcement-learning
- text-generation
tags:
- mathematics
- group-theory
- andrews-curtis
- miller-schupp
- combinatorial-group-theory
- transformer
- hardness-oracle
size_categories:
- 1M<n<10M
pretty_name: AC-Solver Transformer Dataset
---
# AC-Solver Transformer Dataset
Training corpus for a decoder-only Transformer language model on the
**Andrews-Curtis (AC) conjecture** — one of the longest-standing open problems in
combinatorial group theory, and a challenging benchmark for reinforcement learning.
> **Paper:** [What Makes Math Problems Hard for Reinforcement Learning: A Case Study](https://arxiv.org/abs/2408.15332)
> **Code:** [github.com/shehper/AC-Solver](https://github.com/shehper/AC-Solver)
---
## Table of Contents
1. [Background: The Andrews-Curtis Conjecture](#background-the-andrews-curtis-conjecture)
2. [The Miller-Schupp Series](#the-miller-schupp-series)
3. [Why This Dataset Exists](#why-this-dataset-exists)
4. [Generation Process](#generation-process)
5. [Schema](#schema)
6. [Encoding Details](#encoding-details)
7. [Dataset Statistics](#dataset-statistics)
8. [Usage](#usage)
9. [Intended Use & Limitations](#intended-use--limitations)
10. [Related Resources](#related-resources)
11. [Citation](#citation)
---
## Background: The Andrews-Curtis Conjecture
A **balanced presentation** of a group is a description of the form
```
⟨ x, y | r₀, r₁ ⟩
```
where `x` and `y` are generators (and their inverses `x⁻¹`, `y⁻¹`), and `r₀`, `r₁`
are relator words — strings over `{x, x⁻¹, y, y⁻¹}`. When the relators completely
determine that `x = y = 1` (the identity), the presentation describes the **trivial group**,
and the simplest such presentation is `⟨ x, y | x, y ⟩`.
The **Andrews-Curtis (AC) conjecture** (1965) asserts:
> *Every balanced presentation of the trivial group can be reduced to* `⟨ x, y | x, y ⟩`
> *by a finite sequence of AC moves.*
The **AC moves** are:
1. Replace `rᵢ` with `rᵢ · rⱼ` (concatenate relator `j` onto relator `i`)
2. Replace `rᵢ` with `w · rᵢ · w⁻¹` for any word `w` (conjugate a relator)
3. Replace `rᵢ` with `rᵢ⁻¹` (invert a relator)
Despite being 60 years old, the conjecture remains open. The difficulty is that some
presentations may require solution paths of **millions or billions of steps** — far
beyond current search algorithms.
In code, a presentation `⟨ x, y | r₀, r₁ ⟩` is represented as a pair of integer arrays
using the encoding `x=1, x⁻¹=-1, y=2, y⁻¹=-2`, with zeros as right-padding:
```python
# ⟨ x, y | x²y⁻³, xy x⁻¹y⁻¹x⁻¹y⁻¹ ⟩
r0 = [1, 1, -2, -2, -2, 0, 0, ...] # x x y⁻¹ y⁻¹ y⁻¹
r1 = [1, 2, 1, -2, -1, -2, 0, ...] # x y x⁻¹ y⁻¹ x⁻¹ y⁻¹
presentation = r0 + r1 # length 2 × lmax
```
---
## The Miller-Schupp Series
The **Miller-Schupp series** is a parametric family of balanced trivial-group
presentations studied extensively as potential hard instances of the AC conjecture:
```
MS(n, w) = ⟨ x, y | xⁿ y⁻¹ x⁻¹ y x y⁻ⁿ y⁻¹, w x w⁻¹ y ⟩
```
This dataset is built from the **1,190 Miller-Schupp seed presentations** used in
the paper, spanning a range of parameters that produce presentations of varying
difficulty. They are stored in
`ac_solver/search/miller_schupp/data/all_presentations.txt`.
Of these 1,190 seeds:
- **533 are GS-solved**: reachable from the trivial state by greedy search (a
polynomial-time heuristic). These are "easy" in the sense that short solution
paths exist.
- **657 are GS-unsolved**: not solved by greedy search within the allowed budget.
Some may be provably AC-trivial via longer paths; others are potential
counterexamples to the conjecture.
---
## Why This Dataset Exists
The Transformer is trained as a **language model over presentation sequences** to
learn the statistical structure of the AC graph neighbourhood. Its learned
representations serve two downstream purposes:
1. **Hardness Oracle `H(P)`**: The EOS-token embedding of a presentation is fed
to a small MLP classifier to predict whether the presentation is GS-solved or
GS-unsolved (F1 ≈ 0.962 target). This oracle scores the difficulty of any
presentation encountered during RL training.
2. **Enriched State Representation**: The 512-dimensional Transformer embedding
can be concatenated to the raw RL state, giving the PPO policy access to
long-range structural information beyond what the raw int8 array conveys.
---
## Generation Process
Presentations are generated via **Algorithm 6** from the paper: a stratified
Markov-Chain Monte Carlo (MCMC) walk in presentation space.
```
for each of the 1,190 seed presentations p:
for phase in 0 … 127: # gradually increase lmax
lmax_phase = l_min + phase * step # presentation space grows
for chain in 0 … 11: # 12 independent chains
start from p
apply 1,000 random AC′ moves # free+cyclic reduction after each
record the resulting presentation
```
**Key design choices:**
- **Phased lmax**: Early phases restrict relator length, producing presentations
near the seed. Later phases allow longer relators, exploring harder regions.
- **12 independent chains**: Reduces correlation between samples from the same seed.
- **AC′ moves** (the 12 AC move variants used in training) include concatenation
and conjugation, each with all four generator choices — but exclude inversion
(which doesn't change relator length and adds little signal).
- **Free + cyclic reduction** after every move ensures canonical form.
This produces a balanced corpus: every seed contributes exactly
`128 × 12 = 1,536` presentations, totalling `1,190 × 1,536 = 1,827,840` rows.
---
## Schema
| Column | Dtype | Shape | Description |
|---|---|---|---|
| `presentation` | `int8` | `(256,)` | Two relators concatenated and zero-padded to `lmax=128` each. |
| `pres_idx` | `int32` | scalar | Index `0–1189` into the 1,190 Miller-Schupp seeds. |
| `phase` | `int32` | scalar | MCMC phase index `0–127`. |
| `chain` | `int32` | scalar | Markov chain index `0–11`. |
| `gs_solved` | `bool` | scalar | `True` if the seed (`pres_idx`) is GS-solved. |
---
## Encoding Details
### PPO / RL encoding (used in this dataset)
| Symbol | `int8` value |
|---|---|
| `x` | `1` |
| `x⁻¹` | `-1` |
| `y` | `2` |
| `y⁻¹` | `-2` |
| padding | `0` |
`presentation[0:128]` = relator `r0`, right-padded with zeros to length 128.
`presentation[128:256]` = relator `r1`, right-padded with zeros to length 128.
### Transformer token encoding (for language model training)
The companion tokenizer (`ac_solver/transformer/tokenizer.py`) converts the PPO
encoding to a 6-token vocabulary for the Transformer:
| Token ID | Symbol | PPO value |
|---|---|---|
| `0` | `x` | `1` |
| `1` | `x⁻¹` | `-1` |
| `2` | `y` | `2` |
| `3` | `y⁻¹` | `-2` |
| `4` | `SEP` | *(separator between r₀ and r₁)* |
| `5` | `EOS` | *(end of sequence)* |
A tokenized presentation looks like:
```
[r0_tokens...] SEP [r1_tokens...] EOS
```
Zeros (padding) are stripped before tokenization. The EOS token's hidden state
is used as the presentation's embedding vector.
---
## Dataset Statistics
| Statistic | Value |
|---|---|
| Total presentations | 1,827,840 |
| Seed presentations (Miller-Schupp) | 1,190 |
| GS-solved seeds | 533 (44.8%) |
| GS-unsolved seeds | 657 (55.2%) |
| Presentations per seed | 1,536 (128 phases × 12 chains) |
| Max relator length (`lmax`) | 128 |
| Presentation array length | 256 (2 × 128) |
| AC′ moves per chain | 1,000 |
| Random seed | 42 |
| Generation time | ~3.5 hours (CPU) |
| Raw shard size | ~38 MB/shard × 12 shards |
---
## Usage
### Basic loading
```python
from datasets import load_dataset
import numpy as np
ds = load_dataset("mhieuuu/ac-solver-dataset", split="train")
print(ds)
# Dataset({features: ['presentation', 'pres_idx', 'phase', 'chain', 'gs_solved'], num_rows: 1827840})
```
### Decode a single presentation
```python
row = ds[42]
presentation = np.array(row["presentation"], dtype=np.int8) # (256,)
r0 = presentation[:128]
r1 = presentation[128:]
# Strip padding (zeros) to get actual words
r0_word = r0[r0 != 0].tolist()
r1_word = r1[r1 != 0].tolist()
print(f"Seed index : {row['pres_idx']}")
print(f"Phase : {row['phase']}")
print(f"Chain : {row['chain']}")
print(f"GS-solved : {row['gs_solved']}")
print(f"r0 word : {r0_word}") # e.g. [1, 1, -2, -2, -2]
print(f"r1 word : {r1_word}") # e.g. [1, 2, 1, -2, -1, -2]
```
### Filter by difficulty label
```python
# Only presentations from GS-solved seeds
easy = ds.filter(lambda row: row["gs_solved"])
# Only presentations from GS-unsolved seeds
hard = ds.filter(lambda row: not row["gs_solved"])
print(f"Easy: {len(easy):,} Hard: {len(hard):,}")
# Easy: 818,688 Hard: 1,009,152
```
### Group by seed and iterate over chains
```python
import pandas as pd
df = ds.to_pandas()
# All 1536 presentations derived from seed 0
seed0 = df[df["pres_idx"] == 0].sort_values(["phase", "chain"])
print(seed0.shape) # (1536, 5)
```
### Streaming (memory-efficient for large jobs)
```python
from datasets import load_dataset
ds = load_dataset("mhieuuu/ac-solver-dataset", split="train", streaming=True)
for row in ds.take(3):
pres = row["presentation"]
print(f"pres_idx={row['pres_idx']} gs_solved={row['gs_solved']} "
f"r0_len={sum(1 for v in pres[:128] if v != 0)}")
```
### Use with the AC-Solver Transformer
```python
import torch
import numpy as np
from ac_solver.transformer.model import ACTransformer
from ac_solver.transformer.tokenizer import presentation_to_tokens
# Load model (after training)
model = ACTransformer()
model.load_state_dict(torch.load("ac_solver/transformer/checkpoints/lm_checkpoint.pt")["model"])
model.eval()
# Get 512-dim embedding for a presentation from this dataset
from datasets import load_dataset
ds = load_dataset("mhieuuu/ac-solver-dataset", split="train", streaming=True)
row = next(iter(ds))
pres = np.array(row["presentation"], dtype=np.int8)
tokens = presentation_to_tokens(pres, max_relator_length=128)
token_ids = torch.tensor(tokens, dtype=torch.long).unsqueeze(0) # (1, T)
with torch.no_grad():
hidden = model.get_hidden_states(token_ids) # (1, T, 512)
embedding = hidden[0, -1, :] # EOS token → (512,)
print(embedding.shape) # torch.Size([512])
```
---
## Intended Use & Limitations
**Intended uses:**
- Training the ACTransformer language model as a prior over AC presentation space
- Learning a hardness oracle `H(P)` to predict GS-solved / GS-unsolved
- Providing enriched state representations for PPO-based RL agents
- Benchmarking representation learning on combinatorial group theory objects
**Limitations:**
- All presentations are generated from the Miller-Schupp family only; the
distribution may not generalise to other families of balanced presentations.
- The `gs_solved` label reflects **greedy search reachability**, not AC-triviality.
A `gs_solved=False` seed may still be AC-trivial via a longer path.
- The dataset does not include solution paths — only the presentations themselves.
- Presentations from the same `pres_idx` are correlated (same seed, nearby MCMC
walks). Stratify splits by `pres_idx` to avoid leakage.
---
## Related Resources
| Resource | Link |
|---|---|
| Paper (arXiv) | [arxiv.org/abs/2408.15332](https://arxiv.org/abs/2408.15332) |
| Code (GitHub) | [github.com/shehper/AC-Solver](https://github.com/shehper/AC-Solver) |
| W&B Training Runs | [wandb.ai/shehper/AC-Solver-PPO](https://wandb.ai/shehper/AC-Solver-PPO) |
| What is AC? (primer) | [What_is_AC.md](https://github.com/shehper/AC-Solver/blob/main/What_is_AC.md) |
---
## Citation
If you use this dataset, please cite the accompanying paper:
```bibtex
@misc{shehper2024makesmathproblemshard,
title={What makes math problems hard for reinforcement learning: a case study},
author={Ali Shehper and Anibal M. Medina-Mardones and Bartłomiej Lewandowski
and Angus Gruen and Piotr Kucharski and Sergei Gukov},
year={2024},
eprint={2408.15332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.15332},
}
```
提供机构:
mhieuuu



