mhieuuu/ac-solver-dataset

Name: mhieuuu/ac-solver-dataset
Creator: mhieuuu
Published: 2026-02-24 21:49:32
License: 暂无描述

Hugging Face2026-02-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/mhieuuu/ac-solver-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en task_categories: - reinforcement-learning - text-generation tags: - mathematics - group-theory - andrews-curtis - miller-schupp - combinatorial-group-theory - transformer - hardness-oracle size_categories: - 1M<n<10M pretty_name: AC-Solver Transformer Dataset --- # AC-Solver Transformer Dataset Training corpus for a decoder-only Transformer language model on the **Andrews-Curtis (AC) conjecture** — one of the longest-standing open problems in combinatorial group theory, and a challenging benchmark for reinforcement learning. > **Paper:** [What Makes Math Problems Hard for Reinforcement Learning: A Case Study](https://arxiv.org/abs/2408.15332) > **Code:** [github.com/shehper/AC-Solver](https://github.com/shehper/AC-Solver) --- ## Table of Contents 1. [Background: The Andrews-Curtis Conjecture](#background-the-andrews-curtis-conjecture) 2. [The Miller-Schupp Series](#the-miller-schupp-series) 3. [Why This Dataset Exists](#why-this-dataset-exists) 4. [Generation Process](#generation-process) 5. [Schema](#schema) 6. [Encoding Details](#encoding-details) 7. [Dataset Statistics](#dataset-statistics) 8. [Usage](#usage) 9. [Intended Use & Limitations](#intended-use--limitations) 10. [Related Resources](#related-resources) 11. [Citation](#citation) --- ## Background: The Andrews-Curtis Conjecture A **balanced presentation** of a group is a description of the form ``` ⟨ x, y | r₀, r₁ ⟩ ``` where `x` and `y` are generators (and their inverses `x⁻¹`, `y⁻¹`), and `r₀`, `r₁` are relator words — strings over `{x, x⁻¹, y, y⁻¹}`. When the relators completely determine that `x = y = 1` (the identity), the presentation describes the **trivial group**, and the simplest such presentation is `⟨ x, y | x, y ⟩`. The **Andrews-Curtis (AC) conjecture** (1965) asserts: > *Every balanced presentation of the trivial group can be reduced to* `⟨ x, y | x, y ⟩` > *by a finite sequence of AC moves.* The **AC moves** are: 1. Replace `rᵢ` with `rᵢ · rⱼ` (concatenate relator `j` onto relator `i`) 2. Replace `rᵢ` with `w · rᵢ · w⁻¹` for any word `w` (conjugate a relator) 3. Replace `rᵢ` with `rᵢ⁻¹` (invert a relator) Despite being 60 years old, the conjecture remains open. The difficulty is that some presentations may require solution paths of **millions or billions of steps** — far beyond current search algorithms. In code, a presentation `⟨ x, y | r₀, r₁ ⟩` is represented as a pair of integer arrays using the encoding `x=1, x⁻¹=-1, y=2, y⁻¹=-2`, with zeros as right-padding: ```python # ⟨ x, y | x²y⁻³, xy x⁻¹y⁻¹x⁻¹y⁻¹ ⟩ r0 = [1, 1, -2, -2, -2, 0, 0, ...] # x x y⁻¹ y⁻¹ y⁻¹ r1 = [1, 2, 1, -2, -1, -2, 0, ...] # x y x⁻¹ y⁻¹ x⁻¹ y⁻¹ presentation = r0 + r1 # length 2 × lmax ``` --- ## The Miller-Schupp Series The **Miller-Schupp series** is a parametric family of balanced trivial-group presentations studied extensively as potential hard instances of the AC conjecture: ``` MS(n, w) = ⟨ x, y | xⁿ y⁻¹ x⁻¹ y x y⁻ⁿ y⁻¹, w x w⁻¹ y ⟩ ``` This dataset is built from the **1,190 Miller-Schupp seed presentations** used in the paper, spanning a range of parameters that produce presentations of varying difficulty. They are stored in `ac_solver/search/miller_schupp/data/all_presentations.txt`. Of these 1,190 seeds: - **533 are GS-solved**: reachable from the trivial state by greedy search (a polynomial-time heuristic). These are "easy" in the sense that short solution paths exist. - **657 are GS-unsolved**: not solved by greedy search within the allowed budget. Some may be provably AC-trivial via longer paths; others are potential counterexamples to the conjecture. --- ## Why This Dataset Exists The Transformer is trained as a **language model over presentation sequences** to learn the statistical structure of the AC graph neighbourhood. Its learned representations serve two downstream purposes: 1. **Hardness Oracle `H(P)`**: The EOS-token embedding of a presentation is fed to a small MLP classifier to predict whether the presentation is GS-solved or GS-unsolved (F1 ≈ 0.962 target). This oracle scores the difficulty of any presentation encountered during RL training. 2. **Enriched State Representation**: The 512-dimensional Transformer embedding can be concatenated to the raw RL state, giving the PPO policy access to long-range structural information beyond what the raw int8 array conveys. --- ## Generation Process Presentations are generated via **Algorithm 6** from the paper: a stratified Markov-Chain Monte Carlo (MCMC) walk in presentation space. ``` for each of the 1,190 seed presentations p: for phase in 0 … 127: # gradually increase lmax lmax_phase = l_min + phase * step # presentation space grows for chain in 0 … 11: # 12 independent chains start from p apply 1,000 random AC′ moves # free+cyclic reduction after each record the resulting presentation ``` **Key design choices:** - **Phased lmax**: Early phases restrict relator length, producing presentations near the seed. Later phases allow longer relators, exploring harder regions. - **12 independent chains**: Reduces correlation between samples from the same seed. - **AC′ moves** (the 12 AC move variants used in training) include concatenation and conjugation, each with all four generator choices — but exclude inversion (which doesn't change relator length and adds little signal). - **Free + cyclic reduction** after every move ensures canonical form. This produces a balanced corpus: every seed contributes exactly `128 × 12 = 1,536` presentations, totalling `1,190 × 1,536 = 1,827,840` rows. --- ## Schema | Column | Dtype | Shape | Description | |---|---|---|---| | `presentation` | `int8` | `(256,)` | Two relators concatenated and zero-padded to `lmax=128` each. | | `pres_idx` | `int32` | scalar | Index `0–1189` into the 1,190 Miller-Schupp seeds. | | `phase` | `int32` | scalar | MCMC phase index `0–127`. | | `chain` | `int32` | scalar | Markov chain index `0–11`. | | `gs_solved` | `bool` | scalar | `True` if the seed (`pres_idx`) is GS-solved. | --- ## Encoding Details ### PPO / RL encoding (used in this dataset) | Symbol | `int8` value | |---|---| | `x` | `1` | | `x⁻¹` | `-1` | | `y` | `2` | | `y⁻¹` | `-2` | | padding | `0` | `presentation[0:128]` = relator `r0`, right-padded with zeros to length 128. `presentation[128:256]` = relator `r1`, right-padded with zeros to length 128. ### Transformer token encoding (for language model training) The companion tokenizer (`ac_solver/transformer/tokenizer.py`) converts the PPO encoding to a 6-token vocabulary for the Transformer: | Token ID | Symbol | PPO value | |---|---|---| | `0` | `x` | `1` | | `1` | `x⁻¹` | `-1` | | `2` | `y` | `2` | | `3` | `y⁻¹` | `-2` | | `4` | `SEP` | *(separator between r₀ and r₁)* | | `5` | `EOS` | *(end of sequence)* | A tokenized presentation looks like: ``` [r0_tokens...] SEP [r1_tokens...] EOS ``` Zeros (padding) are stripped before tokenization. The EOS token's hidden state is used as the presentation's embedding vector. --- ## Dataset Statistics | Statistic | Value | |---|---| | Total presentations | 1,827,840 | | Seed presentations (Miller-Schupp) | 1,190 | | GS-solved seeds | 533 (44.8%) | | GS-unsolved seeds | 657 (55.2%) | | Presentations per seed | 1,536 (128 phases × 12 chains) | | Max relator length (`lmax`) | 128 | | Presentation array length | 256 (2 × 128) | | AC′ moves per chain | 1,000 | | Random seed | 42 | | Generation time | ~3.5 hours (CPU) | | Raw shard size | ~38 MB/shard × 12 shards | --- ## Usage ### Basic loading ```python from datasets import load_dataset import numpy as np ds = load_dataset("mhieuuu/ac-solver-dataset", split="train") print(ds) # Dataset({features: ['presentation', 'pres_idx', 'phase', 'chain', 'gs_solved'], num_rows: 1827840}) ``` ### Decode a single presentation ```python row = ds[42] presentation = np.array(row["presentation"], dtype=np.int8) # (256,) r0 = presentation[:128] r1 = presentation[128:] # Strip padding (zeros) to get actual words r0_word = r0[r0 != 0].tolist() r1_word = r1[r1 != 0].tolist() print(f"Seed index : {row['pres_idx']}") print(f"Phase : {row['phase']}") print(f"Chain : {row['chain']}") print(f"GS-solved : {row['gs_solved']}") print(f"r0 word : {r0_word}") # e.g. [1, 1, -2, -2, -2] print(f"r1 word : {r1_word}") # e.g. [1, 2, 1, -2, -1, -2] ``` ### Filter by difficulty label ```python # Only presentations from GS-solved seeds easy = ds.filter(lambda row: row["gs_solved"]) # Only presentations from GS-unsolved seeds hard = ds.filter(lambda row: not row["gs_solved"]) print(f"Easy: {len(easy):,} Hard: {len(hard):,}") # Easy: 818,688 Hard: 1,009,152 ``` ### Group by seed and iterate over chains ```python import pandas as pd df = ds.to_pandas() # All 1536 presentations derived from seed 0 seed0 = df[df["pres_idx"] == 0].sort_values(["phase", "chain"]) print(seed0.shape) # (1536, 5) ``` ### Streaming (memory-efficient for large jobs) ```python from datasets import load_dataset ds = load_dataset("mhieuuu/ac-solver-dataset", split="train", streaming=True) for row in ds.take(3): pres = row["presentation"] print(f"pres_idx={row['pres_idx']} gs_solved={row['gs_solved']} " f"r0_len={sum(1 for v in pres[:128] if v != 0)}") ``` ### Use with the AC-Solver Transformer ```python import torch import numpy as np from ac_solver.transformer.model import ACTransformer from ac_solver.transformer.tokenizer import presentation_to_tokens # Load model (after training) model = ACTransformer() model.load_state_dict(torch.load("ac_solver/transformer/checkpoints/lm_checkpoint.pt")["model"]) model.eval() # Get 512-dim embedding for a presentation from this dataset from datasets import load_dataset ds = load_dataset("mhieuuu/ac-solver-dataset", split="train", streaming=True) row = next(iter(ds)) pres = np.array(row["presentation"], dtype=np.int8) tokens = presentation_to_tokens(pres, max_relator_length=128) token_ids = torch.tensor(tokens, dtype=torch.long).unsqueeze(0) # (1, T) with torch.no_grad(): hidden = model.get_hidden_states(token_ids) # (1, T, 512) embedding = hidden[0, -1, :] # EOS token → (512,) print(embedding.shape) # torch.Size([512]) ``` --- ## Intended Use & Limitations **Intended uses:** - Training the ACTransformer language model as a prior over AC presentation space - Learning a hardness oracle `H(P)` to predict GS-solved / GS-unsolved - Providing enriched state representations for PPO-based RL agents - Benchmarking representation learning on combinatorial group theory objects **Limitations:** - All presentations are generated from the Miller-Schupp family only; the distribution may not generalise to other families of balanced presentations. - The `gs_solved` label reflects **greedy search reachability**, not AC-triviality. A `gs_solved=False` seed may still be AC-trivial via a longer path. - The dataset does not include solution paths — only the presentations themselves. - Presentations from the same `pres_idx` are correlated (same seed, nearby MCMC walks). Stratify splits by `pres_idx` to avoid leakage. --- ## Related Resources | Resource | Link | |---|---| | Paper (arXiv) | [arxiv.org/abs/2408.15332](https://arxiv.org/abs/2408.15332) | | Code (GitHub) | [github.com/shehper/AC-Solver](https://github.com/shehper/AC-Solver) | | W&B Training Runs | [wandb.ai/shehper/AC-Solver-PPO](https://wandb.ai/shehper/AC-Solver-PPO) | | What is AC? (primer) | [What_is_AC.md](https://github.com/shehper/AC-Solver/blob/main/What_is_AC.md) | --- ## Citation If you use this dataset, please cite the accompanying paper: ```bibtex @misc{shehper2024makesmathproblemshard, title={What makes math problems hard for reinforcement learning: a case study}, author={Ali Shehper and Anibal M. Medina-Mardones and Bartłomiej Lewandowski and Angus Gruen and Piotr Kucharski and Sergei Gukov}, year={2024}, eprint={2408.15332}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2408.15332}, } ```

提供机构：

mhieuuu

5,000+

优质数据集

54 个

任务类型

进入经典数据集