edev2000/amc12-full
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/edev2000/amc12-full
下载链接
链接失效反馈官方服务:
资源简介:
# AMC12 Dataset (Research-Oriented)
A structured dataset derived from the AMC 12 (American Mathematics Competitions), designed for **LLM training, evaluation, and reinforcement learning (RL)** on mathematical reasoning tasks.
This repository contains **all AMC 12 problems from 2000–2025**, making it one of the most complete AMC12 datasets available for research.
---
## 📘 Introduction
The AMC 12 is a **25-question, 75-minute multiple-choice examination** aimed at high school students. Problems are designed to **increase in difficulty progressively**, requiring a combination of algebra, geometry, combinatorics, and number theory reasoning.
* **Format:** 25 multiple-choice questions (A–E)
* **Duration:** 75 minutes
* **Difficulty progression:** Problems 1 → 25 increase in complexity
* **Calculator policy:**
* Since 2008, calculators are **not permitted**
* Problems are designed to be solvable without computational aids
Top-performing students (~top 6%) are invited to participate in the AIME, making AMC 12 a strong proxy for **high-level mathematical reasoning ability**.
---
## 📦 Dataset Overview
Each sample corresponds to a single AMC 12 problem.
### Example (JSONL)
```json
{
"year": 2019,
"problem_id": "2019A-15",
"question": "...",
"answer": "D",
"difficulty": 3
}
```
---
## 🧱 Schema
Each entry in the dataset follows this structure:
| Field | Type | Description |
| ------------ | ------ | --------------------------------------------------------------------------------- |
| `problem_id` | string | Unique identifier in the format `{year}{A/B}-{problem_number}` (e.g., `2019A-15`) |
| `year` | int | Competition year (2000–2025) |
| `question` | string | Full problem statement (including choices) |
| `answer` | string | Correct answer option (`A`–`E`) |
| `difficulty` | int | Difficulty level derived from problem order |
### 🔑 Problem ID Definition
The `problem_id` encodes the full provenance of each problem:
```
{year}{A/B}-{problem_number}
```
* `{year}` → competition year
* `{A/B}` → AMC12A or AMC12B
* `{problem_number}` → position in the exam (1–25)
#### Examples
* `2007B-5` → AMC 12B, 2007, Problem 5
* `2019A-15` → AMC 12A, 2019, Problem 15
---
### Difficulty Structure
We adopt a coarse-grained difficulty approximation aligned with problem order:
| Problem Range | Difficulty |
| ------------- | --------------- |
| 1–10 | Easy–Medium (2) |
| 11–20 | Medium–Hard (3) |
| 21–25 | Hard (4) |
This structure enables:
* Curriculum learning
* Difficulty-aware evaluation
* Model capability stratification
---
## 🚀 Why This Dataset?
### Compared to Other Math Datasets
* **Not heavily pretrained**
Unlike datasets such as GSM8K, AMC-style problems are less likely to be memorized by models
* **Higher reasoning complexity**
Problems typically require **multi-step, structured reasoning**, often exceeding datasets like MATH500
* **Clean evaluation signal**
* Multiple-choice format eliminates ambiguity
* No unit mismatch issues (e.g., “8 months vs 240 days”)
* **Fully verifiable**
Every problem has a **unique, discrete answer**, ideal for RL reward design
---
### Compared to Other AMC Datasets
* **Complete coverage (2000–2025)**
Includes all AMC12A and AMC12B problems across 25 years
* **Fully indexed & traceable**
Each problem maps directly to its original contest and position
* **Structured for ML pipelines**
Ready for:
* RL training (PPO / GRPO)
* Pass@k evaluation
* Verifier-based reward systems
---
## 📚 Data Source & Attribution
This dataset is curated from publicly available resources, with primary reference to:
* Art of Problem Solving
All AMC problems are **copyrighted by the Mathematical Association of America (MAA)** under the American Mathematics Competitions program.
This repository does **not claim ownership** of the original problem statements and provides them solely for research and educational purposes.
提供机构:
edev2000



