dongboklee/math-eval
收藏Hugging Face2025-12-04 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/dongboklee/math-eval
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: all
data_files:
- split: dev
path: all/dev-*
- split: test
path: all/test-*
- config_name: theoremqa
data_files:
- split: dev
path: theoremqa/dev-*
- split: test
path: theoremqa/test-*
- config_name: math
data_files:
- split: dev
path: math/dev-*
- split: test
path: math/test-*
- config_name: gsm8k
data_files:
- split: dev
path: gsm8k/dev-*
- split: test
path: gsm8k/test-*
- config_name: gpqa_diamond
data_files:
- split: dev
path: gpqa_diamond/dev-*
- split: test
path: gpqa_diamond/test-*
- config_name: mmlu_stem
data_files:
- split: dev
path: mmlu_stem/dev-*
- split: test
path: mmlu_stem/test-*
- config_name: arc
data_files:
- split: dev
path: arc/dev-*
- split: test
path: arc/test-*
- config_name: bbh
data_files:
- split: dev
path: bbh/dev-*
- split: test
path: bbh/test-*
---
# Math Evaluation Dataset Collection
This dataset contains multiple math and reasoning evaluation benchmarks, each available as a separate configuration.
## Available Configurations
- **all**: Combined dataset containing all benchmarks (includes additional 'dataset' field)
- **theoremqa**: TheoremQA dataset
- **math**: MATH dataset
- **gsm8k**: GSM8K dataset
- **gpqa_diamond**: GPQA Diamond dataset
- **mmlu_stem**: MMLU STEM subset
- **arc**: ARC dataset
- **bbh**: Big Bench Hard (BBH) dataset
## Usage
### Load everything at once:
```python
from datasets import load_dataset
import json
# Load all datasets combined
dataset = load_dataset("dongboklee/math-eval", "all")
# Access splits
dev_set = dataset["dev"]
test_set = dataset["test"]
# The 'all' configuration has an additional 'dataset' field
for row in test_set.select(range(5)):
answer = json.loads(row["answer"]) if row["answer"] else None
print(f"Task: {row['task']}")
print(f"Question: {row['question'][:100]}...")
print(f"Answer: {answer}")
print("---")
```
### Load a specific dataset:
```python
# Load a specific dataset (e.g., math)
dataset = load_dataset("dongboklee/math-eval", "math")
# Access splits
dev_set = dataset["dev"]
test_set = dataset["test"]
# Parse answers (they are serialized JSON)
for row in test_set:
answer = json.loads(row["answer"]) if row["answer"] else None
print(f"Question: {row['question'][:100]}...")
print(f"Answer: {answer}")
break
```
### Load BBH and filter by task:
```python
# Load BBH dataset
bbh_dataset = load_dataset("dongboklee/math-eval", "bbh")
# Filter by specific task
boolean_expr = bbh_dataset["test"].filter(lambda x: x["task"] == "boolean_expressions")
```
### Filter the combined dataset:
```python
# Load all data
all_data = load_dataset("dongboklee/math-eval", "all")
# Filter for specific dataset
math_only = all_data["test"].filter(lambda x: x["dataset"] == "math")
# Filter for specific BBH task
bbh_boolean = all_data["test"].filter(
lambda x: x["dataset"] == "bbh" and x["task"] == "boolean_expressions"
)
```
## Dataset Structure
Each configuration has the following structure:
- **dev**: Development set with few-shot examples (includes chain-of-thought)
- **test**: Test set with questions and ground truth answers
### Fields
- `question`: The question text
- `cot`: Chain of thought reasoning (only in dev set for few-shot examples)
- `answer`: Serialized JSON answer (empty for dev set)
- `task`: Task name (particularly relevant for BBH which contains multiple sub-tasks)
## Dataset Statistics
When using the 'all' configuration, you get:
- All 7 evaluation benchmarks in one place
- Consistent formatting across all datasets
- Easy filtering by dataset or task
- Preserved chain-of-thought examples in dev sets
提供机构:
dongboklee



