dongboklee/math-eval

Name: dongboklee/math-eval
Creator: dongboklee
Published: 2025-12-04 06:20:00
License: 暂无描述

Hugging Face2025-12-04 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/dongboklee/math-eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: all data_files: - split: dev path: all/dev-* - split: test path: all/test-* - config_name: theoremqa data_files: - split: dev path: theoremqa/dev-* - split: test path: theoremqa/test-* - config_name: math data_files: - split: dev path: math/dev-* - split: test path: math/test-* - config_name: gsm8k data_files: - split: dev path: gsm8k/dev-* - split: test path: gsm8k/test-* - config_name: gpqa_diamond data_files: - split: dev path: gpqa_diamond/dev-* - split: test path: gpqa_diamond/test-* - config_name: mmlu_stem data_files: - split: dev path: mmlu_stem/dev-* - split: test path: mmlu_stem/test-* - config_name: arc data_files: - split: dev path: arc/dev-* - split: test path: arc/test-* - config_name: bbh data_files: - split: dev path: bbh/dev-* - split: test path: bbh/test-* --- # Math Evaluation Dataset Collection This dataset contains multiple math and reasoning evaluation benchmarks, each available as a separate configuration. ## Available Configurations - **all**: Combined dataset containing all benchmarks (includes additional 'dataset' field) - **theoremqa**: TheoremQA dataset - **math**: MATH dataset - **gsm8k**: GSM8K dataset - **gpqa_diamond**: GPQA Diamond dataset - **mmlu_stem**: MMLU STEM subset - **arc**: ARC dataset - **bbh**: Big Bench Hard (BBH) dataset ## Usage ### Load everything at once: ```python from datasets import load_dataset import json # Load all datasets combined dataset = load_dataset("dongboklee/math-eval", "all") # Access splits dev_set = dataset["dev"] test_set = dataset["test"] # The 'all' configuration has an additional 'dataset' field for row in test_set.select(range(5)): answer = json.loads(row["answer"]) if row["answer"] else None print(f"Task: {row['task']}") print(f"Question: {row['question'][:100]}...") print(f"Answer: {answer}") print("---") ``` ### Load a specific dataset: ```python # Load a specific dataset (e.g., math) dataset = load_dataset("dongboklee/math-eval", "math") # Access splits dev_set = dataset["dev"] test_set = dataset["test"] # Parse answers (they are serialized JSON) for row in test_set: answer = json.loads(row["answer"]) if row["answer"] else None print(f"Question: {row['question'][:100]}...") print(f"Answer: {answer}") break ``` ### Load BBH and filter by task: ```python # Load BBH dataset bbh_dataset = load_dataset("dongboklee/math-eval", "bbh") # Filter by specific task boolean_expr = bbh_dataset["test"].filter(lambda x: x["task"] == "boolean_expressions") ``` ### Filter the combined dataset: ```python # Load all data all_data = load_dataset("dongboklee/math-eval", "all") # Filter for specific dataset math_only = all_data["test"].filter(lambda x: x["dataset"] == "math") # Filter for specific BBH task bbh_boolean = all_data["test"].filter( lambda x: x["dataset"] == "bbh" and x["task"] == "boolean_expressions" ) ``` ## Dataset Structure Each configuration has the following structure: - **dev**: Development set with few-shot examples (includes chain-of-thought) - **test**: Test set with questions and ground truth answers ### Fields - `question`: The question text - `cot`: Chain of thought reasoning (only in dev set for few-shot examples) - `answer`: Serialized JSON answer (empty for dev set) - `task`: Task name (particularly relevant for BBH which contains multiple sub-tasks) ## Dataset Statistics When using the 'all' configuration, you get: - All 7 evaluation benchmarks in one place - Consistent formatting across all datasets - Easy filtering by dataset or task - Preserved chain-of-thought examples in dev sets

提供机构：

dongboklee

5,000+

优质数据集

54 个

任务类型

进入经典数据集