agicorp/StackMathQA
收藏Hugging Face2024-03-23 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/agicorp/StackMathQA
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
language:
- en
pretty_name: StackMathQA
size_categories:
- 1B<n<10B
configs:
- config_name: stackmathqa1600k
data_files: data/stackmathqa1600k/all.jsonl
default: true
- config_name: stackmathqa800k
data_files: data/stackmathqa800k/all.jsonl
- config_name: stackmathqa400k
data_files: data/stackmathqa400k/all.jsonl
- config_name: stackmathqa200k
data_files: data/stackmathqa200k/all.jsonl
- config_name: stackmathqa100k
data_files: data/stackmathqa100k/all.jsonl
- config_name: stackmathqafull-1q1a
data_files: preprocessed/stackexchange-math--1q1a/*.jsonl
- config_name: stackmathqafull-qalist
data_files: preprocessed/stackexchange-math/*.jsonl
tags:
- mathematical-reasoning
- reasoning
- finetuning
- pretraining
- llm
---
# StackMathQA
StackMathQA is a meticulously curated collection of **2 million** mathematical questions and answers, sourced from various Stack Exchange sites. This repository is designed to serve as a comprehensive resource for researchers, educators, and enthusiasts in the field of mathematics and AI research.
## Configs
```YAML
configs:
- config_name: stackmathqa1600k
data_files: data/stackmathqa1600k/all.jsonl
default: true
- config_name: stackmathqa800k
data_files: data/stackmathqa800k/all.jsonl
- config_name: stackmathqa400k
data_files: data/stackmathqa400k/all.jsonl
- config_name: stackmathqa200k
data_files: data/stackmathqa200k/all.jsonl
- config_name: stackmathqa100k
data_files: data/stackmathqa100k/all.jsonl
- config_name: stackmathqafull-1q1a
data_files: preprocessed/stackexchange-math--1q1a/*.jsonl
- config_name: stackmathqafull-qalist
data_files: preprocessed/stackexchange-math/*.jsonl
```
How to load data:
```python
from datasets import load_dataset
ds = load_dataset("math-ai/StackMathQA", "stackmathqa1600k") # or any valid config_name
```
## Preprocessed Data
In the `./preprocessed/stackexchange-math` directory and `./preprocessed/stackexchange-math--1q1a` directory, you will find the data structured in two formats:
1. **Question and List of Answers Format**:
Each entry is structured as {"Q": "question", "A_List": ["answer1", "answer2", ...]}.
- `math.stackexchange.com.jsonl`: 827,439 lines
- `mathoverflow.net.jsonl`: 90,645 lines
- `stats.stackexchange.com.jsonl`: 103,024 lines
- `physics.stackexchange.com.jsonl`: 117,318 lines
- In total: **1,138,426** questions
```YAML
dataset_info:
features:
- name: Q
dtype: string
description: "The mathematical question in LaTeX encoded format."
- name: A_list
dtype: sequence
description: "The list of answers to the mathematical question, also in LaTeX encoded."
- name: meta
dtype: dict
description: "A collection of metadata for each question and its corresponding answer list."
```
2. **Question and Single Answer Format**:
Each line contains a question and one corresponding answer, structured as {"Q": "question", "A": "answer"}. Multiple answers for the same question are separated into different lines.
- `math.stackexchange.com.jsonl`: 1,407,739 lines
- `mathoverflow.net.jsonl`: 166,592 lines
- `stats.stackexchange.com.jsonl`: 156,143 lines
- `physics.stackexchange.com.jsonl`: 226,532 lines
- In total: **1,957,006** answers
```YAML
dataset_info:
features:
- name: Q
dtype: string
description: "The mathematical question in LaTeX encoded format."
- name: A
dtype: string
description: "The answer to the mathematical question, also in LaTeX encoded."
- name: meta
dtype: dict
description: "A collection of metadata for each question-answer pair."
```
## Selected Data
The dataset has been carefully curated using importance sampling. We offer selected subsets of the dataset (`./preprocessed/stackexchange-math--1q1a`) with different sizes to cater to varied needs:
```YAML
dataset_info:
features:
- name: Q
dtype: string
description: "The mathematical question in LaTeX encoded format."
- name: A
dtype: string
description: "The answer to the mathematical question, also in LaTeX encoded."
- name: meta
dtype: dict
description: "A collection of metadata for each question-answer pair."
```
### StackMathQA1600K
- Location: `./data/stackmathqa1600k`
- Contents:
- `all.jsonl`: Containing 1.6 million entries.
- `meta.json`: Metadata and additional information.
```bash
Source: Stack Exchange (Math), Count: 1244887
Source: MathOverflow, Count: 110041
Source: Stack Exchange (Stats), Count: 99878
Source: Stack Exchange (Physics), Count: 145194
```
Similar structures are available for StackMathQA800K, StackMathQA400K, StackMathQA200K, and StackMathQA100K subsets.
### StackMathQA800K
- Location: `./data/stackmathqa800k`
- Contents:
- `all.jsonl`: Containing 800k entries.
- `meta.json`: Metadata and additional information.
```bash
Source: Stack Exchange (Math), Count: 738850
Source: MathOverflow, Count: 24276
Source: Stack Exchange (Stats), Count: 15046
Source: Stack Exchange (Physics), Count: 21828
```
### StackMathQA400K
- Location: `./data/stackmathqa400k`
- Contents:
- `all.jsonl`: Containing 400k entries.
- `meta.json`: Metadata and additional information.
```bash
Source: Stack Exchange (Math), Count: 392940
Source: MathOverflow, Count: 3963
Source: Stack Exchange (Stats), Count: 1637
Source: Stack Exchange (Physics), Count: 1460
```
### StackMathQA200K
- Location: `./data/stackmathqa200k`
- Contents:
- `all.jsonl`: Containing 200k entries.
- `meta.json`: Metadata and additional information.
```bash
Source: Stack Exchange (Math), Count: 197792
Source: MathOverflow, Count: 1367
Source: Stack Exchange (Stats), Count: 423
Source: Stack Exchange (Physics), Count: 418
```
### StackMathQA100K
- Location: `./data/stackmathqa100k`
- Contents:
- `all.jsonl`: Containing 100k entries.
- `meta.json`: Metadata and additional information.
```bash
Source: Stack Exchange (Math), Count: 99013
Source: MathOverflow, Count: 626
Source: Stack Exchange (Stats), Count: 182
Source: Stack Exchange (Physics), Count: 179
```
## Citation
We appreciate your use of StackMathQA in your work. If you find this repository helpful, please consider citing it and star this repo. Feel free to contact zhangyif21@tsinghua.edu.cn or open an issue if you have any questions.
```bibtex
@misc{stackmathqa2024,
title={StackMathQA: A Curated Collection of 2 Million Mathematical Questions and Answers Sourced from Stack Exchange},
author={Zhang, Yifan},
year={2024},
}
```
---
许可证:CC BY 4.0
任务类别:
- 文本生成
- 问答
语言:
- 英语
展示名称:StackMathQA
规模类别:
- 10亿 < 样本数 < 100亿
配置项:
- 配置名称:stackmathqa1600k
数据文件路径:data/stackmathqa1600k/all.jsonl
默认配置:true
- 配置名称:stackmathqa800k
数据文件路径:data/stackmathqa800k/all.jsonl
- 配置名称:stackmathqa400k
数据文件路径:data/stackmathqa400k/all.jsonl
- 配置名称:stackmathqa200k
数据文件路径:data/stackmathqa200k/all.jsonl
- 配置名称:stackmathqa100k
数据文件路径:data/stackmathqa100k/all.jsonl
- 配置名称:stackmathqafull-1q1a
数据文件路径:preprocessed/stackexchange-math--1q1a/*.jsonl
- 配置名称:stackmathqafull-qalist
数据文件路径:preprocessed/stackexchange-math/*.jsonl
标签:
- 数学推理
- 推理
- 微调
- 预训练
- 大语言模型(Large Language Model)
---
# StackMathQA
StackMathQA 是一套经过精心整理的数学问答集,包含**200万**条数学问题与解答,数据源自多个Stack Exchange平台。本数据集仓库旨在为数学与人工智能研究领域的科研人员、教育工作者及爱好者提供全面的资源支持。
## 配置项
YAML
configs:
- config_name: stackmathqa1600k
data_files: data/stackmathqa1600k/all.jsonl
default: true
- config_name: stackmathqa800k
data_files: data/stackmathqa800k/all.jsonl
- config_name: stackmathqa400k
data_files: data/stackmathqa400k/all.jsonl
- config_name: stackmathqa200k
data_files: data/stackmathqa200k/all.jsonl
- config_name: stackmathqa100k
data_files: data/stackmathqa100k/all.jsonl
- config_name: stackmathqafull-1q1a
data_files: preprocessed/stackexchange-math--1q1a/*.jsonl
- config_name: stackmathqafull-qalist
data_files: preprocessed/stackexchange-math/*.jsonl
## 数据加载方式
python
from datasets import load_dataset
ds = load_dataset("math-ai/StackMathQA", "stackmathqa1600k") # 或使用任意合法配置名称
## 预处理数据
在`./preprocessed/stackexchange-math`与`./preprocessed/stackexchange-math--1q1a`目录中,提供了两种结构化格式的数据:
1. **问题与答案列表格式**:
每条数据采用`{"Q": "问题内容", "A_List": ["答案1", "答案2", ...]}`的结构。
- `math.stackexchange.com.jsonl`:827,439行
- `mathoverflow.net.jsonl`:90,645行
- `stats.stackexchange.com.jsonl`:103,024行
- `physics.stackexchange.com.jsonl`:117,318行
总计:**1,138,426**条问题
YAML
dataset_info:
字段信息:
- 字段名:Q
数据类型:string
字段说明:采用LaTeX编码格式的数学问题。
- 字段名:A_list
数据类型:序列
字段说明:该数学问题的答案列表,同样采用LaTeX编码格式。
- 字段名:meta
数据类型:字典
字段说明:该问题及其对应答案列表的元数据集合。
2. **问题与单条答案格式**:
每条数据仅包含一条问题与对应的单条答案,结构为`{"Q": "问题内容", "A": "答案内容"}`。同一问题的多条答案会被拆分为不同的数据行。
- `math.stackexchange.com.jsonl`:1,407,739行
- `mathoverflow.net.jsonl`:166,592行
- `stats.stackexchange.com.jsonl`:156,143行
- `physics.stackexchange.com.jsonl`:226,532行
总计:**1,957,006**条答案
YAML
dataset_info:
字段信息:
- 字段名:Q
数据类型:string
字段说明:采用LaTeX编码格式的数学问题。
- 字段名:A
数据类型:string
字段说明:该数学问题的答案,同样采用LaTeX编码格式。
- 字段名:meta
数据类型:字典
字段说明:该问答对的元数据集合。
## 精选子集数据
本数据集通过重要抽样法进行了精心筛选,我们提供了`./preprocessed/stackexchange-math--1q1a`下的不同规模精选子集,以适配不同的使用需求:
YAML
dataset_info:
字段信息:
- 字段名:Q
数据类型:string
字段说明:采用LaTeX编码格式的数学问题。
- 字段名:A
数据类型:string
字段说明:该数学问题的答案,同样采用LaTeX编码格式。
- 字段名:meta
数据类型:字典
字段说明:该问答对的元数据集合。
### StackMathQA1600K
- 数据路径:`./data/stackmathqa1600k`
- 数据内容:
- `all.jsonl`:包含160万条数据。
- `meta.json`:包含元数据与额外补充信息。
bash
数据来源:Stack Exchange(数学板块):1,244,887条
数据来源:MathOverflow:110,041条
数据来源:Stack Exchange(统计板块):99,878条
数据来源:Stack Exchange(物理板块):145,194条
StackMathQA800K、StackMathQA400K、StackMathQA200K与StackMathQA100K子集拥有类似的结构。
### StackMathQA800K
- 数据路径:`./data/stackmathqa800k`
- 数据内容:
- `all.jsonl`:包含80万条数据。
- `meta.json`:包含元数据与额外补充信息。
bash
数据来源:Stack Exchange(数学板块):738,850条
数据来源:MathOverflow:24,276条
数据来源:Stack Exchange(统计板块):15,046条
数据来源:Stack Exchange(物理板块):21,828条
### StackMathQA400K
- 数据路径:`./data/stackmathqa400k`
- 数据内容:
- `all.jsonl`:包含40万条数据。
- `meta.json`:包含元数据与额外补充信息。
bash
数据来源:Stack Exchange(数学板块):392,940条
数据来源:MathOverflow:3,963条
数据来源:Stack Exchange(统计板块):1,637条
数据来源:Stack Exchange(物理板块):1,460条
### StackMathQA200K
- 数据路径:`./data/stackmathqa200k`
- 数据内容:
- `all.jsonl`:包含20万条数据。
- `meta.json`:包含元数据与额外补充信息。
bash
数据来源:Stack Exchange(数学板块):197,792条
数据来源:MathOverflow:1,367条
数据来源:Stack Exchange(统计板块):423条
数据来源:Stack Exchange(物理板块):418条
### StackMathQA100K
- 数据路径:`./data/stackmathqa100k`
- 数据内容:
- `all.jsonl`:包含10万条数据。
- `meta.json`:包含元数据与额外补充信息。
bash
数据来源:Stack Exchange(数学板块):99,013条
数据来源:MathOverflow:626条
数据来源:Stack Exchange(统计板块):182条
数据来源:Stack Exchange(物理板块):179条
## 引用说明
感谢您在研究工作中使用StackMathQA。若本数据集仓库对您的工作有所帮助,请引用本数据集并为仓库点亮Star。如有任何疑问,可联系邮箱zhangyif21@tsinghua.edu.cn或提交Issue。
bibtex
@misc{stackmathqa2024,
title={StackMathQA: A Curated Collection of 2 Million Mathematical Questions and Answers Sourced from Stack Exchange},
author={Zhang, Yifan},
year={2024},
}
提供机构:
agicorp
原始信息汇总
数据集概述
基本信息
- 名称: StackMathQA
- 许可证: CC-BY-4.0
- 语言: 英语 (en)
- 任务类别:
- 文本生成
- 问答
- 大小范围: 1B<n<10B
- 标签:
- 数学推理
- 推理
- 微调
- 预训练
- LLM
数据集内容
- 描述: StackMathQA 是一个精心策划的包含200万个数学问题和答案的集合,来源于不同的Stack Exchange网站。
- 数据结构:
- 问题和答案列表格式:
- 每个条目结构为 {"Q": "问题", "A_List": ["答案1", "答案2", ...]}。
- 总计: 1,138,426个问题。
- 问题和单个答案格式:
- 每行包含一个问题和一个对应的答案,结构为 {"Q": "问题", "A": "答案"}。
- 总计: 1,957,006个答案。
- 问题和答案列表格式:
配置选项
- 配置名称:
- stackmathqa1600k
- stackmathqa800k
- stackmathqa400k
- stackmathqa200k
- stackmathqa100k
- stackmathqafull-1q1a
- stackmathqafull-qalist
- 默认配置: stackmathqa1600k
数据集子集
-
StackMathQA1600K:
- 位置: ./data/stackmathqa1600k
- 内容:
- all.jsonl: 包含160万条目。
- meta.json: 元数据和附加信息。
- 来源和计数:
- Stack Exchange (Math): 1244887
- MathOverflow: 110041
- Stack Exchange (Stats): 99878
- Stack Exchange (Physics): 145194
-
其他子集:
- StackMathQA800K, StackMathQA400K, StackMathQA200K, StackMathQA100K 具有类似的结构和内容。
数据加载示例
python from datasets import load_dataset
ds = load_dataset("math-ai/StackMathQA", "stackmathqa1600k") # 或任何有效的config_name
搜集汇总
数据集介绍

背景与挑战
背景概述
StackMathQA is a large-scale dataset containing over 2 million mathematical questions and answers from Stack Exchange, designed for AI research and educational purposes. It includes multiple subsets for flexibility and is available in both question-answer list and question-single answer formats.
以上内容由遇见数据集搜集并总结生成



