five

agicorp/StackMathQA

收藏
Hugging Face2024-03-23 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/agicorp/StackMathQA
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - question-answering language: - en pretty_name: StackMathQA size_categories: - 1B<n<10B configs: - config_name: stackmathqa1600k data_files: data/stackmathqa1600k/all.jsonl default: true - config_name: stackmathqa800k data_files: data/stackmathqa800k/all.jsonl - config_name: stackmathqa400k data_files: data/stackmathqa400k/all.jsonl - config_name: stackmathqa200k data_files: data/stackmathqa200k/all.jsonl - config_name: stackmathqa100k data_files: data/stackmathqa100k/all.jsonl - config_name: stackmathqafull-1q1a data_files: preprocessed/stackexchange-math--1q1a/*.jsonl - config_name: stackmathqafull-qalist data_files: preprocessed/stackexchange-math/*.jsonl tags: - mathematical-reasoning - reasoning - finetuning - pretraining - llm --- # StackMathQA StackMathQA is a meticulously curated collection of **2 million** mathematical questions and answers, sourced from various Stack Exchange sites. This repository is designed to serve as a comprehensive resource for researchers, educators, and enthusiasts in the field of mathematics and AI research. ## Configs ```YAML configs: - config_name: stackmathqa1600k data_files: data/stackmathqa1600k/all.jsonl default: true - config_name: stackmathqa800k data_files: data/stackmathqa800k/all.jsonl - config_name: stackmathqa400k data_files: data/stackmathqa400k/all.jsonl - config_name: stackmathqa200k data_files: data/stackmathqa200k/all.jsonl - config_name: stackmathqa100k data_files: data/stackmathqa100k/all.jsonl - config_name: stackmathqafull-1q1a data_files: preprocessed/stackexchange-math--1q1a/*.jsonl - config_name: stackmathqafull-qalist data_files: preprocessed/stackexchange-math/*.jsonl ``` How to load data: ```python from datasets import load_dataset ds = load_dataset("math-ai/StackMathQA", "stackmathqa1600k") # or any valid config_name ``` ## Preprocessed Data In the `./preprocessed/stackexchange-math` directory and `./preprocessed/stackexchange-math--1q1a` directory, you will find the data structured in two formats: 1. **Question and List of Answers Format**: Each entry is structured as {"Q": "question", "A_List": ["answer1", "answer2", ...]}. - `math.stackexchange.com.jsonl`: 827,439 lines - `mathoverflow.net.jsonl`: 90,645 lines - `stats.stackexchange.com.jsonl`: 103,024 lines - `physics.stackexchange.com.jsonl`: 117,318 lines - In total: **1,138,426** questions ```YAML dataset_info: features: - name: Q dtype: string description: "The mathematical question in LaTeX encoded format." - name: A_list dtype: sequence description: "The list of answers to the mathematical question, also in LaTeX encoded." - name: meta dtype: dict description: "A collection of metadata for each question and its corresponding answer list." ``` 2. **Question and Single Answer Format**: Each line contains a question and one corresponding answer, structured as {"Q": "question", "A": "answer"}. Multiple answers for the same question are separated into different lines. - `math.stackexchange.com.jsonl`: 1,407,739 lines - `mathoverflow.net.jsonl`: 166,592 lines - `stats.stackexchange.com.jsonl`: 156,143 lines - `physics.stackexchange.com.jsonl`: 226,532 lines - In total: **1,957,006** answers ```YAML dataset_info: features: - name: Q dtype: string description: "The mathematical question in LaTeX encoded format." - name: A dtype: string description: "The answer to the mathematical question, also in LaTeX encoded." - name: meta dtype: dict description: "A collection of metadata for each question-answer pair." ``` ## Selected Data The dataset has been carefully curated using importance sampling. We offer selected subsets of the dataset (`./preprocessed/stackexchange-math--1q1a`) with different sizes to cater to varied needs: ```YAML dataset_info: features: - name: Q dtype: string description: "The mathematical question in LaTeX encoded format." - name: A dtype: string description: "The answer to the mathematical question, also in LaTeX encoded." - name: meta dtype: dict description: "A collection of metadata for each question-answer pair." ``` ### StackMathQA1600K - Location: `./data/stackmathqa1600k` - Contents: - `all.jsonl`: Containing 1.6 million entries. - `meta.json`: Metadata and additional information. ```bash Source: Stack Exchange (Math), Count: 1244887 Source: MathOverflow, Count: 110041 Source: Stack Exchange (Stats), Count: 99878 Source: Stack Exchange (Physics), Count: 145194 ``` Similar structures are available for StackMathQA800K, StackMathQA400K, StackMathQA200K, and StackMathQA100K subsets. ### StackMathQA800K - Location: `./data/stackmathqa800k` - Contents: - `all.jsonl`: Containing 800k entries. - `meta.json`: Metadata and additional information. ```bash Source: Stack Exchange (Math), Count: 738850 Source: MathOverflow, Count: 24276 Source: Stack Exchange (Stats), Count: 15046 Source: Stack Exchange (Physics), Count: 21828 ``` ### StackMathQA400K - Location: `./data/stackmathqa400k` - Contents: - `all.jsonl`: Containing 400k entries. - `meta.json`: Metadata and additional information. ```bash Source: Stack Exchange (Math), Count: 392940 Source: MathOverflow, Count: 3963 Source: Stack Exchange (Stats), Count: 1637 Source: Stack Exchange (Physics), Count: 1460 ``` ### StackMathQA200K - Location: `./data/stackmathqa200k` - Contents: - `all.jsonl`: Containing 200k entries. - `meta.json`: Metadata and additional information. ```bash Source: Stack Exchange (Math), Count: 197792 Source: MathOverflow, Count: 1367 Source: Stack Exchange (Stats), Count: 423 Source: Stack Exchange (Physics), Count: 418 ``` ### StackMathQA100K - Location: `./data/stackmathqa100k` - Contents: - `all.jsonl`: Containing 100k entries. - `meta.json`: Metadata and additional information. ```bash Source: Stack Exchange (Math), Count: 99013 Source: MathOverflow, Count: 626 Source: Stack Exchange (Stats), Count: 182 Source: Stack Exchange (Physics), Count: 179 ``` ## Citation We appreciate your use of StackMathQA in your work. If you find this repository helpful, please consider citing it and star this repo. Feel free to contact zhangyif21@tsinghua.edu.cn or open an issue if you have any questions. ```bibtex @misc{stackmathqa2024, title={StackMathQA: A Curated Collection of 2 Million Mathematical Questions and Answers Sourced from Stack Exchange}, author={Zhang, Yifan}, year={2024}, } ```

--- 许可证:CC BY 4.0 任务类别: - 文本生成 - 问答 语言: - 英语 展示名称:StackMathQA 规模类别: - 10亿 < 样本数 < 100亿 配置项: - 配置名称:stackmathqa1600k 数据文件路径:data/stackmathqa1600k/all.jsonl 默认配置:true - 配置名称:stackmathqa800k 数据文件路径:data/stackmathqa800k/all.jsonl - 配置名称:stackmathqa400k 数据文件路径:data/stackmathqa400k/all.jsonl - 配置名称:stackmathqa200k 数据文件路径:data/stackmathqa200k/all.jsonl - 配置名称:stackmathqa100k 数据文件路径:data/stackmathqa100k/all.jsonl - 配置名称:stackmathqafull-1q1a 数据文件路径:preprocessed/stackexchange-math--1q1a/*.jsonl - 配置名称:stackmathqafull-qalist 数据文件路径:preprocessed/stackexchange-math/*.jsonl 标签: - 数学推理 - 推理 - 微调 - 预训练 - 大语言模型(Large Language Model) --- # StackMathQA StackMathQA 是一套经过精心整理的数学问答集,包含**200万**条数学问题与解答,数据源自多个Stack Exchange平台。本数据集仓库旨在为数学与人工智能研究领域的科研人员、教育工作者及爱好者提供全面的资源支持。 ## 配置项 YAML configs: - config_name: stackmathqa1600k data_files: data/stackmathqa1600k/all.jsonl default: true - config_name: stackmathqa800k data_files: data/stackmathqa800k/all.jsonl - config_name: stackmathqa400k data_files: data/stackmathqa400k/all.jsonl - config_name: stackmathqa200k data_files: data/stackmathqa200k/all.jsonl - config_name: stackmathqa100k data_files: data/stackmathqa100k/all.jsonl - config_name: stackmathqafull-1q1a data_files: preprocessed/stackexchange-math--1q1a/*.jsonl - config_name: stackmathqafull-qalist data_files: preprocessed/stackexchange-math/*.jsonl ## 数据加载方式 python from datasets import load_dataset ds = load_dataset("math-ai/StackMathQA", "stackmathqa1600k") # 或使用任意合法配置名称 ## 预处理数据 在`./preprocessed/stackexchange-math`与`./preprocessed/stackexchange-math--1q1a`目录中,提供了两种结构化格式的数据: 1. **问题与答案列表格式**: 每条数据采用`{"Q": "问题内容", "A_List": ["答案1", "答案2", ...]}`的结构。 - `math.stackexchange.com.jsonl`:827,439行 - `mathoverflow.net.jsonl`:90,645行 - `stats.stackexchange.com.jsonl`:103,024行 - `physics.stackexchange.com.jsonl`:117,318行 总计:**1,138,426**条问题 YAML dataset_info: 字段信息: - 字段名:Q 数据类型:string 字段说明:采用LaTeX编码格式的数学问题。 - 字段名:A_list 数据类型:序列 字段说明:该数学问题的答案列表,同样采用LaTeX编码格式。 - 字段名:meta 数据类型:字典 字段说明:该问题及其对应答案列表的元数据集合。 2. **问题与单条答案格式**: 每条数据仅包含一条问题与对应的单条答案,结构为`{"Q": "问题内容", "A": "答案内容"}`。同一问题的多条答案会被拆分为不同的数据行。 - `math.stackexchange.com.jsonl`:1,407,739行 - `mathoverflow.net.jsonl`:166,592行 - `stats.stackexchange.com.jsonl`:156,143行 - `physics.stackexchange.com.jsonl`:226,532行 总计:**1,957,006**条答案 YAML dataset_info: 字段信息: - 字段名:Q 数据类型:string 字段说明:采用LaTeX编码格式的数学问题。 - 字段名:A 数据类型:string 字段说明:该数学问题的答案,同样采用LaTeX编码格式。 - 字段名:meta 数据类型:字典 字段说明:该问答对的元数据集合。 ## 精选子集数据 本数据集通过重要抽样法进行了精心筛选,我们提供了`./preprocessed/stackexchange-math--1q1a`下的不同规模精选子集,以适配不同的使用需求: YAML dataset_info: 字段信息: - 字段名:Q 数据类型:string 字段说明:采用LaTeX编码格式的数学问题。 - 字段名:A 数据类型:string 字段说明:该数学问题的答案,同样采用LaTeX编码格式。 - 字段名:meta 数据类型:字典 字段说明:该问答对的元数据集合。 ### StackMathQA1600K - 数据路径:`./data/stackmathqa1600k` - 数据内容: - `all.jsonl`:包含160万条数据。 - `meta.json`:包含元数据与额外补充信息。 bash 数据来源:Stack Exchange(数学板块):1,244,887条 数据来源:MathOverflow:110,041条 数据来源:Stack Exchange(统计板块):99,878条 数据来源:Stack Exchange(物理板块):145,194条 StackMathQA800K、StackMathQA400K、StackMathQA200K与StackMathQA100K子集拥有类似的结构。 ### StackMathQA800K - 数据路径:`./data/stackmathqa800k` - 数据内容: - `all.jsonl`:包含80万条数据。 - `meta.json`:包含元数据与额外补充信息。 bash 数据来源:Stack Exchange(数学板块):738,850条 数据来源:MathOverflow:24,276条 数据来源:Stack Exchange(统计板块):15,046条 数据来源:Stack Exchange(物理板块):21,828条 ### StackMathQA400K - 数据路径:`./data/stackmathqa400k` - 数据内容: - `all.jsonl`:包含40万条数据。 - `meta.json`:包含元数据与额外补充信息。 bash 数据来源:Stack Exchange(数学板块):392,940条 数据来源:MathOverflow:3,963条 数据来源:Stack Exchange(统计板块):1,637条 数据来源:Stack Exchange(物理板块):1,460条 ### StackMathQA200K - 数据路径:`./data/stackmathqa200k` - 数据内容: - `all.jsonl`:包含20万条数据。 - `meta.json`:包含元数据与额外补充信息。 bash 数据来源:Stack Exchange(数学板块):197,792条 数据来源:MathOverflow:1,367条 数据来源:Stack Exchange(统计板块):423条 数据来源:Stack Exchange(物理板块):418条 ### StackMathQA100K - 数据路径:`./data/stackmathqa100k` - 数据内容: - `all.jsonl`:包含10万条数据。 - `meta.json`:包含元数据与额外补充信息。 bash 数据来源:Stack Exchange(数学板块):99,013条 数据来源:MathOverflow:626条 数据来源:Stack Exchange(统计板块):182条 数据来源:Stack Exchange(物理板块):179条 ## 引用说明 感谢您在研究工作中使用StackMathQA。若本数据集仓库对您的工作有所帮助,请引用本数据集并为仓库点亮Star。如有任何疑问,可联系邮箱zhangyif21@tsinghua.edu.cn或提交Issue。 bibtex @misc{stackmathqa2024, title={StackMathQA: A Curated Collection of 2 Million Mathematical Questions and Answers Sourced from Stack Exchange}, author={Zhang, Yifan}, year={2024}, }
提供机构:
agicorp
原始信息汇总

数据集概述

基本信息

  • 名称: StackMathQA
  • 许可证: CC-BY-4.0
  • 语言: 英语 (en)
  • 任务类别:
    • 文本生成
    • 问答
  • 大小范围: 1B<n<10B
  • 标签:
    • 数学推理
    • 推理
    • 微调
    • 预训练
    • LLM

数据集内容

  • 描述: StackMathQA 是一个精心策划的包含200万个数学问题和答案的集合,来源于不同的Stack Exchange网站。
  • 数据结构:
    • 问题和答案列表格式:
      • 每个条目结构为 {"Q": "问题", "A_List": ["答案1", "答案2", ...]}。
      • 总计: 1,138,426个问题。
    • 问题和单个答案格式:
      • 每行包含一个问题和一个对应的答案,结构为 {"Q": "问题", "A": "答案"}。
      • 总计: 1,957,006个答案。

配置选项

  • 配置名称:
    • stackmathqa1600k
    • stackmathqa800k
    • stackmathqa400k
    • stackmathqa200k
    • stackmathqa100k
    • stackmathqafull-1q1a
    • stackmathqafull-qalist
  • 默认配置: stackmathqa1600k

数据集子集

  • StackMathQA1600K:

    • 位置: ./data/stackmathqa1600k
    • 内容:
      • all.jsonl: 包含160万条目。
      • meta.json: 元数据和附加信息。
    • 来源和计数:
      • Stack Exchange (Math): 1244887
      • MathOverflow: 110041
      • Stack Exchange (Stats): 99878
      • Stack Exchange (Physics): 145194
  • 其他子集:

    • StackMathQA800K, StackMathQA400K, StackMathQA200K, StackMathQA100K 具有类似的结构和内容。

数据加载示例

python from datasets import load_dataset

ds = load_dataset("math-ai/StackMathQA", "stackmathqa1600k") # 或任何有效的config_name

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
StackMathQA is a large-scale dataset containing over 2 million mathematical questions and answers from Stack Exchange, designed for AI research and educational purposes. It includes multiple subsets for flexibility and is available in both question-answer list and question-single answer formats.
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作