principia-collection
收藏魔搭社区2026-01-09 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/facebook/principia-collection
下载链接
链接失效反馈官方服务:
资源简介:
# Principia Collection

Principia Collection is a large-scale dataset designed to enhance language models’ ability to derive **mathematical objects** from **STEM-related problem statements**. Each instance contains a problem statement, a ground truth answer, an answer type, and a topic label. The topics are drawn from [*Physics Subject Headings (PhySH)*](https://physh.org/) and [*Mathematical Subject Classification (MSC 2020)*](https://zbmath.org/static/msc2020.pdf). Both problem statements and ground truth labels were generated using our synthetic data generation pipeline using [GPT-OSS-120B](https://huggingface.co/openai/gpt-oss-120b) as the proposer. The details of our data generation pipeline will be outlined in the paper (to be released soon).
The Principia Collection includes **250K instances** where all instances require deriving *mathematical-objects*. Practitioners can incorporate Principia Collection into their training data or extend the synthesis pipeline (e.g., by integrating additional topic taxonomies beyond MSC2020 and PhySH) to generate problems that elicit deeper reasoning capabilities.
The dataset can be loaded as follows:
```python
from datasets import load_dataset
data = load_dataset("facebook/principia-collection", split="mathematical_object")
```
Also, we additionally release a **300K-instance** that shares the same topics but requires **numerical answers**. Training on this subset in addition provided consistent performance improvements on widely used reasoning benchmarks such as **SuperGPQA**, **GPQA-Diamond**, and **AIME**.
The numerical subset can be loaded as follows:
```python
from datasets import load_dataset
data = load_dataset("facebook/principia-collection", split="numerical")
```
---
## Statistics of Principia Collection
The following table outlines the statistics of our dataset:
### 1. *Mathematical Objects* Subset
#### Answer Type counts
| Answer Type | Count |
|--------------------|--------:|
| Equation | 42,714 |
| Inequality | 41,670 |
| Interval | 42,728 |
| Set | 46,869 |
| Matrix | 39,386 |
| Piecewise Function | 35,381 |
#### Token length of the answer (mathematical object)
| | Count |
|--------|------:|
| Mean Answer Length | 135.3 tokens |
| Median Answer Length | 73 tokens |
| Q1 (25th percentile) | 28 tokens |
| Q3 (75th percentile) | 161 tokens |
### 2. *Numerical* Subset
#### Answer Type counts
| Answer Type | Count |
|------------------------|--------:|
| Integer (no unit) | 54,077 |
| Integer (with unit) | 51,612 |
| Decimal (no unit) | 42,986 |
| Decimal (with unit) | 43,994 |
| Fraction (no unit) | 58,039 |
| Fraction (with unit) | 54,953 |
---
### Performances by training on the Principia Collection
To be released with the paper soon!
---
### Citation
A detailed technical report is forthcoming. Please check back soon for the official citation.
# 普林西皮亚数据集(Principia Collection)

普林西皮亚数据集(Principia Collection)是一款大规模数据集,旨在提升大语言模型(Large Language Model, LLM)从STEM相关问题表述中推导**数学对象(mathematical objects)**的能力。每个数据样本均包含问题表述、标准答案、答案类型与主题标签。
主题标签源自**物理学主题词表(Physics Subject Headings, PhySH)**与**数学主题分类表(2020版数学主题分类,Mathematical Subject Classification 2020, MSC 2020)**。所有问题表述与标准答案均通过自研的合成数据生成流水线生成,其中以GPT-OSS-120B作为命题生成器。数据生成流水线的具体细节将在即将发布的论文中详述。
普林西皮亚数据集共包含**25万条数据样本**,所有样本均要求推导数学对象。研究者可将该数据集纳入训练数据,或对合成数据生成流水线进行拓展(例如集成MSC 2020与PhySH之外的更多主题分类体系),以生成能够激发模型深层推理能力的问题。
该数据集可通过以下方式加载:
python
from datasets import load_dataset
data = load_dataset("facebook/principia-collection", split="mathematical_object")
此外,我们还发布了一个包含**30万条数据样本**的子集,该子集沿用相同的主题分类体系,但要求模型输出**数值答案(numerical answers)**。在该子集上进行额外训练,可在SuperGPQA、GPQA-Diamond与AIME等主流推理基准测试中获得稳定的性能提升。
数值答案子集可通过以下方式加载:
python
from datasets import load_dataset
data = load_dataset("facebook/principia-collection", split="numerical")
---
## 普林西皮亚数据集的统计信息
下表展示了该数据集的统计详情:
### 1. 数学对象子集(Mathematical Objects Subset)
#### 答案类型分布
| 答案类型 | 样本数 |
|--------------------|--------:|
| 方程式 | 42,714 |
| 不等式 | 41,670 |
| 区间 | 42,728 |
| 集合 | 46,869 |
| 矩阵 | 39,386 |
| 分段函数 | 35,381 |
#### 答案(数学对象)的Token长度分布
| 统计项 | 数值 |
|--------|------:|
| 平均答案长度 | 135.3 个Token |
| 答案长度中位数 | 73 个Token |
| 第一四分位数(25%分位) | 28 个Token |
| 第三四分位数(75%分位) | 161 个Token |
### 2. 数值答案子集(Numerical Subset)
#### 答案类型分布
| 答案类型 | 样本数 |
|------------------------|--------:|
| 无单位整数 | 54,077 |
| 带单位整数 | 51,612 |
| 无单位小数 | 42,986 |
| 带单位小数 | 43,994 |
| 无单位分数 | 58,039 |
| 带单位分数 | 54,953 |
---
### 基于普林西皮亚数据集训练的性能表现
相关性能结果将随论文一同发布,敬请期待!
---
### 引用信息
详细的技术报告即将发布,请稍后留意官方引用格式。
提供机构:
maas
创建时间:
2025-11-10



