karankhatavkar/polynomial-root-finding-dataset
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/karankhatavkar/polynomial-root-finding-dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Polynomial Root Finding Dataset (With Out-of-Distribution Gaps)
## Dataset Description
This dataset provides a massive, synthetically generated collection of polynomial equations (ranging from degrees 1 to 4) alongside their real roots. It is explicitly designed for benchmarking Machine Learning models (such as Transformers or Mixture Density Networks) on mathematical reasoning, continuous numerical embeddings, and rigorous **Out-of-Distribution (OoD) generalization**.
The dataset features carefully engineered interpolation and extrapolation "blind spots" (gaps), high-precision stress tests (close roots), and unsolvable states (complex roots), making it an excellent stress test for uncertainty quantification.
### Data Format & Structure
The dataset is provided in `JSONL` (JSON Lines) format.
Because a single polynomial can have multiple distinct real roots, the dataset is **unrolled**. If an equation has 3 real roots, it appears as 3 separate rows sharing the identical input string, but featuring different single-float targets.
* **`text_input` (string):** The polynomial equation formatted in standard algebraic notation. Leading zero-padding is aggressively removed to force active parsing of degree tokens. (e.g., `+1.2345x^2 -3.1234x^1 +2.0000x^0`).
* **`target` (float):** A single real root of the equation. If the equation has 0 real roots, this is set to a dummy value of `0.0`.
* **`is_solvable` (int):** A binary flag (`1` or `0`). `0` indicates the equation only has complex roots (no real solutions).
* **`degree` (int):** The highest degree of the polynomial (1, 2, 3, or 4).
### Equation Demographics
To prevent dilution of lower-degree equations during the unrolling process, the dataset uses inverse-frequency sampling. The final unrolled dataset contains roughly equal row representation across all four degrees:
* **Linear (Degree 1):** ~25% of rows
* **Quadratic (Degree 2):** ~25% of rows
* **Cubic (Degree 3):** ~25% of rows
* **Quartic (Degree 4):** ~25% of rows
---
## Dataset Splits & Out-of-Distribution (OoD) Design
The dataset contains a total of **672,841 unrolled rows** generated from 400,000 unique base equations. It is divided into four distinct splits to rigorously test interpolation and extrapolation.
### 1. `train` (468,609 rows) & `test_id` (66,863 rows)
* **The Safe Zone:** All real roots are strictly generated within the bounds of `(-10, -5) ∪ (-2, +2) ∪ (+5, +10)`.
* These sets contain strict boundaries; they possess absolutely zero roots in the designated OoD gaps.
### 2. `ood_gap1` (68,646 rows) - The Interpolation Void
* All real roots in this split are explicitly placed inside the `[-5, -2] ∪ [+2, +5]` bounds.
* *Purpose:* Tests a model's ability to interpolate inside a domain "blind spot" that was entirely absent during training.
### 3. `ood_gap2` (68,723 rows) - The Extrapolation Zone
* All real roots in this split are explicitly placed at the extreme edges: `[-15, -10] ∪ [+10, +15]`.
* *Purpose:* Tests a model's ability to extrapolate mathematical rules beyond the numerical boundaries it was trained on.
---
## Mathematical Complexities & Stress Tests
To prevent models from learning trivial shortcuts, the training and in-distribution test sets are injected with specific mathematical edge cases:
1. **Close Root Precision (10% of applicable equations):** Approximately 16,000 equations have two real roots forced within a microscopic distance of **0.01 to 0.05** of each other. This stress tests continuous numerical embeddings and a model's ability to resolve overlapping probability peaks without merging them.
2. **Unsolvable States (20% of even-degree equations):**
Approximately 19,000 quadratics and quartics are explicitly generated using complex conjugates, yielding **0 real roots** (`is_solvable = 0`). This trains the model to recognize undefined states and collapse its probability weights.
3. **Mixed Root States (50% of multi-root equations):**
Many cubics and quartics are generated with a mix of real roots and complex conjugate pairs, forcing the architecture to isolate only the valid real targets.
4. **Coefficient Normalization:**
All polynomial coefficients are safely normalized so that their absolute peaks land in the `[-10, 10]` window, mirroring the root domain and preventing exploding gradients during neural network training.
---
## Usage
Loading the dataset via Hugging Face `datasets`:
```python
from datasets import load_dataset
# Load all splits
dataset = load_dataset("karankhatavkar/polynomial_roots")
# Example: View the first training sample
print(dataset['train'][0])
# Output: {'text_input': '+1.0000x^2 -3.0000x^1 +2.0000x^0', 'target': 2.0, 'is_solvable': 1, 'degree': 2}
# 带分布外(Out-of-Distribution, OoD)间隙的多项式求根数据集
## 数据集描述
本数据集包含大规模合成生成的1至4次多项式方程集合及其实根,专为在数学推理、连续数值嵌入与严格的分布外(Out-of-Distribution, OoD)泛化能力上评测机器学习模型(如Transformer或混合密度网络(Mixture Density Networks))而设计。
该数据集精心设计了插值与外推的「盲区(间隙)」、高精度压力测试场景(近邻根)与不可解状态(复根),可作为不确定性量化的优质压力测试基准。
### 数据格式与结构
本数据集采用JSONL(JSON Lines)格式存储。
由于单个多项式可拥有多个不同实根,本数据集采用**展开(unrolled)**格式存储:若某方程含3个实根,则会生成3条完全相同的输入字符串行,但对应不同的单精度浮点目标值。
* **`text_input`(字符串类型):** 采用标准代数符号格式编写的多项式方程,为强制模型主动解析次数Token(Token),会完全移除前导零填充(例如:`+1.2345x^2 -3.1234x^1 +2.0000x^0`)。
* **`target`(浮点类型):** 该方程的单个实根。若方程无实根,则该字段将设为占位值`0.0`。
* **`is_solvable`(整数类型):** 二元标记(`1`或`0`),`0`表示该方程仅存在复根(无实解)。
* **`degree`(整数类型):** 多项式的最高次数(取值为1、2、3或4)。
### 多项式构成统计
为避免展开过程中低次多项式的占比被稀释,本数据集采用逆频率采样策略。最终展开后的数据集在四个次数等级上的行占比大致相等:
* **一次多项式(Degree 1):** 约占总样本的25%
* **二次多项式(Degree 2):** 约占总样本的25%
* **三次多项式(Degree 3):** 约占总样本的25%
* **四次多项式(Degree 4):** 约占总样本的25%
---
## 数据集划分与分布外(Out-of-Distribution, OoD)设计
本数据集总计包含**672,841条展开样本行**,源自400,000个唯一的基础多项式方程。为严格评测插值与外推能力,数据集被划分为四个独立子集:
### 1. `train`(468,609条样本)与`test_id`(66,863条样本)
* **安全域:** 所有实根均严格生成于`(-10, -5) ∪ (-2, +2) ∪ (+5, +10)`区间内。
* 该子集拥有严格的数值边界,在指定的分布外间隙中完全不包含任何实根。
### 2. `ood_gap1`(68,646条样本)——插值盲区
* 该子集的所有实根均明确位于`[-5, -2] ∪ [+2, +5]`区间内。
* *设计目的:* 评测模型在训练阶段完全未接触的域「盲区」内进行插值的能力。
### 3. `ood_gap2`(68,723条样本)——外推测试域
* 该子集的所有实根均明确位于极端边界区间`[-15, -10] ∪ [+10, +15]`内。
* *设计目的:* 评测模型将数学规则外推至训练所用数值边界之外的能力。
---
## 数学复杂度与压力测试场景
为避免模型学习到平庸的捷径解法,训练集与分布内测试集植入了特定的数学边缘场景:
1. **近邻根精度测试(适用于10%的方程):** 约16,000个方程被强制设置为两个实根间距仅为**0.01至0.05**的极小值,该场景可用于评测连续数值嵌入性能,以及模型在不合并重叠概率峰的情况下分辨二者的能力。
2. **不可解状态(适用于20%的偶次方程):** 约19,000个二次与四次多项式通过复共轭对生成,最终无实根(`is_solvable = 0`)。该场景用于训练模型识别未定义状态并收敛其概率权重。
3. **混合根状态(适用于50%的多根方程):** 大量三次与四次多项式被设置为同时包含实根与复共轭对,这要求模型仅提取出有效的实根目标。
4. **系数归一化:** 所有多项式系数均经过安全归一化,使其绝对值峰值落在`[-10, 10]`区间内,与根的数值域保持一致,避免神经网络训练过程中出现梯度爆炸。
---
## 使用方法
通过Hugging Face `datasets`库加载本数据集:
python
from datasets import load_dataset
# 加载所有子集
dataset = load_dataset("karankhatavkar/polynomial_roots")
# 示例:查看第一条训练样本
print(dataset['train'][0])
# 输出:{'text_input': '+1.0000x^2 -3.0000x^1 +2.0000x^0', 'target': 2.0, 'is_solvable': 1, 'degree': 2}
提供机构:
karankhatavkar



