karankhatavkar/polynomial-root-finding-dataset

Name: karankhatavkar/polynomial-root-finding-dataset
Creator: karankhatavkar
Published: 2026-04-21 12:57:28
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/karankhatavkar/polynomial-root-finding-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# Polynomial Root Finding Dataset (With Out-of-Distribution Gaps) ## Dataset Description This dataset provides a massive, synthetically generated collection of polynomial equations (ranging from degrees 1 to 4) alongside their real roots. It is explicitly designed for benchmarking Machine Learning models (such as Transformers or Mixture Density Networks) on mathematical reasoning, continuous numerical embeddings, and rigorous **Out-of-Distribution (OoD) generalization**. The dataset features carefully engineered interpolation and extrapolation "blind spots" (gaps), high-precision stress tests (close roots), and unsolvable states (complex roots), making it an excellent stress test for uncertainty quantification. ### Data Format & Structure The dataset is provided in `JSONL` (JSON Lines) format. Because a single polynomial can have multiple distinct real roots, the dataset is **unrolled**. If an equation has 3 real roots, it appears as 3 separate rows sharing the identical input string, but featuring different single-float targets. * **`text_input` (string):** The polynomial equation formatted in standard algebraic notation. Leading zero-padding is aggressively removed to force active parsing of degree tokens. (e.g., `+1.2345x^2 -3.1234x^1 +2.0000x^0`). * **`target` (float):** A single real root of the equation. If the equation has 0 real roots, this is set to a dummy value of `0.0`. * **`is_solvable` (int):** A binary flag (`1` or `0`). `0` indicates the equation only has complex roots (no real solutions). * **`degree` (int):** The highest degree of the polynomial (1, 2, 3, or 4). ### Equation Demographics To prevent dilution of lower-degree equations during the unrolling process, the dataset uses inverse-frequency sampling. The final unrolled dataset contains roughly equal row representation across all four degrees: * **Linear (Degree 1):** ~25% of rows * **Quadratic (Degree 2):** ~25% of rows * **Cubic (Degree 3):** ~25% of rows * **Quartic (Degree 4):** ~25% of rows --- ## Dataset Splits & Out-of-Distribution (OoD) Design The dataset contains a total of **672,841 unrolled rows** generated from 400,000 unique base equations. It is divided into four distinct splits to rigorously test interpolation and extrapolation. ### 1. `train` (468,609 rows) & `test_id` (66,863 rows) * **The Safe Zone:** All real roots are strictly generated within the bounds of `(-10, -5) ∪ (-2, +2) ∪ (+5, +10)`. * These sets contain strict boundaries; they possess absolutely zero roots in the designated OoD gaps. ### 2. `ood_gap1` (68,646 rows) - The Interpolation Void * All real roots in this split are explicitly placed inside the `[-5, -2] ∪ [+2, +5]` bounds. * *Purpose:* Tests a model's ability to interpolate inside a domain "blind spot" that was entirely absent during training. ### 3. `ood_gap2` (68,723 rows) - The Extrapolation Zone * All real roots in this split are explicitly placed at the extreme edges: `[-15, -10] ∪ [+10, +15]`. * *Purpose:* Tests a model's ability to extrapolate mathematical rules beyond the numerical boundaries it was trained on. --- ## Mathematical Complexities & Stress Tests To prevent models from learning trivial shortcuts, the training and in-distribution test sets are injected with specific mathematical edge cases: 1. **Close Root Precision (10% of applicable equations):** Approximately 16,000 equations have two real roots forced within a microscopic distance of **0.01 to 0.05** of each other. This stress tests continuous numerical embeddings and a model's ability to resolve overlapping probability peaks without merging them. 2. **Unsolvable States (20% of even-degree equations):** Approximately 19,000 quadratics and quartics are explicitly generated using complex conjugates, yielding **0 real roots** (`is_solvable = 0`). This trains the model to recognize undefined states and collapse its probability weights. 3. **Mixed Root States (50% of multi-root equations):** Many cubics and quartics are generated with a mix of real roots and complex conjugate pairs, forcing the architecture to isolate only the valid real targets. 4. **Coefficient Normalization:** All polynomial coefficients are safely normalized so that their absolute peaks land in the `[-10, 10]` window, mirroring the root domain and preventing exploding gradients during neural network training. --- ## Usage Loading the dataset via Hugging Face `datasets`: ```python from datasets import load_dataset # Load all splits dataset = load_dataset("karankhatavkar/polynomial_roots") # Example: View the first training sample print(dataset['train'][0]) # Output: {'text_input': '+1.0000x^2 -3.0000x^1 +2.0000x^0', 'target': 2.0, 'is_solvable': 1, 'degree': 2}

# 带分布外（Out-of-Distribution, OoD）间隙的多项式求根数据集 ## 数据集描述本数据集包含大规模合成生成的1至4次多项式方程集合及其实根，专为在数学推理、连续数值嵌入与严格的分布外（Out-of-Distribution, OoD）泛化能力上评测机器学习模型（如Transformer或混合密度网络（Mixture Density Networks））而设计。该数据集精心设计了插值与外推的「盲区（间隙）」、高精度压力测试场景（近邻根）与不可解状态（复根），可作为不确定性量化的优质压力测试基准。 ### 数据格式与结构本数据集采用JSONL（JSON Lines）格式存储。由于单个多项式可拥有多个不同实根，本数据集采用**展开（unrolled）**格式存储：若某方程含3个实根，则会生成3条完全相同的输入字符串行，但对应不同的单精度浮点目标值。 * **`text_input`（字符串类型）：** 采用标准代数符号格式编写的多项式方程，为强制模型主动解析次数Token（Token），会完全移除前导零填充（例如：`+1.2345x^2 -3.1234x^1 +2.0000x^0`）。 * **`target`（浮点类型）：** 该方程的单个实根。若方程无实根，则该字段将设为占位值`0.0`。 * **`is_solvable`（整数类型）：** 二元标记（`1`或`0`），`0`表示该方程仅存在复根（无实解）。 * **`degree`（整数类型）：** 多项式的最高次数（取值为1、2、3或4）。 ### 多项式构成统计为避免展开过程中低次多项式的占比被稀释，本数据集采用逆频率采样策略。最终展开后的数据集在四个次数等级上的行占比大致相等： * **一次多项式（Degree 1）：** 约占总样本的25% * **二次多项式（Degree 2）：** 约占总样本的25% * **三次多项式（Degree 3）：** 约占总样本的25% * **四次多项式（Degree 4）：** 约占总样本的25% --- ## 数据集划分与分布外（Out-of-Distribution, OoD）设计本数据集总计包含**672,841条展开样本行**，源自400,000个唯一的基础多项式方程。为严格评测插值与外推能力，数据集被划分为四个独立子集： ### 1. `train`（468,609条样本）与`test_id`（66,863条样本） * **安全域：** 所有实根均严格生成于`(-10, -5) ∪ (-2, +2) ∪ (+5, +10)`区间内。 * 该子集拥有严格的数值边界，在指定的分布外间隙中完全不包含任何实根。 ### 2. `ood_gap1`（68,646条样本）——插值盲区 * 该子集的所有实根均明确位于`[-5, -2] ∪ [+2, +5]`区间内。 * *设计目的：* 评测模型在训练阶段完全未接触的域「盲区」内进行插值的能力。 ### 3. `ood_gap2`（68,723条样本）——外推测试域 * 该子集的所有实根均明确位于极端边界区间`[-15, -10] ∪ [+10, +15]`内。 * *设计目的：* 评测模型将数学规则外推至训练所用数值边界之外的能力。 --- ## 数学复杂度与压力测试场景为避免模型学习到平庸的捷径解法，训练集与分布内测试集植入了特定的数学边缘场景： 1. **近邻根精度测试（适用于10%的方程）：** 约16,000个方程被强制设置为两个实根间距仅为**0.01至0.05**的极小值，该场景可用于评测连续数值嵌入性能，以及模型在不合并重叠概率峰的情况下分辨二者的能力。 2. **不可解状态（适用于20%的偶次方程）：** 约19,000个二次与四次多项式通过复共轭对生成，最终无实根（`is_solvable = 0`）。该场景用于训练模型识别未定义状态并收敛其概率权重。 3. **混合根状态（适用于50%的多根方程）：** 大量三次与四次多项式被设置为同时包含实根与复共轭对，这要求模型仅提取出有效的实根目标。 4. **系数归一化：** 所有多项式系数均经过安全归一化，使其绝对值峰值落在`[-10, 10]`区间内，与根的数值域保持一致，避免神经网络训练过程中出现梯度爆炸。 --- ## 使用方法通过Hugging Face `datasets`库加载本数据集： python from datasets import load_dataset # 加载所有子集 dataset = load_dataset("karankhatavkar/polynomial_roots") # 示例：查看第一条训练样本 print(dataset['train'][0]) # 输出：{'text_input': '+1.0000x^2 -3.0000x^1 +2.0000x^0', 'target': 2.0, 'is_solvable': 1, 'degree': 2}

提供机构：

karankhatavkar

5,000+

优质数据集

54 个

任务类型

进入经典数据集