five

karankhatavkar/polynomial-root-finding-dataset

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/karankhatavkar/polynomial-root-finding-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# Polynomial Root Finding Dataset (With Out-of-Distribution Gaps) ## Dataset Description This dataset provides a massive, synthetically generated collection of polynomial equations (ranging from degrees 1 to 4) alongside their real roots. It is explicitly designed for benchmarking Machine Learning models (such as Transformers or Mixture Density Networks) on mathematical reasoning, continuous numerical embeddings, and rigorous **Out-of-Distribution (OoD) generalization**. The dataset features carefully engineered interpolation and extrapolation "blind spots" (gaps), high-precision stress tests (close roots), and unsolvable states (complex roots), making it an excellent stress test for uncertainty quantification. ### Data Format & Structure The dataset is provided in `JSONL` (JSON Lines) format. Because a single polynomial can have multiple distinct real roots, the dataset is **unrolled**. If an equation has 3 real roots, it appears as 3 separate rows sharing the identical input string, but featuring different single-float targets. * **`text_input` (string):** The polynomial equation formatted in standard algebraic notation. Leading zero-padding is aggressively removed to force active parsing of degree tokens. (e.g., `+1.2345x^2 -3.1234x^1 +2.0000x^0`). * **`target` (float):** A single real root of the equation. If the equation has 0 real roots, this is set to a dummy value of `0.0`. * **`is_solvable` (int):** A binary flag (`1` or `0`). `0` indicates the equation only has complex roots (no real solutions). * **`degree` (int):** The highest degree of the polynomial (1, 2, 3, or 4). ### Equation Demographics To prevent dilution of lower-degree equations during the unrolling process, the dataset uses inverse-frequency sampling. The final unrolled dataset contains roughly equal row representation across all four degrees: * **Linear (Degree 1):** ~25% of rows * **Quadratic (Degree 2):** ~25% of rows * **Cubic (Degree 3):** ~25% of rows * **Quartic (Degree 4):** ~25% of rows --- ## Dataset Splits & Out-of-Distribution (OoD) Design The dataset contains a total of **672,841 unrolled rows** generated from 400,000 unique base equations. It is divided into four distinct splits to rigorously test interpolation and extrapolation. ### 1. `train` (468,609 rows) & `test_id` (66,863 rows) * **The Safe Zone:** All real roots are strictly generated within the bounds of `(-10, -5) ∪ (-2, +2) ∪ (+5, +10)`. * These sets contain strict boundaries; they possess absolutely zero roots in the designated OoD gaps. ### 2. `ood_gap1` (68,646 rows) - The Interpolation Void * All real roots in this split are explicitly placed inside the `[-5, -2] ∪ [+2, +5]` bounds. * *Purpose:* Tests a model's ability to interpolate inside a domain "blind spot" that was entirely absent during training. ### 3. `ood_gap2` (68,723 rows) - The Extrapolation Zone * All real roots in this split are explicitly placed at the extreme edges: `[-15, -10] ∪ [+10, +15]`. * *Purpose:* Tests a model's ability to extrapolate mathematical rules beyond the numerical boundaries it was trained on. --- ## Mathematical Complexities & Stress Tests To prevent models from learning trivial shortcuts, the training and in-distribution test sets are injected with specific mathematical edge cases: 1. **Close Root Precision (10% of applicable equations):** Approximately 16,000 equations have two real roots forced within a microscopic distance of **0.01 to 0.05** of each other. This stress tests continuous numerical embeddings and a model's ability to resolve overlapping probability peaks without merging them. 2. **Unsolvable States (20% of even-degree equations):** Approximately 19,000 quadratics and quartics are explicitly generated using complex conjugates, yielding **0 real roots** (`is_solvable = 0`). This trains the model to recognize undefined states and collapse its probability weights. 3. **Mixed Root States (50% of multi-root equations):** Many cubics and quartics are generated with a mix of real roots and complex conjugate pairs, forcing the architecture to isolate only the valid real targets. 4. **Coefficient Normalization:** All polynomial coefficients are safely normalized so that their absolute peaks land in the `[-10, 10]` window, mirroring the root domain and preventing exploding gradients during neural network training. --- ## Usage Loading the dataset via Hugging Face `datasets`: ```python from datasets import load_dataset # Load all splits dataset = load_dataset("karankhatavkar/polynomial_roots") # Example: View the first training sample print(dataset['train'][0]) # Output: {'text_input': '+1.0000x^2 -3.0000x^1 +2.0000x^0', 'target': 2.0, 'is_solvable': 1, 'degree': 2}

# 带分布外(Out-of-Distribution, OoD)间隙的多项式求根数据集 ## 数据集描述 本数据集包含大规模合成生成的1至4次多项式方程集合及其实根,专为在数学推理、连续数值嵌入与严格的分布外(Out-of-Distribution, OoD)泛化能力上评测机器学习模型(如Transformer或混合密度网络(Mixture Density Networks))而设计。 该数据集精心设计了插值与外推的「盲区(间隙)」、高精度压力测试场景(近邻根)与不可解状态(复根),可作为不确定性量化的优质压力测试基准。 ### 数据格式与结构 本数据集采用JSONL(JSON Lines)格式存储。 由于单个多项式可拥有多个不同实根,本数据集采用**展开(unrolled)**格式存储:若某方程含3个实根,则会生成3条完全相同的输入字符串行,但对应不同的单精度浮点目标值。 * **`text_input`(字符串类型):** 采用标准代数符号格式编写的多项式方程,为强制模型主动解析次数Token(Token),会完全移除前导零填充(例如:`+1.2345x^2 -3.1234x^1 +2.0000x^0`)。 * **`target`(浮点类型):** 该方程的单个实根。若方程无实根,则该字段将设为占位值`0.0`。 * **`is_solvable`(整数类型):** 二元标记(`1`或`0`),`0`表示该方程仅存在复根(无实解)。 * **`degree`(整数类型):** 多项式的最高次数(取值为1、2、3或4)。 ### 多项式构成统计 为避免展开过程中低次多项式的占比被稀释,本数据集采用逆频率采样策略。最终展开后的数据集在四个次数等级上的行占比大致相等: * **一次多项式(Degree 1):** 约占总样本的25% * **二次多项式(Degree 2):** 约占总样本的25% * **三次多项式(Degree 3):** 约占总样本的25% * **四次多项式(Degree 4):** 约占总样本的25% --- ## 数据集划分与分布外(Out-of-Distribution, OoD)设计 本数据集总计包含**672,841条展开样本行**,源自400,000个唯一的基础多项式方程。为严格评测插值与外推能力,数据集被划分为四个独立子集: ### 1. `train`(468,609条样本)与`test_id`(66,863条样本) * **安全域:** 所有实根均严格生成于`(-10, -5) ∪ (-2, +2) ∪ (+5, +10)`区间内。 * 该子集拥有严格的数值边界,在指定的分布外间隙中完全不包含任何实根。 ### 2. `ood_gap1`(68,646条样本)——插值盲区 * 该子集的所有实根均明确位于`[-5, -2] ∪ [+2, +5]`区间内。 * *设计目的:* 评测模型在训练阶段完全未接触的域「盲区」内进行插值的能力。 ### 3. `ood_gap2`(68,723条样本)——外推测试域 * 该子集的所有实根均明确位于极端边界区间`[-15, -10] ∪ [+10, +15]`内。 * *设计目的:* 评测模型将数学规则外推至训练所用数值边界之外的能力。 --- ## 数学复杂度与压力测试场景 为避免模型学习到平庸的捷径解法,训练集与分布内测试集植入了特定的数学边缘场景: 1. **近邻根精度测试(适用于10%的方程):** 约16,000个方程被强制设置为两个实根间距仅为**0.01至0.05**的极小值,该场景可用于评测连续数值嵌入性能,以及模型在不合并重叠概率峰的情况下分辨二者的能力。 2. **不可解状态(适用于20%的偶次方程):** 约19,000个二次与四次多项式通过复共轭对生成,最终无实根(`is_solvable = 0`)。该场景用于训练模型识别未定义状态并收敛其概率权重。 3. **混合根状态(适用于50%的多根方程):** 大量三次与四次多项式被设置为同时包含实根与复共轭对,这要求模型仅提取出有效的实根目标。 4. **系数归一化:** 所有多项式系数均经过安全归一化,使其绝对值峰值落在`[-10, 10]`区间内,与根的数值域保持一致,避免神经网络训练过程中出现梯度爆炸。 --- ## 使用方法 通过Hugging Face `datasets`库加载本数据集: python from datasets import load_dataset # 加载所有子集 dataset = load_dataset("karankhatavkar/polynomial_roots") # 示例:查看第一条训练样本 print(dataset['train'][0]) # 输出:{'text_input': '+1.0000x^2 -3.0000x^1 +2.0000x^0', 'target': 2.0, 'is_solvable': 1, 'degree': 2}
提供机构:
karankhatavkar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作