yoshitomo-matsubara/srsd-feynman_hard_dummy

Name: yoshitomo-matsubara/srsd-feynman_hard_dummy
Creator: yoshitomo-matsubara
Published: 2024-03-05 07:24:00
License: 暂无描述

Hugging Face2024-03-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/yoshitomo-matsubara/srsd-feynman_hard_dummy

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: SRSD-Feynman (Hard w/ Dummy Variables) annotations_creators: - expert language_creators: - expert-generated language: - en license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - extended task_categories: - tabular-regression task_ids: [] --- # Dataset Card for SRSD-Feynman (Hard set with Dummy Variables) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** https://github.com/omron-sinicx/srsd-benchmark - **Paper:** [Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery](https://arxiv.org/abs/2206.10540) - **Point of Contact:** [Yoshitaka Ushiku](mailto:yoshitaka.ushiku@sinicx.com) ### Dataset Summary Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in [the Feynman Symbolic Regression Database](https://space.mit.edu/home/tegmark/aifeynman.html) to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets. This is the ***Hard set with dummy variables*** of our SRSD-Feynman datasets, which consists of the following 50 different physics formulas: [![Click here to open a PDF file](problem_table.png)](https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_hard_dummy/resolve/main/problem_table.pdf) Dummy variables were randomly generated, and symbolic regression models should not use the dummy variables as part of their predictions. The following datasets contain **1 dummy variable**: I.15.3x, I.30.3, II.6.15a, II.11.17, II.11.28, II.13.23, II.13.34, II.24.17, B1, B6, B12, B16, B17 **2 dummy variables**: I.6.20, I.6.20b, I.9.18, I.15.3t, I.29.16, I.34.14, I.39.22, I.44.4, II.11.20, II.11.27, II.35.18, III.9.52, III.10.19, III.21.20, B2, B3, B7, B9 **3 dummy variables**: I.6.20a, I.32.17, I.37.4, I.40.1, I.41.16, I.50.26, II.6.15b, II.35.21, II.36.38, III.4.33, B4, B5, B10, B11, B13, B14, B15, B19, B20 More details of these datasets are provided in [the paper and its supplementary material](https://openreview.net/forum?id=qrUdrXsiXX). ### Supported Tasks and Leaderboards Symbolic Regression ## Dataset Structure ### Data Instances Tabular data + Ground-truth equation per equation Tabular data: (num_samples, num_variables+1), where the last (rightmost) column indicate output of the target function for given variables. Note that the number of variables (`num_variables`) varies from equation to equation. Ground-truth equation: *pickled* symbolic representation (equation with symbols in sympy) of the target function. ### Data Fields For each dataset, we have 1. train split (txt file, whitespace as a delimiter) 2. val split (txt file, whitespace as a delimiter) 3. test split (txt file, whitespace as a delimiter) 4. true equation (pickle file for sympy object) ### Data Splits - train: 8,000 samples per equation - val: 1,000 samples per equation - test: 1,000 samples per equation ## Dataset Creation ### Curation Rationale We chose target equations based on [the Feynman Symbolic Regression Database](https://space.mit.edu/home/tegmark/aifeynman.html). ### Annotations #### Annotation process We significantly revised the sampling range for each variable from the annotations in the Feynman Symbolic Regression Database. First, we checked the properties of each variable and treat physical constants (e.g., light speed, gravitational constant) as constants. Next, variable ranges were defined to correspond to each typical physics experiment to confirm the physical phenomenon for each equation. In cases where a specific experiment is difficult to be assumed, ranges were set within which the corresponding physical phenomenon can be seen. Generally, the ranges are set to be sampled on log scales within their orders as 10^2 in order to take both large and small changes in value as the order changes. Variables such as angles, for which a linear distribution is expected are set to be sampled uniformly. In addition, variables that take a specific sign were set to be sampled within that range. #### Who are the annotators? The main annotators are - Naoya Chiba (@nchiba) - Ryo Igarashi (@rigarash) ### Personal and Sensitive Information N/A ## Considerations for Using the Data ### Social Impact of Dataset We annotated this dataset, assuming typical physical experiments. The dataset will engage research on symbolic regression for scientific discovery (SRSD) and help researchers discuss the potential of symbolic regression methods towards data-driven scientific discovery. ### Discussion of Biases Our choices of target equations are based on [the Feynman Symbolic Regression Database](https://space.mit.edu/home/tegmark/aifeynman.html), which are focused on a field of Physics. ### Other Known Limitations Some variables used in our datasets indicate some numbers (counts), which should be treated as integer. Due to the capacity of 32-bit integer, however, we treated some of such variables as float e.g., number of molecules (10^{23} - 10^{25}) ## Additional Information ### Dataset Curators The main curators are - Naoya Chiba (@nchiba) - Ryo Igarashi (@rigarash) ### Licensing Information Creative Commons Attribution 4.0 ### Citation Information [[OpenReview](https://openreview.net/forum?id=qrUdrXsiXX)] [[Video](https://www.youtube.com/watch?v=MmeOXuUUAW0)] [[Preprint](https://arxiv.org/abs/2206.10540)] ```bibtex @article{matsubara2024rethinking, title={Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery}, author={Matsubara, Yoshitomo and Chiba, Naoya and Igarashi, Ryo and Ushiku, Yoshitaka}, journal={Journal of Data-centric Machine Learning Research}, year={2024}, url={https://openreview.net/forum?id=qrUdrXsiXX} } ``` ### Contributions Authors: - Yoshitomo Matsubara (@yoshitomo-matsubara) - Naoya Chiba (@nchiba) - Ryo Igarashi (@rigarash) - Yoshitaka Ushiku (@yushiku)

提供机构：

yoshitomo-matsubara

原始信息汇总

数据集概述

数据集名称

名称: SRSD-Feynman (Hard w/ Dummy Variables)
别名: Hard set with Dummy Variables

数据集属性

语言: 英语 (en)
许可证: 知识共享署名 4.0 国际许可 (cc-by-4.0)
多语言性: 单语种
大小类别: 100K<n<1M
任务类别: 表格回归

数据集内容

设计目的: 用于评估符号回归在科学发现中的性能。
数据组成: 包含50个不同的物理公式，每个公式包含训练、验证和测试数据集，以及真实方程的符号表示。
特殊特征: 包含随机生成的虚拟变量，这些变量不应被符号回归模型用于预测。

数据集结构

数据实例: 表格数据 + 每个公式的真实方程。
- 表格数据: (num_samples, num_variables+1)，其中最后一列表示目标函数在给定变量下的输出。
- 真实方程: 使用sympy对象的序列化表示。
数据字段: 每个数据集包含训练、验证和测试分割的文本文件，以及真实方程的pickle文件。
数据分割:
- 训练: 8,000样本/公式
- 验证: 1,000样本/公式
- 测试: 1,000样本/公式

数据集创建

来源数据: 基于Feynman Symbolic Regression Database。
注释过程: 由专家进行，对每个变量的采样范围进行了重大修订，以适应典型的物理实验。
注释者: Naoya Chiba (@nchiba) 和 Ryo Igarashi (@rigarash)

使用考虑

社会影响: 促进符号回归在科学发现中的研究和讨论。
偏见讨论: 基于物理学领域的目标方程选择。
已知限制: 某些变量应视为整数，但由于技术限制，部分被处理为浮点数。

附加信息

数据集管理员: Naoya Chiba 和 Ryo Igarashi
贡献者: Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, Yoshitaka Ushiku

搜集汇总

数据集介绍

构建方式

该数据集源自对费曼符号回归数据库的深度重构，旨在为科学发现中的符号回归任务提供更具挑战性的基准。构建过程中，研究者精心审阅了每个物理公式及其变量的内在属性，为每个变量设定了符合典型物理实验场景的合理采样范围。特别地，对于物理常数（如光速、引力常数）予以固定处理，而对变量则依据其物理意义采用对数或均匀尺度进行采样，以确保数据能真实反映物理现象的变化规律。此外，该数据集的独特之处在于引入了随机生成的哑变量，这些变量与目标函数无关，旨在测试符号回归模型识别并排除无关特征的能力。最终，数据集涵盖了50个不同的物理公式，每个公式均提供了训练集（8000样本）、验证集（1000样本）和测试集（1000样本），并以文本文件形式存储，便于直接加载。

使用方法

使用该数据集时，用户可直接从HuggingFace平台加载。每个公式的数据以文本文件形式组织，其中训练、验证和测试集以空格分隔，最后一列为目标函数输出。用户需注意，不同公式的变量数量可能不同，且哑变量已随机混入输入特征中，模型不应将其用于预测。建议使用符号回归算法（如遗传编程、神经网络符号回归）对数据进行拟合，并通过与提供的真实符号方程（以pickle格式存储）对比，评估模型发现物理定律的准确性与简洁性。此外，研究者可参考配套论文中的实验设置，以复现基线结果或探索新方法。

背景与挑战

背景概述

符号回归作为连接数据驱动建模与科学发现的重要桥梁，旨在从观测数据中自动挖掘出简洁的数学表达式，从而揭示潜在物理规律。在此背景下，由Yoshitomo Matsubara、Naoya Chiba、Ryo Igarashi及Yoshitaka Ushiku等研究人员于2022年创建的SRSD-Feynman（Hard set with Dummy Variables）数据集，专注于评估符号回归方法在科学发现任务中的性能。该数据集基于著名的费曼符号回归数据库，精心挑选了50个涵盖经典物理学的公式，并为每个变量设计了符合实际物理实验的采样范围，以模拟真实科研场景。通过引入随机生成的哑变量，该数据集旨在检验符号回归模型是否能够从包含无关噪声的特征中准确识别并重建目标物理定律，从而对推动数据驱动的科学发现研究具有重要影响力。

当前挑战

该数据集所应对的领域挑战在于符号回归方法在科学发现中的鲁棒性：传统方法往往在理想化、无噪声的数据上表现良好，但实际物理数据常包含无关变量或冗余特征，模型需具备区分核心变量与干扰项的能力。此外，构建过程中面临多重困难：首先，为每个物理公式的变量定义合理的采样范围极具挑战性，需要深入理解每个变量的物理意义及典型实验场景，以确保数据能够真实反映现象；其次，处理如分子数等大数量级变量时，32位整数的容量限制迫使采用浮点数表示，可能引入精度偏差；最后，哑变量的随机生成必须确保不与目标函数存在任何统计关联，以避免模型意外利用虚假相关性，这要求严格的随机化控制与验证。

常用场景

经典使用场景

SRSD-Feynman (Hard w/ Dummy Variables) 数据集专为评估符号回归方法在科学发现中的潜力而设计，其经典使用场景聚焦于从含噪声与无关变量的物理观测数据中，重新发现或逼近隐藏在背后的物理定律。该数据集精心选取了50个源自费曼物理学讲义的真实公式，并注入了随机生成的哑变量，以模拟真实实验环境中存在无关干扰因素的复杂情况。研究者们利用此数据集来检验符号回归算法能否精准地辨识出真正控制物理现象的变量，并恢复出正确的解析表达式，从而衡量模型在应对高维、稀疏且包含冗余信息数据时的鲁棒性与泛化能力。

解决学术问题

该数据集旨在解决符号回归研究领域长期存在的基准测试困境，即如何构建既贴近真实物理实验又具有挑战性的标准化评估平台。传统的符号回归基准往往采用理想化的采样范围或忽略物理常数的约束，导致算法性能被高估。SRSD-Feynman 通过严谨地审查每个公式的物理意义，为每个变量设定了基于典型实验场景的合理采样区间，并保留了物理常数作为不可变参数，从而更真实地反映了数据驱动科学发现中的核心难题——从有限且含噪的观测中提炼出简洁、可解释且符合物理规律的理论模型。这一设计极大地推动了符号回归方法在科学发现领域评估标准的严谨性与可信度。

实际应用

在实际应用中，该数据集为数据驱动的科学发现流程提供了关键的验证工具，尤其在物理学、材料科学和生物信息学等需要从实验数据中推导解析规律的领域具有深远意义。科研人员可借助此数据集训练和筛选符号回归模型，使其具备从高维、多变量且包含冗余信息的实验数据中自动识别关键变量并构建简洁数学表达式的能力。例如，在新型功能材料的开发中，模型可基于大量组分与性能的测量数据，自动发现隐藏的构效关系；在天文学中，则可从复杂的观测数据中提炼出描述天体运动或星系演化的基本方程，从而加速科学假说的生成与验证。

数据集最近研究