openforcefield/descent-format-spice
收藏Hugging Face2026-04-13 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/openforcefield/descent-format-spice
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
size_categories:
- 1M<n<10M
pretty_name: Meta-OMol25 Descent Formatted SPICE2 v1.0
tags:
- openff
- molecular-mechanics
- force-field
- vdw
- valence
- chemistry
configs:
- config_name: descent_data
data_files:
- split: train
path: descent_data/train-*
- config_name: metadata
data_files:
- split: train
path: metadata/train-*
dataset_info:
- config_name: descent_data
features:
- name: smiles
dtype: string
- name: coords
list: float32
- name: energy
list: float32
- name: forces
list: float32
splits:
- name: train
num_bytes: 2168837221
num_examples: 1956538
download_size: 2515274513
dataset_size: 2168837221
- config_name: metadata
features:
- name: OMol25_data_id
dtype: string
- name: OMol25_id
dtype: int64
- name: OMol25_split
dtype: string
- name: OpenFF_id
dtype: int64
- name: formula
dtype: string
- name: charge
dtype: int64
- name: OpenFF_Elements
dtype: bool
- name: OpenFF_abs(q)<=1
dtype: bool
- name: OpenFF_spin=1
dtype: bool
- name: smiles
dtype: string
- name: source
dtype: string
- name: lowdin_charges
list: float64
download_size: 725442298
dataset_size: 1323994687
---
# Dataset Card for Meta-OMol25 Descent Formatted SPICE2 v1.0
## Dataset Details
### Dataset Description
Meta-OMol25 provides molecular structures, coordinates, energies, and forces, and we derived mapped SMILES for broad OpenFF parameter fitting workflows. This release is designed for general fitting and evaluation of van der Waals and valence terms.
- Curated by: Jennifer A Clark; jaclark5
- Funded by: Open Force Field Initiative
- Shared by: Open Force Field Initiative, Open Molecular Software Foundation
- License: CC-BY-4.0
- Dataset version: v1.0
### Dataset Sources
- Repository: https://huggingface.co/facebook/OMol25
- Filtering: https://github.com/jaclark5/MetaOMol25_org_smiles
## Uses
### Direct Use
- Fit or benchmark van der Waals parameters.
- Fit or benchmark valence terms (bonds, angles, torsions).
- Provide aligned molecular metadata and per-structure coordinates / energies / forces for OpenFF workflows.
## Dataset Structure
### Overall Statistics
- Meta-OMol25 rows: 1985965
- Metadata rows: 1956538
- Smee rows: 1956538
- Failed rows: 29427
- Failed unique indices: 29427
- Accounted total (metadata + failed - overlap): 1985965
- All Meta-OMol25 structures accounted for: True
- Uncovered Meta-OMol25 rows: 0
### Metadata Split and Chemistry Summary
- Sample size used for summary: 1956538
- Sample split counts: {
"val": 9554,
"train": 1946984
}
- Sample charge range: [-5, 3]
- Sample molecular weight (g/mol, min / max / mean): 4.032000 / 1451.552762 / 285.915873
- Molecular weight sample count: 1956538
### Metadata Schema
- OMol25_data_id: Value('string')
- OMol25_id: Value('int64')
- OMol25_split: Value('string')
- OpenFF_id: Value('int64')
- formula: Value('string')
- charge: Value('int64')
- OpenFF_Elements: Value('bool')
- OpenFF_abs(q)<=1: Value('bool')
- OpenFF_spin=1: Value('bool')
- smiles: Value('string')
- source: Value('string')
- lowdin_charges: List(Value('float64'))
### Smee Schema
- smiles: Value('string')
- coords: List(Value('float32'))
- energy: List(Value('float32'))
- forces: List(Value('float32'))
Sample energy stats (kcal/mol):
- Min: -17642260.000000
- Max: -1401.253174
- Mean: -856549.566531
### Cross-Table Consistency
- Metadata and smee row count match: True
- Sample metadata/smee SMILES mismatches: 0
## Dataset Creation
### Curation Rationale
- This dataset is intended for broad force-field parameterization workloads and diagnostics.
- This release emphasizes direct support for van der Waals and valence fitting tasks.
### Data Collection and Processing
- Upstream rows are converted into metadata and smee datasets.
- Rows that fail conversion are logged in failed_rows JSONL and used for reconciliation.
- Coverage checks validate whether all Meta-OMol25 rows are represented by either successful records or failed-row entries.
### Quality and Validation
- smee importable: True
- descent importable: True
- descent create_dataset on sample succeeded: True
- Validation sample size: 256
- Validation error (if any):
Top failure modes:
- InconsistentStereochemistryError('Programming error: OpenEye atom stereochemistry assumptions failed. The atom in the oemol has stereochemistry None and the atom in the offmol has stereochemistry S.'): 6777
- InconsistentStereochemistryError('Programming error: OpenEye atom stereochemistry assumptions failed. The atom in the oemol has stereochemistry None and the atom in the offmol has stereochemistry R.'): 6156
- RadicalsNotSupportedError('The OpenFF Toolkit does not currently support parsing molecules with S- and P-block radicals. Found 1 radical electrons on molecule [H][O+][H].[K].'): 2845
- RadicalsNotSupportedError('The OpenFF Toolkit does not currently support parsing molecules with S- and P-block radicals. Found 1 radical electrons on molecule [H][O+][H].[Na].'): 2745
- RadicalsNotSupportedError('The OpenFF Toolkit does not currently support parsing molecules with S- and P-block radicals. Found 1 radical electrons on molecule [H][O+][H].[Li].'): 2677
- RadicalsNotSupportedError('The OpenFF Toolkit does not currently support parsing molecules with S- and P-block radicals. Found 1 radical electrons on molecule [Ca+].[H][O+][H].'): 1411
- RadicalsNotSupportedError('The OpenFF Toolkit does not currently support parsing molecules with S- and P-block radicals. Found 1 radical electrons on molecule [H][O+][H].[Mg+].'): 1411
- ValueError('Inconsistent charge with target!'): 268
- InconsistentStereochemistryError('Programming error: OpenEye bond stereochemistry assumptions failed. The bond in the oemol has stereochemistry None and the bond in the offmol has stereochemistry E.'): 84
- InconsistentStereochemistryError('Programming error: OpenEye bond stereochemistry assumptions failed. The bond in the oemol has stereochemistry None and the bond in the offmol has stereochemistry Z.'): 79
## Dataset Card Authors
- Jennifer A Clark (Open Force Field Initiative); jaclark5
## Dataset Card Contact
- Primary contact: info@openforcefield.org
提供机构:
openforcefield
搜集汇总
数据集介绍

构建方式
descent-format-spice数据集源自Meta-OMol25,后者提供了丰富的分子结构、坐标、能量与受力数据。在此基础上,通过映射SMILES(简化分子线性输入规范)以适配OpenFF力场参数化工作流。构建过程中,原始数据被转化为元数据(metadata)与smee两大子集,并利用JSONL记录转换失败的行以进行完整性校验。为确保数据一致性,所有Meta-OMol25条目均经过覆盖检查,验证成功记录与失败条目是否共同代表全部原始数据。
特点
该数据集包含近200万条分子结构记录,涵盖广泛的化学空间,分子量范围从4.03至1451.55 g/mol,电荷从-5到+3。元数据详细记录了分子的OMol25标识符、化学式、电荷、Lowdin电荷分布及来源等信息;smee子集则提供SMILES、三维坐标、能量及受力数据,能量范围跨越多个数量级。数据集还附带了过滤与验证信息,如立体化学一致性错误与自由基支持问题,便于用户评估数据质量。
使用方法
用户可直接利用该数据集拟合或评估范德华与价键参数(键、角、二面角),适用于OpenFF力场参数化工作流。通过加载metadata与descent_data两个配置(config),可分别获取分子元数据与结构-能量-受力数据。使用时应根据任务选择数据拆分(训练集与验证集),并注意过滤掉标记为失败的条目。建议结合OpenFF Toolkit与descent库进行数据处理,以确保与现有工作流的兼容性。
背景与挑战
背景概述
在分子力学领域,精确的力场参数化是模拟分子结构与性质的核心。Meta-OMol25 Descent Formatted SPICE2 v1.0数据集由Open Force Field Initiative于近期创建,主要研究人员包括Jennifer A Clark等。该数据集基于SPICE2的QM计算结果,针对范德华和价键项的参数拟合与评估进行了专门设计,包含约196万条分子结构、能量和力信息。通过提供对齐的分子元数据及坐标,它显著简化了OpenFF力场工作流的输入,支持广泛的小分子力场参数化任务,对推动分子力学精确性研究具有重要影响。
当前挑战
该数据集旨在解决传统力场参数化中数据不一致与覆盖不足的挑战,特别是范德华和价键项的精确拟合。在构建过程中,面临了多重困难:首先,从原始Meta-OMol25数据转换为标准格式时,约2.9万条记录因立体化学不一致(如InconsistentStereochemistryError)、自由基不支持(RadicalsNotSupportedError)或电荷冲突等错误而失败,需逐类错误处理。其次,确保元数据与结构数据行数精确匹配,并验证所有原始记录被完整覆盖,涉及复杂的数据清洗与一致性校验。此外,还需要处理分子电荷范围从-5到+3、分子量跨度达4至1451 g/mol的大化学空间,保证参数化工作的普适性。
常用场景
经典使用场景
在计算化学与分子模拟领域,力场参数的精确性直接决定了模拟结果的可靠性。descent-format-spice数据集专为分子力学力场的参数化与基准测试而生,其核心使用场景聚焦于范德华和价键项的拟合与评估。研究者可借助该数据集提供的分子结构、坐标、能量和力场数据,系统性地优化并验证力场中描述非键相互作用与化学键、键角、二面角等内坐标行为的各项参数,从而提升分子模拟的物理真实性与预测精度。
衍生相关工作
该数据集的发布催生了一系列与之紧密相关的后续研究工作。基于其高质量参考数据,研究人员开发了多款面向OpenFF框架的力场参数自动拟合工具与交互式诊断平台,极大提升了力场开发的效率与透明度。同时,它也被用作基准来评估机器学习势函数与经典力场的性能差异,促进了混合建模方法的演进。此外,围绕该数据集衍生的分析工作流,如构象采样与能量分解协议,已成为领域内新一代力场验证的标准化模板。
数据集最近研究
最新研究方向
在当前分子力学与力场参数化领域,该数据集聚焦于精准拟合与评估范德华及价键相互作用参数,为新一代通用力场OpenFF的优化提供高质量基准。研究前沿集中于利用超过190万条分子构象的能量与受力数据,结合低丁电荷与立体化学元数据,推动力场模型从经典精度向量子力学级准确演进。该工作与开放科学运动紧密关联,通过开源框架与社区校验机制,显著提升了力场开发的透明性与可复现性,对药物分子设计及材料模拟的预测可靠性具有关键意义。
以上内容由遇见数据集搜集并总结生成



