SpectrumWorld/opendatalab-experimental-nmr-peaks
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/SpectrumWorld/opendatalab-experimental-nmr-peaks
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- other
language:
- en
tags:
- chemistry
- nmr
- spectroscopy
- experimental
- molecular-formula
- smiles
size_categories:
- 100K<n<1M
---
# OpenDataLab Experimental NMR Peaks Dataset
## Dataset Description
This dataset contains experimental NMR (Nuclear Magnetic Resonance) peak sequences extracted from the OpenDataLab experimental spectra database. The dataset includes both H-NMR and C-NMR peak sequences for chemical compounds, along with their SMILES representations and molecular formulas.
### Dataset Summary
- **Total Samples**: 533,595 compounds
- **Batches**: 333 batch files
- **Data Source**: Experimental spectra from OpenDataLab
- **Format**: Parquet files (one file per batch)
### Data Fields
Each sample contains the following fields:
- `smiles`: Standardized SMILES string representation of the molecule
- `molecular_formula`: Molecular formula (e.g., "C9H12O6")
- `h_nmr_peaks_sequence`: Space-separated H-NMR peak positions (2 decimal places)
- Empty string if H-NMR data is not available
- `c_nmr_peaks_sequence`: Space-separated C-NMR peak positions (2 decimal places)
- Empty string if C-NMR data is not available
- `source`: Data source identifier, always "experimental" for this dataset
### Data Structure
The dataset is organized into 333 parquet files, each corresponding to a batch:
- `batch_0000_peaks.parquet`
- `batch_0001_peaks.parquet`
- ...
- `batch_0332_peaks.parquet`
### Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("SpectrumWorld/opendatalab-experimental-nmr-peaks", split="train")
# Access a sample
sample = dataset[0]
print(sample)
# {
# 'smiles': 'O=CC1(O)CC(O)(C=O)CC(O)(C=O)C1',
# 'molecular_formula': 'C9H12O6',
# 'h_nmr_peaks_sequence': '9.96 9.96 9.96',
# 'c_nmr_peaks_sequence': '',
# 'source': 'experimental'
# }
```
### Dataset Statistics
- **Total compounds**: 533,595
- **Average per batch**: ~1,602 compounds
- **Batch size range**: 545 - 3,170 compounds
- **H-NMR data availability**: Most samples have H-NMR data
- **C-NMR data availability**: Some samples have C-NMR data
### Data Processing
The dataset was processed from the original experimental spectra data:
1. Extracted NMR peak sequences from parsed chemical shift data
2. Calculated molecular formulas from SMILES using RDKit
3. Formatted peak sequences to 2 decimal places
4. Added source identifier for data provenance
### Citation
If you use this dataset, please cite:
```bibtex
@dataset{opendatalab_experimental_nmr_peaks,
title={OpenDataLab Experimental NMR Peaks Dataset},
author={SpectrumWorld},
year={2024},
url={https://huggingface.co/datasets/SpectrumWorld/opendatalab-experimental-nmr-peaks}
}
```
### License
This dataset is released under the MIT License.
提供机构:
SpectrumWorld



