five

SpectrumWorld/opendatalab-experimental-nmr-peaks

收藏
Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/SpectrumWorld/opendatalab-experimental-nmr-peaks
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - other language: - en tags: - chemistry - nmr - spectroscopy - experimental - molecular-formula - smiles size_categories: - 100K<n<1M --- # OpenDataLab Experimental NMR Peaks Dataset ## Dataset Description This dataset contains experimental NMR (Nuclear Magnetic Resonance) peak sequences extracted from the OpenDataLab experimental spectra database. The dataset includes both H-NMR and C-NMR peak sequences for chemical compounds, along with their SMILES representations and molecular formulas. ### Dataset Summary - **Total Samples**: 533,595 compounds - **Batches**: 333 batch files - **Data Source**: Experimental spectra from OpenDataLab - **Format**: Parquet files (one file per batch) ### Data Fields Each sample contains the following fields: - `smiles`: Standardized SMILES string representation of the molecule - `molecular_formula`: Molecular formula (e.g., "C9H12O6") - `h_nmr_peaks_sequence`: Space-separated H-NMR peak positions (2 decimal places) - Empty string if H-NMR data is not available - `c_nmr_peaks_sequence`: Space-separated C-NMR peak positions (2 decimal places) - Empty string if C-NMR data is not available - `source`: Data source identifier, always "experimental" for this dataset ### Data Structure The dataset is organized into 333 parquet files, each corresponding to a batch: - `batch_0000_peaks.parquet` - `batch_0001_peaks.parquet` - ... - `batch_0332_peaks.parquet` ### Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("SpectrumWorld/opendatalab-experimental-nmr-peaks", split="train") # Access a sample sample = dataset[0] print(sample) # { # 'smiles': 'O=CC1(O)CC(O)(C=O)CC(O)(C=O)C1', # 'molecular_formula': 'C9H12O6', # 'h_nmr_peaks_sequence': '9.96 9.96 9.96', # 'c_nmr_peaks_sequence': '', # 'source': 'experimental' # } ``` ### Dataset Statistics - **Total compounds**: 533,595 - **Average per batch**: ~1,602 compounds - **Batch size range**: 545 - 3,170 compounds - **H-NMR data availability**: Most samples have H-NMR data - **C-NMR data availability**: Some samples have C-NMR data ### Data Processing The dataset was processed from the original experimental spectra data: 1. Extracted NMR peak sequences from parsed chemical shift data 2. Calculated molecular formulas from SMILES using RDKit 3. Formatted peak sequences to 2 decimal places 4. Added source identifier for data provenance ### Citation If you use this dataset, please cite: ```bibtex @dataset{opendatalab_experimental_nmr_peaks, title={OpenDataLab Experimental NMR Peaks Dataset}, author={SpectrumWorld}, year={2024}, url={https://huggingface.co/datasets/SpectrumWorld/opendatalab-experimental-nmr-peaks} } ``` ### License This dataset is released under the MIT License.
提供机构:
SpectrumWorld
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作