hheiden/PubChem-124M-SMILES-SELFIES-InChI-IUPAC

Name: hheiden/PubChem-124M-SMILES-SELFIES-InChI-IUPAC
Creator: hheiden
Published: 2026-01-13 03:30:25
License: 暂无描述

Hugging Face2026-01-13 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/hheiden/PubChem-124M-SMILES-SELFIES-InChI-IUPAC

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 task_categories: - text-generation - translation - fill-mask tags: - chemistry - biology - molecules - smiles - selfies - inchi - iupac - pubchem size_categories: - 100M<n<1B --- # PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC ## Dataset Summary This dataset contains **~124 million** chemical structures sourced from PubChem (as of Jan 2026), processed into a clean, machine-learning-ready Parquet format. Unlike raw XML/JSON dumps or standard CSVs, this dataset provides a unified, tabular structure that joins multiple chemical identifiers and descriptors into a single sharded resource: * **SMILES:** Raw and RDKit-Canonicalized. * **SELFIES:** Pre-computed 100% robust molecular string representations. * **InChI:** Standard InChI strings (mapped from PubChem auxiliary files). * **IUPAC:** Preferred IUPAC names (mapped from PubChem auxiliary files). * **Mass:** Monoisotopic and Exact Mass information. It is sharded into 124 files (~1M rows each) and globally shuffled to ensure I.I.D. distribution for streaming training. ## Data Fields | Column | Description | Fill Rate | |---|---|---| | `CID` | PubChem Compound ID | 100% | | `SMILES` | Original SMILES from PubChem | 100% | | `SMILES_Canonical` | Canonicalized using RDKit | 99.97% | | `SELFIES` | Generated using `selfies` library | 99.63% | | `formula` | Molecular Formula | 97.31% | | `inchi` | Standard InChI String | 77.30% | | `iupac` | Preferred IUPAC Name | 75.71% | *Note: Missing values in SELFIES usually indicate structures that violate standard valence rules or cannot be represented in the SELFIES grammar. Rows with missing InChI or IUPAC values correspond to PubChem entries where that specific auxiliary data was not provided in the dump.* ## Usage ### Loading with Hugging Face ```python from datasets import load_dataset # Load the entire dataset (warning: large) ds = load_dataset("hheiden/PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC") # Streaming (Recommended for training) ds_stream = load_dataset("hheiden/PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC", streaming=True) for sample in ds_stream['train']: print(sample['SMILES'], sample['SELFIES']) ``` ### Filtering for Translation Tasks This dataset is particularly powerful for **chemical translation tasks** (e.g., SMILES $\rightarrow$ IUPAC). You can filter for rows where your target logic is present: ```python # Filter for samples with both SELFIES and IUPAC names ds = ds.filter(lambda x: x['SELFIES'] is not None and x['iupac'] is not None) ``` ## Processing Steps 1. **Source:** Downloaded raw Compound data and auxiliary files (`CID-InChI-Key.gz`, `CID-IUPAC.gz`, `CID-Mass.gz`) from PubChem. 2. **Canonicalization:** Processed all SMILES through RDKit to ensure validity and standard formatting. 3. **SELFIES:** Converted Canonical SMILES to SELFIES representations. 4. **Sharding & Join:** Globally shuffled the base SMILES into 124 shards, then joined auxiliary data (InChI, IUPAC, Mass) per shard using DuckDB to ensure atomic row alignment. ## License The original data is from PubChem (Public Domain/US Government Work). This collation and processing is released under **CC0 1.0 Universal**. Raw data available [here](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/)

license: cc0-1.0 task_categories: - 文本生成 - 翻译 - 掩码填空 tags: - 化学 - 生物学 - 分子 - SMILES(SMILES) - SELFIES(SELFIES) - InChI(InChI) - IUPAC(IUPAC) - PubChem(PubChem) size_categories: - 1亿 < 数据量 < 10亿 # PubChem-124M-标准化SELFIES-InChI-IUPAC数据集 ## 数据集概览本数据集包含约1.24亿个化学结构，数据来源于2026年1月的PubChem(PubChem)，已处理为干净的、可直接用于机器学习的Parquet格式。与原始XML/JSON转储文件或标准CSV文件不同，本数据集提供了统一的表格结构，将多种化学标识符与描述符整合为单个分片式资源： * **SMILES(SMILES)**：原始SMILES与经RDKit标准化后的SMILES。 * **SELFIES(SELFIES)**：预计算得到的100%可靠分子字符串表示。 * **InChI(InChI)**：标准InChI字符串（从PubChem辅助文件映射得到）。 * **IUPAC(IUPAC)**：首选IUPAC名称（从PubChem辅助文件映射得到）。 * **质量信息**：单同位素质量与精确质量数据。本数据集被分片为124个文件（每个文件约含100万条数据），并进行了全局打乱，以确保流式训练时的独立同分布(I.I.D.)分布。 ## 数据字段 | 列名 | 描述 | 填充率 | |---|---|---| | `CID` | PubChem化合物ID | 100% | | `SMILES` | 来自PubChem的原始SMILES | 100% | | `SMILES_Canonical` | 使用RDKit进行标准化后的SMILES | 99.97% | | `SELFIES` | 使用`selfies`库生成的分子表示 | 99.63% | | `formula` | 分子式 | 97.31% | | `inchi` | 标准InChI字符串 | 77.30% | | `iupac` | 首选IUPAC名称 | 75.71% | *注意：SELFIES字段的缺失值通常表示该结构违反了标准价规则，或无法用SELFIES语法表示。InChI或IUPAC字段缺失的行对应PubChem条目转储中未提供该辅助数据的条目。 ## 使用方法 ### 通过Hugging Face加载 python from datasets import load_dataset # 加载完整数据集（警告：数据量较大） ds = load_dataset("hheiden/PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC") # 流式加载（推荐用于训练） ds_stream = load_dataset("hheiden/PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC", streaming=True) for sample in ds_stream['train']: print(sample['SMILES'], sample['SELFIES']) ### 针对翻译任务的筛选本数据集尤其适用于**化学翻译任务**（例如SMILES到IUPAC的转换）。你可以根据目标逻辑筛选符合条件的行： python # 筛选同时包含SELFIES与IUPAC名称的样本 ds = ds.filter(lambda x: x['SELFIES'] is not None and x['iupac'] is not None) ## 处理流程 1. **数据源获取**：从PubChem下载原始化合物数据与辅助文件（`CID-InChI-Key.gz`、`CID-IUPAC.gz`、`CID-Mass.gz`）。 2. **标准化处理**：通过RDKit处理所有SMILES，以确保其有效性与标准格式。 3. **SELFIES转换**：将标准化后的SMILES转换为SELFIES表示形式。 4. **分片与数据合并**：将基础SMILES数据全局打乱后分为124个分片，随后使用DuckDB为每个分片合并辅助数据（InChI、IUPAC、质量信息），以保证行级对齐的原子性。 ## 许可证原始数据来自PubChem（属于公共领域/美国政府作品）。本数据集的整理与处理工作基于CC0 1.0通用许可证发布。原始数据可在此处获取：[https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/)

提供机构：

hheiden

5,000+

优质数据集

54 个

任务类型

进入经典数据集