five

th-laurel/PubChem-124M-SMILES-SELFIES-InChI-IUPAC

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/th-laurel/PubChem-124M-SMILES-SELFIES-InChI-IUPAC
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - text-generation - translation - fill-mask tags: - chemistry - biology - molecules - smiles - selfies - inchi - iupac - pubchem size_categories: - 100M<n<1B --- # PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC ## Dataset Summary This dataset contains **~124 million** chemical structures sourced from PubChem (as of Jan 2026), processed into a clean, machine-learning-ready Parquet format. Unlike raw XML/JSON dumps or standard CSVs, this dataset provides a unified, tabular structure that joins multiple chemical identifiers and descriptors into a single sharded resource: * **SMILES:** Raw and RDKit-Canonicalized. * **SELFIES:** Pre-computed 100% robust molecular string representations. * **InChI:** Standard InChI strings (mapped from PubChem auxiliary files). * **IUPAC:** Preferred IUPAC names (mapped from PubChem auxiliary files). * **Mass:** Monoisotopic and Exact Mass information. It is sharded into 124 files (~1M rows each) and globally shuffled to ensure I.I.D. distribution for streaming training. ## Data Fields | Column | Description | Fill Rate | |---|---|---| | `CID` | PubChem Compound ID | 100% | | `SMILES` | Original SMILES from PubChem | 100% | | `SMILES_Canonical` | Canonicalized using RDKit | 99.97% | | `SELFIES` | Generated using `selfies` library | 99.63% | | `formula` | Molecular Formula | 97.31% | | `inchi` | Standard InChI String | 77.30% | | `iupac` | Preferred IUPAC Name | 75.71% | *Note: Missing values in SELFIES usually indicate structures that violate standard valence rules or cannot be represented in the SELFIES grammar. Rows with missing InChI or IUPAC values correspond to PubChem entries where that specific auxiliary data was not provided in the dump.* ## Usage ### Loading with Hugging Face ```python from datasets import load_dataset # Load the entire dataset (warning: large) ds = load_dataset("hheiden/PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC") # Streaming (Recommended for training) ds_stream = load_dataset("hheiden/PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC", streaming=True) for sample in ds_stream['train']: print(sample['SMILES'], sample['SELFIES']) ``` ### Filtering for Translation Tasks This dataset is particularly powerful for **chemical translation tasks** (e.g., SMILES $\rightarrow$ IUPAC). You can filter for rows where your target logic is present: ```python # Filter for samples with both SELFIES and IUPAC names ds = ds.filter(lambda x: x['SELFIES'] is not None and x['iupac'] is not None) ``` ## Processing Steps 1. **Source:** Downloaded raw Compound data and auxiliary files (`CID-InChI-Key.gz`, `CID-IUPAC.gz`, `CID-Mass.gz`) from PubChem. 2. **Canonicalization:** Processed all SMILES through RDKit to ensure validity and standard formatting. 3. **SELFIES:** Converted Canonical SMILES to SELFIES representations. 4. **Sharding & Join:** Globally shuffled the base SMILES into 124 shards, then joined auxiliary data (InChI, IUPAC, Mass) per shard using DuckDB to ensure atomic row alignment. ## License The original data is from PubChem (Public Domain/US Government Work). This collation and processing is released under **CC0 1.0 Universal**. Raw data available [here](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/)

授权协议:CC0 1.0 任务类别: - 文本生成 - 翻译 - 掩码填空 标签: - 化学 - 生物学 - 分子 - SMILES - SELFIES - InChI - IUPAC - PubChem 数据规模: - 100M<n<1B # PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC ## 数据集概述 本数据集包含约1.24亿个化学结构,数据源自PubChem(截至2026年1月),已处理为干净且适配机器学习的Parquet格式。 与原始XML/JSON转储文件或标准CSV文件不同,本数据集采用统一的表格结构,将多种化学标识符与描述符整合为单一分片式资源,包含以下内容: - **SMILES (Simplified Molecular-Input Line-Entry System)**:原始SMILES与经RDKit标准化后的SMILES。 - **SELFIES**:预计算的100%鲁棒性分子字符串表示。 - **InChI (International Chemical Identifier)**:标准InChI字符串(从PubChem辅助文件映射得到)。 - **IUPAC (International Union of Pure and Applied Chemistry)**:推荐IUPAC命名(从PubChem辅助文件映射得到)。 - **质量**:单同位素质量与精确质量信息。 本数据集被切分为124个文件(每个文件约含100万行数据),并进行全局洗牌以确保流训练所需的独立同分布(I.I.D.)分布。 ## 数据字段 | 列名 | 描述 | 填充率 | |---|---|---| | `CID` | PubChem化合物编号 | 100% | | `SMILES` | PubChem原始SMILES字符串 | 100% | | `SMILES_Canonical` | 经RDKit标准化后的SMILES | 99.97% | | `SELFIES` | 使用`selfies`库生成的表示 | 99.63% | | `formula` | 分子式 | 97.31% | | `inchi` | 标准InChI字符串 | 77.30% | | `iupac` | 推荐IUPAC命名 | 75.71% | *注:SELFIES字段的缺失值通常表示对应结构违反了标准价态规则,或无法通过SELFIES语法进行表示。InChI或IUPAC字段缺失的行,对应转储文件中未提供该辅助数据的PubChem条目。* ## 使用方法 ### 基于Hugging Face加载 python from datasets import load_dataset # 加载完整数据集(警告:数据量较大) ds = load_dataset("hheiden/PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC") # 流式加载(推荐用于训练) ds_stream = load_dataset("hheiden/PubChem-124M-Canonicalized-SELFIES-InChI-IUPAC", streaming=True) for sample in ds_stream['train']: print(sample['SMILES'], sample['SELFIES']) ### 针对翻译任务的筛选 本数据集尤其适用于**化学翻译任务**(例如SMILES → IUPAC)。你可以根据目标逻辑筛选符合条件的行: python # 筛选同时包含SELFIES和IUPAC名称的样本 ds = ds.filter(lambda x: x['SELFIES'] is not None and x['iupac'] is not None) ## 处理流程 1. **数据源获取**:从PubChem下载原始化合物数据与辅助文件(`CID-InChI-Key.gz`、`CID-IUPAC.gz`、`CID-Mass.gz`)。 2. **标准化处理**:通过RDKit处理所有SMILES字符串,确保其有效性与标准格式。 3. **SELFIES转换**:将标准化后的SMILES转换为SELFIES表示形式。 4. **分片与数据关联**:将基础SMILES数据全局洗牌后切分为124个分片,随后通过DuckDB为每个分片关联辅助数据(InChI、IUPAC、质量),确保行级原子对齐。 ## 授权协议 原始数据来自PubChem(公共领域/美国政府作品)。本数据集的整理与处理工作采用**CC0 1.0 Universal**协议发布。 原始数据可在此处获取:[https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/)
提供机构:
th-laurel
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作