five

Derify/ReactionSmiles

收藏
Hugging Face2026-01-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Derify/ReactionSmiles
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 tags: - chemistry - molecules - reactions - smiles - cheminformatics size_categories: - 1M<n<10M task_categories: - text-generation - fill-mask pretty_name: Reaction SMILES Dataset dataset_info: features: - name: smiles dtype: string - name: source_id dtype: int64 splits: - name: train num_bytes: 747528582 num_examples: 2752563 - name: validation num_bytes: 93440971 num_examples: 344070 - name: test num_bytes: 93441242 num_examples: 344071 download_size: 406610470 dataset_size: 934410795 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # Reaction SMILES Dataset A collated dataset of 3.4M unique chemical reaction SMILES strings compiled from multiple public sources for use in pre-training and fine-tuning chemical language models. Reaction SMILES (Simplified Molecular Input Line Entry System) extend the standard SMILES notation to represent complete chemical reactions. They encode reactants, reagents/catalysts, and products in a single text string using the `>` delimiter: ```text reactants>reagents>products ``` For example: `CC(=O)O.CCO>[H+]>CC(=O)OCC.O` represents the esterification of acetic acid with ethanol to form ethyl acetate. ## Dataset Features - **smiles**: The reaction SMILES string - **source_id**: Original data source identifier ## Data Sources | Source ID | Source URL | | --------- | -------------------------------------------------------------------------------------------------------------------------------------- | | 1 | [US Patents 1976-Sep2016 Grants](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873) | | 2 | [US Patents 2001-Sep2016 Applications](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873) | | 3 | [CRD 1.37M Dataset (2024)](https://figshare.com/articles/dataset/Reaction_SMILES_b_CRD_b_1_37M_dataset/28230053/1) | | 4 | [USPTO Year 2023](https://figshare.com/articles/dataset/Reaction_SMILES_USPTO_year_2023/24921555) | | 5 | [Reaction SMILES Dataset (2023)](https://figshare.com/articles/dataset/Reaction_SMILES_dataset/22491730) | ## License This dataset aggregates publicly available reaction data. Please refer to the individual source links for specific licensing terms.
提供机构:
Derify
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作