five

nfsrulesFR/mega-moledit-large

收藏
Hugging Face2025-12-04 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nfsrulesFR/mega-moledit-large
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: gpl-3.0 task_categories: - text-generation language: - en tags: - chemistry - molecular-editing - drug-discovery - smiles - molecule-generation pretty_name: MEGA Molecular Editing Dataset (Large - 62M) size_categories: - 10M<n<100M dataset_info: features: - name: task_id dtype: int64 - name: prompt dtype: string - name: input_smiles dtype: string - name: output_smiles dtype: string - name: action_type dtype: string - name: edit dtype: string - name: target_delta dtype: float64 - name: SA_delta dtype: float64 - name: MW_delta dtype: float64 - name: QED_delta dtype: float64 - name: murcko_scaffold_retained dtype: bool splits: - name: train num_bytes: 10576225637 num_examples: 28219060 - name: validation num_bytes: 1175208625 num_examples: 3135462 - name: train_neg num_bytes: 8843010060 num_examples: 23684301 - name: validation_neg num_bytes: 982619844 num_examples: 2631604 download_size: 6506509462 dataset_size: 21577064166 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: train_neg path: data/train_neg-* - split: validation_neg path: data/validation_neg-* --- # MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization Large-scale annotated molecular editing dataset with 57M examplesfor training models to modify molecular structures based on natural language instructions. **Paper**: [MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization](https://openreview.net/pdf?id=MaS7e2EVFm) **Official Repository**: [https://github.com/nfsrules/MEGA-moledit](https://github.com/nfsrules/MEGA-moledit) ## Dataset Structure Each example will contain: - `task_id`: Task identifier - `prompt`: Natural language instruction - `input_smiles`: Input molecule - `output_smiles`: Target molecule - `action_type`: Edit operation type - `edit`: Specific edit applied - `target_delta`: Change in target property - `SA_delta`: Change in Synthetic Accessibility - `MW_delta`: Change in Molecular Weight - `QED_delta`: Change in Drug-likeness - `murcko_scaffold_retained`: Scaffold preservation flag ## Supported Tasks | Task ID | Description | |---------|-------------| | 101 | Increase water solubility | | 102 | Decrease water solubility | | 103 | Increase drug-likeness | | 104 | Decrease drug-likeness | | 105 | Increase permeability | | 106 | Decrease permeability | | 107 | Increase hydrogen bond acceptors | | 108 | Increase hydrogen bond donors | | 201 | Increase solubility + HBA | | 202 | Decrease solubility + increase HBA | | 203 | Increase solubility + HBD | | 204 | Decrease solubility + increase HBD | | 205 | Increase solubility + permeability | | 206 | Increase solubility + decrease permeability | **Splits**: `train` (31M positive + 26M negative), `validation` ## Smaller Variant Check out the smaller version: [nfsrulesFR/mega-moledit-522K](https://huggingface.co/datasets/nfsrulesFR/mega-moledit-522K) ## Trained Models Llama 3 8B-based models for molecular optimization: - **MEGA-SFT**: [nfsrulesFR/mega-sft](https://huggingface.co/nfsrulesFR/mega-sft) - Supervised fine-tuning model - **MEGA-GRPO**: [nfsrulesFR/mega-grpo](https://huggingface.co/nfsrulesFR/mega-grpo) - Tanimoto-GRPO optimized model ## Citation ```bibtex @article{ fernandezillouz2025mega, title={MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization}, author={Nelson Fernandez and Maxime Illouz and Luis Pinto and Entao Yang and Habiboulaye Amadou Boubacar}, journal={Under review at International Conference on Learning Representations}, year={2025}, url={https://openreview.net/pdf?id=MaS7e2EVFm} } ```
提供机构:
nfsrulesFR
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作