nfsrulesFR/mega-moledit-522K

Name: nfsrulesFR/mega-moledit-522K
Creator: nfsrulesFR
Published: 2025-12-04 10:10:19
License: 暂无描述

Hugging Face2025-12-04 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/nfsrulesFR/mega-moledit-522K

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: gpl-3.0 task_categories: - text-generation language: - en tags: - chemistry - molecular-editing - drug-discovery - smiles - molecule-generation pretty_name: MEGA Molecular Editing Dataset (522K) size_categories: - 100K<n<1M --- # MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization Large-scale molecular editing dataset with 522K examples for training models to modify molecular structures based on natural language instructions. **Paper**: [MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization](https://openreview.net/pdf?id=MaS7e2EVFm) **Official Repository**: [https://github.com/nfsrules/MEGA-moledit](https://github.com/nfsrules/MEGA-moledit) ## Usage ```python from datasets import load_dataset # Load dataset dataset = load_dataset("nfsrulesFR/mega-moledit-522K") # Access splits # Positive examples (successful edits) train_data = dataset["train"] val_data = dataset["validation"] # Negative examples (unsuccessful edits) train_neg_data = dataset["train_neg"] val_neg_data = dataset["validation_neg"] # Example example = train_data[0] print(f"Prompt: {example['prompt']}") print(f"Input: {example['input_smiles']}") print(f"Output: {example['output_smiles']}") ``` ## Supported Tasks | Task ID | Description | |---------|-------------| | 101 | Increase water solubility | | 102 | Decrease water solubility | | 103 | Increase drug-likeness | | 104 | Decrease drug-likeness | | 105 | Increase permeability | | 106 | Decrease permeability | | 107 | Increase hydrogen bond acceptors | | 108 | Increase hydrogen bond donors | | 201 | Increase solubility + HBA | | 202 | Decrease solubility + increase HBA | | 203 | Increase solubility + HBD | | 204 | Decrease solubility + increase HBD | | 205 | Increase solubility + permeability | | 206 | Increase solubility + decrease permeability | ## Dataset Structure Each example contains: - `task_id`: Task identifier - `prompt`: Natural language instruction - `input_smiles`: Input molecule - `output_smiles`: Target molecule - `action_type`: Edit operation type - `edit`: Specific edit applied - `target_delta`: Change in target property - `SA_delta`: Change in Synthetic Accessibility - `MW_delta`: Change in Molecular Weight - `QED_delta`: Change in Drug-likeness - `murcko_scaffold_retained`: Scaffold preservation flag **Splits**: `train` (469K), `validation` (52K), `train_neg` (469K), `validation_neg` (52K) ## Trained Models Llama 3 8B-based models for molecular optimization: - **MEGA-SFT**: [nfsrulesFR/mega-sft](https://huggingface.co/nfsrulesFR/mega-sft) - Supervised fine-tuning model - **MEGA-GRPO**: [nfsrulesFR/mega-grpo](https://huggingface.co/nfsrulesFR/mega-grpo) - Tanimoto-GRPO optimized model ## Citation ```bibtex @article{ fernandezillouz2025mega, title={MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization}, author={Nelson Fernandez and Maxime Illouz and Luis Pinto and Entao Yang and Habiboulaye Amadou Boubacar}, journal={Under review at International Conference on Learning Representations}, year={2025}, url={https://openreview.net/pdf?id=MaS7e2EVFm} } ```

提供机构：

nfsrulesFR

5,000+

优质数据集

54 个

任务类型

进入经典数据集