five

nfsrulesFR/mega-moledit-522K

收藏
Hugging Face2025-12-04 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nfsrulesFR/mega-moledit-522K
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: gpl-3.0 task_categories: - text-generation language: - en tags: - chemistry - molecular-editing - drug-discovery - smiles - molecule-generation pretty_name: MEGA Molecular Editing Dataset (522K) size_categories: - 100K<n<1M --- # MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization Large-scale molecular editing dataset with 522K examples for training models to modify molecular structures based on natural language instructions. **Paper**: [MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization](https://openreview.net/pdf?id=MaS7e2EVFm) **Official Repository**: [https://github.com/nfsrules/MEGA-moledit](https://github.com/nfsrules/MEGA-moledit) ## Usage ```python from datasets import load_dataset # Load dataset dataset = load_dataset("nfsrulesFR/mega-moledit-522K") # Access splits # Positive examples (successful edits) train_data = dataset["train"] val_data = dataset["validation"] # Negative examples (unsuccessful edits) train_neg_data = dataset["train_neg"] val_neg_data = dataset["validation_neg"] # Example example = train_data[0] print(f"Prompt: {example['prompt']}") print(f"Input: {example['input_smiles']}") print(f"Output: {example['output_smiles']}") ``` ## Supported Tasks | Task ID | Description | |---------|-------------| | 101 | Increase water solubility | | 102 | Decrease water solubility | | 103 | Increase drug-likeness | | 104 | Decrease drug-likeness | | 105 | Increase permeability | | 106 | Decrease permeability | | 107 | Increase hydrogen bond acceptors | | 108 | Increase hydrogen bond donors | | 201 | Increase solubility + HBA | | 202 | Decrease solubility + increase HBA | | 203 | Increase solubility + HBD | | 204 | Decrease solubility + increase HBD | | 205 | Increase solubility + permeability | | 206 | Increase solubility + decrease permeability | ## Dataset Structure Each example contains: - `task_id`: Task identifier - `prompt`: Natural language instruction - `input_smiles`: Input molecule - `output_smiles`: Target molecule - `action_type`: Edit operation type - `edit`: Specific edit applied - `target_delta`: Change in target property - `SA_delta`: Change in Synthetic Accessibility - `MW_delta`: Change in Molecular Weight - `QED_delta`: Change in Drug-likeness - `murcko_scaffold_retained`: Scaffold preservation flag **Splits**: `train` (469K), `validation` (52K), `train_neg` (469K), `validation_neg` (52K) ## Trained Models Llama 3 8B-based models for molecular optimization: - **MEGA-SFT**: [nfsrulesFR/mega-sft](https://huggingface.co/nfsrulesFR/mega-sft) - Supervised fine-tuning model - **MEGA-GRPO**: [nfsrulesFR/mega-grpo](https://huggingface.co/nfsrulesFR/mega-grpo) - Tanimoto-GRPO optimized model ## Citation ```bibtex @article{ fernandezillouz2025mega, title={MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization}, author={Nelson Fernandez and Maxime Illouz and Luis Pinto and Entao Yang and Habiboulaye Amadou Boubacar}, journal={Under review at International Conference on Learning Representations}, year={2025}, url={https://openreview.net/pdf?id=MaS7e2EVFm} } ```
提供机构:
nfsrulesFR
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作