nfsrulesFR/mega-moledit-522K
收藏Hugging Face2025-12-04 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nfsrulesFR/mega-moledit-522K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: gpl-3.0
task_categories:
- text-generation
language:
- en
tags:
- chemistry
- molecular-editing
- drug-discovery
- smiles
- molecule-generation
pretty_name: MEGA Molecular Editing Dataset (522K)
size_categories:
- 100K<n<1M
---
# MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization
Large-scale molecular editing dataset with 522K examples for training models to modify molecular structures based on natural language instructions.
**Paper**: [MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization](https://openreview.net/pdf?id=MaS7e2EVFm)
**Official Repository**: [https://github.com/nfsrules/MEGA-moledit](https://github.com/nfsrules/MEGA-moledit)
## Usage
```python
from datasets import load_dataset
# Load dataset
dataset = load_dataset("nfsrulesFR/mega-moledit-522K")
# Access splits
# Positive examples (successful edits)
train_data = dataset["train"]
val_data = dataset["validation"]
# Negative examples (unsuccessful edits)
train_neg_data = dataset["train_neg"]
val_neg_data = dataset["validation_neg"]
# Example
example = train_data[0]
print(f"Prompt: {example['prompt']}")
print(f"Input: {example['input_smiles']}")
print(f"Output: {example['output_smiles']}")
```
## Supported Tasks
| Task ID | Description |
|---------|-------------|
| 101 | Increase water solubility |
| 102 | Decrease water solubility |
| 103 | Increase drug-likeness |
| 104 | Decrease drug-likeness |
| 105 | Increase permeability |
| 106 | Decrease permeability |
| 107 | Increase hydrogen bond acceptors |
| 108 | Increase hydrogen bond donors |
| 201 | Increase solubility + HBA |
| 202 | Decrease solubility + increase HBA |
| 203 | Increase solubility + HBD |
| 204 | Decrease solubility + increase HBD |
| 205 | Increase solubility + permeability |
| 206 | Increase solubility + decrease permeability |
## Dataset Structure
Each example contains:
- `task_id`: Task identifier
- `prompt`: Natural language instruction
- `input_smiles`: Input molecule
- `output_smiles`: Target molecule
- `action_type`: Edit operation type
- `edit`: Specific edit applied
- `target_delta`: Change in target property
- `SA_delta`: Change in Synthetic Accessibility
- `MW_delta`: Change in Molecular Weight
- `QED_delta`: Change in Drug-likeness
- `murcko_scaffold_retained`: Scaffold preservation flag
**Splits**: `train` (469K), `validation` (52K), `train_neg` (469K), `validation_neg` (52K)
## Trained Models
Llama 3 8B-based models for molecular optimization:
- **MEGA-SFT**: [nfsrulesFR/mega-sft](https://huggingface.co/nfsrulesFR/mega-sft) - Supervised fine-tuning model
- **MEGA-GRPO**: [nfsrulesFR/mega-grpo](https://huggingface.co/nfsrulesFR/mega-grpo) - Tanimoto-GRPO optimized model
## Citation
```bibtex
@article{
fernandezillouz2025mega,
title={MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization},
author={Nelson Fernandez and Maxime Illouz and Luis Pinto and Entao Yang and Habiboulaye Amadou Boubacar},
journal={Under review at International Conference on Learning Representations},
year={2025},
url={https://openreview.net/pdf?id=MaS7e2EVFm}
}
```
提供机构:
nfsrulesFR



