SCIR-HI/PseudoMD-1M
收藏Hugging Face2023-12-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SCIR-HI/PseudoMD-1M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- translation
- text2text-generation
language:
- en
tags:
- chemistry
- biology
- medical
size_categories:
- 1M<n<10M
---
Pre-training dataset used in paper "[From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery](https://arxiv.org/abs/2309.05203)" (AAAI 2024)
PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens.
### Citation
If you found the dataset useful, please cite:
```bibtex
@article{chen2023artificially,
title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery},
author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing},
journal={arXiv preprint arXiv:2309.05203},
year={2023}
}
```
提供机构:
SCIR-HI
原始信息汇总
数据集概述
基本信息
- 许可证:Apache 2.0
- 任务类别:
- 翻译
- 文本生成
- 语言:英语
- 标签:
- 化学
- 生物学
- 医学
- 数据集大小:1M<n<10M
详细描述
- 数据集名称:PseudoMD-1M
- 数据集用途:用于跨模态分子发现的伪数据集
- 数据集组成:包含1,020,139个伪分子-描述对
- 分子表示:使用Canonical SMILES表示法,数据来源为PubChem通过PUG View API
- 描述特征:每个描述平均包含5.11个句子,106.47个单词,165.07个词
引用信息
- 论文标题:From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery
- 作者:Chen, Yuhan 等
- 发表期刊:arXiv preprint arXiv:2309.05203
- 发表年份:2023



