SCIR-HI/PseudoMD-1M

Name: SCIR-HI/PseudoMD-1M
Creator: SCIR-HI
Published: 2023-12-20 11:19:29
License: 暂无描述

Hugging Face2023-12-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/SCIR-HI/PseudoMD-1M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - translation - text2text-generation language: - en tags: - chemistry - biology - medical size_categories: - 1M<n<10M --- Pre-training dataset used in paper "[From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery](https://arxiv.org/abs/2309.05203)" (AAAI 2024) PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. ### Citation If you found the dataset useful, please cite: ```bibtex @article{chen2023artificially, title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery}, author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing}, journal={arXiv preprint arXiv:2309.05203}, year={2023} } ```

提供机构：

SCIR-HI

原始信息汇总

数据集概述

基本信息

许可证：Apache 2.0
任务类别：
- 翻译
- 文本生成
语言：英语
标签：
- 化学
- 生物学
- 医学
数据集大小：1M<n<10M

详细描述

数据集名称：PseudoMD-1M
数据集用途：用于跨模态分子发现的伪数据集
数据集组成：包含1,020,139个伪分子-描述对
分子表示：使用Canonical SMILES表示法，数据来源为PubChem通过PUG View API
描述特征：每个描述平均包含5.11个句子，106.47个单词，165.07个词

引用信息

论文标题：From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery
作者：Chen, Yuhan 等
发表期刊：arXiv preprint arXiv:2309.05203
发表年份：2023

5,000+

优质数据集

54 个

任务类型

进入经典数据集