five

SCIR-HI/PseudoMD-1M

收藏
Hugging Face2023-12-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SCIR-HI/PseudoMD-1M
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - translation - text2text-generation language: - en tags: - chemistry - biology - medical size_categories: - 1M<n<10M --- Pre-training dataset used in paper "[From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery](https://arxiv.org/abs/2309.05203)" (AAAI 2024) PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. ### Citation If you found the dataset useful, please cite: ```bibtex @article{chen2023artificially, title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery}, author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing}, journal={arXiv preprint arXiv:2309.05203}, year={2023} } ```
提供机构:
SCIR-HI
原始信息汇总

数据集概述

基本信息

  • 许可证:Apache 2.0
  • 任务类别
    • 翻译
    • 文本生成
  • 语言:英语
  • 标签
    • 化学
    • 生物学
    • 医学
  • 数据集大小:1M<n<10M

详细描述

  • 数据集名称:PseudoMD-1M
  • 数据集用途:用于跨模态分子发现的伪数据集
  • 数据集组成:包含1,020,139个伪分子-描述对
  • 分子表示:使用Canonical SMILES表示法,数据来源为PubChem通过PUG View API
  • 描述特征:每个描述平均包含5.11个句子,106.47个单词,165.07个词

引用信息

  • 论文标题:From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery
  • 作者:Chen, Yuhan 等
  • 发表期刊:arXiv preprint arXiv:2309.05203
  • 发表年份:2023
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作