five

chebi_20

收藏
魔搭社区2025-10-09 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chebi_20
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Details ### Dataset Description A dataset of pairs of natural language descriptions and SMILEs. - **Curated by:** - **License:** CC BY 4.0 ### Dataset Sources - [Original Text2Mol paper which introduced the chebi_20 dataset.](https://aclanthology.org/2021.emnlp-main.47/) - [Text2Mol original data repository on GitHub.](https://github.com/cnedwards/text2mol) - [Hugging Face dataset uploaded to the OpenBioML organisation.](https://huggingface.co/datasets/OpenBioML/chebi_20) ## Citation **BibTeX:** ```bibtex @inproceedings{edwards2021text2mol, title={Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries}, author={Edwards, Carl and Zhai, ChengXiang and Ji, Heng}, booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, pages={595--607}, year={2021}, url = {https://aclanthology.org/2021.emnlp-main.47/} } @inproceedings{edwards-etal-2022-translation, title = "Translation between Molecules and Natural Language", author = "Edwards, Carl and Lai, Tuan and Ros, Kevin and Honke, Garrett and Cho, Kyunghyun and Ji, Heng", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.26", pages = "375--413", abstract = "We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.", } ```

## 数据集详情 ### 数据集描述 本数据集收录自然语言描述与简化分子线性输入规范(SMILES)的配对样本。 - **整理方:** - **许可协议:** CC BY 4.0 ### 数据集来源 - [推出chebi_20数据集的原始Text2Mol论文](https://aclanthology.org/2021.emnlp-main.47/) - [托管于GitHub的Text2Mol原始数据集仓库](https://github.com/cnedwards/text2mol) - [上传至OpenBioML组织的Hugging Face数据集](https://huggingface.co/datasets/OpenBioML/chebi_20) ## 引用 ### BibTeX格式引用: bibtex @inproceedings{edwards2021text2mol, title={Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries}, author={Edwards, Carl and Zhai, ChengXiang and Ji, Heng}, booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, pages={595--607}, year={2021}, url = {https://aclanthology.org/2021.emnlp-main.47/} } @inproceedings{edwards-etal-2022-translation, title = "Translation between Molecules and Natural Language", author = "Edwards, Carl and Lai, Tuan and Ros, Kevin and Honke, Garrett and Cho, Kyunghyun and Ji, Heng", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year={2022}, address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = {https://aclanthology.org/2022.emnlp-main.26}, pages = "375--413", abstract = "We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.", }
提供机构:
maas
创建时间:
2025-05-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作