下载链接：

https://modelscope.cn/datasets/jablonkagroup/chebi_20

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Details ### Dataset Description A dataset of pairs of natural language descriptions and SMILEs. - **Curated by:** - **License:** CC BY 4.0 ### Dataset Sources - [Original Text2Mol paper which introduced the chebi_20 dataset.](https://aclanthology.org/2021.emnlp-main.47/) - [Text2Mol original data repository on GitHub.](https://github.com/cnedwards/text2mol) - [Hugging Face dataset uploaded to the OpenBioML organisation.](https://huggingface.co/datasets/OpenBioML/chebi_20) ## Citation **BibTeX:** ```bibtex @inproceedings{edwards2021text2mol, title={Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries}, author={Edwards, Carl and Zhai, ChengXiang and Ji, Heng}, booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, pages={595--607}, year={2021}, url = {https://aclanthology.org/2021.emnlp-main.47/} } @inproceedings{edwards-etal-2022-translation, title = "Translation between Molecules and Natural Language", author = "Edwards, Carl and Lai, Tuan and Ros, Kevin and Honke, Garrett and Cho, Kyunghyun and Ji, Heng", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.26", pages = "375--413", abstract = "We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.", } ```

## 数据集详情 ### 数据集描述本数据集收录自然语言描述与简化分子线性输入规范（SMILES）的配对样本。 - **整理方：** - **许可协议：** CC BY 4.0 ### 数据集来源 - [推出chebi_20数据集的原始Text2Mol论文](https://aclanthology.org/2021.emnlp-main.47/) - [托管于GitHub的Text2Mol原始数据集仓库](https://github.com/cnedwards/text2mol) - [上传至OpenBioML组织的Hugging Face数据集](https://huggingface.co/datasets/OpenBioML/chebi_20) ## 引用 ### BibTeX格式引用： bibtex @inproceedings{edwards2021text2mol, title={Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries}, author={Edwards, Carl and Zhai, ChengXiang and Ji, Heng}, booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, pages={595--607}, year={2021}, url = {https://aclanthology.org/2021.emnlp-main.47/} } @inproceedings{edwards-etal-2022-translation, title = "Translation between Molecules and Natural Language", author = "Edwards, Carl and Lai, Tuan and Ros, Kevin and Honke, Garrett and Cho, Kyunghyun and Ji, Heng", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year={2022}, address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = {https://aclanthology.org/2022.emnlp-main.26}, pages = "375--413", abstract = "We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.", }

应用场景：