chebi_20
收藏魔搭社区2025-10-09 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chebi_20
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Details
### Dataset Description
A dataset of pairs of natural language descriptions and SMILEs.
- **Curated by:**
- **License:** CC BY 4.0
### Dataset Sources
- [Original Text2Mol paper which introduced the chebi_20 dataset.](https://aclanthology.org/2021.emnlp-main.47/)
- [Text2Mol original data repository on GitHub.](https://github.com/cnedwards/text2mol)
- [Hugging Face dataset uploaded to the OpenBioML organisation.](https://huggingface.co/datasets/OpenBioML/chebi_20)
## Citation
**BibTeX:**
```bibtex
@inproceedings{edwards2021text2mol,
title={Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries},
author={Edwards, Carl and Zhai, ChengXiang and Ji, Heng},
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
pages={595--607},
year={2021},
url = {https://aclanthology.org/2021.emnlp-main.47/}
}
@inproceedings{edwards-etal-2022-translation,
title = "Translation between Molecules and Natural Language",
author = "Edwards, Carl and
Lai, Tuan and
Ros, Kevin and
Honke, Garrett and
Cho, Kyunghyun and
Ji, Heng",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.26",
pages = "375--413",
abstract = "We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.",
}
```
## 数据集详情
### 数据集描述
本数据集收录自然语言描述与简化分子线性输入规范(SMILES)的配对样本。
- **整理方:**
- **许可协议:** CC BY 4.0
### 数据集来源
- [推出chebi_20数据集的原始Text2Mol论文](https://aclanthology.org/2021.emnlp-main.47/)
- [托管于GitHub的Text2Mol原始数据集仓库](https://github.com/cnedwards/text2mol)
- [上传至OpenBioML组织的Hugging Face数据集](https://huggingface.co/datasets/OpenBioML/chebi_20)
## 引用
### BibTeX格式引用:
bibtex
@inproceedings{edwards2021text2mol,
title={Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries},
author={Edwards, Carl and Zhai, ChengXiang and Ji, Heng},
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
pages={595--607},
year={2021},
url = {https://aclanthology.org/2021.emnlp-main.47/}
}
@inproceedings{edwards-etal-2022-translation,
title = "Translation between Molecules and Natural Language",
author = "Edwards, Carl and
Lai, Tuan and
Ros, Kevin and
Honke, Garrett and
Cho, Kyunghyun and
Ji, Heng",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year={2022},
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = {https://aclanthology.org/2022.emnlp-main.26},
pages = "375--413",
abstract = "We present MolT5 - a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that MolT5-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.",
}
提供机构:
maas
创建时间:
2025-05-29



