five

scbirlab/wong-2024-ai

收藏
Hugging Face2025-11-19 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/scbirlab/wong-2024-ai
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-classification tags: - chemistry - biology - antibiotics - SMILES size_categories: - 10K<n<100K pretty_name: Data from Wong, 2024 configs: - config_name: wong24-sau data_files: - split: train path: "*-train.csv.gz" - split: test path: "*-test.csv.gz" - split: validation path: "*-validation.csv.gz" --- # wong-2024-ai SMILES of compounds used for training and prediction in: > Wong F, Zheng EJ, Valeri JA, Donghia NM, Anahtar MN, Omori S, Li A, Cubillos-Ruiz A, Krishnan A, Jin W, Manson AL, Friedrichs J, Helbig R, Hajian B, Fiejtek DK, Wagner FF, Soutter HH, Earl AM, Stokes JM, Renner LD, Collins JJ. > Discovery of a structural class of antibiotics with explainable deep learning. > Nature. 2024 Feb;626(7997):177-185. > doi: 10.1038/s41586-023-06887-8. Epub 2023 Dec 20. > PMID: 38123686; PMCID: PMC10866013. The SMILES strings have been canonicalized, and split into training (70%), validation (15%), and test (15%) sets by Murcko scaffold. Additional features like molecular weight and topological polar surface area have also been calculated. ## Dataset Details ### Dataset Description - **Curated by:** [@eachanjohnson](https://huggingface.co/eachanjohnson) - **Funded by:** The Francis Crick Institute - **License:** CC-by-4.0 ### Dataset Sources <!-- Provide the basic links for the dataset. --> <!-- - **Repository:** https://doi.org/10.5281/zenodo.8136904 --> - **Paper** https://doi.org/10.1038/s41586-023-06887-8 <!-- - **Demo [optional]:** [More Information Needed] --> ## Uses Developing chemistry models. <!-- ### Direct Use --> <!-- This section describes suitable use cases for the dataset. --> <!-- [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> <!-- [More Information Needed] --> ## Dataset Structure - **SMILES**: SMILES string of compound - **id**: Numerical almost-unique identifier of compound - **inchikey**: Unique identifier for compound - **smiles**: RDKit-canonicalized SMILES string of compound - **pubchem_name**: Compound name pulled from PubChem - **pubchem_id**: PubChem compound ID - **scaffold**: Murcko scaffold of compound - **mwt**: Molecular weight of compound - **clogp**: Crippen LogP of compound - **tpsa**: Topological polar surface area of compound - **is_train**: In training split - **is_test**: In test split - **is_validation**: In validation split ## Dataset Creation ### Curation Rationale To make available a large dataset of SMILES strings for DOS compounds, as distinct from commonly encountered virtual libraries from conventional combinatorial chemistry. #### Data Collection and Processing Data were processed using [schemist](https://github.com/scbirlab/schemist), a tool for processing chemical datasets. #### Who are the source data producers? Liu G, Catacutan DB, Rathod K, Swanson K, Jin W, Mohammed JC, Chiappino-Pepe A, Syed SA, Fragis M, Rachwalski K, Magolan J, Surette MG, Coombes BK, Jaakkola T, Barzilay R, Collins JJ, Stokes JM #### Personal and Sensitive Information None. <!-- ## Bias, Risks, and Limitations --> <!-- This section is meant to convey both technical and sociotechnical limitations. --> <!-- [More Information Needed] --> <!-- ### Recommendations --> <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> <!-- Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. --> ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ``` @article{10.1038/s41586-023-06887-8, author = {Wong, Felix and Zheng, Erica J. and Valeri, Jacqueline A. and Donghia, Nina M. and Anahtar, Melis N. and Omori, Satotaka and Li, Alicia and Cubillos-Ruiz, Andres and Krishnan, Aarti and Jin, Wengong and Manson, Abigail L. and Friedrichs, Jens and Helbig, Ralf and Hajian, Behnoush and Fiejtek, Dawid K. and Wagner, Florence F. and Soutter, Holly H. and Earl, Ashlee M. and Stokes, Jonathan M. and Renner, Lars D. and Collins, James J.}, title = {Discovery of a structural class of antibiotics with explainable deep learning}, journal = {Nature}, volume = {626}, number = {7997}, pages = {177-185}, abstract = {The discovery of novel structural classes of antibiotics is urgently needed to address the ongoing antibiotic resistance crisis1–9. Deep learning approaches have aided in exploring chemical spaces1,10–15; these typically use black box models and do not provide chemical insights. Here we reasoned that the chemical substructures associated with antibiotic activity learned by neural network models can be identified and used to predict structural classes of antibiotics. We tested this hypothesis by developing an explainable, substructure-based approach for the efficient, deep learning-guided exploration of chemical spaces. We determined the antibiotic activities and human cell cytotoxicity profiles of 39,312 compounds and applied ensembles of graph neural networks to predict antibiotic activity and cytotoxicity for 12,076,365 compounds. Using explainable graph algorithms, we identified substructure-based rationales for compounds with high predicted antibiotic activity and low predicted cytotoxicity. We empirically tested 283 compounds and found that compounds exhibiting antibiotic activity against Staphylococcus aureus were enriched in putative structural classes arising from rationales. Of these structural classes of compounds, one is selective against methicillin-resistant S. aureus (MRSA) and vancomycin-resistant enterococci, evades substantial resistance, and reduces bacterial titres in mouse models of MRSA skin and systemic thigh infection. Our approach enables the deep learning-guided discovery of structural classes of antibiotics and demonstrates that machine learning models in drug discovery can be explainable, providing insights into the chemical substructures that underlie selective antibiotic activity.}, ISSN = {1476-4687}, DOI = {10.1038/s41586-023-06887-8}, url = {https://doi.org/10.1038/s41586-023-06887-8}, year = {2024}, type = {Journal Article} } ``` **APA:** > Wong, F., Zheng, E. J., Valeri, J. A., Donghia, N. M., Anahtar, M. N., Omori, S., Li, A., Cubillos-Ruiz, A., Krishnan, A., Jin, W., Manson, A. L., Friedrichs, J., Helbig, R., Hajian, B., Fiejtek, D. K., Wagner, F. F., Soutter, H. H., Earl, A. M., Stokes, J. M., Renner, L. D., … Collins, J. J. (2024). Discovery of a structural class of antibiotics with explainable deep learning. Nature, 626(7997), 177–185. https://doi.org/10.1038/s41586-023-06887-8 <!-- ## Glossary [optional] --> <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> <!-- [More Information Needed] <!-- ## More Information [optional] <!-- [More Information Needed] <!-- ## Dataset Card Authors [optional] <!-- [More Information Needed] --> ## Dataset Card Contact [@eachanjohnson](https://huggingface.co/eachanjohnson)
提供机构:
scbirlab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作