five

scbirlab/stokes-2020-ai

收藏
Hugging Face2025-11-19 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/scbirlab/stokes-2020-ai
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-classification tags: - chemistry - biology - antibiotics - SMILES size_categories: - 10K<n<100K pretty_name: Data from Stokes, 2020 configs: - config_name: eco data_files: - split: train path: "*-eco0-train.csv.gz" - split: test path: "*-eco0-test.csv.gz" - split: validation path: "*-eco0-validation.csv.gz" - config_name: broad-repurposing data_files: - split: train path: "*-broad-repurposing-train.csv.gz" - split: test path: "*-broad-repurposing-test.csv.gz" - split: validation path: "*-broad-repurposing-validation.csv.gz" - config_name: zinc-predictions data_files: - split: train path: "*-zinc-predictions-train.csv.gz" - split: test path: "*-zinc-predictions-test.csv.gz" - split: validation path: "*-zinc-predictions-validation.csv.gz" --- # stokes-2020-ai SMILES of compounds used for training and prediction in: > Stokes, J. M., Yang, K., ..., Collins, J. J. (2020). > A Deep Learning Approach to Antibiotic Discovery. > Cell, 180(4), 688–702.e13. > https://doi.org/10.1016/j.cell.2020.01.021 > PMID: 32084340; PMCID: PMC8349178. The SMILES strings have been canonicalized, and split into training (70%), validation (15%), and test (15%) sets by Murcko scaffold. Additional features like molecular weight and topological polar surface area have also been calculated. ## Dataset Details ### Dataset Description - **Curated by:** [@eachanjohnson](https://huggingface.co/eachanjohnson) - **Funded by:** The Francis Crick Institute - **License:** CC-by-4.0 ### Dataset Sources <!-- Provide the basic links for the dataset. --> <!-- - **Repository:** https://doi.org/10.5281/zenodo.8136904 --> - **Paper** https://doi.org/10.1016/j.cell.2020.01.021 <!-- - **Demo [optional]:** [More Information Needed] --> ## Uses Developing chemistry models. <!-- ### Direct Use --> <!-- This section describes suitable use cases for the dataset. --> <!-- [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> <!-- [More Information Needed] --> ## Dataset Structure - **SMILES**: SMILES string of compound - **id**: Numerical almost-unique identifier of compound - **inchikey**: Unique identifier for compound - **smiles**: RDKit-canonicalized SMILES string of compound - **pubchem_name**: Compound name pulled from PubChem - **pubchem_id**: PubChem compound ID - **scaffold**: Murcko scaffold of compound - **mwt**: Molecular weight of compound - **clogp**: Crippen LogP of compound - **tpsa**: Topological polar surface area of compound - **is_train**: In training split - **is_test**: In test split - **is_validation**: In validation split ## Dataset Creation ### Curation Rationale To make available a large dataset of SMILES strings for DOS compounds, as distinct from commonly encountered virtual libraries from conventional combinatorial chemistry. #### Data Collection and Processing Data were processed using [schemist](https://github.com/scbirlab/schemist), a tool for processing chemical datasets. #### Who are the source data producers? Stokes, J. M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N. M., MacNair, C. R., French, S., Carfrae, L. A., Bloom-Ackermann, Z., Tran, V. M., Chiappino-Pepe, A., Badran, A. H., Andrews, I. W., Chory, E. J., Church, G. M., Brown, E. D., Jaakkola, T. S., Barzilay, R., & Collins, J. J. #### Personal and Sensitive Information None. <!-- ## Bias, Risks, and Limitations --> <!-- This section is meant to convey both technical and sociotechnical limitations. --> <!-- [More Information Needed] --> <!-- ### Recommendations --> <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> <!-- Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. --> ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ``` @article{10.1016/j.cell.2020.01.021, title = {A Deep Learning Approach to Antibiotic Discovery}, journal = {Cell}, volume = {180}, number = {4}, pages = {688-702.e13}, year = {2020}, issn = {0092-8674}, doi = {https://doi.org/10.1016/j.cell.2020.01.021}, url = {https://www.sciencedirect.com/science/article/pii/S0092867420301021}, author = {Jonathan M. Stokes and Kevin Yang and Kyle Swanson and Wengong Jin and Andres Cubillos-Ruiz and Nina M. Donghia and Craig R. MacNair and Shawn French and Lindsey A. Carfrae and Zohar Bloom-Ackermann and Victoria M. Tran and Anush Chiappino-Pepe and Ahmed H. Badran and Ian W. Andrews and Emma J. Chory and George M. Church and Eric D. Brown and Tommi S. Jaakkola and Regina Barzilay and James J. Collins}, keywords = {antibiotics, antibiotic resistance, antibiotic tolerance, machine learning, drug discovery}, abstract = {Summary Due to the rapid emergence of antibiotic-resistant bacteria, there is a growing need to discover new antibiotics. To address this challenge, we trained a deep neural network capable of predicting molecules with antibacterial activity. We performed predictions on multiple chemical libraries and discovered a molecule from the Drug Repurposing Hub—halicin—that is structurally divergent from conventional antibiotics and displays bactericidal activity against a wide phylogenetic spectrum of pathogens including Mycobacterium tuberculosis and carbapenem-resistant Enterobacteriaceae. Halicin also effectively treated Clostridioides difficile and pan-resistant Acinetobacter baumannii infections in murine models. Additionally, from a discrete set of 23 empirically tested predictions from >107 million molecules curated from the ZINC15 database, our model identified eight antibacterial compounds that are structurally distant from known antibiotics. This work highlights the utility of deep learning approaches to expand our antibiotic arsenal through the discovery of structurally distinct antibacterial molecules.} } ``` **APA:** > Stokes, J. M., Yang, K., ..., Collins, J. J. (2020). > A Deep Learning Approach to Antibiotic Discovery. > Cell, 180(4), 688–702.e13. > https://doi.org/10.1016/j.cell.2020.01.021 <!-- ## Glossary [optional] --> <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> <!-- [More Information Needed] <!-- ## More Information [optional] <!-- [More Information Needed] <!-- ## Dataset Card Authors [optional] <!-- [More Information Needed] --> ## Dataset Card Contact [@eachanjohnson](https://huggingface.co/eachanjohnson)
提供机构:
scbirlab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作