scbirlab/liu-2023-ai
收藏Hugging Face2025-11-19 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/scbirlab/liu-2023-ai
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
tags:
- chemistry
- biology
- antibiotics
- SMILES
size_categories:
- 10K<n<100K
pretty_name: Data from Liu, 2023
configs:
- config_name: liu23
data_files:
- split: train
path: "*-train.csv.gz"
- split: test
path: "*-test.csv.gz"
- split: validation
path: "*-validation.csv.gz"
---
# liu-2023-ai
SMILES of compounds used for training and prediction in:
> Liu G, Catacutan DB, Rathod K, Swanson K, Jin W, Mohammed JC, Chiappino-Pepe A, Syed SA, Fragis M, Rachwalski K, Magolan J, Surette MG, Coombes BK, Jaakkola T, Barzilay R, Collins JJ, Stokes JM.
> Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii.
> Nat Chem Biol. 2023 Nov;19(11):1342-1350.
> doi: 10.1038/s41589-023-01349-8. Epub 2023 May 25.
> PMID: 37231267.
The SMILES strings have been canonicalized, and split into training (70%), validation (15%), and test (15%) sets by Murcko scaffold.
Additional features like molecular weight and topological polar surface area have also been calculated.
## Dataset Details
### Dataset Description
- **Curated by:** [@eachanjohnson](https://huggingface.co/eachanjohnson)
- **Funded by:** The Francis Crick Institute
- **License:** CC-by-4.0
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
<!-- - **Repository:** https://doi.org/10.5281/zenodo.8136904 -->
- **Paper** https://doi.org/10.1038/s41589-023-01349-8
<!-- - **Demo [optional]:** [More Information Needed] -->
## Uses
Developing chemistry models.
<!-- ### Direct Use -->
<!-- This section describes suitable use cases for the dataset. -->
<!-- [More Information Needed]
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
<!-- [More Information Needed] -->
## Dataset Structure
- **SMILES**: SMILES string of compound
- **id**: Numerical almost-unique identifier of compound
- **inchikey**: Unique identifier for compound
- **smiles**: RDKit-canonicalized SMILES string of compound
- **pubchem_name**: Compound name pulled from PubChem
- **pubchem_id**: PubChem compound ID
- **scaffold**: Murcko scaffold of compound
- **mwt**: Molecular weight of compound
- **clogp**: Crippen LogP of compound
- **tpsa**: Topological polar surface area of compound
- **is_train**: In training split
- **is_test**: In test split
- **is_validation**: In validation split
## Dataset Creation
### Curation Rationale
To make available a large dataset of SMILES strings for DOS compounds, as distinct from commonly encountered
virtual libraries from conventional combinatorial chemistry.
#### Data Collection and Processing
Data were processed using [schemist](https://github.com/scbirlab/schemist), a tool for processing chemical datasets.
#### Who are the source data producers?
Liu G, Catacutan DB, Rathod K, Swanson K, Jin W, Mohammed JC, Chiappino-Pepe A, Syed SA,
Fragis M, Rachwalski K, Magolan J, Surette MG, Coombes BK, Jaakkola T, Barzilay R,
Collins JJ, Stokes JM
#### Personal and Sensitive Information
None.
<!-- ## Bias, Risks, and Limitations -->
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
<!-- [More Information Needed] -->
<!-- ### Recommendations -->
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
<!-- Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. -->
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@article{10.1038/s41589-023-01349-8,
author = {Liu, Gary and Catacutan, Denise B. and Rathod, Khushi and Swanson, Kyle and Jin, Wengong and Mohammed, Jody C. and Chiappino-Pepe, Anush and Syed, Saad A. and Fragis, Meghan and Rachwalski, Kenneth and Magolan, Jakob and Surette, Michael G. and Coombes, Brian K. and Jaakkola, Tommi and Barzilay, Regina and Collins, James J. and Stokes, Jonathan M.},
title = {Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii},
journal = {Nature Chemical Biology},
volume = {19},
number = {11},
pages = {1342-1350},
abstract = {Acinetobacter baumannii is a nosocomial Gram-negative pathogen that often displays multidrug resistance. Discovering new antibiotics against A. baumannii has proven challenging through conventional screening approaches. Fortunately, machine learning methods allow for the rapid exploration of chemical space, increasing the probability of discovering new antibacterial molecules. Here we screened ~7,500 molecules for those that inhibited the growth of A. baumannii in vitro. We trained a neural network with this growth inhibition dataset and performed in silico predictions for structurally new molecules with activity against A. baumannii. Through this approach, we discovered abaucin, an antibacterial compound with narrow-spectrum activity against A. baumannii. Further investigations revealed that abaucin perturbs lipoprotein trafficking through a mechanism involving LolE. Moreover, abaucin could control an A. baumannii infection in a mouse wound model. This work highlights the utility of machine learning in antibiotic discovery and describes a promising lead with targeted activity against a challenging Gram-negative pathogen.},
ISSN = {1552-4469},
DOI = {10.1038/s41589-023-01349-8},
url = {https://doi.org/10.1038/s41589-023-01349-8},
year = {2023},
type = {Journal Article}
}
```
**APA:**
> Liu, G., Catacutan, D. B., Rathod, K., Swanson, K., Jin, W., Mohammed, J. C., Chiappino-Pepe, A., Syed, S. A., Fragis, M., Rachwalski, K., Magolan, J., Surette, M. G., Coombes, B. K., Jaakkola, T., Barzilay, R., Collins, J. J., & Stokes, J. M. (2023). Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nature chemical biology, 19(11), 1342–1350. https://doi.org/10.1038/s41589-023-01349-8
<!-- ## Glossary [optional] -->
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
<!-- [More Information Needed]
<!-- ## More Information [optional]
<!-- [More Information Needed]
<!-- ## Dataset Card Authors [optional]
<!-- [More Information Needed] -->
## Dataset Card Contact
[@eachanjohnson](https://huggingface.co/eachanjohnson)
提供机构:
scbirlab



