five

jjingliu/approved_drug_target

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jjingliu/approved_drug_target
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - text-generation language: - en tags: - chemistry - biology - medical pretty_name: approved_drug_target size_categories: - 10K<n<100K configs: - config_name: approved_drug_target data_files: - split: train path: approved_drug_target.json - config_name: uniprot_sequence data_files: - split: uniprot_seq path: uniprotId_sequence_2024_11_01.json --- # Approved Drug SMILES and Protein Sequence Dataset This dataset provides a curated collection of approved drug Simplified Molecular Input Line Entry System (SMILES) strings and their associated protein sequences. Each small molecule has been approved by at least one regulatory body, ensuring the safety and relevance of the data for computational applications. The dataset includes 1,660 approved small molecules and their 2,093 related protein targets. # Dataset The data comes from the following sources: - DrugBank - UniProt - ChEMBL - ZINC20 # Data verification and processing A total of 1,710 approved small molecules were retrieved from the DrugBank database, 117 of which were labeled as withdrawn. After assessment by a physician (Ali Motahharynia) and a pharmacist (Mahsa Sheikholeslami), 50 withdrawn drugs were excluded due to safety concerns, resulting in 1,660 approved small molecules. 2,116 protein targets were associated with these drugs, but 27 proteins were missing or unverified in UniProt database. These were manually replaced or verified using UniProt IDs, identical protein names, or Basic Local Alignment Search Tool (BLAST) for alignment matching, ultimately leading to 2,093 verified protein targets. Protein with UniProt ID “Q5JXX5” was deleted from the UniProt database and was therefore excluded from the dataset. # Data structure - SMILES: Contains the SMILES strings for each of the approved molecules. These SMILES were retrieved from DrugBank, ChEMBL, and ZINC20 databases. - Sequences: Contains protein sequences retrieved from UniProt database. # You can load this dataset with: ```python from datasets import load_dataset dataset = load_dataset("alimotahharynia/approved_drug_target", "approved_drug_target") ``` You can also download the dataset directly in JSON format. # Citation If you use this dataset in your research, please cite our paper: ``` Sheikholeslami, M., Mazrouei, N., Gheisari, Y., Fasihi, A., Irajpour, M., & Motahharynia, A*. DrugGen enhances drug discovery with large language models and reinforcement learning. Sci Rep 15, 13445 (2025). https://doi.org/10.1038/s41598-025-98629-1 ```
提供机构:
jjingliu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作