five

lukaskim/ChEMBL-36

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lukaskim/ChEMBL-36
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 pretty_name: ChEMBL 36 tags: - chemistry - drug-discovery - biology - smiles - proteins - bioactivity size_categories: - 1M<n<10M configs: - config_name: molecules data_files: molecules/train-*.parquet - config_name: targets data_files: targets/train-*.parquet - config_name: molecule_target_pairs data_files: molecule_target_pairs/train-*.parquet --- # ChEMBL 36 ChEMBL 36 converted to HuggingFace `datasets` format. Source data is from the [ChEMBL database](https://www.ebi.ac.uk/chembl/) (EMBL-EBI), a manually curated database of bioactive molecules with drug-like properties. ## Dataset configs ### `molecules` (~2.4M rows) All compounds in ChEMBL with canonical SMILES representations. | Column | Type | Description | |---|---|---| | `chembl_id` | string | ChEMBL compound identifier | | `canonical_smiles` | string | Canonical SMILES representation | | `standard_inchi` | string | Standard InChI representation | | `standard_inchi_key` | string | Standard InChI key (27-char hash; use for cross-database deduplication) | | `molecule_type` | string | Compound type (e.g. `Small molecule`) | | `max_phase` | float64 | Highest clinical trial phase reached (0–4; 4 = approved drug; null = not tested) | | `first_approval` | float64 | Year of first regulatory approval, if approved | | `oral` | int64 | Orally bioavailable flag (1/0) | | `prodrug` | int64 | Prodrug flag (1/0) | | `natural_product` | int64 | Natural product flag (1/0) | | `black_box_warning` | int64 | FDA black-box warning flag (1/0) | | `withdrawn_flag` | int64 | Withdrawn from market flag (1/0) | | `therapeutic_flag` | int64 | Has a documented therapeutic use (1/0) | | `mw_freebase` | float64 | Molecular weight of the free base form | | `alogp` | float64 | Calculated lipophilicity (Wildman–Crippen LogP) | | `hba` | float64 | Number of hydrogen bond acceptors | | `hbd` | float64 | Number of hydrogen bond donors | | `psa` | float64 | Polar surface area (Ų) | | `rtb` | float64 | Number of rotatable bonds | | `aromatic_rings` | float64 | Number of aromatic rings | | `heavy_atoms` | float64 | Number of heavy atoms | | `qed_weighted` | float64 | Quantitative Estimate of Drug-likeness (0–1) | | `num_ro5_violations` | float64 | Number of Lipinski Rule-of-Five violations (0–4) | ### `targets` (~15K rows) Protein targets with amino-acid sequences. Multi-component targets (protein complexes) produce one row per component. | Column | Type | Description | |---|---|---| | `target_chembl_id` | string | ChEMBL target identifier | | `pref_name` | string | Preferred target name | | `target_type` | string | Target type (e.g. `SINGLE PROTEIN`, `PROTEIN COMPLEX`) | | `organism` | string | Source organism (e.g. `Homo sapiens`) | | `tax_id` | int64 | NCBI taxonomy ID | | `accession` | string | UniProt accession | | `sequence` | string | Amino-acid sequence | | `gene_names` | string | Comma-separated HGNC gene symbol(s) (e.g. `EGFR`); null for non-gene targets | | `protein_class_l1` | string | Top-level protein family (e.g. `Kinase`, `GPCR`); null if unclassified | | `protein_class_l2` | string | Second-level protein family; null if unclassified | ### `molecule_target_pairs` (~1–5M rows) Paired bioactivity records linking compounds to protein targets. Filtered to rows with non-null SMILES, sequence, and pChEMBL value. All included measurements use `standard_relation = '='` and `standard_units = 'nM'`. | Column | Type | Description | |---|---|---| | `chembl_id` | string | ChEMBL compound identifier | | `canonical_smiles` | string | Canonical SMILES representation | | `target_chembl_id` | string | ChEMBL target identifier | | `target_pref_name` | string | Preferred target name | | `organism` | string | Target organism | | `sequence` | string | Amino-acid sequence (`\|`-separated for multi-component targets) | | `standard_type` | string | Activity type (e.g. `IC50`, `Ki`, `Kd`) | | `standard_value` | float64 | Raw activity value (nM) | | `pchembl_value` | float64 | Standardized −log₁₀ activity value | | `assay_type` | string | Assay classification (e.g. `B` for binding) | | `confidence_score` | int64 | Target-assignment confidence 0–9 (9 = direct single-protein assay) | ## Usage ```python from datasets import load_dataset molecules = load_dataset("your-org/chembl-36", "molecules") targets = load_dataset("your-org/chembl-36", "targets") pairs = load_dataset("your-org/chembl-36", "molecule_target_pairs") ``` ## Source Built from the [ChEMBL 36 SQLite release](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_36/) using [chembl-hf](https://github.com/your-org/chembl-hf). ## License ChEMBL data is released under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). If you use this dataset, please cite the ChEMBL publication: > Zdrazil B, et al. (2024). The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity complementary data resources. *Nucleic Acids Research*, 52(D1), D1180–D1192. https://doi.org/10.1093/nar/gkad1004
提供机构:
lukaskim
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作