lukaskim/ChEMBL-36
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lukaskim/ChEMBL-36
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
pretty_name: ChEMBL 36
tags:
- chemistry
- drug-discovery
- biology
- smiles
- proteins
- bioactivity
size_categories:
- 1M<n<10M
configs:
- config_name: molecules
data_files: molecules/train-*.parquet
- config_name: targets
data_files: targets/train-*.parquet
- config_name: molecule_target_pairs
data_files: molecule_target_pairs/train-*.parquet
---
# ChEMBL 36
ChEMBL 36 converted to HuggingFace `datasets` format. Source data is from the [ChEMBL database](https://www.ebi.ac.uk/chembl/) (EMBL-EBI), a manually curated database of bioactive molecules with drug-like properties.
## Dataset configs
### `molecules` (~2.4M rows)
All compounds in ChEMBL with canonical SMILES representations.
| Column | Type | Description |
|---|---|---|
| `chembl_id` | string | ChEMBL compound identifier |
| `canonical_smiles` | string | Canonical SMILES representation |
| `standard_inchi` | string | Standard InChI representation |
| `standard_inchi_key` | string | Standard InChI key (27-char hash; use for cross-database deduplication) |
| `molecule_type` | string | Compound type (e.g. `Small molecule`) |
| `max_phase` | float64 | Highest clinical trial phase reached (0–4; 4 = approved drug; null = not tested) |
| `first_approval` | float64 | Year of first regulatory approval, if approved |
| `oral` | int64 | Orally bioavailable flag (1/0) |
| `prodrug` | int64 | Prodrug flag (1/0) |
| `natural_product` | int64 | Natural product flag (1/0) |
| `black_box_warning` | int64 | FDA black-box warning flag (1/0) |
| `withdrawn_flag` | int64 | Withdrawn from market flag (1/0) |
| `therapeutic_flag` | int64 | Has a documented therapeutic use (1/0) |
| `mw_freebase` | float64 | Molecular weight of the free base form |
| `alogp` | float64 | Calculated lipophilicity (Wildman–Crippen LogP) |
| `hba` | float64 | Number of hydrogen bond acceptors |
| `hbd` | float64 | Number of hydrogen bond donors |
| `psa` | float64 | Polar surface area (Ų) |
| `rtb` | float64 | Number of rotatable bonds |
| `aromatic_rings` | float64 | Number of aromatic rings |
| `heavy_atoms` | float64 | Number of heavy atoms |
| `qed_weighted` | float64 | Quantitative Estimate of Drug-likeness (0–1) |
| `num_ro5_violations` | float64 | Number of Lipinski Rule-of-Five violations (0–4) |
### `targets` (~15K rows)
Protein targets with amino-acid sequences. Multi-component targets (protein complexes) produce one row per component.
| Column | Type | Description |
|---|---|---|
| `target_chembl_id` | string | ChEMBL target identifier |
| `pref_name` | string | Preferred target name |
| `target_type` | string | Target type (e.g. `SINGLE PROTEIN`, `PROTEIN COMPLEX`) |
| `organism` | string | Source organism (e.g. `Homo sapiens`) |
| `tax_id` | int64 | NCBI taxonomy ID |
| `accession` | string | UniProt accession |
| `sequence` | string | Amino-acid sequence |
| `gene_names` | string | Comma-separated HGNC gene symbol(s) (e.g. `EGFR`); null for non-gene targets |
| `protein_class_l1` | string | Top-level protein family (e.g. `Kinase`, `GPCR`); null if unclassified |
| `protein_class_l2` | string | Second-level protein family; null if unclassified |
### `molecule_target_pairs` (~1–5M rows)
Paired bioactivity records linking compounds to protein targets. Filtered to rows with non-null SMILES, sequence, and pChEMBL value. All included measurements use `standard_relation = '='` and `standard_units = 'nM'`.
| Column | Type | Description |
|---|---|---|
| `chembl_id` | string | ChEMBL compound identifier |
| `canonical_smiles` | string | Canonical SMILES representation |
| `target_chembl_id` | string | ChEMBL target identifier |
| `target_pref_name` | string | Preferred target name |
| `organism` | string | Target organism |
| `sequence` | string | Amino-acid sequence (`\|`-separated for multi-component targets) |
| `standard_type` | string | Activity type (e.g. `IC50`, `Ki`, `Kd`) |
| `standard_value` | float64 | Raw activity value (nM) |
| `pchembl_value` | float64 | Standardized −log₁₀ activity value |
| `assay_type` | string | Assay classification (e.g. `B` for binding) |
| `confidence_score` | int64 | Target-assignment confidence 0–9 (9 = direct single-protein assay) |
## Usage
```python
from datasets import load_dataset
molecules = load_dataset("your-org/chembl-36", "molecules")
targets = load_dataset("your-org/chembl-36", "targets")
pairs = load_dataset("your-org/chembl-36", "molecule_target_pairs")
```
## Source
Built from the [ChEMBL 36 SQLite release](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_36/) using [chembl-hf](https://github.com/your-org/chembl-hf).
## License
ChEMBL data is released under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). If you use this dataset, please cite the ChEMBL publication:
> Zdrazil B, et al. (2024). The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity complementary data resources. *Nucleic Acids Research*, 52(D1), D1180–D1192. https://doi.org/10.1093/nar/gkad1004
提供机构:
lukaskim



