kohbanye/crossdocked2020
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/kohbanye/crossdocked2020
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- other
tags:
- drug-discovery
- molecular-generation
- protein-ligand
- structural-biology
size_categories:
- 1M<n<10M
---
# CrossDocked2020
Pre-processed CrossDocked2020 dataset containing raw receptor PDB and ligand
SDF.gz files, organized for efficient loading.
## Dataset Summary
- **Unique pairs**: 25,092,018
- **Unique receptor PDB files**: 24,533
- **Source types**: cdonly, it0, it2_redocked
- **Fold splits**: 3 folds (0, 1, 2) per source type category
## Repository Structure
```
receptors/ Unique receptor PDB files in tar.gz archives
ligands/ Ligand SDF.gz files in tar shards (WebDataset-compatible)
manifest.parquet Pair index with metadata and fold split info
```
## Ligand Tar Shard Format
Each shard is a tar file containing pairs of files per sample:
- `{pair_idx:07d}.sdf.gz` — original ligand SDF.gz (all conformers)
- `{pair_idx:07d}.json` — metadata (receptor_path, complex_dir, source_type)
## Manifest Schema
| Column | Type | Description |
|--------|------|-------------|
| pair_idx | uint32 | Global unique pair ID |
| complex_dir | string | Complex directory name |
| receptor_pdb | string | Receptor PDB filename |
| ligand_sdf_gz | string | Ligand SDF.gz filename |
| source_type | string | cdonly / it0 / it2_redocked |
| shard_idx | uint16 | Ligand shard number |
| label | int8 | Types file label (0/1) |
| score1 | float32 | Types file score 1 |
| score2 | float32 | Types file score 2 |
| {cat}_fold{n} | string | "train" / "test" per category and fold |
## Original Source
Paul G. Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B. Iovanisci, Ian Snyder, David R. Koes. *J. Chem. Inf. Model.* 2020, 60(9), p.4200–4215.
提供机构:
kohbanye



