five

RosettaCommons/NCI_Open_Compounds

收藏
Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/RosettaCommons/NCI_Open_Compounds
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en tags: - chemistry - biology pretty_name: >- Processed NCI Open Compounds Structures for Docking, Cofold, and Affinity Prediction size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: nci_compounds.tsv delimiter: "\t" --- # Curated NCI Open Compounds dataset A curated set of the NCI Open Compounds with compatible mol2 and pdbqt files safe for cofolding and docking applications ## Quickstart Usage ### Install HuggingFace Datasets package Each subset can be loaded into python using the Huggingface [datasets](https://huggingface.co/docs/datasets/index) library. First, from the command line install the `datasets` library $ pip install datasets Optionally set the cache directory, e.g. $ HF_HOME=${HOME}/.cache/huggingface/ $ export HF_HOME then, from within python load the datasets library >>> import datasets ### Load model datasets To load one of the `NCI_Open_Compounds` model datasets, use `datasets.load_dataset(...)`: >>> dataset_tag = "train" >>> dataset_models = datasets.load_dataset( path = "leebecca/NCI_Open_Compounds", name = f"{dataset_tag}_models", data_dir = f"{dataset_tag}")['train'] and the dataset is loaded as a `datasets.arrow_dataset.Dataset` >>> dataset_models Dataset({ features: [ 'NSC', 'duplicate_idx', 'CID', 'SID', 'CAS', 'entry_id', 'entry_name', 'name', 'formula', 'smiles', 'mw', 'tot_q', 'tot_abs_q', 'chiralities_consistent', 'chiral_flag', 'flags', 'charging_adjusted_penalty', 'ionization_penalty', 'ionization_penalty_charging', 'ionization_penalty_neutral', 'state_penalty', 'energy', 'tautomer_probability', 'input_file', 'structure_evaluation', 'chemistry_notes', 'pka_notes' ], num_rows: 445794 }) ## Dataset Details ### Dataset Description The set contains ligprep output of the minimized 3D structures, expanded to include possible protonation states and tautomers capped at 3 per ligand. - **Acknowledgements:** We kindly acknowledge RosettaCommons ### Dataset Sources https://wiki.nci.nih.gov/spaces/NCIDTPdata/pages/155844992/Chemical+Data ## Uses ### Out-of-Scope Use ### Source Data ## Citation ## Dataset Card Authors Becca Lee (beccalee5@g.ucla.edu)
提供机构:
RosettaCommons
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作