lianghsun/peptidomimetic-pretrain

Name: lianghsun/peptidomimetic-pretrain
Creator: lianghsun
Published: 2026-04-14 22:47:12
License: 暂无描述

Hugging Face2026-04-14 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/lianghsun/peptidomimetic-pretrain

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en task_categories: - text-generation tags: - chemistry - drug-discovery - selfies - peptidomimetic - molecular-generation - psma - pretrain size_categories: - 1M<n<10M dataset_info: features: - name: selfies dtype: string - name: input_ids sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 1935737412 num_examples: 1433044 - name: validation num_bytes: 39499500 num_examples: 29245 download_size: 1643143156 dataset_size: 1975236912 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* --- # Dataset Card for peptidomimetic-pretrain  peptidomimetic-pretrain 是一個以 SELFIES（Self-Referencing Embedded Strings）分子表示法所構成的仿肽（peptidomimetic）預訓練語料集，包含訓練集 1,433,044 筆與驗證集 29,245 筆分子，適用於以 Causal Language Modeling（CLM）方式預訓練分子生成模型，作為 PSMA 靶向藥物探索模型之第一階段基礎訓練素材。 ## Dataset Details ### Dataset Description  本資料集針對 PSMA（Prostate-Specific Membrane Antigen，又稱 Glutamate Carboxypeptidase II, GCPII）靶點之仿肽藥物探索任務設計，蒐集既有之 PSMA 相關配體分子，將其轉換為 SELFIES 字串，並預先進行 CLM 格式之 tokenization，形成可直接餵入語言模型訓練迴圈之預處理語料。 SELFIES 相較於 SMILES 的優勢在於其任意字串均可保證為有效分子（100% validity），非常適合用於生成式分子設計任務，可避免模型輸出無效化合物之問題。 - **Curated by:** [Liang Hsun Huang](https://www.linkedin.com/in/lianghsunhuang/?locale=en_US) - **Language(s) (NLP):** English / SELFIES notation - **License:** MIT - **Purpose:** Phase 1 pretraining for the PSMA-targeted drug discovery model [lianghsun/peptide-dpt](https://github.com/lianghsun/peptide-dpt) ### Dataset Sources  - **Repository:** [lianghsun/peptidomimetic-pretrain](https://huggingface.co/datasets/lianghsun/peptidomimetic-pretrain/) - **Downstream Project:** [lianghsun/peptide-dpt](https://github.com/lianghsun/peptide-dpt) - **Original Sources:** - [ChEMBL GCPII (CHEMBL3231)](https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL3231/) - [RCSB PDB](https://www.rcsb.org/) — co-crystal ligands - [BindingDB (UniProt Q04609)](https://www.bindingdb.org/) ## Uses  ### Direct Use  本資料集主要設計用於： * 以 CLM 方式預訓練分子生成語言模型，學習 PSMA 相關仿肽化合物之分子骨架分佈； * 作為 peptide-dpt 之 Phase 1 預訓練階段的基礎語料； * 研究 SELFIES 格式下之分子語言建模與 tokenization 策略； * 藥物探索領域中以語言模型為基礎之生成式分子設計任務。 ### Out-of-Scope Use  本資料集不適用於下列用途： * 作為分子活性（binding affinity）或毒性之預測資料集，本資料集僅包含分子結構，不含定量活性標註。 * 直接用於 PSMA 以外之靶點藥物設計，本語料之分子分佈偏向 PSMA 相關化合物。 * 作為藥物臨床使用之依據，任何生成之分子均需經實驗驗證與臨床試驗。 * 以 SMILES 為主之訓練流程，本資料集採用 SELFIES 格式。 ## Dataset Structure  ```python { "selfies": "[C][N][S][=Branch1][C][=O][=Branch1][C][=O][C][=C][C][=C]...", "input_ids": [1, 159, 206, 243, 90, 159, 105, ...], "labels": [159, 206, 243, 90, 159, 105, 90, ..., 2] } ``` | 欄位 | 型別 | 說明 | |---|---|---| | `selfies` | string | 分子 SELFIES 字串表示法 | | `input_ids` | list[int64] | CLM 格式之 tokenized 輸入（以 BOS 開頭） | | `labels` | list[int64] | CLM 格式之 labels（相對 `input_ids` 向右位移 1，末尾加 EOS） | | 切分 | 筆數 | |---|---| | train | 1,433,044 | | validation | 29,245 | ## Dataset Creation ### Curation Rationale  PSMA 是前列腺癌與多種神經內分泌腫瘤之重要靶點，相關仿肽藥物之設計長期受限於可用訓練語料之不足。本資料集將 ChEMBL、PDB、BindingDB 等公開科學資料庫中與 PSMA 相關之配體分子整合為統一之 SELFIES 語料，並預先 tokenize 為 CLM 格式，方便直接投入分子生成語言模型之預訓練流程。 ### Source Data  #### Data Collection and Processing  資料蒐集自下列公開科學資料庫： 1. **ChEMBL GCPII（CHEMBL3231）**：針對 GCPII / PSMA 之活性配體分子。 2. **RCSB PDB co-crystal ligands**：PSMA 共結晶結構中之配體。 3. **BindingDB（UniProt Q04609）**：PSMA 之結合親和力資料。所有分子皆轉換為 SELFIES 表示法，並以自訂 tokenizer 進行編碼，產生 `input_ids` 與 `labels`（labels 由 input_ids 向右位移 1 形成 CLM 訓練目標）。資料以 train/validation 切分後儲存為 parquet 格式。 #### Who are the source data producers?  原始分子資料由 ChEMBL（EMBL-EBI）、RCSB PDB 與 BindingDB 等科學組織維護，由全球藥物化學與結構生物學研究社群共同貢獻。 ### Annotations  #### Annotation process 本資料集不包含額外標註。`input_ids` 與 `labels` 為自動 tokenization 之結果。 #### Who are the annotators? 不適用。 #### Personal and Sensitive Information  本資料集僅包含分子結構資料，不涉及個人資訊、醫療紀錄或敏感資料。 ## Bias, Risks, and Limitations  * 分子分佈偏向 PSMA 靶點之既有配體空間，模型可能難以產出結構上與現有化合物差異極大之新穎分子。 * SELFIES 雖保證生成分子有效，但並不保證分子之可合成性或藥理活性。 * Tokenization 結果依賴特定 tokenizer，若下游使用不同 tokenizer 則需重新處理。 * 資料量受限於公開資料庫涵蓋之 PSMA 配體數量，約百萬筆級別。 * 本資料集不含靶點親和力、ADMET 或毒性之定量標註，無法單獨用於活性預測任務。 ### Recommendations  建議使用者： * 以本資料集進行預訓練後，搭配 PSMA 活性資料進行下游微調（如 reward-based 或 DPO-based 分子優化）； * 生成之候選分子應以分子對接、合成可行性評估與實驗驗證進行篩選； * 若應用於其他靶點，建議重新蒐集相應之配體資料並進行領域適應性訓練。 ## Citation  ```bibtex @misc{peptidomimetic-pretrain, title = {peptidomimetic-pretrain: A SELFIES Corpus for PSMA-targeted Drug Discovery}, author = {Liang Hsun Huang}, year = {2026}, howpublished = {\url{https://huggingface.co/datasets/lianghsun/peptidomimetic-pretrain}}, note = {1.43M training + 29K validation SELFIES molecules for Phase 1 pretraining of the peptide-dpt model.} } ``` ## Dataset Card Authors [Liang Hsun Huang](https://www.linkedin.com/in/lianghsunhuang/?locale=en_US) ## Dataset Card Contact [Liang Hsun Huang](https://www.linkedin.com/in/lianghsunhuang/?locale=en_US)

提供机构：

lianghsun

5,000+

优质数据集

54 个

任务类型

进入经典数据集