lianghsun/peptidomimetic-pretrain
收藏Hugging Face2026-04-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lianghsun/peptidomimetic-pretrain
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
task_categories:
- text-generation
tags:
- chemistry
- drug-discovery
- selfies
- peptidomimetic
- molecular-generation
- psma
- pretrain
size_categories:
- 1M<n<10M
dataset_info:
features:
- name: selfies
dtype: string
- name: input_ids
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 1935737412
num_examples: 1433044
- name: validation
num_bytes: 39499500
num_examples: 29245
download_size: 1643143156
dataset_size: 1975236912
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
---
# Dataset Card for peptidomimetic-pretrain
<!-- Provide a quick summary of the dataset. -->
peptidomimetic-pretrain 是一個以 SELFIES(Self-Referencing Embedded Strings)分子表示法所構成的仿肽(peptidomimetic)預訓練語料集,包含訓練集 1,433,044 筆與驗證集 29,245 筆分子,適用於以 Causal Language Modeling(CLM)方式預訓練分子生成模型,作為 PSMA 靶向藥物探索模型之第一階段基礎訓練素材。
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
本資料集針對 PSMA(Prostate-Specific Membrane Antigen,又稱 Glutamate Carboxypeptidase II, GCPII)靶點之仿肽藥物探索任務設計,蒐集既有之 PSMA 相關配體分子,將其轉換為 SELFIES 字串,並預先進行 CLM 格式之 tokenization,形成可直接餵入語言模型訓練迴圈之預處理語料。
SELFIES 相較於 SMILES 的優勢在於其任意字串均可保證為有效分子(100% validity),非常適合用於生成式分子設計任務,可避免模型輸出無效化合物之問題。
- **Curated by:** [Liang Hsun Huang](https://www.linkedin.com/in/lianghsunhuang/?locale=en_US)
- **Language(s) (NLP):** English / SELFIES notation
- **License:** MIT
- **Purpose:** Phase 1 pretraining for the PSMA-targeted drug discovery model [lianghsun/peptide-dpt](https://github.com/lianghsun/peptide-dpt)
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **Repository:** [lianghsun/peptidomimetic-pretrain](https://huggingface.co/datasets/lianghsun/peptidomimetic-pretrain/)
- **Downstream Project:** [lianghsun/peptide-dpt](https://github.com/lianghsun/peptide-dpt)
- **Original Sources:**
- [ChEMBL GCPII (CHEMBL3231)](https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL3231/)
- [RCSB PDB](https://www.rcsb.org/) — co-crystal ligands
- [BindingDB (UniProt Q04609)](https://www.bindingdb.org/)
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
本資料集主要設計用於:
* 以 CLM 方式預訓練分子生成語言模型,學習 PSMA 相關仿肽化合物之分子骨架分佈;
* 作為 peptide-dpt 之 Phase 1 預訓練階段的基礎語料;
* 研究 SELFIES 格式下之分子語言建模與 tokenization 策略;
* 藥物探索領域中以語言模型為基礎之生成式分子設計任務。
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
本資料集不適用於下列用途:
* 作為分子活性(binding affinity)或毒性之預測資料集,本資料集僅包含分子結構,不含定量活性標註。
* 直接用於 PSMA 以外之靶點藥物設計,本語料之分子分佈偏向 PSMA 相關化合物。
* 作為藥物臨床使用之依據,任何生成之分子均需經實驗驗證與臨床試驗。
* 以 SMILES 為主之訓練流程,本資料集採用 SELFIES 格式。
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
```python
{
"selfies": "[C][N][S][=Branch1][C][=O][=Branch1][C][=O][C][=C][C][=C]...",
"input_ids": [1, 159, 206, 243, 90, 159, 105, ...],
"labels": [159, 206, 243, 90, 159, 105, 90, ..., 2]
}
```
| 欄位 | 型別 | 說明 |
|---|---|---|
| `selfies` | string | 分子 SELFIES 字串表示法 |
| `input_ids` | list[int64] | CLM 格式之 tokenized 輸入(以 BOS 開頭) |
| `labels` | list[int64] | CLM 格式之 labels(相對 `input_ids` 向右位移 1,末尾加 EOS) |
| 切分 | 筆數 |
|---|---|
| train | 1,433,044 |
| validation | 29,245 |
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
PSMA 是前列腺癌與多種神經內分泌腫瘤之重要靶點,相關仿肽藥物之設計長期受限於可用訓練語料之不足。本資料集將 ChEMBL、PDB、BindingDB 等公開科學資料庫中與 PSMA 相關之配體分子整合為統一之 SELFIES 語料,並預先 tokenize 為 CLM 格式,方便直接投入分子生成語言模型之預訓練流程。
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
資料蒐集自下列公開科學資料庫:
1. **ChEMBL GCPII(CHEMBL3231)**:針對 GCPII / PSMA 之活性配體分子。
2. **RCSB PDB co-crystal ligands**:PSMA 共結晶結構中之配體。
3. **BindingDB(UniProt Q04609)**:PSMA 之結合親和力資料。
所有分子皆轉換為 SELFIES 表示法,並以自訂 tokenizer 進行編碼,產生 `input_ids` 與 `labels`(labels 由 input_ids 向右位移 1 形成 CLM 訓練目標)。資料以 train/validation 切分後儲存為 parquet 格式。
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
原始分子資料由 ChEMBL(EMBL-EBI)、RCSB PDB 與 BindingDB 等科學組織維護,由全球藥物化學與結構生物學研究社群共同貢獻。
### Annotations
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
#### Annotation process
本資料集不包含額外標註。`input_ids` 與 `labels` 為自動 tokenization 之結果。
#### Who are the annotators?
不適用。
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
本資料集僅包含分子結構資料,不涉及個人資訊、醫療紀錄或敏感資料。
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
* 分子分佈偏向 PSMA 靶點之既有配體空間,模型可能難以產出結構上與現有化合物差異極大之新穎分子。
* SELFIES 雖保證生成分子有效,但並不保證分子之可合成性或藥理活性。
* Tokenization 結果依賴特定 tokenizer,若下游使用不同 tokenizer 則需重新處理。
* 資料量受限於公開資料庫涵蓋之 PSMA 配體數量,約百萬筆級別。
* 本資料集不含靶點親和力、ADMET 或毒性之定量標註,無法單獨用於活性預測任務。
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
建議使用者:
* 以本資料集進行預訓練後,搭配 PSMA 活性資料進行下游微調(如 reward-based 或 DPO-based 分子優化);
* 生成之候選分子應以分子對接、合成可行性評估與實驗驗證進行篩選;
* 若應用於其他靶點,建議重新蒐集相應之配體資料並進行領域適應性訓練。
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
```bibtex
@misc{peptidomimetic-pretrain,
title = {peptidomimetic-pretrain: A SELFIES Corpus for PSMA-targeted Drug Discovery},
author = {Liang Hsun Huang},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/lianghsun/peptidomimetic-pretrain}},
note = {1.43M training + 29K validation SELFIES molecules for Phase 1 pretraining of the peptide-dpt model.}
}
```
## Dataset Card Authors
[Liang Hsun Huang](https://www.linkedin.com/in/lianghsunhuang/?locale=en_US)
## Dataset Card Contact
[Liang Hsun Huang](https://www.linkedin.com/in/lianghsunhuang/?locale=en_US)
提供机构:
lianghsun



