HakimT/peptideforge-dataset
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/HakimT/peptideforge-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: PeptideForge
license: cc-by-4.0
size_categories:
- 100K<n<1M
task_categories:
- text-generation
- text-classification
tags:
- biology
- peptides
- antimicrobial-peptides
- conditional-generation
- medical
- chemistry
configs:
- config_name: generator_text
data_files:
- split: train
path: generator_text/train.csv
- split: validation
path: generator_text/validation.csv
- split: test
path: generator_text/test.csv
- config_name: generator_structured
data_files:
- split: train
path: generator_structured/train.csv
- split: validation
path: generator_structured/validation.csv
- split: test
path: generator_structured/test.csv
- config_name: scorer
data_files:
- split: train
path: scorer/train.csv
- split: validation
path: scorer/validation.csv
- split: test
path: scorer/test.csv
---
# PeptideForge Dataset
## Dataset Description
This dataset repo packages the processed training, validation, and test splits used by the
`PeptideForge` project for conditioned peptide generation and AMP scoring.
It exposes three Hub configs:
| config | purpose | splits |
| --- | --- | --- |
| generator_text | Conditioned text corpus exported as parsed CSV rows | train / validation / test |
| generator_structured | Structured generator table with features and conditioned prompts | train / validation / test |
| scorer | Structured AMP/non-AMP scorer dataset | train / validation / test |
## Dataset Summary
- Total rows across all published configs: `221424`
- Generator text corpus: prompt-style conditioned peptide records derived from the line-based
`train_core.csv` / `val_core.csv` / `test_core.csv` files
- Generator structured tables: per-sequence features plus `conditioned_text`
- Scorer tables: labeled AMP vs non-AMP examples with the features used by the AMP scorer
## Splits
| config | train | validation | test |
| --- | --- | --- | --- |
| generator_text | 23791 | 2974 | 2974 |
| generator_structured | 23791 | 2974 | 2974 |
| scorer | 129556 | 16195 | 16195 |
## Feature Schemas
### `generator_text`
Columns: `conditioned_text`, `sequence`, `length_tag`, `charge_tag`, `hydro_tag`
Each example stores the raw conditioned line plus the parsed sequence and the three
conditioning tags.
### `generator_structured`
Columns: `id`, `sequence`, `label`, `length`, `unique_chars`, `is_standard`, `charge`, `hydrophobicity`, `frac_basic`, `frac_acidic`, `cysteine_count`, `length_bin`, `charge_bin`, `hydro_bin`, `condition_prefix`, `conditioned_text`
These rows are the structured generator tables used for analysis and for workflows that want
explicit feature columns in addition to the prompt-style conditioning text.
### `scorer`
Columns: `id`, `sequence`, `label`, `length`, `unique_chars`, `is_standard`, `charge`, `hydrophobicity`, `frac_basic`, `frac_acidic`, `cysteine_count`, `length_bin`, `charge_bin`, `hydro_bin`, `condition_prefix`, `conditioned_text`
These rows contain AMP/non-AMP labels and the feature set used by `scorer/scorer.py`.
## Source Data Layout
The original code repository stores the processed files in:
- `data/generator/`
- `data/scorer/`
- `data/other/`
The original repo `data/` tree is mirrored under `source_data/` in this dataset repo so the exact CSV/text files used by the codebase stay inspectable.
## Loading The Data
```python
from datasets import load_dataset
generator_text = load_dataset("HakimT/peptideforge-dataset", "generator_text")
generator_structured = load_dataset("HakimT/peptideforge-dataset", "generator_structured")
scorer = load_dataset("HakimT/peptideforge-dataset", "scorer")
```
Load a single split directly:
```python
train_generator_text = load_dataset("HakimT/peptideforge-dataset", "generator_text", split="train")
```
## Data Provenance
The processed data in this project is derived from:
Peng, Shuang; Rajjou, Loïc, 2024, "Unifying Antimicrobial Peptide Datasets for Robust Deep
Learning-Based Classification", Recherche Data Gouv, V1,
https://doi.org/10.57745/NZ0IRX
This Hugging Face dataset repo contains processed and reformatted derivatives used by the
PeptideForge training, evaluation, and scoring pipelines.
## Preprocessing Notes
- Generator text examples encode peptide conditioning tags inline and are exported here as a
parsed CSV for easier loading on the Hub.
- Structured generator tables retain the same conditioning information in explicit feature
columns, including `condition_prefix` and `conditioned_text`.
- The scorer split preserves AMP labels and physicochemical features for classifier training
and offline evaluation.
## Intended Uses
- Reproducing the training and evaluation flows in the PeptideForge codebase
- Training conditioned peptide generators from prompt-style or tabular representations
- Training or benchmarking AMP scoring/classification pipelines
## Limitations
- This is a processed research dataset, not a clinical decision-making resource.
- The conditioning tags are coarse bins, not precise biophysical targets.
- Generated peptides still require downstream validation.
## Licensing
Unless otherwise noted, the processed data files distributed from the PeptideForge project are
intended to be shared under `CC BY 4.0`, consistent with the repository's top-level license
notice.
## Citation
If you use this dataset, cite both the upstream AMP dataset source and the PeptideForge
repository that produced these processed splits.
提供机构:
HakimT



