HakimT/peptideforge-dataset

Name: HakimT/peptideforge-dataset
Creator: HakimT
Published: 2026-03-31 21:28:06
License: 暂无描述

Hugging Face2026-03-31 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/HakimT/peptideforge-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: PeptideForge license: cc-by-4.0 size_categories: - 100K<n<1M task_categories: - text-generation - text-classification tags: - biology - peptides - antimicrobial-peptides - conditional-generation - medical - chemistry configs: - config_name: generator_text data_files: - split: train path: generator_text/train.csv - split: validation path: generator_text/validation.csv - split: test path: generator_text/test.csv - config_name: generator_structured data_files: - split: train path: generator_structured/train.csv - split: validation path: generator_structured/validation.csv - split: test path: generator_structured/test.csv - config_name: scorer data_files: - split: train path: scorer/train.csv - split: validation path: scorer/validation.csv - split: test path: scorer/test.csv --- # PeptideForge Dataset ## Dataset Description This dataset repo packages the processed training, validation, and test splits used by the `PeptideForge` project for conditioned peptide generation and AMP scoring. It exposes three Hub configs: | config | purpose | splits | | --- | --- | --- | | generator_text | Conditioned text corpus exported as parsed CSV rows | train / validation / test | | generator_structured | Structured generator table with features and conditioned prompts | train / validation / test | | scorer | Structured AMP/non-AMP scorer dataset | train / validation / test | ## Dataset Summary - Total rows across all published configs: `221424` - Generator text corpus: prompt-style conditioned peptide records derived from the line-based `train_core.csv` / `val_core.csv` / `test_core.csv` files - Generator structured tables: per-sequence features plus `conditioned_text` - Scorer tables: labeled AMP vs non-AMP examples with the features used by the AMP scorer ## Splits | config | train | validation | test | | --- | --- | --- | --- | | generator_text | 23791 | 2974 | 2974 | | generator_structured | 23791 | 2974 | 2974 | | scorer | 129556 | 16195 | 16195 | ## Feature Schemas ### `generator_text` Columns: `conditioned_text`, `sequence`, `length_tag`, `charge_tag`, `hydro_tag` Each example stores the raw conditioned line plus the parsed sequence and the three conditioning tags. ### `generator_structured` Columns: `id`, `sequence`, `label`, `length`, `unique_chars`, `is_standard`, `charge`, `hydrophobicity`, `frac_basic`, `frac_acidic`, `cysteine_count`, `length_bin`, `charge_bin`, `hydro_bin`, `condition_prefix`, `conditioned_text` These rows are the structured generator tables used for analysis and for workflows that want explicit feature columns in addition to the prompt-style conditioning text. ### `scorer` Columns: `id`, `sequence`, `label`, `length`, `unique_chars`, `is_standard`, `charge`, `hydrophobicity`, `frac_basic`, `frac_acidic`, `cysteine_count`, `length_bin`, `charge_bin`, `hydro_bin`, `condition_prefix`, `conditioned_text` These rows contain AMP/non-AMP labels and the feature set used by `scorer/scorer.py`. ## Source Data Layout The original code repository stores the processed files in: - `data/generator/` - `data/scorer/` - `data/other/` The original repo `data/` tree is mirrored under `source_data/` in this dataset repo so the exact CSV/text files used by the codebase stay inspectable. ## Loading The Data ```python from datasets import load_dataset generator_text = load_dataset("HakimT/peptideforge-dataset", "generator_text") generator_structured = load_dataset("HakimT/peptideforge-dataset", "generator_structured") scorer = load_dataset("HakimT/peptideforge-dataset", "scorer") ``` Load a single split directly: ```python train_generator_text = load_dataset("HakimT/peptideforge-dataset", "generator_text", split="train") ``` ## Data Provenance The processed data in this project is derived from: Peng, Shuang; Rajjou, Loïc, 2024, "Unifying Antimicrobial Peptide Datasets for Robust Deep Learning-Based Classification", Recherche Data Gouv, V1, https://doi.org/10.57745/NZ0IRX This Hugging Face dataset repo contains processed and reformatted derivatives used by the PeptideForge training, evaluation, and scoring pipelines. ## Preprocessing Notes - Generator text examples encode peptide conditioning tags inline and are exported here as a parsed CSV for easier loading on the Hub. - Structured generator tables retain the same conditioning information in explicit feature columns, including `condition_prefix` and `conditioned_text`. - The scorer split preserves AMP labels and physicochemical features for classifier training and offline evaluation. ## Intended Uses - Reproducing the training and evaluation flows in the PeptideForge codebase - Training conditioned peptide generators from prompt-style or tabular representations - Training or benchmarking AMP scoring/classification pipelines ## Limitations - This is a processed research dataset, not a clinical decision-making resource. - The conditioning tags are coarse bins, not precise biophysical targets. - Generated peptides still require downstream validation. ## Licensing Unless otherwise noted, the processed data files distributed from the PeptideForge project are intended to be shared under `CC BY 4.0`, consistent with the repository's top-level license notice. ## Citation If you use this dataset, cite both the upstream AMP dataset source and the PeptideForge repository that produced these processed splits.

提供机构：

HakimT

5,000+

优质数据集

54 个

任务类型

进入经典数据集