timodonnell/protein-docs

Name: timodonnell/protein-docs
Creator: timodonnell
Published: 2026-04-02 19:09:40
License: 暂无描述

Hugging Face2026-04-02 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/timodonnell/protein-docs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation language: - en tags: - protein-structure - alphafold - contact-map - structural-biology - protein-language-model size_categories: - 10M<n<100M --- # Protein Documents (Parquet) Structured text documents encoding protein residue sequences and 3D contact maps from [AlphaFold Database](https://alphafold.ebi.ac.uk/) v4 predicted structures, stored as Parquet files. Each row is one protein document with metadata. Source structures: [timodonnell/afdb-24M](https://huggingface.co/datasets/timodonnell/afdb-24M) and [timodonnell/afdb-1.6M](https://huggingface.co/datasets/timodonnell/afdb-1.6M) ## Document Schemes Each subdirectory contains documents generated with a different scheme. All schemes share leakage-resistant train/val/test splits based on structural cluster hashing (Foldseek AFDB v4, 98/1/1 split). | Scheme | Source | Documents | Description | |--------|--------|-----------|-------------| | [`deterministic-positives-only`](deterministic-positives-only/) | afdb-24M | ~24M | Baseline: residue sequence + closest heavy-atom contact per residue pair within 4.0 Å, sorted by sequence separation | | [`random-3-bins`](random-3-bins/) | afdb-1.6M | ~1.68M | Distance-binned 6-token contacts with false contact injection, corrections, long-range upsampling, and pLDDT bin tokens (1 entry per structural cluster) | | [`random-3-bins-5x`](random-3-bins-5x/) | afdb-24M | ~5.39M | Same scheme as random-3-bins but with up to 5 entries per structural cluster. Documents are ordered in rounds: round 0 has one entry per cluster, round 1 has a second entry per cluster (where available), etc. Shuffled within each round. | | [`contacts-and-distances-v1-5x`](contacts-and-distances-v1-5x/) | afdb-24M | ~5.39M | Two statement types: contact statements (CB-CB ≤ 8Å, categorized by sequence separation) and distance statements (0.5Å resolution, 64 bins, randomly sampled atom pairs). Contacts rank-ordered to appear earlier. Up to 5 entries per structural cluster, round-ordered. | ## Schema | Column | Type | Description | |--------|------|-------------| | `document` | `string` | Full document text | | `entry_id` | `string` | AFDB entry ID (e.g. `AF-A0A1C0V126-F1`) | | `uniprot_accession` | `string` | UniProt accession | | `tax_id` | `int64` | NCBI taxonomy ID | | `organism_name` | `string` | Scientific name | | `global_plddt` | `float32` | Global mean pLDDT confidence score | | `seq_len` | `int32` | Sequence length in residues | | `contacts_pre_filter` | `int32` | Contacts found before pLDDT filter | | `contacts_emitted` | `int32` | Contacts in final document | | `residues_passing_plddt` | `int32` | Residues above pLDDT threshold | | `split` | `string` | `train`, `val`, or `test` | | `seq_cluster_id` | `string` | AFDB50 sequence cluster representative | | `struct_cluster_id` | `string` | Structural cluster representative | | `split_cluster_id` | `string` | Cluster used for split assignment | | `sha1` | `string` | SHA1 hash of document text | ## File Structure ``` deterministic-positives-only/ train/ shard_000000.parquet ... val/ shard_000000.parquet ... test/ shard_000000.parquet ... random-3-bins/ train/ shard_000000.parquet ... val/ shard_000000.parquet ... test/ shard_000000.parquet ... random-3-bins-5x/ train/ shard_000000.parquet ... (round-ordered: round 0 shards first, then round 1, etc.) val/ shard_000000.parquet ... test/ shard_000000.parquet ... contacts-and-distances-v1-5x/ train/ shard_000000.parquet ... (round-ordered) val/ shard_000000.parquet ... test/ shard_000000.parquet ... ``` ## Example Documents ### deterministic-positives-only ``` <deterministic-positives-only> <begin_sequence> <MET> <LYS> <PHE> <CYS> <ASP> <TYR> <GLY> <LEU> <begin_contacts> <p1> <p8> <SD> <CD1> <p1> <p7> <CG> <CA> <p2> <p8> <NZ> <O> <p1> <p6> <CE> <OH> <end_contacts> <end> ``` Each contact is a 4-tuple: `<p_i> <p_j> <atom_i> <atom_j>`. Contacts sorted by decreasing sequence separation. ### random-3-bins ``` <random-3-bins> <begin_sequence> <MET> <LYS> <PHE> <CYS> <ASP> <TYR> <GLY> <LEU> <begin_contacts> <non-correction> <p1> <p5> <SD> <CD1> <bin_lt4> <non-correction> <p3> <p7> <CA> <CB> <bin_4_12> <non-correction> <p2> <p6> <NZ> <OH> <bin_gt12> <non-correction> <p4> <p8> <CB> <O> <bin_lt4> <correction> <p3> <p7> <CG> <CB> <bin_lt4> <plddt_80_85> <non-correction> <p1> <p6> <CE> <OH> <bin_lt4> <end_contacts> <end> ``` Each contact is a 6-token group: `<correction|non-correction> <p_i> <p_j> <atom_i> <atom_j> <distance_bin>`. Contacts are in random order. `<correction>` marks updates to previously stated contacts. Distance bins: `<bin_lt4>` (< 4 Å), `<bin_4_12>` (4–12 Å), `<bin_gt12>` (> 12 Å). A pLDDT bin token appears once per document (50% at end, 50% random position). See the [full specification](https://github.com/timodonnell/contactdoc/blob/main/docs/random-3-bins-scheme.md). ### contacts-and-distances-v1 ``` <contacts-and-distances-v1> <begin_sequence> <MET> <LYS> <PHE> <CYS> <ASP> <TYR> <GLY> <LEU> <begin_statements> <long-range-contact> <p1> <p50> <medium-range-contact> <p3> <p20> <distance> <p10> <p45> <CA> <CB> <d4.5> <short-range-contact> <p5> <p12> <distance> <p2> <p80> <NZ> <O> <d15.0> <plddt_80_85> <end> ``` Two statement types: contact statements (3 tokens: `<mode> <p_i> <p_j>`) and distance statements (6 tokens: `<distance> <p_i> <p_j> <atom_i> <atom_j> <d_value>`). Contact modes: `<long-range-contact>` (sep ≥ 24), `<medium-range-contact>` (sep 12–24), `<short-range-contact>` (sep 6–12), defined by CB-CB distance ≤ 8 Å. Distance bins at 0.5 Å resolution from `<d0.5>` to `<d32.0>` (64 bins). Contacts are rank-ordered to appear earlier in the document. All statements are correct (no false contacts). See [prompts/contacts-and-distances-v1.txt](https://github.com/timodonnell/contactdoc/blob/main/prompts/contacts-and-distances-v1.txt). ## Common Generation Parameters | Parameter | Value | Description | |-----------|-------|-------------| | Heavy atoms only | yes | Hydrogens excluded | | Adjacent residue exclusion | yes | No contacts between residues i, i±1 | | Global pLDDT filter | ≥ 70.0 | Entry-level confidence threshold | | Per-residue pLDDT filter | ≥ 70.0 | Both residues in a contact must pass | | Max sequence length | 2048 | Residues | | Fragment filter | skip | Only full-length UniProt models | | Non-canonical residues | map to `<UNK>` | | ## Splits Split assignment uses **structural cluster representatives** as hash keys (SHA1-based), so all proteins sharing a 3D fold land in the same split. | Split | Fraction | |-------|----------| | train | 98% | | val | 1% | | test | 1% | ## Usage ```python import pyarrow.parquet as pq table = pq.read_table("deterministic-positives-only/train/shard_000000.parquet") print(f"{len(table)} documents") print(table[0]["document"].as_py()[:200]) ``` Or with HuggingFace datasets: ```python from datasets import load_dataset ds = load_dataset("timodonnell/protein-docs", data_dir="deterministic-positives-only") print(ds["train"][0]["document"][:200]) ``` ## Data Source and License Derived from [AlphaFold Database v4](https://alphafold.ebi.ac.uk/) (DeepMind / EMBL-EBI) under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). Cluster assignments from [Steinegger lab AFDB clusters](https://afdb-cluster.steineggerlab.workers.dev/) (Version 3).

提供机构：

timodonnell

5,000+

优质数据集

54 个

任务类型

进入经典数据集