opendatalab/SA-Prot-annot

Name: opendatalab/SA-Prot-annot
Creator: opendatalab
Published: 2026-04-02 12:03:10
License: 暂无描述

Hugging Face2026-04-02 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/opendatalab/SA-Prot-annot

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation - feature-extraction language: - en tags: - biology - protein - bioinformatics - uniprot - protein-annotation size_categories: - 10K<n<100K - 100K<n<1M - 1M<n<10M --- # SA-Prot-Annot Dataset (Sci-Align) ## 🌌 The Sciverse Data Foundation [**Sciverse**](https://Sciverse.opendatalab.com/) is a comprehensive, multi-layered scientific data foundation designed to provide the ultimate data infrastructure for the AI for Science (AI4S) community. As scientific research becomes increasingly data-driven, Sciverse supplies the essential, high-quality data resources required to build robust scientific knowledge systems and accelerate research. <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/643e60d96db6ba8c5ee177ad/ugVRh4ckRm4a-fsc5k7n1.png" alt="Sciverse" width="700"> </p> Sciverse consists of three core data pillars: * **Sci-Base (Scientific Knowledge Base Data):** The massive-scale, purely objective scientific knowledge base. Comprising over 25 million deeply cleaned and parsed Open Access documents, it provides the comprehensive, purely factual scientific corpus that serves as the universal foundation for all downstream scientific applications. * **Sci-Align (Scientific Multi-Alignment Data):** A highly curated, structured dataset mapping direct scientific relationships and precise factual alignments. It focuses on well-defined entity interactions—such as mapping specific chemical reaction pathways (e.g., via SMILES strings), condition-to-result pairings, and standardized structural descriptions. This layer provides the structured factual alignment needed for models to accurately connect and ground foundational scientific concepts. * **Sci-Evo (Scientific Evolution Data):** A multi-layered, high-density reasoning dataset designed for complex problem-solving and deep scientific evaluation. Going beyond basic facts, this layer captures deep, causal descriptions—detailing not just the 'what', but the underlying reasoning for specific experimental designs, multi-step mathematical derivations, and the complex logic of how modifying specific conditions alters outcomes. It is constructed to rigorously measure a model's advanced scientific reasoning accuracy and logical depth. --- ## SA-Prot-Annot Dataset Overview (Sci-Align) SA-Prot-Annot releases annotations for a UniProtKB-scale slice: about 1.2 million proteins spanning manually reviewed Swiss-Prot and computationally analyzed TrEMBL, in a single Parquet file at the repository root (`seqstudio_uniprot_1.2m.parquet`). ## Annotation content SA-Prot-Annot is the protein function annotated data from SeqStuido, the generative protein functional annotation system. It is designed to approximate the integrative judgment of expert UniProt curators: orchestrating heterogeneous evidence, weighing reliability and specificity, reconciling cross-modal conflicts, and synthesizing mechanistic explanations—rather than treating annotation as a simple union of pattern-matching hits. Evidence includes, in line with the manuscript: sequence homology (BLAST against reviewed UniProt), domain and motif architecture (InterProScan, together with rule-based context such as UniRule where used in the pipeline), three-dimensional fold similarity (Foldseek), and membrane topology (TMHMM). Evidence items are semantically enriched (e.g. GO definitions, domain descriptions) before large language model–based generative reasoning, so outputs are grounded in retrieved signals rather than unconstrained parametric guessing. The pipeline produces a natural-language-style functional summary and structured predictions with per-field confidence (0–1) and explicit evidence provenance (`support`: motifs / GO terms / which tools contributed). Exact LLM and single- vs multi-turn configuration can differ between high-precision and high-throughput deployment modes; the on-disk schema is shared. ### What is stored in this dataset | Column | Role | |--------|------| | `seqStudioSummary` | Functional summary: integrated narrative of molecular mechanism, biological role, localization, and major structural features. | | `seqStudioComments` | Machine-readable JSON: `version`, `generatedAt`, and `predictions` over six functional dimensions (see below). Each dimension is typically an object with `value`, `confidence`, and `support` linking the claim to concrete evidence. | Parse with `json.loads` and read `obj["predictions"]`. Six prediction dimensions (manuscript / evaluation schema; JSON keys in current exports): | Dimension | Typical JSON key | Notes | |-----------|------------------|--------| | Protein family | `proteinFamily` | Family or superfamily assignment. | | Function | `function` | Molecular and biological role (text). *Some older records use `primaryFunction`.* | | Enzyme information | `enzymeInfo` | Enzyme flag, EC, catalytic description (often nested JSON). *Legacy alias: `catalyticActivity`.* | | Pathways | `pathways` | Pathway involvement (list or text). | | Subcellular location | `subcellularLocation` | Predicted localization (topology-informed when TMHMM is used). | | Structural class / architecture | `proteinStructure` | Domains, fold class, membrane protein flag, TM helix count, etc. *Legacy alias: `structuralClass`.* | Top-level fields `version` and `generatedAt` record the pipeline build and generation time for traceability. ### Relation to `toolResult` `toolResult` preserves raw outputs from the integrated bioinformatics tools (e.g. BLAST, InterProScan, Foldseek, TMHMM). SeqStudio consumes these as grounding; `seqStudioComments` holds evidence-conditioned structured predictions, auditable through `support` fields and side-by-side comparison with `toolResult`. ## Data file | File | Records (approx.) | Size (approx.) | Description | |------|-------------------|----------------|-------------| | `seqstudio_uniprot_1.2m.parquet` | 1,200,000 | 5.5 GB | UniProtKB mix: Swiss-Prot + TrEMBL; original UniProt fields, SeqStudio outputs, and `toolResult` | Composition (same split as the main SeqStudio dataset card): - Swiss-Prot: 573,661 (about 47.8%) — manually reviewed UniProtKB entries - TrEMBL: 626,339 (about 52.2%) — computationally analyzed entries Use column `data_source` to distinguish provenance labels such as `swiss`, `trembl5`, and `trembl4`. ## Quick start ```python import pandas as pd path = "hf://datasets/opendatalab/SA-Prot-annot/seqstudio_uniprot_1.2m.parquet" df = pd.read_parquet(path) print(len(df), df.columns.tolist()[:5]) ``` Using `datasets`: ```python from datasets import load_dataset ds = load_dataset( "opendatalab/SA-Prot-annot", data_files="seqstudio_uniprot_1.2m.parquet", ) print(ds["train"]) ``` ## Content summary - Coverage: about 1.2M UniProtKB proteins (Swiss-Prot + TrEMBL), with `data_source` marking origin. - Format: Parquet with 23 columns combining UniProt-style fields, SeqStudio prediction payloads, and bioinformatics tool results. - Highlights: see “SeqStudio annotation content” above for `seqStudioComments` / `seqStudioSummary`; `toolResult` aggregates supporting tool outputs. ## Column reference (23 columns) 1. `entryType` — entry type 2. `primaryAccession` — UniProt primary accession 3. `uniProtkbId` — UniProtKB ID 4. `entryAudit` — audit metadata (JSON string) 5. `annotationScore` — annotation score 6. `organism` — organism (JSON) 7. `proteinExistence` — protein existence evidence 8. `proteinDescription` — description (JSON) 9. `genes` — genes (JSON) 10. `comments` — comments (JSON) 11. `features` — features (JSON) 12. `keywords` — keywords (JSON) 13. `references` — references (JSON) 14. `uniProtKBCrossReferences` — cross-references (JSON) 15. `sequence` — sequence (JSON) 16. `extraAttributes` — extra attributes (JSON) 17. `seqStudioComments` — SeqStudio structured predictions (JSON: `predictions` with six dimensions—see “SeqStudio annotation content”; keys may be `function` / `enzymeInfo` / `proteinStructure` or legacy `primaryFunction` / `catalyticActivity` / `structuralClass`) 18. `seqStudioSummary` — integrated functional summary (text or JSON string, depending on export) 19. `toolResult` — tool outputs, e.g. InterProScan, BLAST (JSON) 20. `data_source` — provenance label (`swiss` / `trembl5` / `trembl4`, etc.) 21. `secondaryAccessions` — secondary accessions (JSON) 22. `organismHosts` — organism hosts (JSON) 23. `geneLocations` — gene locations (JSON) Example: `import json` then `json.loads(row["seqStudioComments"])` and read `["predictions"]`. ## Citation Please cite this dataset, UniProt, and the SeqStudio paper (Liu et al., *Generative reasoning emulating expert curation moves protein functional annotation beyond pattern matching at scale*) as appropriate once the reference is available. Example for the Hub release: ```bibtex @dataset{saprotannot2025, title={SA-Prot-annot: SeqStudio Annotations for UniProt 1.2M (Swiss-Prot + TrEMBL)}, author={OpenDataLab}, year={2025}, url={https://huggingface.co/datasets/opendatalab/SA-Prot-annot} } ``` ## License This dataset is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license: you may share and adapt the material, provided you give appropriate credit, indicate if changes were made, and do not add legal terms that restrict others from doing anything the license permits. The underlying protein records and many raw fields originate from [UniProt](https://www.uniprot.org/); use of this dataset should remain consistent with UniProt’s own terms and citation expectations in addition to CC BY 4.0.

提供机构：

opendatalab

搜集汇总

数据集介绍

构建方式

在蛋白质功能注释领域，SA-Prot-annot数据集通过集成多源生物信息学证据构建而成。该流程首先从UniProtKB中选取约120万条蛋白质序列，涵盖经人工审阅的Swiss-Prot和计算分析的TrEMBL条目。随后，系统综合序列同源性比对、结构域架构分析、三维折叠相似性检测及膜拓扑预测等多模态证据，并利用大型语言模型进行生成式推理，最终产出兼具自然语言功能摘要与结构化预测的注释结果。整个构建过程强调证据的语义增强与可追溯性，确保注释结论建立在检索信号而非无约束参数猜测之上。

使用方法

使用该数据集时，研究人员可通过Python环境快速加载Parquet格式文件，利用pandas或Hugging Face datasets库进行数据访问。数据集中的核心注释信息存储于`seqStudioComments`列，需经JSON解析后提取`predictions`对象以获取结构化预测结果。用户可根据`data_source`字段区分蛋白质条目的来源类别，并结合`toolResult`列中的原始工具输出进行深入分析。该数据集适用于蛋白质功能预测模型训练、注释系统评估及生物知识图谱构建等多种生物信息学应用场景。

背景与挑战

背景概述

蛋白质功能注释是生物信息学领域的核心任务，旨在揭示蛋白质序列所蕴含的生物学功能、分子机制及参与的生命过程。随着高通量测序技术的飞速发展，蛋白质序列数据呈现爆炸式增长，传统依赖专家手动注释的方法已难以应对海量数据的处理需求。在此背景下，由Sciverse数据基金会主导构建的SA-Prot-annot数据集应运而生，其作为Sci-Align科学多对齐数据支柱的重要组成部分，于2025年正式发布。该数据集整合了约120万条来自UniProtKB的蛋白质记录，涵盖经过人工审阅的Swiss-Prot与计算分析的TrEMBL数据，并依托SeqStudio生成式蛋白质功能注释系统，模拟了专家整合异质证据、权衡可靠性并合成机制性解释的复杂推理过程，旨在为AI for Science社区提供高质量、结构化的蛋白质功能注释资源，推动蛋白质功能预测与理解向更深层次的推理与解释迈进。

当前挑战

SA-Prot-annot数据集致力于解决蛋白质功能自动注释这一领域核心问题，其面临的首要挑战在于如何超越传统的模式匹配方法，实现接近专家水平的整合性功能推断。这要求模型必须能够协调来自序列同源性、结构域架构、三维折叠相似性及膜拓扑等多种异质证据，并处理证据间的冲突与不确定性，最终生成兼具准确性、特异性和可解释性的功能描述。在数据集构建过程中，挑战同样显著：一方面，需要设计复杂的多步骤处理流程，将原始生物信息学工具的输出进行语义富集，并作为大语言模型生成推理的可靠基础，以避免无约束的参数猜测；另一方面，需确保生成的结构化预测与自然语言摘要既能忠实反映证据，又具备统一的模式与可追溯性，同时处理大规模数据（如约120万条记录、5.5GB的存储规模）对数据处理、存储与分发效率提出了严峻考验。

常用场景

经典使用场景

在蛋白质功能注释领域，SA-Prot-annot数据集为研究者提供了一个大规模、高质量的基准资源。该数据集整合了约120万条蛋白质记录，覆盖了经过人工审阅的Swiss-Prot和计算分析的TrEMBL条目，其核心价值在于通过生成式人工智能系统SeqStudio，模拟专家策展人的综合判断过程。经典使用场景包括训练和评估蛋白质功能预测模型，特别是那些需要融合多源证据（如同源序列、结构域架构、三维折叠相似性）的机器学习方法。研究者可利用数据集中的结构化预测字段，如蛋白质家族、功能描述、酶学信息等，来构建或验证能够从序列到功能进行端到端推理的计算框架。

解决学术问题

该数据集直接回应了蛋白质组学中一个长期存在的挑战：如何高效、准确地将海量蛋白质序列与生物学功能关联起来。传统方法依赖于模式匹配或单一证据，难以处理证据冲突或复杂的功能机制解释。SA-Prot-annot通过集成大型语言模型的生成式推理，将异构证据进行语义丰富和逻辑整合，从而系统性地解决了功能注释中的证据权衡、跨模态冲突调和以及机制性解释合成等学术问题。其意义在于推动了蛋白质功能预测从简单的模式识别向基于证据的、可解释的生成式推理范式转变，为构建更可靠的计算生物学知识系统奠定了数据基础。

实际应用

在实际应用层面，SA-Prot-annot数据集为新药靶点发现、酶工程设计和合成生物学研究提供了关键的数据支持。生物信息学家和工业界研发人员可以利用数据集中的功能总结和结构化预测，快速筛选具有特定催化活性、参与特定代谢通路或位于特定亚细胞区室的候选蛋白质。例如，在药物开发中，可基于预测的蛋白质家族和功能信息，识别潜在的疾病相关蛋白；在工业酶改造中，可根据酶学信息和结构类别预测，指导理性设计。数据集提供的证据溯源字段也增强了预测结果的可信度和可审计性，满足了实际应用中对可靠性的要求。

数据集最近研究