OpenRaiser/SeqStudio
收藏Hugging Face2026-04-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OpenRaiser/SeqStudio
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- feature-extraction
language:
- en
tags:
- biology
- protein
- bioinformatics
- uniprot
- protein-annotation
- seqstudio
size_categories:
- 10K<n<100K
- 100K<n<1M
- 1M<n<10M
- 10M<n<100M
---
# SeqStudio: Protein Annotation Dataset
**SeqStudio** is an AI-powered protein annotation system that generates comprehensive functional predictions for protein sequences. This repository hosts **SeqStudio-generated annotations** at multiple scales: Swiss-Prot subsets, a **1.2M UniProt** mix (Swiss-Prot + TrEMBL), and a **~20M UniProt-scale** release split into six coarse **annotation-score tiers** as Parquet tables.
## Dataset Files
### Legacy single-file releases (repo root)
| File | Records | Size (approx.) | Description |
|------|---------|----------------|-------------|
| `seqstudio_swissprot_10k.parquet` | 10,000 | 58 MB | High-quality Swiss-Prot subset (evaluation set) |
| `seqstudio_swissprot_full.parquet` | 573,661 | 3.0 GB | Complete Swiss-Prot with SeqStudio annotations |
| `seqstudio_uniprot_1.2m.parquet` | 1,200,000 | 5.9 GB | Swiss-Prot + TrEMBL mix |
### UniProt ~20M release (`seqstudio_uniprot_20m/`)
Six Parquet files, same **column layout** and JSON-string serialization as the files above (including `data_source`). Filenames follow a **10M-style naming convention**; the numeric suffix is a **scale label**, not an exact row guarantee.
| File | Rows (this build) | Size (approx.) | Description |
|------|-------------------|----------------|-------------|
| `seqstudio_uniprot_20m/swiss_57w.parquet` | 575,661 | 2.8 GB | Swiss-Prot tier |
| `seqstudio_uniprot_20m/trembl5_32w.parquet` | 323,099 | 1.9 GB | TrEMBL, annotation score 5 |
| `seqstudio_uniprot_20m/trembl4_108w.parquet` | 1,084,340 | 6.3 GB | TrEMBL, score 4 |
| `seqstudio_uniprot_20m/trembl3_397w.parquet` | 3,970,975 | 17 GB | TrEMBL, score 3 |
| `seqstudio_uniprot_20m/trembl2_324w.parquet` | 3,238,000 | 12 GB | TrEMBL, score 2 |
| `seqstudio_uniprot_20m/trembl1_1081w.parquet` | 10,809,925 | 31 GB | TrEMBL, score 1 |
**Total rows (six files):** 20,002,000 (nominal **20M** scale; a small surplus can occur at shard boundaries—filter by `primaryAccession` if you need strict de-duplication).
**Reading:** load one tier at a time to limit memory; use `columns=[...]` when only a subset of fields is needed.
### Data composition (1.2M file)
**UniProt 1.2M** (`seqstudio_uniprot_1.2m.parquet`):
- Swiss-Prot: 573,661 (47.8%) — manually reviewed
- TrEMBL: 626,339 (52.2%) — computationally analyzed
**Swiss-Prot full** (`seqstudio_swissprot_full.parquet`):
- 573,661 records, all with SeqStudio AI-generated annotations and confidence-style fields where applicable.
## Key Features
### SeqStudio AI-generated annotations
Each protein entry can include **SeqStudio predictions**, for example:
- **Protein family** classification with confidence
- **Primary biological function**
- **Catalytic activity** (EC, reaction, substrates/products, cofactors) where applicable
- **Pathways**, **subcellular localization**, **structural class** (`structuralClass` in `predictions`; older exports may still contain legacy keys such as `proteinStructure`)
### Additional fields
- **Original UniProt-style fields**: sequence, organism, descriptions, features, comments, cross-refs, etc.
- **`toolResult`**: InterProScan, BLAST, Foldseek, TMHMM payloads (JSON string where present)
- **`data_source`**: coarse tier label — for the 20M folder: `swiss`, `trembl5`, `trembl4`, `trembl3`, `trembl2`, `trembl1` (TrEMBL score 3 may also appear as `trembl3_gemini` / `trembl3_gpt` in some pipelines; the six Hub files bucket these under **`trembl3`**)
- **Legacy field names**: some JSONL sources may include `cokeComments` / `cokeSummary`; in Parquet these are normalized next to `seqStudioComments` / `seqStudioSummary` when present.
## Quick Start
```python
import pandas as pd
import json
# --- OpenRaiser/SeqStudio (this repo) ---
# Small Swiss-Prot subset
df = pd.read_parquet("hf://datasets/OpenRaiser/SeqStudio/seqstudio_swissprot_10k.parquet")
# Full Swiss-Prot
# df = pd.read_parquet("hf://datasets/OpenRaiser/SeqStudio/seqstudio_swissprot_full.parquet")
# 1.2M Swiss + TrEMBL
# df = pd.read_parquet("hf://datasets/OpenRaiser/SeqStudio/seqstudio_uniprot_1.2m.parquet")
# ~20M release: one tier (example — Swiss-Prot tier)
df = pd.read_parquet("hf://datasets/OpenRaiser/SeqStudio/seqstudio_uniprot_20m/swiss_57w.parquet")
seqstudio_predictions = json.loads(df.iloc[0]["seqStudioComments"])
predictions = seqstudio_predictions["predictions"]
print(f"Protein family: {predictions['proteinFamily']['value']}")
print(f"Function: {predictions['primaryFunction']['value']}")
print(f"Confidence: {predictions['primaryFunction']['confidence']}")
if predictions.get("catalyticActivity", {}).get("value") not in (None, "Unknown", ""):
cat = predictions["catalyticActivity"]["value"]
if isinstance(cat, dict) and "ec_number" in cat:
print(f"EC: {cat['ec_number']}")
```
Historical mirrors may use the `opendatalab-raiser/SeqStudio` slug; **this card describes `OpenRaiser/SeqStudio`.**
## Data fields (high level)
- `primaryAccession`: UniProt accession
- `organism`, `sequence`, `proteinDescription`, `genes`, `comments`, `features`, …
- `seqStudioComments`: SeqStudio predictions (JSON string): `version`, `generatedAt`, `predictions` (family, function, catalytic activity, pathways, localization, structural class, …)
- `seqStudioSummary`: short narrative summary (JSON string)
- `toolResult`: tool payloads (JSON string)
- `data_source`: provenance / tier (`swiss`, `trembl5`, …, or finer labels in raw exports)
## Citation
```bibtex
@dataset{seqstudio2025,
title={SeqStudio: AI-Powered Protein Annotation Datasets},
author={OpenRaiser / OpenDataLab RAISER},
year={2025},
note={Releases from 10k Swiss-Prot subset to ~20M UniProt-scale Parquet},
url={https://huggingface.co/datasets/OpenRaiser/SeqStudio}
}
```
## License
MIT License
提供机构:
OpenRaiser



