five

OpenRaiser/SeqStudio

收藏
Hugging Face2026-04-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OpenRaiser/SeqStudio
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - feature-extraction language: - en tags: - biology - protein - bioinformatics - uniprot - protein-annotation - seqstudio size_categories: - 10K<n<100K - 100K<n<1M - 1M<n<10M - 10M<n<100M --- # SeqStudio: Protein Annotation Dataset **SeqStudio** is an AI-powered protein annotation system that generates comprehensive functional predictions for protein sequences. This repository hosts **SeqStudio-generated annotations** at multiple scales: Swiss-Prot subsets, a **1.2M UniProt** mix (Swiss-Prot + TrEMBL), and a **~20M UniProt-scale** release split into six coarse **annotation-score tiers** as Parquet tables. ## Dataset Files ### Legacy single-file releases (repo root) | File | Records | Size (approx.) | Description | |------|---------|----------------|-------------| | `seqstudio_swissprot_10k.parquet` | 10,000 | 58 MB | High-quality Swiss-Prot subset (evaluation set) | | `seqstudio_swissprot_full.parquet` | 573,661 | 3.0 GB | Complete Swiss-Prot with SeqStudio annotations | | `seqstudio_uniprot_1.2m.parquet` | 1,200,000 | 5.9 GB | Swiss-Prot + TrEMBL mix | ### UniProt ~20M release (`seqstudio_uniprot_20m/`) Six Parquet files, same **column layout** and JSON-string serialization as the files above (including `data_source`). Filenames follow a **10M-style naming convention**; the numeric suffix is a **scale label**, not an exact row guarantee. | File | Rows (this build) | Size (approx.) | Description | |------|-------------------|----------------|-------------| | `seqstudio_uniprot_20m/swiss_57w.parquet` | 575,661 | 2.8 GB | Swiss-Prot tier | | `seqstudio_uniprot_20m/trembl5_32w.parquet` | 323,099 | 1.9 GB | TrEMBL, annotation score 5 | | `seqstudio_uniprot_20m/trembl4_108w.parquet` | 1,084,340 | 6.3 GB | TrEMBL, score 4 | | `seqstudio_uniprot_20m/trembl3_397w.parquet` | 3,970,975 | 17 GB | TrEMBL, score 3 | | `seqstudio_uniprot_20m/trembl2_324w.parquet` | 3,238,000 | 12 GB | TrEMBL, score 2 | | `seqstudio_uniprot_20m/trembl1_1081w.parquet` | 10,809,925 | 31 GB | TrEMBL, score 1 | **Total rows (six files):** 20,002,000 (nominal **20M** scale; a small surplus can occur at shard boundaries—filter by `primaryAccession` if you need strict de-duplication). **Reading:** load one tier at a time to limit memory; use `columns=[...]` when only a subset of fields is needed. ### Data composition (1.2M file) **UniProt 1.2M** (`seqstudio_uniprot_1.2m.parquet`): - Swiss-Prot: 573,661 (47.8%) — manually reviewed - TrEMBL: 626,339 (52.2%) — computationally analyzed **Swiss-Prot full** (`seqstudio_swissprot_full.parquet`): - 573,661 records, all with SeqStudio AI-generated annotations and confidence-style fields where applicable. ## Key Features ### SeqStudio AI-generated annotations Each protein entry can include **SeqStudio predictions**, for example: - **Protein family** classification with confidence - **Primary biological function** - **Catalytic activity** (EC, reaction, substrates/products, cofactors) where applicable - **Pathways**, **subcellular localization**, **structural class** (`structuralClass` in `predictions`; older exports may still contain legacy keys such as `proteinStructure`) ### Additional fields - **Original UniProt-style fields**: sequence, organism, descriptions, features, comments, cross-refs, etc. - **`toolResult`**: InterProScan, BLAST, Foldseek, TMHMM payloads (JSON string where present) - **`data_source`**: coarse tier label — for the 20M folder: `swiss`, `trembl5`, `trembl4`, `trembl3`, `trembl2`, `trembl1` (TrEMBL score 3 may also appear as `trembl3_gemini` / `trembl3_gpt` in some pipelines; the six Hub files bucket these under **`trembl3`**) - **Legacy field names**: some JSONL sources may include `cokeComments` / `cokeSummary`; in Parquet these are normalized next to `seqStudioComments` / `seqStudioSummary` when present. ## Quick Start ```python import pandas as pd import json # --- OpenRaiser/SeqStudio (this repo) --- # Small Swiss-Prot subset df = pd.read_parquet("hf://datasets/OpenRaiser/SeqStudio/seqstudio_swissprot_10k.parquet") # Full Swiss-Prot # df = pd.read_parquet("hf://datasets/OpenRaiser/SeqStudio/seqstudio_swissprot_full.parquet") # 1.2M Swiss + TrEMBL # df = pd.read_parquet("hf://datasets/OpenRaiser/SeqStudio/seqstudio_uniprot_1.2m.parquet") # ~20M release: one tier (example — Swiss-Prot tier) df = pd.read_parquet("hf://datasets/OpenRaiser/SeqStudio/seqstudio_uniprot_20m/swiss_57w.parquet") seqstudio_predictions = json.loads(df.iloc[0]["seqStudioComments"]) predictions = seqstudio_predictions["predictions"] print(f"Protein family: {predictions['proteinFamily']['value']}") print(f"Function: {predictions['primaryFunction']['value']}") print(f"Confidence: {predictions['primaryFunction']['confidence']}") if predictions.get("catalyticActivity", {}).get("value") not in (None, "Unknown", ""): cat = predictions["catalyticActivity"]["value"] if isinstance(cat, dict) and "ec_number" in cat: print(f"EC: {cat['ec_number']}") ``` Historical mirrors may use the `opendatalab-raiser/SeqStudio` slug; **this card describes `OpenRaiser/SeqStudio`.** ## Data fields (high level) - `primaryAccession`: UniProt accession - `organism`, `sequence`, `proteinDescription`, `genes`, `comments`, `features`, … - `seqStudioComments`: SeqStudio predictions (JSON string): `version`, `generatedAt`, `predictions` (family, function, catalytic activity, pathways, localization, structural class, …) - `seqStudioSummary`: short narrative summary (JSON string) - `toolResult`: tool payloads (JSON string) - `data_source`: provenance / tier (`swiss`, `trembl5`, …, or finer labels in raw exports) ## Citation ```bibtex @dataset{seqstudio2025, title={SeqStudio: AI-Powered Protein Annotation Datasets}, author={OpenRaiser / OpenDataLab RAISER}, year={2025}, note={Releases from 10k Swiss-Prot subset to ~20M UniProt-scale Parquet}, url={https://huggingface.co/datasets/OpenRaiser/SeqStudio} } ``` ## License MIT License
提供机构:
OpenRaiser
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作