alegendaryfish/CodonTranslator-data

Name: alegendaryfish/CodonTranslator-data
Creator: alegendaryfish
Published: 2026-04-08 01:44:23
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/alegendaryfish/CodonTranslator-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 pretty_name: CodonTranslator Data task_categories: - text-generation tags: - biology - dna - codon-optimization - protein-conditioned-generation size_categories: - 10M<n<100M --- # CodonTranslator Data This repository contains the final public training-data release used for CodonTranslator. ## Contents - `train/`: representative-only training shards - `val/`: representative-only validation shards - `test/`: representative-only held-out test shards - `embeddings_v2/`: precomputed species conditioning embeddings used in training - `_work/final_representative_counts.json`: final released split sizes - `_work/split_report.json`: split audit report - `_work/mmseqs_manifest.json`: MMseqs version and clustering parameters - `_work/cluster_split.parquet`: cluster-level split assignments - `_work/seq_cluster.parquet`: MMseqs cluster assignments - `_work/seq_split.parquet`: split assignments before representative selection ## Split definition The public `data_v3` split was rebuilt from `data_v2` with the following rules: - MMseqs clustering in **protein space** - test holdout by **binomial species** - validation split from **seen species but unseen clusters** - representative-only parquet outputs, one retained row per released representative sequence Mixed seen/held-out protein clusters and exact-protein leakage cases are removed on the seen side before final release. ## Final released split sizes - `train = 36,888,301` - `val = 373,637` - `test = 331,455` All three released splits satisfy: - exact protein overlap `train/val = 0` - exact protein overlap `train/test = 0` - test species not seen in train/val - representatives-only rows, so `rows == unique_seq_id` ## Parquet schema Each released parquet shard contains these columns: - `RefseqID` - `protein_refseq_id` - `protein_seq` - `cds_DNA` - `taxon` - `shard` ## Embeddings `embeddings_v2/` is the exact species embedding store used in training. It contains: - `species_vocab.json` - `species_index.json` - `species_tok_emb.bin` - `metadata.json` - `taxonomy_database.json` The released `data_v3` taxa are fully covered by this embedding store. ## Model and code The corresponding public model and code release is: - `alegendaryfish/CodonTranslator` ## Attribution This release redistributes processed training data and derived embeddings for reproducibility of the CodonTranslator experiments. See the audit files in `_work/` for construction details and final verification outputs. The raw MMseqs working database is not included because it is large, machine-specific intermediate state. The released `_work/` files are the reproducibility artifacts needed to audit and reconstruct the clustering and split decisions.

提供机构：

alegendaryfish

5,000+

优质数据集

54 个

任务类型

进入经典数据集