five

alegendaryfish/CodonTranslator-data

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/alegendaryfish/CodonTranslator-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 pretty_name: CodonTranslator Data task_categories: - text-generation tags: - biology - dna - codon-optimization - protein-conditioned-generation size_categories: - 10M<n<100M --- # CodonTranslator Data This repository contains the final public training-data release used for CodonTranslator. ## Contents - `train/`: representative-only training shards - `val/`: representative-only validation shards - `test/`: representative-only held-out test shards - `embeddings_v2/`: precomputed species conditioning embeddings used in training - `_work/final_representative_counts.json`: final released split sizes - `_work/split_report.json`: split audit report - `_work/mmseqs_manifest.json`: MMseqs version and clustering parameters - `_work/cluster_split.parquet`: cluster-level split assignments - `_work/seq_cluster.parquet`: MMseqs cluster assignments - `_work/seq_split.parquet`: split assignments before representative selection ## Split definition The public `data_v3` split was rebuilt from `data_v2` with the following rules: - MMseqs clustering in **protein space** - test holdout by **binomial species** - validation split from **seen species but unseen clusters** - representative-only parquet outputs, one retained row per released representative sequence Mixed seen/held-out protein clusters and exact-protein leakage cases are removed on the seen side before final release. ## Final released split sizes - `train = 36,888,301` - `val = 373,637` - `test = 331,455` All three released splits satisfy: - exact protein overlap `train/val = 0` - exact protein overlap `train/test = 0` - test species not seen in train/val - representatives-only rows, so `rows == unique_seq_id` ## Parquet schema Each released parquet shard contains these columns: - `RefseqID` - `protein_refseq_id` - `protein_seq` - `cds_DNA` - `taxon` - `shard` ## Embeddings `embeddings_v2/` is the exact species embedding store used in training. It contains: - `species_vocab.json` - `species_index.json` - `species_tok_emb.bin` - `metadata.json` - `taxonomy_database.json` The released `data_v3` taxa are fully covered by this embedding store. ## Model and code The corresponding public model and code release is: - `alegendaryfish/CodonTranslator` ## Attribution This release redistributes processed training data and derived embeddings for reproducibility of the CodonTranslator experiments. See the audit files in `_work/` for construction details and final verification outputs. The raw MMseqs working database is not included because it is large, machine-specific intermediate state. The released `_work/` files are the reproducibility artifacts needed to audit and reconstruct the clustering and split decisions.
提供机构:
alegendaryfish
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作