cds-dataset
收藏数据集概述
基本信息
- 数据集名称: jheuschkel/cds-dataset
- 许可证: Apache-2.0
- 标签: codon, cds, CDS, mRNA, RNA, Codon
数据集内容
- 用途: 用于训练SynCodonLM模型
- 数据组成: 包含约6600万条CDS序列(无内含子)
- 物种覆盖: 来自约35000个物种
相关资源
- 模型GitHub地址: https://github.com/Boehringer-Ingelheim/SynCodonLM
- 模型权重地址: https://huggingface.co/jheuschkel/SynCodonLM
引用信息
bibtex @article {Heuschkel2025.08.19.671089, author = {Heuschkel, James and Kingsley, Laura and Pefaur, Noah and Nixon, Andrew and Cramer, Steven}, title = {Advancing Codon Language Modeling with Synonymous Codon Constrained Masking}, elocation-id = {2025.08.19.671089}, year = {2025}, doi = {10.1101/2025.08.19.671089}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Codon language models offer a promising framework for modeling protein-coding DNA sequences, yet current approaches often conflate codon usage with amino acid semantics, limiting their ability to capture DNA-level biology. We introduce SynCodonLM, a codon language model that enforces a biologically grounded constraint: masked codons are only predicted from synonymous options, guided by the known protein sequence. This design disentangles codon-level from protein-level semantics, enabling the model to learn nucleotide-specific patterns. The constraint is implemented by masking non-synonymous codons from the prediction space prior to softmax. Unlike existing models, which cluster codons by amino acid identity, SynCodonLM clusters by nucleotide properties, revealing structure aligned with DNA-level biology. Furthermore, SynCodonLM outperforms existing models on 6 of 7 benchmarks sensitive to DNA-level features, including mRNA and protein expression. Our approach advances domain-specific representation learning and opens avenues for sequence design in synthetic biology, as well as deeper insights into diverse bioprocesses.Competing Interest StatementThe authors have declared no competing interest.}, URL = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.19.671089}, eprint = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.19.671089.full.pdf}, journal = {bioRxiv} }




