cds-dataset

Hugging Face2025-08-24 更新2025-08-25 收录

下载链接：

https://huggingface.co/datasets/jheuschkel/cds-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个用于训练SynCodonLM模型的数据集，由大约6600万个不包含内含子的CDS序列组成，这些序列来自大约35000个不同物种。该数据集旨在为编码蛋白质的DNA序列建模提供支持。

创建时间：

2025-08-19

原始信息汇总

数据集概述

基本信息

数据集名称: jheuschkel/cds-dataset
许可证: Apache-2.0
标签: codon, cds, CDS, mRNA, RNA, Codon

数据集内容

用途: 用于训练SynCodonLM模型
数据组成: 包含约6600万条CDS序列（无内含子）
物种覆盖: 来自约35000个物种

引用信息

bibtex @article {Heuschkel2025.08.19.671089, author = {Heuschkel, James and Kingsley, Laura and Pefaur, Noah and Nixon, Andrew and Cramer, Steven}, title = {Advancing Codon Language Modeling with Synonymous Codon Constrained Masking}, elocation-id = {2025.08.19.671089}, year = {2025}, doi = {10.1101/2025.08.19.671089}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Codon language models offer a promising framework for modeling protein-coding DNA sequences, yet current approaches often conflate codon usage with amino acid semantics, limiting their ability to capture DNA-level biology. We introduce SynCodonLM, a codon language model that enforces a biologically grounded constraint: masked codons are only predicted from synonymous options, guided by the known protein sequence. This design disentangles codon-level from protein-level semantics, enabling the model to learn nucleotide-specific patterns. The constraint is implemented by masking non-synonymous codons from the prediction space prior to softmax. Unlike existing models, which cluster codons by amino acid identity, SynCodonLM clusters by nucleotide properties, revealing structure aligned with DNA-level biology. Furthermore, SynCodonLM outperforms existing models on 6 of 7 benchmarks sensitive to DNA-level features, including mRNA and protein expression. Our approach advances domain-specific representation learning and opens avenues for sequence design in synthetic biology, as well as deeper insights into diverse bioprocesses.Competing Interest StatementThe authors have declared no competing interest.}, URL = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.19.671089}, eprint = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.19.671089.full.pdf}, journal = {bioRxiv} }

搜集汇总

数据集介绍

构建方式

在生物信息学领域，高质量数据集的构建对于推进计算生物学研究至关重要。cds-dataset的构建基于广泛的生物序列数据源，整合了约6600万条无内含子的CDS序列，涵盖约3.5万个物种。这些序列经过严格的筛选与标准化处理，确保数据的准确性与一致性，为后续的 codon 语言模型训练提供了坚实基础。

特点

该数据集的特点体现在其规模宏大与多样性丰富，不仅覆盖了广泛的物种范围，还专注于 codon 级别的生物学特征。序列数据经过精心处理，突出了 synonymous codon 的约束特性，有助于模型区分密码子使用与氨基酸语义，为深入研究DNA层面的生物学模式提供了独特视角。

使用方法

cds-dataset主要用于训练和评估 codon 语言模型，如 SynCodonLM。研究人员可通过 Hugging Face 平台访问预训练模型权重，结合该数据集进行下游任务，如 mRNA 表达预测或合成生物学中的序列设计。使用前需引用相关文献，确保学术合规性。

背景与挑战

背景概述

随着合成生物学与计算生物学领域的深度融合，密码子使用偏好性研究逐渐成为基因表达调控机制解析的关键方向。2025年，由Boehringer Ingelheim机构的James Heuschkel等研究人员构建了cds-dataset，该数据集汇集了约35,000个物种的6,600万条无内含子的CDS序列，旨在支撑SynCodonLM模型的训练。其核心科学问题聚焦于解耦密码子语义与氨基酸语义的表示学习，为DNA层级生物学特征的建模提供数据基础，显著推动了密码子语言模型在基因表达预测和合成序列设计中的应用。

当前挑战

该数据集致力于解决密码子使用偏好的生物学建模挑战，尤其在区分同义密码子选择与蛋白质功能语义的纠缠关系上存在显著难度。构建过程中需应对多物种CDS序列的标准化处理、序列质量控制以及同义密码子注释的一致性等关键问题，同时需确保大规模数据整合时避免引入系统偏差，这对数据清洗和生物学验证提出了较高要求。

常用场景

经典使用场景

在计算生物学领域，cds-dataset为密码子语言模型的训练提供了核心数据支持。该数据集汇集了约6600万个不含内含子的CDS序列，涵盖3.5万个物种，专门用于解析密码子使用偏好与蛋白质编码规律之间的深层关联。研究者通过大规模序列建模，能够揭示不同物种中密码子选择的进化特征及其对翻译效率的影响机制。

衍生相关工作

基于该数据集训练的SynCodonLM模型已成为密码子语言建模领域的标杆工作。其提出的同义密码子约束掩码机制启发了后续多项研究，包括核苷酸属性聚类算法改进、翻译动力学建模框架创新等。这些衍生工作共同推进了DNA级生物学特征的 computational 解析深度。

数据集最近研究

cds-dataset

数据集概述

基本信息

数据集内容

相关资源

引用信息