five

Data from: Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation

收藏
DataCite Commons2025-06-01 更新2025-06-15 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.149m8
下载链接
链接失效反馈
官方服务:
资源简介:
VDJ rearrangement and somatic hypermutation work together to produce antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of antigens. It is now possible to sequence these BCRs in high throughput; analysis of these sequences is bringing new insight into how antibodies develop, in particular for broadly-neutralizing antibodies against HIV and influenza. A fundamental step in such sequence analysis is to annotate each base as coming from a specific one of the V, D, or J genes, or from an N-addition (a.k.a. non-templated insertion). Previous work has used simple parametric distributions to model transitions from state to state in a hidden Markov model (HMM) of VDJ recombination, and assumed that mutations occur via the same process across sites. However, codon frame and other effects have been observed to violate these parametric assumptions for such coding sequences, suggesting that a non-parametric approach to modeling the recombination process could be useful. In our paper, we find that indeed large modern data sets suggest a model using parameter-rich per-allele categorical distributions for HMM transition probabilities and per-allele-per-position mutation probabilities, and that using such a model for inference leads to significantly improved results. We present an accurate and efficient BCR sequence annotation software package using a novel HMM “factorization” strategy. This package, called partis (https://github.com/psathyrella/partis/), is built on a new general-purpose HMM compiler that can perform efficient inference given a simple text description of an HMM.

VDJ重排(VDJ rearrangement)与体细胞高频突变(somatic hypermutation)协同作用,产生编码抗体的B细胞受体(B cell receptor, BCR)序列,以应对种类极其丰富的抗原。如今,我们已能对这些BCR序列进行高通量测序;对这些序列的分析为抗体的发育机制带来了新的见解,尤其是针对HIV和流感的广谱中和抗体(broadly-neutralizing antibodies)。此类序列分析中的一个基本步骤是,将每个碱基注释为来源于V、D或J基因中的特定基因,或是来源于N添加(N-addition,又称非模板插入(non-templated insertion))。以往的研究采用简单的参数分布,在VDJ重排的隐马尔可夫模型(hidden Markov model, HMM)中对状态间的转移进行建模,并假设突变在所有位点上通过相同过程发生。然而,已观察到密码子框架(codon frame)及其他效应会违反这些针对编码序列的参数假设,这表明采用非参数方法对重排过程进行建模可能是有益的。在我们的研究中,我们发现,现代大型数据集确实支持一种模型:该模型针对HMM转移概率采用富含参数的等位基因特异性分类分布(per-allele categorical distribution),并针对突变概率采用等位基因-位点特异性分布(per-allele-per-position mutation probability);使用这种模型进行推理可显著提升结果性能。我们提出了一个准确且高效的BCR序列注释软件包,该软件包采用了新颖的HMM“因子分解”(factorization)策略。这个名为partis(https://github.com/psathyrella/partis/)的软件包基于一个全新的通用HMM编译器构建,该编译器能够根据HMM的简单文本描述执行高效推理。
提供机构:
Dryad
创建时间:
2016-01-27
二维码
社区交流群
二维码
科研交流群
商业服务