Data from: Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation
收藏DataCite Commons2025-06-01 更新2025-06-15 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.149m8
下载链接
链接失效反馈官方服务:
资源简介:
VDJ rearrangement and somatic hypermutation work together to produce
antibody-coding B cell receptor (BCR) sequences for a remarkable diversity
of antigens. It is now possible to sequence these BCRs in high throughput;
analysis of these sequences is bringing new insight into how antibodies
develop, in particular for broadly-neutralizing antibodies against HIV and
influenza. A fundamental step in such sequence analysis is to annotate
each base as coming from a specific one of the V, D, or J genes, or from
an N-addition (a.k.a. non-templated insertion). Previous work has used
simple parametric distributions to model transitions from state to state
in a hidden Markov model (HMM) of VDJ recombination, and assumed that
mutations occur via the same process across sites. However, codon frame
and other effects have been observed to violate these parametric
assumptions for such coding sequences, suggesting that a non-parametric
approach to modeling the recombination process could be useful. In our
paper, we find that indeed large modern data sets suggest a model using
parameter-rich per-allele categorical distributions for HMM transition
probabilities and per-allele-per-position mutation probabilities, and that
using such a model for inference leads to significantly improved results.
We present an accurate and efficient BCR sequence annotation software
package using a novel HMM “factorization” strategy. This package, called
partis (https://github.com/psathyrella/partis/), is built on a new
general-purpose HMM compiler that can perform efficient inference given a
simple text description of an HMM.
VDJ重排(VDJ rearrangement)与体细胞高频突变(somatic hypermutation)协同作用,产生编码抗体的B细胞受体(B cell receptor, BCR)序列,以应对种类极其丰富的抗原。如今,我们已能对这些BCR序列进行高通量测序;对这些序列的分析为抗体的发育机制带来了新的见解,尤其是针对HIV和流感的广谱中和抗体(broadly-neutralizing antibodies)。此类序列分析中的一个基本步骤是,将每个碱基注释为来源于V、D或J基因中的特定基因,或是来源于N添加(N-addition,又称非模板插入(non-templated insertion))。以往的研究采用简单的参数分布,在VDJ重排的隐马尔可夫模型(hidden Markov model, HMM)中对状态间的转移进行建模,并假设突变在所有位点上通过相同过程发生。然而,已观察到密码子框架(codon frame)及其他效应会违反这些针对编码序列的参数假设,这表明采用非参数方法对重排过程进行建模可能是有益的。在我们的研究中,我们发现,现代大型数据集确实支持一种模型:该模型针对HMM转移概率采用富含参数的等位基因特异性分类分布(per-allele categorical distribution),并针对突变概率采用等位基因-位点特异性分布(per-allele-per-position mutation probability);使用这种模型进行推理可显著提升结果性能。我们提出了一个准确且高效的BCR序列注释软件包,该软件包采用了新颖的HMM“因子分解”(factorization)策略。这个名为partis(https://github.com/psathyrella/partis/)的软件包基于一个全新的通用HMM编译器构建,该编译器能够根据HMM的简单文本描述执行高效推理。
提供机构:
Dryad
创建时间:
2016-01-27



