Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://figshare.com/articles/dataset/Statistical_Analysis_and_Tokenization_of_Epitopes_to_Construct_Artificial_Neoepitope_Libraries/24133036
下载链接
链接失效反馈官方服务:
资源简介:
Epitopes are specific regions on an antigen’s
surface that
the immune system recognizes. Epitopes are usually protein regions
on foreign immune-stimulating entities such as viruses and bacteria,
and in some cases, endogenous proteins may act as antigens. Identifying
epitopes is crucial for accelerating the development of vaccines and
immunotherapies. However, mapping epitopes in pathogen proteomes is
challenging using conventional methods. Screening artificial neoepitope
libraries against antibodies can overcome this issue. Here, we applied
conventional sequence analysis and methods inspired in natural language
processing to reveal specific sequence patterns in the linear epitopes
deposited in the Immune Epitope Database (www.iedb.org) that can serve as building
blocks for the design of universal epitope libraries. Our results
reveal that amino acid frequency in annotated linear epitopes differs
from that in the human proteome. Aromatic residues are overrepresented,
while the presence of cysteines is practically null in epitopes. Byte
pair encoding tokenization shows high frequencies of tryptophan in
tokens of 5, 6, and 7 amino acids, corroborating the findings of the
conventional sequence analysis. These results can be applied to reduce
the diversity of linear epitope libraries by orders of magnitude.
创建时间:
2023-09-13



