Supporting data for "simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods"
收藏DataCite Commons2025-05-26 更新2024-07-13 收录
下载链接:
http://gigadb.org/dataset/102434
下载链接
链接失效反馈官方服务:
资源简介:
Machine learning (ML) has seen an increase in interest in classifying immune states in adaptive immune receptor repertoires (AIRR) to aid the development of immunodiagnostics and therapeutics. Simulated data are crucial and necessary for the development and comprehensive evaluation of AIRR-ML methods e.g. through crowdsourced ML competitions. <br>We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR constructs antigen-experienced-like baseline repertoires by introducing signals that follow the empirical relationship between VDJ generation probability and population incidence of public sequences calibrated on real-world experimental datasets. By allowing users to provide a set of true immune state-associated sequences, simAIRR can be used for constructing repertoire-level benchmarks based on a range of assumptions (or experimental data source) for what constitutes receptor level immune signals. This includes the possibility of either making or not making any prior assumptions regarding the similarity or commonality of immune state-associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets. <br>This study not only sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets, but also provides a solution through a simulation strategy implemented as a Python package: https://github.com/KanduriC/simAIRR.
提供机构:
GigaScience Database
创建时间:
2023-08-30



