five

thematrixmaster/cosine

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/thematrixmaster/cosine
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en tags: - biology - antibodies size_categories: - 100M<n<1B --- # CoSiNE Training Dataset Training data for [CoSiNE](https://github.com/thematrixmaster/cosine) model, which simulates antibody affinity maturation. Please read our [paper](https://arxiv.org/abs/2602.18982) for more details! ## Dataset Summary This dataset consists of approximately **2 million B-cell receptor (BCR) parent-child sequence transitions** derived from **120,000 clonal families** across **555 individual donors**. The data was processed using a rigorous phylogenetic inference pipeline to capture the nuances of somatic hypermutation: 1. **Clonal Inference:** Sequences were clustered into clonal families and naive germlines were inferred using `partis`. 2. **Quality Filtering:** We retained only productive sequences (no stop codons, conserved CDR3 anchors) and excluded sequences with mutations in conserved signature cysteines. 3. **Phylogenetic Reconstruction:** Phylogenetic trees and ancestral sequences were inferred using **IQ-TREE** under a K80 substitution model. 4. **Paired-Chain Modeling:** For paired heavy and light chain data, we utilized an edge-linked-proportional partition model to account for distinct evolutionary rates across chains. The final training set consists of **Parent-Child Pairs (PCPs)** extracted from the edges of these phylogenetic trees, representing a comprehensive map of the evolutionary trajectories within the adaptive immune system. For a detailed description of the processing pipeline, please refer to https://elifesciences.org/reviewed-preprints/109644v1. ## Dataset Sources The dataset was compiled using B-cell receptor (BCR) sequencing datasets from five sources: * [Jaffe-2022](https://www.nature.com/articles/s41586-022-05371-z) * [Tang-2022](https://www.sciencedirect.com/science/article/pii/S2589004221016382) * [Vergani-2017](https://pubmed.ncbi.nlm.nih.gov/28959265/) * [Engelbrecht-2025](https://www.nature.com/articles/s41467-025-66759-9) * [Rodriguez-2023](https://www.nature.com/articles/s41467-023-40070-x) ## Citation Consider citing our paper if you use CoSiNE in your research! ```bibtex @article{Lu2026ConditionallySN, title={Conditionally Site-Independent Neural Evolution of Antibody Sequences}, author={Stephen Zhewen Lu and Aakarsh Vermani and Kohei Sanno and Jiarui Lu and IV FrederickA.Matsen and Milind Jagota and Yun S. Song}, journal={ArXiv}, year={2026}, url={https://api.semanticscholar.org/CorpusID:285973749} } ``` ## Dataset Card Contact For questions or issues, please contact: - Stephen Z. Lu (stephen.lu@berkeley.edu) - Aakarsh Vermani (aakarshv@berkeley.edu)
提供机构:
thematrixmaster
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作