thematrixmaster/cosine
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/thematrixmaster/cosine
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
tags:
- biology
- antibodies
size_categories:
- 100M<n<1B
---
# CoSiNE Training Dataset
Training data for [CoSiNE](https://github.com/thematrixmaster/cosine) model, which simulates antibody affinity maturation. Please read our [paper](https://arxiv.org/abs/2602.18982) for more details!
## Dataset Summary
This dataset consists of approximately **2 million B-cell receptor (BCR) parent-child sequence transitions** derived from **120,000 clonal families** across **555 individual donors**.
The data was processed using a rigorous phylogenetic inference pipeline to capture the nuances of somatic hypermutation:
1. **Clonal Inference:** Sequences were clustered into clonal families and naive germlines were inferred using `partis`.
2. **Quality Filtering:** We retained only productive sequences (no stop codons, conserved CDR3 anchors) and excluded sequences with mutations in conserved signature cysteines.
3. **Phylogenetic Reconstruction:** Phylogenetic trees and ancestral sequences were inferred using **IQ-TREE** under a K80 substitution model.
4. **Paired-Chain Modeling:** For paired heavy and light chain data, we utilized an edge-linked-proportional partition model to account for distinct evolutionary rates across chains.
The final training set consists of **Parent-Child Pairs (PCPs)** extracted from the edges of these phylogenetic trees, representing a comprehensive map of the evolutionary trajectories within the adaptive immune system. For a detailed description of the processing pipeline, please refer to https://elifesciences.org/reviewed-preprints/109644v1.
## Dataset Sources
The dataset was compiled using B-cell receptor (BCR) sequencing datasets from five sources:
* [Jaffe-2022](https://www.nature.com/articles/s41586-022-05371-z)
* [Tang-2022](https://www.sciencedirect.com/science/article/pii/S2589004221016382)
* [Vergani-2017](https://pubmed.ncbi.nlm.nih.gov/28959265/)
* [Engelbrecht-2025](https://www.nature.com/articles/s41467-025-66759-9)
* [Rodriguez-2023](https://www.nature.com/articles/s41467-023-40070-x)
## Citation
Consider citing our paper if you use CoSiNE in your research!
```bibtex
@article{Lu2026ConditionallySN,
title={Conditionally Site-Independent Neural Evolution of Antibody Sequences},
author={Stephen Zhewen Lu and Aakarsh Vermani and Kohei Sanno and Jiarui Lu and IV FrederickA.Matsen and Milind Jagota and Yun S. Song},
journal={ArXiv},
year={2026},
url={https://api.semanticscholar.org/CorpusID:285973749}
}
```
## Dataset Card Contact
For questions or issues, please contact:
- Stephen Z. Lu (stephen.lu@berkeley.edu)
- Aakarsh Vermani (aakarshv@berkeley.edu)
提供机构:
thematrixmaster



