ConvergeBio/oas-unpaired
收藏Hugging Face2026-03-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ConvergeBio/oas-unpaired
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
tags:
- antibody
- immunology
- bcr
- airr
- oas
- proteomics
- proteins
- protein
size_categories:
- 1B<n<10B
configs:
- config_name: default
data_files:
- split: heavy
path: data/unpaired_heavy/**/*.parquet
- split: light
path: data/unpaired_light/**/*.parquet
- config_name: heavy
data_files:
- split: train
path: data/unpaired_heavy/**/*.parquet
- config_name: light
data_files:
- split: train
path: data/unpaired_light/**/*.parquet
- config_name: Banerjee et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Banerjee et al., 2017/*.parquet
- config_name: Bashford et al., 2013
data_files:
- split: heavy
path: data/unpaired_heavy/Bashford et al., 2013/*.parquet
- config_name: Bender et al., 2020
data_files:
- split: light
path: data/unpaired_light/Bender et al., 2020/*.parquet
- config_name: Bernardes et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Bernardes et al., 2020/*.parquet
- split: light
path: data/unpaired_light/Bernardes et al., 2020/*.parquet
- config_name: Bernat et al., 2019
data_files:
- split: heavy
path: data/unpaired_heavy/Bernat et al., 2019/*.parquet
- split: light
path: data/unpaired_light/Bernat et al., 2019/*.parquet
- config_name: Bhiman et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Bhiman et al., 2015/*.parquet
- split: light
path: data/unpaired_light/Bhiman et al., 2015/*.parquet
- config_name: Bolland et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Bolland et al., 2016/*.parquet
- config_name: Bonsignori et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Bonsignori et al., 2016/*.parquet
- config_name: Briney et al., 2019
data_files:
- split: heavy
path: data/unpaired_heavy/Briney et al., 2019/*.parquet
- split: light
path: data/unpaired_light/Briney et al., 2019/*.parquet
- config_name: Buchheim et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Buchheim et al., 2020/*.parquet
- config_name: Chen et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Chen et al., 2020/*.parquet
- split: light
path: data/unpaired_light/Chen et al., 2020/*.parquet
- config_name: Collins et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Collins et al., 2015/*.parquet
- config_name: Corcoran et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Corcoran et al., 2016/*.parquet
- split: light
path: data/unpaired_light/Corcoran et al., 2016/*.parquet
- config_name: Cui et al., 2019
data_files:
- split: heavy
path: data/unpaired_heavy/Cui et al., 2019/*.parquet
- split: light
path: data/unpaired_light/Cui et al., 2019/*.parquet
- config_name: Davis et al., 2019
data_files:
- split: heavy
path: data/unpaired_heavy/Davis et al., 2019/*.parquet
- config_name: Doria-Rose et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Doria-Rose et al., 2015/*.parquet
- split: light
path: data/unpaired_light/Doria-Rose et al., 2015/*.parquet
- config_name: Eccles et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Eccles et al., 2020/*.parquet
- split: light
path: data/unpaired_light/Eccles et al., 2020/*.parquet
- config_name: Eliyahu et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Eliyahu et al., 2015/*.parquet
- config_name: Ellebedy et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Ellebedy et al., 2016/*.parquet
- config_name: Fisher et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Fisher et al., 2017/*.parquet
- split: light
path: data/unpaired_light/Fisher et al., 2017/*.parquet
- config_name: Galson et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Galson et al., 2015/*.parquet
- config_name: Galson et al., 2015a
data_files:
- split: heavy
path: data/unpaired_heavy/Galson et al., 2015a/*.parquet
- config_name: Galson et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Galson et al., 2016/*.parquet
- config_name: Galson et al., 2016a
data_files:
- split: heavy
path: data/unpaired_heavy/Galson et al., 2016a/*.parquet
- config_name: Galson et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Galson et al., 2020/*.parquet
- split: light
path: data/unpaired_light/Galson et al., 2020/*.parquet
- config_name: Ghraichy et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Ghraichy et al., 2020/*.parquet
- config_name: Gidoni et al., 2019
data_files:
- split: heavy
path: data/unpaired_heavy/Gidoni et al., 2019/*.parquet
- split: light
path: data/unpaired_light/Gidoni et al., 2019/*.parquet
- config_name: Greif et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Greif et al., 2015/*.parquet
- config_name: Greiff et al., 2014
data_files:
- split: heavy
path: data/unpaired_heavy/Greiff et al., 2014/*.parquet
- config_name: Greiff et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Greiff et al., 2017/*.parquet
- config_name: Gupta et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Gupta et al., 2017/*.parquet
- split: light
path: data/unpaired_light/Gupta et al., 2017/*.parquet
- config_name: Halliley et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Halliley et al., 2015/*.parquet
- config_name: Huang et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Huang et al., 2016/*.parquet
- split: light
path: data/unpaired_light/Huang et al., 2016/*.parquet
- config_name: Jaffe et al., 2022
data_files:
- split: heavy
path: data/unpaired_heavy/Jaffe et al., 2022/*.parquet
- split: light
path: data/unpaired_light/Jaffe et al., 2022/*.parquet
- config_name: Jiang et al., 2013
data_files:
- split: heavy
path: data/unpaired_heavy/Jiang et al., 2013/*.parquet
- config_name: Johnson et al., 2018
data_files:
- split: heavy
path: data/unpaired_heavy/Johnson et al., 2018/*.parquet
- split: light
path: data/unpaired_light/Johnson et al., 2018/*.parquet
- config_name: Joyce et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Joyce et al., 2016/*.parquet
- config_name: Khan et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Khan et al., 2016/*.parquet
- split: light
path: data/unpaired_light/Khan et al., 2016/*.parquet
- config_name: Kim et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Kim et al., 2020/*.parquet
- split: light
path: data/unpaired_light/Kim et al., 2020/*.parquet
- config_name: King et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/King et al., 2020/*.parquet
- split: light
path: data/unpaired_light/King et al., 2020/*.parquet
- config_name: Kuri-Cervantes et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Kuri-Cervantes et al., 2020/*.parquet
- config_name: Levin et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Levin et al., 2016/*.parquet
- config_name: Levin et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Levin et al., 2017/*.parquet
- config_name: Li et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Li et al., 2017/*.parquet
- split: light
path: data/unpaired_light/Li et al., 2017/*.parquet
- config_name: Liao et al., 2013
data_files:
- split: heavy
path: data/unpaired_heavy/Liao et al., 2013/*.parquet
- split: light
path: data/unpaired_light/Liao et al., 2013/*.parquet
- config_name: Lindner et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Lindner et al., 2015/*.parquet
- config_name: Meng et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Meng et al., 2017/*.parquet
- config_name: Menzel et al., 2014
data_files:
- split: heavy
path: data/unpaired_heavy/Menzel et al., 2014/*.parquet
- config_name: Montague et al., 2021
data_files:
- split: heavy
path: data/unpaired_heavy/Montague et al., 2021/*.parquet
- config_name: Mor et al., 2021
data_files:
- split: heavy
path: data/unpaired_heavy/Mor et al., 2021/*.parquet
- split: light
path: data/unpaired_light/Mor et al., 2021/*.parquet
- config_name: Mroczek et al., 2014
data_files:
- split: heavy
path: data/unpaired_heavy/Mroczek et al., 2014/*.parquet
- config_name: Mukhamedova et al. 2021
data_files:
- split: heavy
path: data/unpaired_heavy/Mukhamedova et al. 2021/*.parquet
- split: light
path: data/unpaired_light/Mukhamedova et al. 2021/*.parquet
- config_name: Nielsen et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Nielsen et al., 2020/*.parquet
- config_name: Ohm-Laursen et al., 2018
data_files:
- split: heavy
path: data/unpaired_heavy/Ohm-Laursen et al., 2018/*.parquet
- config_name: Ota et al., 2010
data_files:
- split: light
path: data/unpaired_light/Ota et al., 2010/*.parquet
- config_name: Palanichamy et al., 2014
data_files:
- split: heavy
path: data/unpaired_heavy/Palanichamy et al., 2014/*.parquet
- config_name: Parameswaran et al., 2014
data_files:
- split: heavy
path: data/unpaired_heavy/Parameswaran et al., 2014/*.parquet
- config_name: Prohaska et al., 2018
data_files:
- split: heavy
path: data/unpaired_heavy/Prohaska et al., 2018/*.parquet
- config_name: Rettig et al., 2018
data_files:
- split: heavy
path: data/unpaired_heavy/Rettig et al., 2018/*.parquet
- split: light
path: data/unpaired_light/Rettig et al., 2018/*.parquet
- config_name: Richardson et al., 2022
data_files:
- split: heavy
path: data/unpaired_heavy/Richardson et al., 2022/*.parquet
- config_name: Rubelt et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Rubelt et al., 2016/*.parquet
- config_name: Schanz et al., 2014
data_files:
- split: heavy
path: data/unpaired_heavy/Schanz et al., 2014/*.parquet
- split: light
path: data/unpaired_light/Schanz et al., 2014/*.parquet
- config_name: Schultheiss et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Schultheiss et al., 2020/*.parquet
- config_name: Setliff et al., 2018
data_files:
- split: heavy
path: data/unpaired_heavy/Setliff et al., 2018/*.parquet
- split: light
path: data/unpaired_light/Setliff et al., 2018/*.parquet
- config_name: Sevy et al., 2019
data_files:
- split: heavy
path: data/unpaired_heavy/Sevy et al., 2019/*.parquet
- split: light
path: data/unpaired_light/Sevy et al., 2019/*.parquet
- config_name: Sheng et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Sheng et al., 2017/*.parquet
- split: light
path: data/unpaired_light/Sheng et al., 2017/*.parquet
- config_name: Simonich et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Simonich et al., 2020/*.parquet
- split: light
path: data/unpaired_light/Simonich et al., 2020/*.parquet
- config_name: Soto et al., 2016
data_files:
- split: heavy
path: data/unpaired_heavy/Soto et al., 2016/*.parquet
- split: light
path: data/unpaired_light/Soto et al., 2016/*.parquet
- config_name: Soto et al., 2019
data_files:
- split: heavy
path: data/unpaired_heavy/Soto et al., 2019/*.parquet
- split: light
path: data/unpaired_light/Soto et al., 2019/*.parquet
- config_name: Stern et al., 2014
data_files:
- split: heavy
path: data/unpaired_heavy/Stern et al., 2014/*.parquet
- split: light
path: data/unpaired_light/Stern et al., 2014/*.parquet
- config_name: Sundling et al., 2014
data_files:
- split: heavy
path: data/unpaired_heavy/Sundling et al., 2014/*.parquet
- config_name: Tipton et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Tipton et al., 2015/*.parquet
- config_name: Tong et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Tong et al., 2017/*.parquet
- config_name: Turchaninova et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Turchaninova et al., 2015/*.parquet
- config_name: Turner et al., 2021
data_files:
- split: heavy
path: data/unpaired_heavy/Turner et al., 2021/*.parquet
- split: light
path: data/unpaired_light/Turner et al., 2021/*.parquet
- config_name: VanDuijn et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/VanDuijn et al., 2017/*.parquet
- split: light
path: data/unpaired_light/VanDuijn et al., 2017/*.parquet
- config_name: Vander Heiden et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Vander Heiden et al., 2017/*.parquet
- split: light
path: data/unpaired_light/Vander Heiden et al., 2017/*.parquet
- config_name: Vergani et al., 2017
data_files:
- split: heavy
path: data/unpaired_heavy/Vergani et al., 2017/*.parquet
- config_name: Waltari et al., 2018
data_files:
- split: heavy
path: data/unpaired_heavy/Waltari et al., 2018/*.parquet
- split: light
path: data/unpaired_light/Waltari et al., 2018/*.parquet
- config_name: Wesemann et al., 2013
data_files:
- split: heavy
path: data/unpaired_heavy/Wesemann et al., 2013/*.parquet
- split: light
path: data/unpaired_light/Wesemann et al., 2013/*.parquet
- config_name: Woodruff et al., 2020
data_files:
- split: heavy
path: data/unpaired_heavy/Woodruff et al., 2020/*.parquet
- split: light
path: data/unpaired_light/Woodruff et al., 2020/*.parquet
- config_name: Wu et al., 2011
data_files:
- split: heavy
path: data/unpaired_heavy/Wu et al., 2011/*.parquet
- split: light
path: data/unpaired_light/Wu et al., 2011/*.parquet
- config_name: Wu et al., 2014
data_files:
- split: heavy
path: data/unpaired_heavy/Wu et al., 2014/*.parquet
- config_name: Wu et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Wu et al., 2015/*.parquet
- split: light
path: data/unpaired_light/Wu et al., 2015/*.parquet
- config_name: Zhou et al., 2013
data_files:
- split: heavy
path: data/unpaired_heavy/Zhou et al., 2013/*.parquet
- split: light
path: data/unpaired_light/Zhou et al., 2013/*.parquet
- config_name: Zhou et al., 2015
data_files:
- split: heavy
path: data/unpaired_heavy/Zhou et al., 2015/*.parquet
- config_name: Zhu et al., 2012
data_files:
- split: heavy
path: data/unpaired_heavy/Zhu et al., 2012/*.parquet
- split: light
path: data/unpaired_light/Zhu et al., 2012/*.parquet
- config_name: Zhu et al., 2013
data_files:
- split: heavy
path: data/unpaired_heavy/Zhu et al., 2013/*.parquet
- split: light
path: data/unpaired_light/Zhu et al., 2013/*.parquet
pretty_name: OAS Unpaired
---
# OAS Unpaired
The OAS unpaired dataset [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/), available as parquet with content-defined chunking on HuggingFace.
## Configs and Splits
This dataset exposes **91 configs**:
| Config | Splits | Description |
|--------|--------|-------------|
| `default` | `heavy`, `light` | All sequences, split by chain |
| `heavy` | `train` | All heavy chain sequences |
| `light` | `train` | All light chain sequences |
| `{Author et al., YYYY}` | `heavy`, `light`, or both | One author's sequences |
```python
from datasets import load_dataset
# All heavy chain sequences
ds = load_dataset("ConvergeBio/oas-unpaired", "heavy", split="train", streaming=True)
# All light chain sequences
ds = load_dataset("ConvergeBio/oas-unpaired", "light", split="train", streaming=True)
# One author
ds = load_dataset("ConvergeBio/oas-unpaired", "Briney et al., 2019", split="heavy")
```
---
## Usage
### Load specific columns
```python
from datasets import load_dataset
ds = load_dataset(
"ConvergeBio/oas-unpaired", "Briney et al., 2019",
split="heavy",
columns=["sequence_alignment_aa", "cdr3_aa", "v_call", "j_call", "meta_Subject"],
)
df = ds.to_pandas()
```
### Load a specific author
```python
ds = load_dataset("ConvergeBio/oas-unpaired", "Gidoni et al., 2019", split="heavy")
print(ds)
# Dataset({features: [...], num_rows: ...})
```
### Stream and filter
```python
ds = load_dataset(
"ConvergeBio/oas-unpaired", "heavy",
split="train",
streaming=True,
)
productive = ds.filter(lambda x: x["productive"] == "T")
```
### With DuckDB
```python
import duckdb
con = duckdb.connect()
con.sql("INSTALL httpfs; LOAD httpfs;")
con.sql("CREATE SECRET (TYPE HUGGINGFACE, TOKEN 'hf_...');")
# CDR3 lengths for one author
df = con.sql("""
SELECT cdr3_aa, length(cdr3_aa) as cdr3_len, v_call, j_call, meta_Subject
FROM 'hf://datasets/ConvergeBio/oas-unpaired/data/unpaired_heavy/Briney et al., 2019/*.parquet'
WHERE productive = 'T'
""").df()
# Sequence counts per author across all heavy chains
df = con.sql("""
SELECT meta_Author, COUNT(*) as n_sequences
FROM 'hf://datasets/ConvergeBio/oas-unpaired/data/unpaired_heavy/**/*.parquet'
GROUP BY meta_Author
ORDER BY n_sequences DESC
""").df()
```
---
## Studies
Each entry is a separate config. Sequence counts come from OAS source file headers. Duplicate rates are based on xxHash128 of `sequence_alignment_aa`. Click an author name to view the original publication.
| Author | Heavy | H Dup% | Light | L Dup% | Summary |
|--------|------:|-------:|------:|-------:|---------|
| [Banerjee et al., 2017](https://doi.org/10.1016/j.virol.2017.02.015) | 3.5M | 1.97% | — | — | Rabbit heavy chains from HIV-vaccinated animals; vaccine designed to elicit MPER 4E10/10E8 broadly neutralizing antibodies. |
| [Bashford et al., 2013](https://doi.org/10.1101/gr.154815.113) | 258.2K | 0.03% | — | — | Human heavy chains from chronic lymphocytic leukemia (CLL) patients; B-cell receptor network analysis via deep sequencing. |
| [Bender et al., 2020](https://doi.org/10.1182/blood.2019004197) | — | — | 1.1M | 0.43% | Human light chains from bone marrow of POEMS syndrome patients (rare plasma cell disorder with monoclonal immunoglobulin). |
| [Bernardes et al., 2020](https://doi.org/10.1016/j.immuni.2020.11.017) | 3.0M | 8.99% | 19 | — | Human heavy chains from COVID-19 patients; longitudinal multi-omics study of immune cell responses during SARS-CoV-2 infection. |
| [Bernat et al., 2019](https://doi.org/10.3389/fimmu.2019.00660) | 2.8M | 1.03% | 353.3K | 0.09% | Human heavy and light chains from healthy PBMC; optimized NGS library preparation for immunoglobulin germline gene inference. |
| [Bhiman et al., 2015](https://doi.org/10.1038/nm.3963) | 88.5K | 1.21% | 129.3K | 0.02% | Human heavy and light chains from HIV-infected donors tracking viral variants that initiate V1V2-directed broadly neutralizing antibody maturation. |
| [Bolland et al., 2016](https://doi.org/10.1016/j.celrep.2016.05.020) | 27.6K | 0.09% | — | — | Mouse pro-B cell heavy chains studying chromatin states that regulate efficient V(D)J recombination at the immunoglobulin locus. |
| [Bonsignori et al., 2016](https://doi.org/10.1016/j.cell.2016.02.022) | 202.7K | 1.6% | — | — | Human heavy chains from HIV-infected donors tracing the germline-to-neutralizer maturation pathway of a CD4-mimicking broadly neutralizing antibody. |
| [Briney et al., 2019](https://doi.org/10.1038/s41586-019-0879-y) | 940.3M | 7.23% | 18 | — | Very large human heavy chain dataset (~940M sequences) from 10 healthy donors revealing commonality despite exceptional diversity in baseline human antibody repertoires. |
| [Buchheim et al., 2020](https://doi.org/10.1096/fj.202001403RR) | 5.2M | 0.74% | — | — | Human IgM heavy chains from astronauts and ground controls; studies IgM repertoire plasticity during long-term spaceflight. |
| [Chen et al., 2020](https://doi.org/10.1371/journal.pone.0235713) | 776.4K | 6.9% | 2.1M | 8.11% | Human heavy and light chains from bone marrow of light chain amyloidosis patients; diverse patterns of antibody variable gene disruption. |
| [Collins et al., 2015](https://doi.org/10.1098/rstb.2014.0236) | 359.5K | 37.98% | — | — | Mouse heavy chains from spleen showing the murine antibody repertoire is germline-focused and highly variable across individuals. |
| [Corcoran et al., 2016](https://doi.org/10.1038/ncomms13642) | 3.7M | 0.16% | 1.2M | 0.62% | Cross-species (human, mouse C57BL/6, mouse BALB/c, rhesus) heavy and light chains; individualized V gene databases reveal high immunoglobulin gene polymorphism. |
| [Cui et al., 2019](https://doi.org/10.4049/jimmunol.1502263) | 5.5K | 11.91% | 1.0M | 8.76% | Mouse memory B cell heavy and light chains after NP-CGG immunization; models somatic hypermutation targeting in mice. |
| [Davis et al., 2019](https://doi.org/10.1016/j.cell.2019.04.036) | 11.9M | 0.95% | — | — | Human heavy chains from Ebola virus-infected donors across B cell subsets; longitudinal analysis of the B cell response to Ebola infection. |
| [Doria-Rose et al., 2015](https://doi.org/10.1038/nature13036) | 526.1K | 0.66% | 415.3K | 0.19% | Human heavy and light chains from HIV-infected donors tracking the developmental pathway of potent V1V2-directed broadly neutralizing antibodies. |
| [Eccles et al., 2020](https://doi.org/10.1016/j.celrep.2019.12.027) | 796 | — | 13.6K | 3.91% | Human heavy and light chains from rhinovirus-reactive (T-bet+) memory B cells linking local cross-reactive IgG to rhinovirus infection. |
| [Eliyahu et al., 2015](https://doi.org/10.3389/fimmu.2018.03004) | 1.6M | 0.42% | — | — | Human heavy chains from hepatitis C virus (HCV) infected donors identifying immune signatures and potential therapeutic antibody targets. |
| [Ellebedy et al., 2016](https://doi.org/10.1038/ni.3533) | 11.3M | 1.48% | — | — | Human heavy chains from multiple B cell subsets (naive, memory, plasmablasts) defining antigen-specific B cell responses after seasonal influenza vaccination. |
| [Fisher et al., 2017](https://doi.org/10.1371/journal.ppat.1006469) | 29.9K | 0.27% | 130.4K | 5.84% | Mouse heavy and light chains from Plasmodium-immunized BALB/c spleen; T-dependent B cell responses forming high-avidity anti-parasite antibodies. |
| [Galson et al., 2015](https://doi.org/10.1016/j.ebiom.2015.11.034) | 15.9M | 2.72% | — | — | Human heavy chains tracking B cell repertoire dynamics following hepatitis B vaccination. |
| [Galson et al., 2016](https://doi.org/10.1186/s13073-016-0322-z) | 16.1M | 1.54% | — | — | Human heavy chains studying B cell repertoire dynamics after sequential hepatitis B vaccination; evidence for clonal B cell persistence. |
| [Galson et al., 2016a](https://doi.org/10.1038/srep37229) | 4.4M | — | — | — | Human heavy chains from plasma cells after pandemic H1N1 influenza vaccination, investigating AS03 adjuvant effects on B cell repertoire. |
| [Galson et al., 2020](https://doi.org/10.3389/fimmu.2020.605170) | 4.6M | 2.57% | 46 | — | Human heavy and light chains from COVID-19 patients; one of the early deep BCR sequencing studies of SARS-CoV-2 infection. |
| [Galson et al., 2015a](https://doi.org/10.1038/icb.2015.57) | 3.9M | 2.23% | — | — | Human heavy chains across B cell subsets (naive, memory, plasma, unsorted) after meningococcal (MenACWY) conjugate and polysaccharide vaccination. |
| [Ghraichy et al., 2020](https://doi.org/10.3389/fimmu.2020.01734) | 8.3M | 2.95% | — | — | Human heavy chain repertoire from healthy donors across age groups (children to elderly); studies age-related maturation of immunoglobulin diversity. |
| [Gidoni et al., 2019](https://doi.org/10.1038/s41467-019-08489-3) | 13.6M | 0.86% | 12.7M | 28.45% | Human naive B cell heavy and light chains revealing mosaic deletion patterns in the antibody heavy chain gene locus; includes healthy and celiac disease subjects. |
| [Greif et al., 2015](https://doi.org/10.1186/s13073-015-0169-8) | 552.0K | 1.04% | — | — | Mouse heavy chains across B cell subsets (naive, ASC, plasma cells) after NP-CGG immunization; framework for immune repertoire diversity profiling. |
| [Greiff et al., 2014](https://doi.org/10.1186/s12865-014-0040-5) | 3.4M | 13.88% | — | — | Mouse plasmablast/plasma cell heavy chains after NP-CGG immunization; quantitative assessment of NGS-based antibody repertoire sequencing robustness. |
| [Greiff et al., 2017](https://doi.org/10.1016/j.celrep.2017.04.054) | 138.8M | 5.54% | — | — | Very large mouse heavy chain dataset (multiple strains, antigens, tissues); systems analysis of genetic and antigen-driven predetermination of antibody repertoire structure. |
| [Gupta et al., 2017](https://doi.org/10.4049/jimmunol.1601850) | 3.1M | 2.82% | 9.3M | 6.96% | Human naive B cell heavy and light chains from flu/HepB vaccinated donors; used to evaluate hierarchical clustering methods for identifying B cell clones. |
| [Halliley et al., 2015](https://doi.org/10.1016/j.immuni.2015.06.016) | 593.3K | 1.6% | — | — | Human bone marrow plasma cell heavy chains after tetanus/flu vaccination; identifies long-lived plasma cells within the CD19-CD38hiCD138+ subset. |
| [Huang et al., 2016](https://doi.org/10.1016/j.immuni.2016.10.027) | 3.6M | 5.57% | 3.5M | 11.33% | Human memory B cell heavy and light chains from HIV-infected donors; identifies CD4-binding-site antibodies that evolved near-pan HIV neutralization. |
| [Jaffe et al., 2022](https://doi.org/10.1038/s41586-022-05371-z) | 1.6M | 3.71% | 969.5K | 37.19% | Human heavy and light chains from COVID-19 and CMV donors; demonstrates that functional antibodies exhibit light chain coherence (light chain pairing bias). |
| [Jiang et al., 2013](https://doi.org/10.1126/scitranslmed.3004794) | 3.7M | 14.33% | — | — | Human heavy chains from influenza-vaccinated donors; examines lineage structure of the antibody repertoire in response to influenza vaccination. |
| [Johnson et al., 2018](https://doi.org/10.1038/s41467-018-06424-6) | 8.6M | 2.82% | 1.9M | 2.83% | Human heavy and light chains from HIV-infected donors; sequences broadly neutralizing antibody exons and introns revealing detailed aspects of antibody evolution. |
| [Joyce et al., 2016](https://doi.org/10.1016/j.cell.2016.06.043) | 1.6M | — | — | — | Human heavy chains used in a vaccine study identifying broadly protective antibodies that neutralize both group 1 and group 2 influenza A viruses. |
| [Khan et al., 2016](https://doi.org/10.1126/sciadv.1501371) | 12.0M | 17.76% | 5 | — | Mouse heavy chains from OVA-immunized BALB/c spleen; demonstrates accurate antibody repertoire profiling by molecular amplification fingerprinting. |
| [Kim et al., 2020](https://doi.org/10.1126/scitranslmed.abd6990) | 45.1M | 4.37% | 19.6M | 10.51% | Large human heavy and light chain dataset from COVID-19 patients; identifies stereotypic VH antibodies that neutralize SARS-CoV-2. |
| [King et al., 2020](https://doi.org/10.1126/sciimmunol.abe6291) | 13.7M | 0.7% | 45.9K | 14.88% | Human heavy and light chains from tonsillar B cells across subsets (GC, memory, naive, plasmablast); single-cell analysis predicts antibody class switching. |
| [Kuri-Cervantes et al., 2020](https://doi.org/10.1126/sciimmunol.abd7114) | 8.8M | 1.45% | — | — | Human heavy chains from COVID-19 patients; comprehensive mapping of immune perturbations associated with severe COVID-19. |
| [Levin et al., 2016](https://doi.org/10.1016/j.jaci.2015.09.027) | 675.9K | 1.49% | — | — | Human heavy chains from allergy patients with/without subcutaneous immunotherapy (SIT); persistence and evolution of allergen-specific IgE repertoires. |
| [Levin et al., 2017](https://doi.org/10.1016/j.jaci.2016.06.040) | 13.1M | 5.67% | — | — | Human heavy chains from bone marrow and blood of IgE allergy patients; focuses on bone marrow as an antibody-encoding IgE-producing niche. |
| [Li et al., 2017](https://doi.org/10.1371/journal.pone.0161801) | 1.6M | 0.34% | 355 | 0.56% | Heavy and light chain repertoire from Bactrian camels; comparative analysis of conventional and heavy-chain-only (nanobody precursor) antibody repertoires. |
| [Liao et al., 2013](https://doi.org/10.1038/nature12053) | 411.6K | 1.32% | 333.4K | 1.63% | Human heavy and light chains from an HIV-infected donor; tracks co-evolution of the broadly neutralizing antibody VRC01 and its founder virus. |
| [Lindner et al., 2015](https://doi.org/10.1038/ni.3213) | 741.5K | 3.7% | — | — | Mouse heavy chains from small intestinal B cells; microbial colonization drives diversification of memory B cells producing secretory IgA. |
| [Meng et al., 2017](https://doi.org/10.1038/nbt.3942) | 32.2M | 1.89% | — | — | Human heavy chain atlas across 8 tissues (blood, bone marrow, lung, gut, spleen, etc.); maps B cell clonal distribution throughout the human body. |
| [Menzel et al., 2014](https://doi.org/10.1371/journal.pone.0096727) | 8.3M | 22.07% | — | — | Mouse plasmablast/plasma cell heavy chains after NP-CGG immunization; comprehensive evaluation of amplicon library preparation methods for repertoire sequencing. |
| [Montague et al., 2021](https://doi.org/10.1016/j.celrep.2021.109173) | 10.2M | 7.03% | — | — | Human heavy chains from COVID-19 patients; studies dynamics of B cell repertoire and emergence of cross-reactive antibody responses. |
| [Mor et al., 2021](https://doi.org/10.1371/journal.ppat.1009165) | 81.1K | 7.15% | 139.5K | 4.37% | Human heavy and light chains from severe COVID-19 patients; identifies multi-clonal SARS-CoV-2 neutralizing antibodies. |
| [Mroczek et al., 2014](https://doi.org/10.3389/fimmu.2014.00096) | 121.5K | 0.17% | — | — | Human heavy chains across B cell subsets (immature, naive, memory, plasma) from healthy donors; analyzes repertoire composition by B cell subset. |
| [Mukhamedova et al. 2021](https://doi.org/10.1016/j.immuni.2021.03.004) | 447.9K | 0.59% | 452.4K | 5.93% | Human heavy and light chains from RSV prefusion-protein vaccinated donors; studies antibody responses to respiratory syncytial virus (RSV). |
| [Nielsen et al., 2020](https://doi.org/10.1101/2020.07.08.194456) | 12.1M | 3.61% | — | — | Human heavy chains from COVID-19 patients and nasopharyngeal swabs; studies clonal B cell expansion and convergent antibody responses to SARS-CoV-2. |
| [Ohm-Laursen et al., 2018](https://doi.org/10.3389/fimmu.2018.01976) | 7.2M | 5.43% | — | — | Human heavy chains from bronchial biopsies and blood of asthma patients; studies local clonal B cell diversification and dissemination in the airway. |
| [Ota et al., 2010](https://doi.org/10.4049/jimmunol.1002176) | — | — | 20.1K | 9.88% | Mouse light chains from healthy spleen; studies how BAFF regulates B cell receptor repertoire composition and self-reactivity. |
| [Palanichamy et al., 2014](https://doi.org/10.1126/scitranslmed.3008930) | 339.6K | 0.12% | — | — | Human heavy chains from cerebrospinal fluid and blood of multiple sclerosis patients; immunoglobulin class-switched B cells form a CNS-periphery immune axis. |
| [Parameswaran et al., 2014](https://doi.org/10.1016/j.chom.2013.05.008) | 314.1K | 4.52% | — | — | Human heavy chains from dengue fever and non-dengue febrile illness donors; identifies convergent antibody signatures across multiple individuals. |
| [Prohaska et al., 2018](https://doi.org/10.4049/jimmunol.1700568) | 255.0K | 1.37% | — | — | Mouse heavy chains from B cell subsets (B-1a, B-1b, B-2, follicular, marginal zone) in peritoneal cavity and spleen; highlights innate-like B cell repertoire differences. |
| [Rettig et al., 2018](https://doi.org/10.1371/journal.pone.0190982) | 27.8K | 2.67% | 30.8K | 17.56% | Mouse heavy and light chains from healthy spleen; naive repertoire characterization using unamplified (no PCR) high-throughput sequencing to minimize amplification bias. |
| [Richardson et al., 2022](https://doi.org/10.1101/2022.06.27.497709) | 406.9K | 3.16% | — | — | Heavy chains from Kymouse (humanized transgenic) naive splenic B cells; characterizes the human-like immune repertoire in this model organism. |
| [Rubelt et al., 2016](https://doi.org/10.1038/ncomms11112) | 2.2M | 0.87% | — | — | Human heavy chains from memory and naive B cells in twins; heritable individual differences drive unique B cell receptor repertoire formation. |
| [Schanz et al., 2014](https://doi.org/10.1371/journal.pone.0111726) | 4.3M | 3.36% | 1.7M | 0.84% | Human heavy and light chains from HIV-infected donors using isotype-specific (IgG, IgM) high-throughput immunoglobulin sequencing. |
| [Schultheiss et al., 2020](https://doi.org/10.1016/j.immuni.2020.06.024) | 4.7M | 0.21% | — | — | Human heavy chains from COVID-19 patients; next-generation sequencing of both T and B cell receptor repertoires from COVID-19 patients and healthy controls. |
| [Setliff et al., 2018](https://doi.org/10.1016/j.chom.2018.05.001) | 22.5M | 2.91% | 1.9M | 0.8% | Large longitudinal human heavy and light chain dataset from HIV-infected donors; reveals stable clonal memory B cell pools across multiple donors. |
| [Sevy et al., 2019](https://doi.org/10.1186/s12859-019-3281-8) | 18.7M | 0.4% | 74 | 1.35% | Human heavy chains from HIV-infected and flu-vaccinated donors; repertoire fingerprinting by PCA reveals shared clonotypes across individuals. |
| [Sheng et al., 2017](https://doi.org/10.3389/fimmu.2017.00537) | 541.8K | 2.6% | 755.9K | 13.83% | Human heavy and light chains from healthy PBMC; describes gene-specific amino acid substitution profiles quantifying somatic hypermutation type and frequency. |
| [Simonich et al., 2020](https://doi.org/10.1038/s41467-019-09481-7) | 847.0K | 0.05% | 1.2M | 1.42% | Human heavy and light chains from HIV-infected infants; kappa light chain maturation drives rapid development of broadly neutralizing antibodies. |
| [Soto et al., 2016](https://doi.org/10.1371/journal.pone.0157409) | 333.9K | 2.99% | 422.3K | — | Human heavy and light chains from HIV-infected donors; traces the developmental pathway of the MPER-directed broadly neutralizing antibody 10E8. |
| [Soto et al., 2019](https://doi.org/10.1038/s41586-019-0934-8) | 553.3M | 19.91% | 242.7M | 25.32% | Very large human heavy and light chain dataset (~796M sequences) from healthy donors; demonstrates high frequency of shared clonotypes in human B cell receptor repertoires. |
| [Stern et al., 2014](https://doi.org/10.1126/scitranslmed.3008879) | 10.1M | 8.62% | 207 | 0.48% | Human heavy chains from multiple sclerosis patient brain lesions and draining cervical lymph nodes; B cells populating the MS brain mature in the CNS. |
| [Sundling et al., 2014](https://doi.org/10.4049/jimmunol.1303334) | 130.2K | 0.23% | — | — | Rhesus macaque IgG-switched heavy chains from PBMC after HIV vaccination; single-cell and deep sequencing reveals diverse antibody responses. |
| [Tipton et al., 2015](https://doi.org/10.1038/ni.3175) | 15.7M | 1.58% | — | — | Human heavy chains from SLE patients and healthy controls; studies diversity, cellular origin, and autoreactivity of antibody-secreting cells. |
| [Tong et al., 2017](https://doi.org/10.1073/pnas.1704962114) | 59.3K | 0.21% | — | — | Mouse heavy chains from OVA-immunized spleen and bone marrow; studies how IgH isotype-specific B cell receptor expression influences B cell fate. |
| [Turchaninova et al., 2015](https://doi.org/10.1038/nprot.2016.093) | 201.4K | 0.01% | — | — | Human heavy chains from memory, naive, and plasma B cells; demonstrates high-quality full-length immunoglobulin profiling using unique molecular barcodes. |
| [Turner et al., 2021](https://doi.org/10.1038/s41586-021-03738-2) | 1.6M | 0.64% | 11.7K | 4.66% | Human heavy and light chains from germinal center B cells and plasmablasts after SARS-CoV-2 mRNA vaccination; persistent germinal center responses observed. |
| [VanDuijn et al., 2017](https://doi.org/10.3389/fimmu.2017.01286) | 5.2M | 1.5% | 9 | — | Rat heavy chains from DNP/HuD-immunized spleen; studies immune repertoire by combining next-generation sequencing with protein mass spectrometry. |
| [Vander Heiden et al., 2017](https://doi.org/10.4049/jimmunol.1601415) | 2.5M | 3.36% | 5.3M | 9.48% | Human heavy and light chains from myasthenia gravis (AChR-MG and MuSK-MG) patients; B cell repertoire dysregulation in autoimmune disease. |
| [Vergani et al., 2017](https://doi.org/10.3389/fimmu.2017.01157) | 13.5M | 5.94% | — | — | Human heavy chains from healthy naive and unsorted B cells; presents a novel high-throughput method for full-length IGHV-D-J sequencing. |
| [Waltari et al., 2018](https://doi.org/10.3389/fimmu.2018.00628) | 29.6M | 0.78% | 45.1M | 6.55% | Large heavy and light chain dataset from HIV-infected donors and humanized mice across multiple tissues; 5' RACE amplification maps B cell receptor features. |
| [Wesemann et al., 2013](https://doi.org/10.1038/nature12496) | 37.0K | 1.9% | 29.9K | 27.03% | Mouse heavy and light chains from gut lamina propria, bone marrow, and spleen; gut microbial colonization influences early B cell lineage development. |
| [Woodruff et al., 2020](https://doi.org/10.1038/s41590-020-00814-z) | 18.3K | 4.11% | 45.5K | 1.97% | Human heavy and light chains from antibody-secreting and naive B cells in COVID-19; extrafollicular B cell responses correlate with neutralizing antibodies and morbidity. |
| [Wu et al., 2011](https://doi.org/10.1126/science.1207532) | 271.5K | 2.83% | 37.8K | — | Human heavy and light chains from HIV-infected donors; focused evolution of broadly neutralizing antibodies revealed by structures and deep sequencing. |
| [Wu et al., 2014](https://doi.org/10.1016/j.jaci.2014.07.010) | 37.5K | 2.07% | — | — | Human heavy chains from allergic rhinitis patients in- and out-of-season; seasonal grass pollen exposure shapes local and peripheral blood IgE repertoires. |
| [Wu et al., 2015](https://doi.org/10.1016/j.cell.2015.03.004) | 1.4M | 15.47% | 827.2K | 5.74% | Human heavy and light chains from a single HIV-infected donor over 15 years; tracks maturation and diversification of the VRC01 broadly neutralizing antibody lineage. |
| [Zhou et al., 2013](https://doi.org/10.1016/j.immuni.2013.04.012) | 302.8K | 1.14% | 691.0K | 1.19% | Human heavy and light chains from multiple HIV-infected donors; multidonor analysis of structural elements, genetic determinants, and maturation of broadly neutralizing antibodies. |
| [Zhou et al., 2015](https://doi.org/10.1016/j.cell.2015.05.007) | 383.2K | 1.01% | — | — | Human heavy chains from HIV-infected donors; structural repertoire of antibodies targeting the CD4 supersite on HIV-1. |
| [Zhu et al., 2012](https://doi.org/10.3389/fmicb.2012.00315) | 200.0K | 0.51% | 115.1K | — | Human heavy and light chains from HIV-infected donors; identifies somatic populations of PGT135-137 broadly neutralizing antibodies by deep sequencing. |
| [Zhu et al., 2013](https://doi.org/10.1073/pnas.1306262110) | 533.7K | 5.1% | 478.8K | 9.19% | Human heavy and light chains from HIV-infected donors; de novo identification of VRC01-class HIV-1 broadly neutralizing antibodies by next-generation sequencing. |
*Total: 2,070,782,127 heavy + 356,864,753 light sequences across all studies.*
---
## Schema
Each row is one antibody sequence. Fields follow the [AIRR Community standard](https://docs.airr-community.org/en/stable/datarep/rearrangements.html), with OAS study metadata.
### Core AIRR fields
| Column | Type | Description |
|--------|------|-------------|
| `sequence` | string | Raw nucleotide sequence |
| `locus` | string | `IGH`, `IGK`, or `IGL` |
| `v_call`, `d_call`, `j_call` | string | V/D/J gene assignments |
| `sequence_alignment` | string | Aligned nucleotide sequence |
| `sequence_alignment_aa` | string | Aligned amino acid sequence |
| `junction` | string | Junction nucleotides |
| `junction_aa` | string | Junction amino acids |
| `cdr1_aa`, `cdr2_aa`, `cdr3_aa` | string | CDR amino acid sequences |
| `fwr1_aa` … `fwr4_aa` | string | Framework amino acid sequences |
| `v_identity`, `d_identity`, `j_identity` | double | Alignment identity scores |
| `productive` | string | Whether the sequence is productive |
| `stop_codon`, `vj_in_frame`, `v_frameshift` | string | QC flags |
| `Redundancy` | int64 | Copy count in original OAS study |
| `ANARCI_numbering` | string | ANARCI antibody numbering |
| `ANARCI_status` | string | ANARCI annotation status |
### OAS metadata columns (`meta_*`)
| Column | Description |
|--------|-------------|
| `meta_Run` | SRA run accession |
| `meta_Author` | Author label (matches config name) |
| `meta_Species` | Donor species |
| `meta_Age` | Donor age |
| `meta_BSource` | B-cell source tissue |
| `meta_BType` | B-cell type |
| `meta_Vaccine` | Vaccine/antigen if applicable |
| `meta_Disease` | Disease condition |
| `meta_Subject` | Subject identifier |
| `meta_Longitudinal` | Whether study is longitudinal |
| `meta_Isotype` | Isotype |
| `meta_Chain` | `Heavy` or `Light` |
| `meta_Link` | URL to original OAS study page |
### Hash columns
| Column | Type | Description |
|--------|------|-------------|
| `aa_hash_hi` | uint64 | High 64 bits of xxh128(`sequence_alignment_aa`) |
| `aa_hash_lo` | uint64 | Low 64 bits of xxh128(`sequence_alignment_aa`) |
---
## Citation
If you use this dataset, please cite the original OAS publication:
```bibtex
@article{Olsen2022,
author = {Olsen, Tobias H. and Boyles, Fergus and Deane, Charlotte M.},
title = {Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences},
journal = {Protein Science},
year = {2022},
volume = {31},
number = {1},
pages = {141--146},
doi = {10.1002/pro.4205}
}
```
Please also cite the individual studies whose data you use -- links are available in the `meta_Link` column and on the [OAS website](https://opig.stats.ox.ac.uk/webapps/oas/oas_unpaired/).
## About
Built by [Converge Bio](https://converge-bio.com) — accelerating drug discovery with generative AI. Converge Bio develops foundation models for protein engineering, antibody design, and gene expression optimization, powering its computational lab products ConvergeAB, ConvergeGEO, and ConvergeCELL.
## License
OAS data is available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
提供机构:
ConvergeBio



