five

ConvergeBio/oas-unpaired

收藏
Hugging Face2026-03-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ConvergeBio/oas-unpaired
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en tags: - antibody - immunology - bcr - airr - oas - proteomics - proteins - protein size_categories: - 1B<n<10B configs: - config_name: default data_files: - split: heavy path: data/unpaired_heavy/**/*.parquet - split: light path: data/unpaired_light/**/*.parquet - config_name: heavy data_files: - split: train path: data/unpaired_heavy/**/*.parquet - config_name: light data_files: - split: train path: data/unpaired_light/**/*.parquet - config_name: Banerjee et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Banerjee et al., 2017/*.parquet - config_name: Bashford et al., 2013 data_files: - split: heavy path: data/unpaired_heavy/Bashford et al., 2013/*.parquet - config_name: Bender et al., 2020 data_files: - split: light path: data/unpaired_light/Bender et al., 2020/*.parquet - config_name: Bernardes et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Bernardes et al., 2020/*.parquet - split: light path: data/unpaired_light/Bernardes et al., 2020/*.parquet - config_name: Bernat et al., 2019 data_files: - split: heavy path: data/unpaired_heavy/Bernat et al., 2019/*.parquet - split: light path: data/unpaired_light/Bernat et al., 2019/*.parquet - config_name: Bhiman et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Bhiman et al., 2015/*.parquet - split: light path: data/unpaired_light/Bhiman et al., 2015/*.parquet - config_name: Bolland et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Bolland et al., 2016/*.parquet - config_name: Bonsignori et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Bonsignori et al., 2016/*.parquet - config_name: Briney et al., 2019 data_files: - split: heavy path: data/unpaired_heavy/Briney et al., 2019/*.parquet - split: light path: data/unpaired_light/Briney et al., 2019/*.parquet - config_name: Buchheim et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Buchheim et al., 2020/*.parquet - config_name: Chen et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Chen et al., 2020/*.parquet - split: light path: data/unpaired_light/Chen et al., 2020/*.parquet - config_name: Collins et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Collins et al., 2015/*.parquet - config_name: Corcoran et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Corcoran et al., 2016/*.parquet - split: light path: data/unpaired_light/Corcoran et al., 2016/*.parquet - config_name: Cui et al., 2019 data_files: - split: heavy path: data/unpaired_heavy/Cui et al., 2019/*.parquet - split: light path: data/unpaired_light/Cui et al., 2019/*.parquet - config_name: Davis et al., 2019 data_files: - split: heavy path: data/unpaired_heavy/Davis et al., 2019/*.parquet - config_name: Doria-Rose et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Doria-Rose et al., 2015/*.parquet - split: light path: data/unpaired_light/Doria-Rose et al., 2015/*.parquet - config_name: Eccles et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Eccles et al., 2020/*.parquet - split: light path: data/unpaired_light/Eccles et al., 2020/*.parquet - config_name: Eliyahu et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Eliyahu et al., 2015/*.parquet - config_name: Ellebedy et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Ellebedy et al., 2016/*.parquet - config_name: Fisher et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Fisher et al., 2017/*.parquet - split: light path: data/unpaired_light/Fisher et al., 2017/*.parquet - config_name: Galson et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Galson et al., 2015/*.parquet - config_name: Galson et al., 2015a data_files: - split: heavy path: data/unpaired_heavy/Galson et al., 2015a/*.parquet - config_name: Galson et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Galson et al., 2016/*.parquet - config_name: Galson et al., 2016a data_files: - split: heavy path: data/unpaired_heavy/Galson et al., 2016a/*.parquet - config_name: Galson et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Galson et al., 2020/*.parquet - split: light path: data/unpaired_light/Galson et al., 2020/*.parquet - config_name: Ghraichy et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Ghraichy et al., 2020/*.parquet - config_name: Gidoni et al., 2019 data_files: - split: heavy path: data/unpaired_heavy/Gidoni et al., 2019/*.parquet - split: light path: data/unpaired_light/Gidoni et al., 2019/*.parquet - config_name: Greif et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Greif et al., 2015/*.parquet - config_name: Greiff et al., 2014 data_files: - split: heavy path: data/unpaired_heavy/Greiff et al., 2014/*.parquet - config_name: Greiff et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Greiff et al., 2017/*.parquet - config_name: Gupta et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Gupta et al., 2017/*.parquet - split: light path: data/unpaired_light/Gupta et al., 2017/*.parquet - config_name: Halliley et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Halliley et al., 2015/*.parquet - config_name: Huang et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Huang et al., 2016/*.parquet - split: light path: data/unpaired_light/Huang et al., 2016/*.parquet - config_name: Jaffe et al., 2022 data_files: - split: heavy path: data/unpaired_heavy/Jaffe et al., 2022/*.parquet - split: light path: data/unpaired_light/Jaffe et al., 2022/*.parquet - config_name: Jiang et al., 2013 data_files: - split: heavy path: data/unpaired_heavy/Jiang et al., 2013/*.parquet - config_name: Johnson et al., 2018 data_files: - split: heavy path: data/unpaired_heavy/Johnson et al., 2018/*.parquet - split: light path: data/unpaired_light/Johnson et al., 2018/*.parquet - config_name: Joyce et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Joyce et al., 2016/*.parquet - config_name: Khan et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Khan et al., 2016/*.parquet - split: light path: data/unpaired_light/Khan et al., 2016/*.parquet - config_name: Kim et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Kim et al., 2020/*.parquet - split: light path: data/unpaired_light/Kim et al., 2020/*.parquet - config_name: King et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/King et al., 2020/*.parquet - split: light path: data/unpaired_light/King et al., 2020/*.parquet - config_name: Kuri-Cervantes et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Kuri-Cervantes et al., 2020/*.parquet - config_name: Levin et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Levin et al., 2016/*.parquet - config_name: Levin et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Levin et al., 2017/*.parquet - config_name: Li et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Li et al., 2017/*.parquet - split: light path: data/unpaired_light/Li et al., 2017/*.parquet - config_name: Liao et al., 2013 data_files: - split: heavy path: data/unpaired_heavy/Liao et al., 2013/*.parquet - split: light path: data/unpaired_light/Liao et al., 2013/*.parquet - config_name: Lindner et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Lindner et al., 2015/*.parquet - config_name: Meng et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Meng et al., 2017/*.parquet - config_name: Menzel et al., 2014 data_files: - split: heavy path: data/unpaired_heavy/Menzel et al., 2014/*.parquet - config_name: Montague et al., 2021 data_files: - split: heavy path: data/unpaired_heavy/Montague et al., 2021/*.parquet - config_name: Mor et al., 2021 data_files: - split: heavy path: data/unpaired_heavy/Mor et al., 2021/*.parquet - split: light path: data/unpaired_light/Mor et al., 2021/*.parquet - config_name: Mroczek et al., 2014 data_files: - split: heavy path: data/unpaired_heavy/Mroczek et al., 2014/*.parquet - config_name: Mukhamedova et al. 2021 data_files: - split: heavy path: data/unpaired_heavy/Mukhamedova et al. 2021/*.parquet - split: light path: data/unpaired_light/Mukhamedova et al. 2021/*.parquet - config_name: Nielsen et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Nielsen et al., 2020/*.parquet - config_name: Ohm-Laursen et al., 2018 data_files: - split: heavy path: data/unpaired_heavy/Ohm-Laursen et al., 2018/*.parquet - config_name: Ota et al., 2010 data_files: - split: light path: data/unpaired_light/Ota et al., 2010/*.parquet - config_name: Palanichamy et al., 2014 data_files: - split: heavy path: data/unpaired_heavy/Palanichamy et al., 2014/*.parquet - config_name: Parameswaran et al., 2014 data_files: - split: heavy path: data/unpaired_heavy/Parameswaran et al., 2014/*.parquet - config_name: Prohaska et al., 2018 data_files: - split: heavy path: data/unpaired_heavy/Prohaska et al., 2018/*.parquet - config_name: Rettig et al., 2018 data_files: - split: heavy path: data/unpaired_heavy/Rettig et al., 2018/*.parquet - split: light path: data/unpaired_light/Rettig et al., 2018/*.parquet - config_name: Richardson et al., 2022 data_files: - split: heavy path: data/unpaired_heavy/Richardson et al., 2022/*.parquet - config_name: Rubelt et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Rubelt et al., 2016/*.parquet - config_name: Schanz et al., 2014 data_files: - split: heavy path: data/unpaired_heavy/Schanz et al., 2014/*.parquet - split: light path: data/unpaired_light/Schanz et al., 2014/*.parquet - config_name: Schultheiss et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Schultheiss et al., 2020/*.parquet - config_name: Setliff et al., 2018 data_files: - split: heavy path: data/unpaired_heavy/Setliff et al., 2018/*.parquet - split: light path: data/unpaired_light/Setliff et al., 2018/*.parquet - config_name: Sevy et al., 2019 data_files: - split: heavy path: data/unpaired_heavy/Sevy et al., 2019/*.parquet - split: light path: data/unpaired_light/Sevy et al., 2019/*.parquet - config_name: Sheng et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Sheng et al., 2017/*.parquet - split: light path: data/unpaired_light/Sheng et al., 2017/*.parquet - config_name: Simonich et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Simonich et al., 2020/*.parquet - split: light path: data/unpaired_light/Simonich et al., 2020/*.parquet - config_name: Soto et al., 2016 data_files: - split: heavy path: data/unpaired_heavy/Soto et al., 2016/*.parquet - split: light path: data/unpaired_light/Soto et al., 2016/*.parquet - config_name: Soto et al., 2019 data_files: - split: heavy path: data/unpaired_heavy/Soto et al., 2019/*.parquet - split: light path: data/unpaired_light/Soto et al., 2019/*.parquet - config_name: Stern et al., 2014 data_files: - split: heavy path: data/unpaired_heavy/Stern et al., 2014/*.parquet - split: light path: data/unpaired_light/Stern et al., 2014/*.parquet - config_name: Sundling et al., 2014 data_files: - split: heavy path: data/unpaired_heavy/Sundling et al., 2014/*.parquet - config_name: Tipton et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Tipton et al., 2015/*.parquet - config_name: Tong et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Tong et al., 2017/*.parquet - config_name: Turchaninova et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Turchaninova et al., 2015/*.parquet - config_name: Turner et al., 2021 data_files: - split: heavy path: data/unpaired_heavy/Turner et al., 2021/*.parquet - split: light path: data/unpaired_light/Turner et al., 2021/*.parquet - config_name: VanDuijn et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/VanDuijn et al., 2017/*.parquet - split: light path: data/unpaired_light/VanDuijn et al., 2017/*.parquet - config_name: Vander Heiden et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Vander Heiden et al., 2017/*.parquet - split: light path: data/unpaired_light/Vander Heiden et al., 2017/*.parquet - config_name: Vergani et al., 2017 data_files: - split: heavy path: data/unpaired_heavy/Vergani et al., 2017/*.parquet - config_name: Waltari et al., 2018 data_files: - split: heavy path: data/unpaired_heavy/Waltari et al., 2018/*.parquet - split: light path: data/unpaired_light/Waltari et al., 2018/*.parquet - config_name: Wesemann et al., 2013 data_files: - split: heavy path: data/unpaired_heavy/Wesemann et al., 2013/*.parquet - split: light path: data/unpaired_light/Wesemann et al., 2013/*.parquet - config_name: Woodruff et al., 2020 data_files: - split: heavy path: data/unpaired_heavy/Woodruff et al., 2020/*.parquet - split: light path: data/unpaired_light/Woodruff et al., 2020/*.parquet - config_name: Wu et al., 2011 data_files: - split: heavy path: data/unpaired_heavy/Wu et al., 2011/*.parquet - split: light path: data/unpaired_light/Wu et al., 2011/*.parquet - config_name: Wu et al., 2014 data_files: - split: heavy path: data/unpaired_heavy/Wu et al., 2014/*.parquet - config_name: Wu et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Wu et al., 2015/*.parquet - split: light path: data/unpaired_light/Wu et al., 2015/*.parquet - config_name: Zhou et al., 2013 data_files: - split: heavy path: data/unpaired_heavy/Zhou et al., 2013/*.parquet - split: light path: data/unpaired_light/Zhou et al., 2013/*.parquet - config_name: Zhou et al., 2015 data_files: - split: heavy path: data/unpaired_heavy/Zhou et al., 2015/*.parquet - config_name: Zhu et al., 2012 data_files: - split: heavy path: data/unpaired_heavy/Zhu et al., 2012/*.parquet - split: light path: data/unpaired_light/Zhu et al., 2012/*.parquet - config_name: Zhu et al., 2013 data_files: - split: heavy path: data/unpaired_heavy/Zhu et al., 2013/*.parquet - split: light path: data/unpaired_light/Zhu et al., 2013/*.parquet pretty_name: OAS Unpaired --- # OAS Unpaired The OAS unpaired dataset [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/), available as parquet with content-defined chunking on HuggingFace. ## Configs and Splits This dataset exposes **91 configs**: | Config | Splits | Description | |--------|--------|-------------| | `default` | `heavy`, `light` | All sequences, split by chain | | `heavy` | `train` | All heavy chain sequences | | `light` | `train` | All light chain sequences | | `{Author et al., YYYY}` | `heavy`, `light`, or both | One author's sequences | ```python from datasets import load_dataset # All heavy chain sequences ds = load_dataset("ConvergeBio/oas-unpaired", "heavy", split="train", streaming=True) # All light chain sequences ds = load_dataset("ConvergeBio/oas-unpaired", "light", split="train", streaming=True) # One author ds = load_dataset("ConvergeBio/oas-unpaired", "Briney et al., 2019", split="heavy") ``` --- ## Usage ### Load specific columns ```python from datasets import load_dataset ds = load_dataset( "ConvergeBio/oas-unpaired", "Briney et al., 2019", split="heavy", columns=["sequence_alignment_aa", "cdr3_aa", "v_call", "j_call", "meta_Subject"], ) df = ds.to_pandas() ``` ### Load a specific author ```python ds = load_dataset("ConvergeBio/oas-unpaired", "Gidoni et al., 2019", split="heavy") print(ds) # Dataset({features: [...], num_rows: ...}) ``` ### Stream and filter ```python ds = load_dataset( "ConvergeBio/oas-unpaired", "heavy", split="train", streaming=True, ) productive = ds.filter(lambda x: x["productive"] == "T") ``` ### With DuckDB ```python import duckdb con = duckdb.connect() con.sql("INSTALL httpfs; LOAD httpfs;") con.sql("CREATE SECRET (TYPE HUGGINGFACE, TOKEN 'hf_...');") # CDR3 lengths for one author df = con.sql(""" SELECT cdr3_aa, length(cdr3_aa) as cdr3_len, v_call, j_call, meta_Subject FROM 'hf://datasets/ConvergeBio/oas-unpaired/data/unpaired_heavy/Briney et al., 2019/*.parquet' WHERE productive = 'T' """).df() # Sequence counts per author across all heavy chains df = con.sql(""" SELECT meta_Author, COUNT(*) as n_sequences FROM 'hf://datasets/ConvergeBio/oas-unpaired/data/unpaired_heavy/**/*.parquet' GROUP BY meta_Author ORDER BY n_sequences DESC """).df() ``` --- ## Studies Each entry is a separate config. Sequence counts come from OAS source file headers. Duplicate rates are based on xxHash128 of `sequence_alignment_aa`. Click an author name to view the original publication. | Author | Heavy | H Dup% | Light | L Dup% | Summary | |--------|------:|-------:|------:|-------:|---------| | [Banerjee et al., 2017](https://doi.org/10.1016/j.virol.2017.02.015) | 3.5M | 1.97% | — | — | Rabbit heavy chains from HIV-vaccinated animals; vaccine designed to elicit MPER 4E10/10E8 broadly neutralizing antibodies. | | [Bashford et al., 2013](https://doi.org/10.1101/gr.154815.113) | 258.2K | 0.03% | — | — | Human heavy chains from chronic lymphocytic leukemia (CLL) patients; B-cell receptor network analysis via deep sequencing. | | [Bender et al., 2020](https://doi.org/10.1182/blood.2019004197) | — | — | 1.1M | 0.43% | Human light chains from bone marrow of POEMS syndrome patients (rare plasma cell disorder with monoclonal immunoglobulin). | | [Bernardes et al., 2020](https://doi.org/10.1016/j.immuni.2020.11.017) | 3.0M | 8.99% | 19 | — | Human heavy chains from COVID-19 patients; longitudinal multi-omics study of immune cell responses during SARS-CoV-2 infection. | | [Bernat et al., 2019](https://doi.org/10.3389/fimmu.2019.00660) | 2.8M | 1.03% | 353.3K | 0.09% | Human heavy and light chains from healthy PBMC; optimized NGS library preparation for immunoglobulin germline gene inference. | | [Bhiman et al., 2015](https://doi.org/10.1038/nm.3963) | 88.5K | 1.21% | 129.3K | 0.02% | Human heavy and light chains from HIV-infected donors tracking viral variants that initiate V1V2-directed broadly neutralizing antibody maturation. | | [Bolland et al., 2016](https://doi.org/10.1016/j.celrep.2016.05.020) | 27.6K | 0.09% | — | — | Mouse pro-B cell heavy chains studying chromatin states that regulate efficient V(D)J recombination at the immunoglobulin locus. | | [Bonsignori et al., 2016](https://doi.org/10.1016/j.cell.2016.02.022) | 202.7K | 1.6% | — | — | Human heavy chains from HIV-infected donors tracing the germline-to-neutralizer maturation pathway of a CD4-mimicking broadly neutralizing antibody. | | [Briney et al., 2019](https://doi.org/10.1038/s41586-019-0879-y) | 940.3M | 7.23% | 18 | — | Very large human heavy chain dataset (~940M sequences) from 10 healthy donors revealing commonality despite exceptional diversity in baseline human antibody repertoires. | | [Buchheim et al., 2020](https://doi.org/10.1096/fj.202001403RR) | 5.2M | 0.74% | — | — | Human IgM heavy chains from astronauts and ground controls; studies IgM repertoire plasticity during long-term spaceflight. | | [Chen et al., 2020](https://doi.org/10.1371/journal.pone.0235713) | 776.4K | 6.9% | 2.1M | 8.11% | Human heavy and light chains from bone marrow of light chain amyloidosis patients; diverse patterns of antibody variable gene disruption. | | [Collins et al., 2015](https://doi.org/10.1098/rstb.2014.0236) | 359.5K | 37.98% | — | — | Mouse heavy chains from spleen showing the murine antibody repertoire is germline-focused and highly variable across individuals. | | [Corcoran et al., 2016](https://doi.org/10.1038/ncomms13642) | 3.7M | 0.16% | 1.2M | 0.62% | Cross-species (human, mouse C57BL/6, mouse BALB/c, rhesus) heavy and light chains; individualized V gene databases reveal high immunoglobulin gene polymorphism. | | [Cui et al., 2019](https://doi.org/10.4049/jimmunol.1502263) | 5.5K | 11.91% | 1.0M | 8.76% | Mouse memory B cell heavy and light chains after NP-CGG immunization; models somatic hypermutation targeting in mice. | | [Davis et al., 2019](https://doi.org/10.1016/j.cell.2019.04.036) | 11.9M | 0.95% | — | — | Human heavy chains from Ebola virus-infected donors across B cell subsets; longitudinal analysis of the B cell response to Ebola infection. | | [Doria-Rose et al., 2015](https://doi.org/10.1038/nature13036) | 526.1K | 0.66% | 415.3K | 0.19% | Human heavy and light chains from HIV-infected donors tracking the developmental pathway of potent V1V2-directed broadly neutralizing antibodies. | | [Eccles et al., 2020](https://doi.org/10.1016/j.celrep.2019.12.027) | 796 | — | 13.6K | 3.91% | Human heavy and light chains from rhinovirus-reactive (T-bet+) memory B cells linking local cross-reactive IgG to rhinovirus infection. | | [Eliyahu et al., 2015](https://doi.org/10.3389/fimmu.2018.03004) | 1.6M | 0.42% | — | — | Human heavy chains from hepatitis C virus (HCV) infected donors identifying immune signatures and potential therapeutic antibody targets. | | [Ellebedy et al., 2016](https://doi.org/10.1038/ni.3533) | 11.3M | 1.48% | — | — | Human heavy chains from multiple B cell subsets (naive, memory, plasmablasts) defining antigen-specific B cell responses after seasonal influenza vaccination. | | [Fisher et al., 2017](https://doi.org/10.1371/journal.ppat.1006469) | 29.9K | 0.27% | 130.4K | 5.84% | Mouse heavy and light chains from Plasmodium-immunized BALB/c spleen; T-dependent B cell responses forming high-avidity anti-parasite antibodies. | | [Galson et al., 2015](https://doi.org/10.1016/j.ebiom.2015.11.034) | 15.9M | 2.72% | — | — | Human heavy chains tracking B cell repertoire dynamics following hepatitis B vaccination. | | [Galson et al., 2016](https://doi.org/10.1186/s13073-016-0322-z) | 16.1M | 1.54% | — | — | Human heavy chains studying B cell repertoire dynamics after sequential hepatitis B vaccination; evidence for clonal B cell persistence. | | [Galson et al., 2016a](https://doi.org/10.1038/srep37229) | 4.4M | — | — | — | Human heavy chains from plasma cells after pandemic H1N1 influenza vaccination, investigating AS03 adjuvant effects on B cell repertoire. | | [Galson et al., 2020](https://doi.org/10.3389/fimmu.2020.605170) | 4.6M | 2.57% | 46 | — | Human heavy and light chains from COVID-19 patients; one of the early deep BCR sequencing studies of SARS-CoV-2 infection. | | [Galson et al., 2015a](https://doi.org/10.1038/icb.2015.57) | 3.9M | 2.23% | — | — | Human heavy chains across B cell subsets (naive, memory, plasma, unsorted) after meningococcal (MenACWY) conjugate and polysaccharide vaccination. | | [Ghraichy et al., 2020](https://doi.org/10.3389/fimmu.2020.01734) | 8.3M | 2.95% | — | — | Human heavy chain repertoire from healthy donors across age groups (children to elderly); studies age-related maturation of immunoglobulin diversity. | | [Gidoni et al., 2019](https://doi.org/10.1038/s41467-019-08489-3) | 13.6M | 0.86% | 12.7M | 28.45% | Human naive B cell heavy and light chains revealing mosaic deletion patterns in the antibody heavy chain gene locus; includes healthy and celiac disease subjects. | | [Greif et al., 2015](https://doi.org/10.1186/s13073-015-0169-8) | 552.0K | 1.04% | — | — | Mouse heavy chains across B cell subsets (naive, ASC, plasma cells) after NP-CGG immunization; framework for immune repertoire diversity profiling. | | [Greiff et al., 2014](https://doi.org/10.1186/s12865-014-0040-5) | 3.4M | 13.88% | — | — | Mouse plasmablast/plasma cell heavy chains after NP-CGG immunization; quantitative assessment of NGS-based antibody repertoire sequencing robustness. | | [Greiff et al., 2017](https://doi.org/10.1016/j.celrep.2017.04.054) | 138.8M | 5.54% | — | — | Very large mouse heavy chain dataset (multiple strains, antigens, tissues); systems analysis of genetic and antigen-driven predetermination of antibody repertoire structure. | | [Gupta et al., 2017](https://doi.org/10.4049/jimmunol.1601850) | 3.1M | 2.82% | 9.3M | 6.96% | Human naive B cell heavy and light chains from flu/HepB vaccinated donors; used to evaluate hierarchical clustering methods for identifying B cell clones. | | [Halliley et al., 2015](https://doi.org/10.1016/j.immuni.2015.06.016) | 593.3K | 1.6% | — | — | Human bone marrow plasma cell heavy chains after tetanus/flu vaccination; identifies long-lived plasma cells within the CD19-CD38hiCD138+ subset. | | [Huang et al., 2016](https://doi.org/10.1016/j.immuni.2016.10.027) | 3.6M | 5.57% | 3.5M | 11.33% | Human memory B cell heavy and light chains from HIV-infected donors; identifies CD4-binding-site antibodies that evolved near-pan HIV neutralization. | | [Jaffe et al., 2022](https://doi.org/10.1038/s41586-022-05371-z) | 1.6M | 3.71% | 969.5K | 37.19% | Human heavy and light chains from COVID-19 and CMV donors; demonstrates that functional antibodies exhibit light chain coherence (light chain pairing bias). | | [Jiang et al., 2013](https://doi.org/10.1126/scitranslmed.3004794) | 3.7M | 14.33% | — | — | Human heavy chains from influenza-vaccinated donors; examines lineage structure of the antibody repertoire in response to influenza vaccination. | | [Johnson et al., 2018](https://doi.org/10.1038/s41467-018-06424-6) | 8.6M | 2.82% | 1.9M | 2.83% | Human heavy and light chains from HIV-infected donors; sequences broadly neutralizing antibody exons and introns revealing detailed aspects of antibody evolution. | | [Joyce et al., 2016](https://doi.org/10.1016/j.cell.2016.06.043) | 1.6M | — | — | — | Human heavy chains used in a vaccine study identifying broadly protective antibodies that neutralize both group 1 and group 2 influenza A viruses. | | [Khan et al., 2016](https://doi.org/10.1126/sciadv.1501371) | 12.0M | 17.76% | 5 | — | Mouse heavy chains from OVA-immunized BALB/c spleen; demonstrates accurate antibody repertoire profiling by molecular amplification fingerprinting. | | [Kim et al., 2020](https://doi.org/10.1126/scitranslmed.abd6990) | 45.1M | 4.37% | 19.6M | 10.51% | Large human heavy and light chain dataset from COVID-19 patients; identifies stereotypic VH antibodies that neutralize SARS-CoV-2. | | [King et al., 2020](https://doi.org/10.1126/sciimmunol.abe6291) | 13.7M | 0.7% | 45.9K | 14.88% | Human heavy and light chains from tonsillar B cells across subsets (GC, memory, naive, plasmablast); single-cell analysis predicts antibody class switching. | | [Kuri-Cervantes et al., 2020](https://doi.org/10.1126/sciimmunol.abd7114) | 8.8M | 1.45% | — | — | Human heavy chains from COVID-19 patients; comprehensive mapping of immune perturbations associated with severe COVID-19. | | [Levin et al., 2016](https://doi.org/10.1016/j.jaci.2015.09.027) | 675.9K | 1.49% | — | — | Human heavy chains from allergy patients with/without subcutaneous immunotherapy (SIT); persistence and evolution of allergen-specific IgE repertoires. | | [Levin et al., 2017](https://doi.org/10.1016/j.jaci.2016.06.040) | 13.1M | 5.67% | — | — | Human heavy chains from bone marrow and blood of IgE allergy patients; focuses on bone marrow as an antibody-encoding IgE-producing niche. | | [Li et al., 2017](https://doi.org/10.1371/journal.pone.0161801) | 1.6M | 0.34% | 355 | 0.56% | Heavy and light chain repertoire from Bactrian camels; comparative analysis of conventional and heavy-chain-only (nanobody precursor) antibody repertoires. | | [Liao et al., 2013](https://doi.org/10.1038/nature12053) | 411.6K | 1.32% | 333.4K | 1.63% | Human heavy and light chains from an HIV-infected donor; tracks co-evolution of the broadly neutralizing antibody VRC01 and its founder virus. | | [Lindner et al., 2015](https://doi.org/10.1038/ni.3213) | 741.5K | 3.7% | — | — | Mouse heavy chains from small intestinal B cells; microbial colonization drives diversification of memory B cells producing secretory IgA. | | [Meng et al., 2017](https://doi.org/10.1038/nbt.3942) | 32.2M | 1.89% | — | — | Human heavy chain atlas across 8 tissues (blood, bone marrow, lung, gut, spleen, etc.); maps B cell clonal distribution throughout the human body. | | [Menzel et al., 2014](https://doi.org/10.1371/journal.pone.0096727) | 8.3M | 22.07% | — | — | Mouse plasmablast/plasma cell heavy chains after NP-CGG immunization; comprehensive evaluation of amplicon library preparation methods for repertoire sequencing. | | [Montague et al., 2021](https://doi.org/10.1016/j.celrep.2021.109173) | 10.2M | 7.03% | — | — | Human heavy chains from COVID-19 patients; studies dynamics of B cell repertoire and emergence of cross-reactive antibody responses. | | [Mor et al., 2021](https://doi.org/10.1371/journal.ppat.1009165) | 81.1K | 7.15% | 139.5K | 4.37% | Human heavy and light chains from severe COVID-19 patients; identifies multi-clonal SARS-CoV-2 neutralizing antibodies. | | [Mroczek et al., 2014](https://doi.org/10.3389/fimmu.2014.00096) | 121.5K | 0.17% | — | — | Human heavy chains across B cell subsets (immature, naive, memory, plasma) from healthy donors; analyzes repertoire composition by B cell subset. | | [Mukhamedova et al. 2021](https://doi.org/10.1016/j.immuni.2021.03.004) | 447.9K | 0.59% | 452.4K | 5.93% | Human heavy and light chains from RSV prefusion-protein vaccinated donors; studies antibody responses to respiratory syncytial virus (RSV). | | [Nielsen et al., 2020](https://doi.org/10.1101/2020.07.08.194456) | 12.1M | 3.61% | — | — | Human heavy chains from COVID-19 patients and nasopharyngeal swabs; studies clonal B cell expansion and convergent antibody responses to SARS-CoV-2. | | [Ohm-Laursen et al., 2018](https://doi.org/10.3389/fimmu.2018.01976) | 7.2M | 5.43% | — | — | Human heavy chains from bronchial biopsies and blood of asthma patients; studies local clonal B cell diversification and dissemination in the airway. | | [Ota et al., 2010](https://doi.org/10.4049/jimmunol.1002176) | — | — | 20.1K | 9.88% | Mouse light chains from healthy spleen; studies how BAFF regulates B cell receptor repertoire composition and self-reactivity. | | [Palanichamy et al., 2014](https://doi.org/10.1126/scitranslmed.3008930) | 339.6K | 0.12% | — | — | Human heavy chains from cerebrospinal fluid and blood of multiple sclerosis patients; immunoglobulin class-switched B cells form a CNS-periphery immune axis. | | [Parameswaran et al., 2014](https://doi.org/10.1016/j.chom.2013.05.008) | 314.1K | 4.52% | — | — | Human heavy chains from dengue fever and non-dengue febrile illness donors; identifies convergent antibody signatures across multiple individuals. | | [Prohaska et al., 2018](https://doi.org/10.4049/jimmunol.1700568) | 255.0K | 1.37% | — | — | Mouse heavy chains from B cell subsets (B-1a, B-1b, B-2, follicular, marginal zone) in peritoneal cavity and spleen; highlights innate-like B cell repertoire differences. | | [Rettig et al., 2018](https://doi.org/10.1371/journal.pone.0190982) | 27.8K | 2.67% | 30.8K | 17.56% | Mouse heavy and light chains from healthy spleen; naive repertoire characterization using unamplified (no PCR) high-throughput sequencing to minimize amplification bias. | | [Richardson et al., 2022](https://doi.org/10.1101/2022.06.27.497709) | 406.9K | 3.16% | — | — | Heavy chains from Kymouse (humanized transgenic) naive splenic B cells; characterizes the human-like immune repertoire in this model organism. | | [Rubelt et al., 2016](https://doi.org/10.1038/ncomms11112) | 2.2M | 0.87% | — | — | Human heavy chains from memory and naive B cells in twins; heritable individual differences drive unique B cell receptor repertoire formation. | | [Schanz et al., 2014](https://doi.org/10.1371/journal.pone.0111726) | 4.3M | 3.36% | 1.7M | 0.84% | Human heavy and light chains from HIV-infected donors using isotype-specific (IgG, IgM) high-throughput immunoglobulin sequencing. | | [Schultheiss et al., 2020](https://doi.org/10.1016/j.immuni.2020.06.024) | 4.7M | 0.21% | — | — | Human heavy chains from COVID-19 patients; next-generation sequencing of both T and B cell receptor repertoires from COVID-19 patients and healthy controls. | | [Setliff et al., 2018](https://doi.org/10.1016/j.chom.2018.05.001) | 22.5M | 2.91% | 1.9M | 0.8% | Large longitudinal human heavy and light chain dataset from HIV-infected donors; reveals stable clonal memory B cell pools across multiple donors. | | [Sevy et al., 2019](https://doi.org/10.1186/s12859-019-3281-8) | 18.7M | 0.4% | 74 | 1.35% | Human heavy chains from HIV-infected and flu-vaccinated donors; repertoire fingerprinting by PCA reveals shared clonotypes across individuals. | | [Sheng et al., 2017](https://doi.org/10.3389/fimmu.2017.00537) | 541.8K | 2.6% | 755.9K | 13.83% | Human heavy and light chains from healthy PBMC; describes gene-specific amino acid substitution profiles quantifying somatic hypermutation type and frequency. | | [Simonich et al., 2020](https://doi.org/10.1038/s41467-019-09481-7) | 847.0K | 0.05% | 1.2M | 1.42% | Human heavy and light chains from HIV-infected infants; kappa light chain maturation drives rapid development of broadly neutralizing antibodies. | | [Soto et al., 2016](https://doi.org/10.1371/journal.pone.0157409) | 333.9K | 2.99% | 422.3K | — | Human heavy and light chains from HIV-infected donors; traces the developmental pathway of the MPER-directed broadly neutralizing antibody 10E8. | | [Soto et al., 2019](https://doi.org/10.1038/s41586-019-0934-8) | 553.3M | 19.91% | 242.7M | 25.32% | Very large human heavy and light chain dataset (~796M sequences) from healthy donors; demonstrates high frequency of shared clonotypes in human B cell receptor repertoires. | | [Stern et al., 2014](https://doi.org/10.1126/scitranslmed.3008879) | 10.1M | 8.62% | 207 | 0.48% | Human heavy chains from multiple sclerosis patient brain lesions and draining cervical lymph nodes; B cells populating the MS brain mature in the CNS. | | [Sundling et al., 2014](https://doi.org/10.4049/jimmunol.1303334) | 130.2K | 0.23% | — | — | Rhesus macaque IgG-switched heavy chains from PBMC after HIV vaccination; single-cell and deep sequencing reveals diverse antibody responses. | | [Tipton et al., 2015](https://doi.org/10.1038/ni.3175) | 15.7M | 1.58% | — | — | Human heavy chains from SLE patients and healthy controls; studies diversity, cellular origin, and autoreactivity of antibody-secreting cells. | | [Tong et al., 2017](https://doi.org/10.1073/pnas.1704962114) | 59.3K | 0.21% | — | — | Mouse heavy chains from OVA-immunized spleen and bone marrow; studies how IgH isotype-specific B cell receptor expression influences B cell fate. | | [Turchaninova et al., 2015](https://doi.org/10.1038/nprot.2016.093) | 201.4K | 0.01% | — | — | Human heavy chains from memory, naive, and plasma B cells; demonstrates high-quality full-length immunoglobulin profiling using unique molecular barcodes. | | [Turner et al., 2021](https://doi.org/10.1038/s41586-021-03738-2) | 1.6M | 0.64% | 11.7K | 4.66% | Human heavy and light chains from germinal center B cells and plasmablasts after SARS-CoV-2 mRNA vaccination; persistent germinal center responses observed. | | [VanDuijn et al., 2017](https://doi.org/10.3389/fimmu.2017.01286) | 5.2M | 1.5% | 9 | — | Rat heavy chains from DNP/HuD-immunized spleen; studies immune repertoire by combining next-generation sequencing with protein mass spectrometry. | | [Vander Heiden et al., 2017](https://doi.org/10.4049/jimmunol.1601415) | 2.5M | 3.36% | 5.3M | 9.48% | Human heavy and light chains from myasthenia gravis (AChR-MG and MuSK-MG) patients; B cell repertoire dysregulation in autoimmune disease. | | [Vergani et al., 2017](https://doi.org/10.3389/fimmu.2017.01157) | 13.5M | 5.94% | — | — | Human heavy chains from healthy naive and unsorted B cells; presents a novel high-throughput method for full-length IGHV-D-J sequencing. | | [Waltari et al., 2018](https://doi.org/10.3389/fimmu.2018.00628) | 29.6M | 0.78% | 45.1M | 6.55% | Large heavy and light chain dataset from HIV-infected donors and humanized mice across multiple tissues; 5' RACE amplification maps B cell receptor features. | | [Wesemann et al., 2013](https://doi.org/10.1038/nature12496) | 37.0K | 1.9% | 29.9K | 27.03% | Mouse heavy and light chains from gut lamina propria, bone marrow, and spleen; gut microbial colonization influences early B cell lineage development. | | [Woodruff et al., 2020](https://doi.org/10.1038/s41590-020-00814-z) | 18.3K | 4.11% | 45.5K | 1.97% | Human heavy and light chains from antibody-secreting and naive B cells in COVID-19; extrafollicular B cell responses correlate with neutralizing antibodies and morbidity. | | [Wu et al., 2011](https://doi.org/10.1126/science.1207532) | 271.5K | 2.83% | 37.8K | — | Human heavy and light chains from HIV-infected donors; focused evolution of broadly neutralizing antibodies revealed by structures and deep sequencing. | | [Wu et al., 2014](https://doi.org/10.1016/j.jaci.2014.07.010) | 37.5K | 2.07% | — | — | Human heavy chains from allergic rhinitis patients in- and out-of-season; seasonal grass pollen exposure shapes local and peripheral blood IgE repertoires. | | [Wu et al., 2015](https://doi.org/10.1016/j.cell.2015.03.004) | 1.4M | 15.47% | 827.2K | 5.74% | Human heavy and light chains from a single HIV-infected donor over 15 years; tracks maturation and diversification of the VRC01 broadly neutralizing antibody lineage. | | [Zhou et al., 2013](https://doi.org/10.1016/j.immuni.2013.04.012) | 302.8K | 1.14% | 691.0K | 1.19% | Human heavy and light chains from multiple HIV-infected donors; multidonor analysis of structural elements, genetic determinants, and maturation of broadly neutralizing antibodies. | | [Zhou et al., 2015](https://doi.org/10.1016/j.cell.2015.05.007) | 383.2K | 1.01% | — | — | Human heavy chains from HIV-infected donors; structural repertoire of antibodies targeting the CD4 supersite on HIV-1. | | [Zhu et al., 2012](https://doi.org/10.3389/fmicb.2012.00315) | 200.0K | 0.51% | 115.1K | — | Human heavy and light chains from HIV-infected donors; identifies somatic populations of PGT135-137 broadly neutralizing antibodies by deep sequencing. | | [Zhu et al., 2013](https://doi.org/10.1073/pnas.1306262110) | 533.7K | 5.1% | 478.8K | 9.19% | Human heavy and light chains from HIV-infected donors; de novo identification of VRC01-class HIV-1 broadly neutralizing antibodies by next-generation sequencing. | *Total: 2,070,782,127 heavy + 356,864,753 light sequences across all studies.* --- ## Schema Each row is one antibody sequence. Fields follow the [AIRR Community standard](https://docs.airr-community.org/en/stable/datarep/rearrangements.html), with OAS study metadata. ### Core AIRR fields | Column | Type | Description | |--------|------|-------------| | `sequence` | string | Raw nucleotide sequence | | `locus` | string | `IGH`, `IGK`, or `IGL` | | `v_call`, `d_call`, `j_call` | string | V/D/J gene assignments | | `sequence_alignment` | string | Aligned nucleotide sequence | | `sequence_alignment_aa` | string | Aligned amino acid sequence | | `junction` | string | Junction nucleotides | | `junction_aa` | string | Junction amino acids | | `cdr1_aa`, `cdr2_aa`, `cdr3_aa` | string | CDR amino acid sequences | | `fwr1_aa` … `fwr4_aa` | string | Framework amino acid sequences | | `v_identity`, `d_identity`, `j_identity` | double | Alignment identity scores | | `productive` | string | Whether the sequence is productive | | `stop_codon`, `vj_in_frame`, `v_frameshift` | string | QC flags | | `Redundancy` | int64 | Copy count in original OAS study | | `ANARCI_numbering` | string | ANARCI antibody numbering | | `ANARCI_status` | string | ANARCI annotation status | ### OAS metadata columns (`meta_*`) | Column | Description | |--------|-------------| | `meta_Run` | SRA run accession | | `meta_Author` | Author label (matches config name) | | `meta_Species` | Donor species | | `meta_Age` | Donor age | | `meta_BSource` | B-cell source tissue | | `meta_BType` | B-cell type | | `meta_Vaccine` | Vaccine/antigen if applicable | | `meta_Disease` | Disease condition | | `meta_Subject` | Subject identifier | | `meta_Longitudinal` | Whether study is longitudinal | | `meta_Isotype` | Isotype | | `meta_Chain` | `Heavy` or `Light` | | `meta_Link` | URL to original OAS study page | ### Hash columns | Column | Type | Description | |--------|------|-------------| | `aa_hash_hi` | uint64 | High 64 bits of xxh128(`sequence_alignment_aa`) | | `aa_hash_lo` | uint64 | Low 64 bits of xxh128(`sequence_alignment_aa`) | --- ## Citation If you use this dataset, please cite the original OAS publication: ```bibtex @article{Olsen2022, author = {Olsen, Tobias H. and Boyles, Fergus and Deane, Charlotte M.}, title = {Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences}, journal = {Protein Science}, year = {2022}, volume = {31}, number = {1}, pages = {141--146}, doi = {10.1002/pro.4205} } ``` Please also cite the individual studies whose data you use -- links are available in the `meta_Link` column and on the [OAS website](https://opig.stats.ox.ac.uk/webapps/oas/oas_unpaired/). ## About Built by [Converge Bio](https://converge-bio.com) — accelerating drug discovery with generative AI. Converge Bio develops foundation models for protein engineering, antibody design, and gene expression optimization, powering its computational lab products ConvergeAB, ConvergeGEO, and ConvergeCELL. ## License OAS data is available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
提供机构:
ConvergeBio
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作