MHuangX/LAION-Beyond
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MHuangX/LAION-Beyond
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- image-classification
- zero-shot-classification
language:
- en
tags:
- vision-language
- CLIP
- out-of-pre-training
- OOP
- benchmark
- multimodal
- few-shot
- zero-shot
pretty_name: LAION-Beyond
size_categories:
- 100K<n<1M
---
# LAION-Beyond: Reproducible Vision-Language Models Meet Concepts Out of Pre-Training
<p align="center">
📄 <a href="https://openaccess.thecvf.com/content/CVPR2025/papers/Chen_Reproducible_Vision-Language_Models_Meet_Concepts_Out_of_Pre-Training_CVPR_2025_paper.pdf">Paper (CVPR 2025)</a> |
💻 <a href="https://github.com/M-HuangX/LAION-Beyond">Code</a> |
🌐 <a href="https://github.com/M-HuangX/laion_beyond">Project Page</a>
</p>
## Dataset Summary
LAION-Beyond is the **first multi-domain benchmark** specifically designed to evaluate the Out-of-Pre-training (OOP) generalization of vision-language models (e.g., CLIP, OpenCLIP, EVA-CLIP).
We distinguish two types of visual concepts:
- **IP (In-Pre-training)**: concepts that appear in the pre-training data (e.g., LAION-400M / 2B / 5B)
- **OOP (Out-of-Pre-training)**: concepts entirely absent from the pre-training data
<p align="center">
<img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure1_OOP_IP_difference.jpg" alt="IP vs OOP Difference" width="80%">
<br>
<em>Figure 1: Comparison between IP and OOP generalization. The former evaluates generalization within seen visual concepts, while the latter tests concepts absent during pre-training.</em>
</p>
The key finding of our paper is that despite OpenCLIP's image encoder forming well-separated clusters for OOP concepts, **zero-shot transfer fails significantly** due to poor image-text alignment — the token embeddings for OOP class names were never aligned with visual features during pre-training.
---
## Dataset Statistics
| Split | Images | Concepts |
| --------- | ----------- | -------- |
| OOP | 106,052 | 674 |
| IP | 51,330 | 324 |
| **Total** | **157,382** | **998** |
<p align="center">
<img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2a_LAION_Beyond_Distribution.png" width="48%">
<img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2b_Image_Counts_per_category.png" width="48%">
<br>
<em>Figure 2: (Left) Statistics of OOP/IP concepts across different LAION scales; (Right) Detailed train/val/test split in LAION-Beyond (400M).</em>
</p>
### Domains Covered:
- 🐾 **Animals** | 🏛️ **Architecture** | 👘 **Attire**
- 🎨 **FolkArt** | 🍜 **Food** | 🦋 **Insects & Spiders**
- 🗺️ **Landmark** | 🌿 **Plants & Fungi** | 🎮 **Pokemon**
Each domain contains an IP subset and an OOP subset, covering LAION-400M, LAION-2B, and LAION-5B scales to support neural scaling law research.
---
## Dataset Structure
Each domain folder is named `{Domain}{NumClasses}_{IP/OOP}`, e.g., `Animals42_IP`, `Animals92_OOP`.
```
LAION_Beyond/
├── Animals42_IP/
│ ├── images/ # jpg images organized by class
│ ├── label2name.json # label index → class name
│ ├── name2label.json # class name → label index
│ ├── merged_mapping.json # merged label mapping
│ └── split_Xin_Animals42_IP.json # train/val/test split info
├── Animals92_OOP/
│ └── ...
├── Architecture23_IP/
├── Architecture50_OOP/
├── Attire28_IP/
├── Attire54_OOP/
├── FolkArt27_IP/
├── FolkArt59_OOP/
├── Food27_IP/
├── Food53_OOP/
├── Insects_Spiders52_IP/
├── Insects_Spiders106_OOP/
├── Landmark30_IP/
├── Landmark59_OOP/
├── Plants_Fugi56_IP/
├── Plants_Fugi113_OOP/
├── Pokemon39_IP/
└── Pokemon89_OOP/
```
### File Descriptions
| File | Description |
| --------------------- | --------------------------------------------------- |
| `images/` | Raw image files (JPG), organized by class subfolder |
| `label2name.json` | Mapping from integer label to class name string |
| `name2label.json` | Mapping from class name string to integer label |
| `merged_mapping.json` | Combined label mapping across splits |
| `split_Xin_*.json` | Train / val / test split assignments per image |
---
## Loading the Dataset
### Option 1: Download full dataset (recommended)
```python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="MHuangX/LAION-Beyond",
repo_type="dataset",
local_dir="./LAION_Beyond"
)
```
### Option 2: Download a single domain only
```python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="MHuangX/LAION-Beyond",
repo_type="dataset",
local_dir="./LAION_Beyond",
allow_patterns="Animals42_IP/**"
)
```
---
## Key Findings
1. **Strong image features for OOP concepts**: OpenCLIP's image encoder forms well-separated clusters for OOP concepts (clustering accuracy gap < 3% on most domains vs. IP concepts).
2. **Image-text alignment failure**: Zero-shot accuracy on OOP concepts is significantly lower than IP concepts, persisting even as pre-training data scales from 400M to 5B.
3. **Name-tuning is the key**: Our proposed FSNL and ZSNL algorithms, which fine-tune only the name (token) embeddings of OOP concepts, efficiently restore OOP generalization without degrading IP performance.
---
## Algorithms
### FSNL — Few-Shot Name Learning
Optimizes only OOP concept name embeddings using a few image-text pairs, with context augmentation via similar concept shuffling. Achieves state-of-the-art on 8/9 domains.
### ZSNL — Zero-Shot Name Learning
Requires no image-text pairs. Uses Novel Class Discovery (NCD) and image-text bipartite graph matching to optimize OOP name embeddings from unlabeled images only.
---
## Benchmark Results (400M split)
### OOP Few-Shot Learning (4-shot, H-mean of OOP & IP accuracy)
| Method | Animals | Architecture | Attire | FolkArt | Food | Insects | Landmark | Plants | Pokemon | Avg |
| --------------- | --------- | ------------ | --------- | --------- | -------- | --------- | --------- | --------- | --------- | --------- |
| OpenCLIP | 26.75 | 30.75 | 25.88 | 35.04 | 15.36 | 22.38 | 40.25 | 21.43 | 24.48 | 26.92 |
| CoOp | 31.37 | 57.8 | 50.39 | 52.06 | 42.55 | 25.73 | 85.89 | 24.78 | 35.52 | 45.12 |
| CLIP-Adapter | 38.98 | 59.27 | 64.56 | 56.32 | 64.32 | 32.51 | 90.82 | 31.97 | 54.99 | 54.86 |
| **FSNL (ours)** | **46.17** | **62.63** | **71.65** | **63.03** | **70.0** | **44.03** | **94.48** | **44.12** | **68.87** | **62.55** |
---
## Citation
If you use LAION-Beyond in your research, please cite:
```bibtex
@inproceedings{chen2025reproducible,
title={Reproducible vision-language models meet concepts out of pre-training},
author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={14701--14711},
year={2025}
}
```
---
## License
This dataset is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](http://creativecommons.org/licenses/by-sa/4.0/).
---
## Authors
[Xin Huang](https://www.linkedin.com/in/mhuangx/)†, [Ziliang Chen](https://scholar.google.com/citations?user=RC-LN4QAAAAJ&hl=en)†, Xiaoxuan Fan, [Keze Wang](https://kezewang.com/), Yuyu Zhou, [Quanlong Guan](https://scholar.google.com/citations?user=v4JiSqsAAAAJ&hl=en), [Liang Lin](http://www.linliang.net/)*
Affiliations: Peng Cheng Laboratory, Sun Yat-sen University, EPFL, Jinan University
†Equal Contribution · *Corresponding Author
license: CC BY-SA 4.0
task_categories:
- 图像分类(image-classification)
- 零样本分类(zero-shot-classification)
language:
- 英语
tags:
- 视觉语言(vision-language)
- CLIP
- 预训练外(out-of-pre-training)
- OOP
- 基准测试(benchmark)
- 多模态(multimodal)
- 少样本(few-shot)
- 零样本(zero-shot)
pretty_name: LAION-Beyond
size_categories:
- 100K<n<1M
---
# LAION-Beyond:可复现视觉语言模型适配预训练外概念
<p align="center">
📄 <a href="https://openaccess.thecvf.com/content/CVPR2025/papers/Chen_Reproducible_Vision-Language_Models_Meet_Concepts_Out_of_Pre-Training_CVPR_2025_paper.pdf">论文(CVPR 2025)</a> |
💻 <a href="https://github.com/M-HuangX/LAION-Beyond">代码</a> |
🌐 <a href="https://github.com/M-HuangX/laion_beyond">项目主页</a>
</p>
## 数据集概述
LAION-Beyond是**首个多领域基准测试集**,专为评估视觉语言模型(vision-language models, VLM)的预训练外(Out-of-Pre-training, OOP)泛化能力而设计,例如CLIP、OpenCLIP、EVA-CLIP。
我们区分了两类视觉概念:
- **IP(预训练内,In-Pre-training)**:出现在预训练数据中的概念(例如LAION-400M、LAION-2B、LAION-5B)
- **OOP(预训练外,Out-of-Pre-training)**:完全未出现在预训练数据中的概念
<p align="center">
<img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure1_OOP_IP_difference.jpg" alt="IP vs OOP Difference" width="80%">
<br>
<em>图1:IP与OOP泛化能力对比。前者评估可见视觉概念下的泛化能力,后者测试预训练阶段未出现的概念。</em>
</p>
本论文的核心发现是,尽管OpenCLIP的图像编码器可为OOP概念形成分离度良好的聚类,但**零样本迁移性能显著失效**——原因在于较差的图像-文本对齐:预训练阶段从未将OOP类别名称的Token嵌入与视觉特征对齐。
---
## 数据集统计
| 划分 | 图像数量 | 概念数 |
| --------- | ----------- | -------- |
| OOP | 106,052 | 674 |
| IP | 51,330 | 324 |
| **总计** | **157,382** | **998** |
<p align="center">
<img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2a_LAION_Beyond_Distribution.png" width="48%">
<img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2b_Image_Counts_per_category.png" width="48%">
<br>
<em>图2:(左)不同LAION规模下OOP/IP概念的统计分布;(右)LAION-Beyond(400M划分)的详细训练/验证/测试拆分情况。</em>
</p>
### 覆盖领域
- 🐾 **动物** | 🏛️ **建筑** | 👘 **服饰**
- 🎨 **民间艺术** | 🍜 **食品** | 🦋 **昆虫与蜘蛛**
- 🗺️ **地标** | 🌿 **植物与真菌** | 🎮 **宝可梦**
每个领域均包含IP子集与OOP子集,覆盖LAION-400M、LAION-2B与LAION-5B三种规模,以支持神经缩放定律相关研究。
---
## 数据集结构
每个领域文件夹的命名格式为`{Domain}{NumClasses}_{IP/OOP}`,例如`Animals42_IP`、`Animals92_OOP`。
LAION_Beyond/
├── Animals42_IP/
│ ├── images/ # 按类别组织的JPG图像文件
│ ├── label2name.json # 标签索引→类别名称映射表
│ ├── name2label.json # 类别名称→标签索引映射表
│ ├── merged_mapping.json # 合并后的跨划分标签映射表
│ └── split_Xin_Animals42_IP.json # 训练/验证/测试划分信息
├── Animals92_OOP/
│ └── ...
├── Architecture23_IP/
├── Architecture50_OOP/
├── Attire28_IP/
├── Attire54_OOP/
├── FolkArt27_IP/
├── FolkArt59_OOP/
├── Food27_IP/
├── Food53_OOP/
├── Insects_Spiders52_IP/
├── Insects_Spiders106_OOP/
├── Landmark30_IP/
├── Landmark59_OOP/
├── Plants_Fugi56_IP/
├── Plants_Fugi113_OOP/
├── Pokemon39_IP/
└── Pokemon89_OOP/
### 文件说明
| 文件名称 | 说明 |
| --------------------- | --------------------------------------------------- |
| `images/` | 按类别子文件夹组织的原始JPG图像文件 |
| `label2name.json` | 整数标签到类别名称字符串的映射表 |
| `name2label.json` | 类别名称字符串到整数标签的映射表 |
| `merged_mapping.json` | 跨划分的合并标签映射表 |
| `split_Xin_*.json` | 单张图像的训练/验证/测试划分分配信息 |
---
## 数据集加载方式
### 方案1:下载完整数据集(推荐)
python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="MHuangX/LAION-Beyond",
repo_type="dataset",
local_dir="./LAION_Beyond"
)
### 方案2:仅下载单个领域
python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="MHuangX/LAION-Beyond",
repo_type="dataset",
local_dir="./LAION_Beyond",
allow_patterns="Animals42_IP/**"
)
---
## 核心发现
1. **OOP概念具备优质图像特征**:OpenCLIP的图像编码器可为OOP概念形成分离度良好的聚类(多数领域的聚类准确率与IP概念的差距小于3%)。
2. **图像-文本对齐失效**:OOP概念上的零样本准确率显著低于IP概念,即便预训练数据规模从400M扩展至5B,该现象依然存在。
3. **名称微调为关键解决方案**:我们提出的FSNL与ZSNL算法仅微调OOP概念的名称(Token)嵌入,可在不降低IP性能的前提下,有效恢复OOP泛化能力。
---
## 算法介绍
### FSNL — 少样本名称学习(Few-Shot Name Learning)
仅使用少量图像-文本对优化OOP概念的名称嵌入,并通过相似概念洗牌实现上下文增强。在9个领域中的8个领域取得了当前最优性能。
### ZSNL — 零样本名称学习(Zero-Shot Name Learning)
无需使用图像-文本对。通过新颖类别发现(Novel Class Discovery, NCD)与图像-文本二分图匹配,仅利用未标记图像优化OOP名称嵌入。
---
## 基准测试结果(400M划分)
### OOP少样本学习(4-shot,OOP与IP准确率的调和均值)
| 方法 | 动物 | 建筑 | 服饰 | 民间艺术 | 食品 | 昆虫与蜘蛛 | 地标 | 植物 | 宝可梦 | 平均 |
| --------------- | --------- | ------------ | --------- | --------- | -------- | --------- | --------- | --------- | --------- | --------- |
| OpenCLIP | 26.75 | 30.75 | 25.88 | 35.04 | 15.36 | 22.38 | 40.25 | 21.43 | 24.48 | 26.92 |
| CoOp | 31.37 | 57.8 | 50.39 | 52.06 | 42.55 | 25.73 | 85.89 | 24.78 | 35.52 | 45.12 |
| CLIP-Adapter | 38.98 | 59.27 | 64.56 | 56.32 | 64.32 | 32.51 | 90.82 | 31.97 | 54.99 | 54.86 |
| **FSNL(本文方法)** | **46.17** | **62.63** | **71.65** | **63.03** | **70.0** | **44.03** | **94.48** | **44.12** | **68.87** | **62.55** |
---
## 引用方式
若您在研究中使用LAION-Beyond,请引用以下文献:
bibtex
@inproceedings{chen2025reproducible,
title={Reproducible vision-language models meet concepts out of pre-training},
author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={14701--14711},
year={2025}
}
---
## 许可证
本数据集采用[知识共享署名-相同方式共享4.0国际许可协议(CC BY-SA 4.0)](http://creativecommons.org/licenses/by-sa/4.0/)发布。
---
## 作者信息
[Xin Huang](https://www.linkedin.com/in/mhuangx/)†, [Ziliang Chen](https://scholar.google.com/citations?user=RC-LN4QAAAAJ&hl=en)†, Xiaoxuan Fan, [Keze Wang](https://kezewang.com/), Yuyu Zhou, [Quanlong Guan](https://scholar.google.com/citations?user=v4JiSqsAAAAJ&hl=en), [Liang Lin](http://www.linliang.net/)*
所属机构:鹏城实验室、中山大学、洛桑联邦理工学院、暨南大学
†同等贡献 · *通讯作者
提供机构:
MHuangX



