MHuangX/LAION-Beyond

Name: MHuangX/LAION-Beyond
Creator: MHuangX
Published: 2026-04-09 23:03:05
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/MHuangX/LAION-Beyond

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 task_categories: - image-classification - zero-shot-classification language: - en tags: - vision-language - CLIP - out-of-pre-training - OOP - benchmark - multimodal - few-shot - zero-shot pretty_name: LAION-Beyond size_categories: - 100K<n<1M --- # LAION-Beyond: Reproducible Vision-Language Models Meet Concepts Out of Pre-Training 📄 <a href="https://openaccess.thecvf.com/content/CVPR2025/papers/Chen_Reproducible_Vision-Language_Models_Meet_Concepts_Out_of_Pre-Training_CVPR_2025_paper.pdf">Paper (CVPR 2025)</a> | 💻 <a href="https://github.com/M-HuangX/LAION-Beyond">Code</a> | 🌐 <a href="https://github.com/M-HuangX/laion_beyond">Project Page</a> ## Dataset Summary LAION-Beyond is the **first multi-domain benchmark** specifically designed to evaluate the Out-of-Pre-training (OOP) generalization of vision-language models (e.g., CLIP, OpenCLIP, EVA-CLIP). We distinguish two types of visual concepts: - **IP (In-Pre-training)**: concepts that appear in the pre-training data (e.g., LAION-400M / 2B / 5B) - **OOP (Out-of-Pre-training)**: concepts entirely absent from the pre-training data <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure1_OOP_IP_difference.jpg" alt="IP vs OOP Difference" width="80%"> Figure 1: Comparison between IP and OOP generalization. The former evaluates generalization within seen visual concepts, while the latter tests concepts absent during pre-training. The key finding of our paper is that despite OpenCLIP's image encoder forming well-separated clusters for OOP concepts, **zero-shot transfer fails significantly** due to poor image-text alignment — the token embeddings for OOP class names were never aligned with visual features during pre-training. --- ## Dataset Statistics | Split | Images | Concepts | | --------- | ----------- | -------- | | OOP | 106,052 | 674 | | IP | 51,330 | 324 | | **Total** | **157,382** | **998** | <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2a_LAION_Beyond_Distribution.png" width="48%"> <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2b_Image_Counts_per_category.png" width="48%"> Figure 2: (Left) Statistics of OOP/IP concepts across different LAION scales; (Right) Detailed train/val/test split in LAION-Beyond (400M). ### Domains Covered: - 🐾 **Animals** | 🏛️ **Architecture** | 👘 **Attire** - 🎨 **FolkArt** | 🍜 **Food** | 🦋 **Insects & Spiders** - 🗺️ **Landmark** | 🌿 **Plants & Fungi** | 🎮 **Pokemon** Each domain contains an IP subset and an OOP subset, covering LAION-400M, LAION-2B, and LAION-5B scales to support neural scaling law research. --- ## Dataset Structure Each domain folder is named `{Domain}{NumClasses}_{IP/OOP}`, e.g., `Animals42_IP`, `Animals92_OOP`. ``` LAION_Beyond/ ├── Animals42_IP/ │ ├── images/ # jpg images organized by class │ ├── label2name.json # label index → class name │ ├── name2label.json # class name → label index │ ├── merged_mapping.json # merged label mapping │ └── split_Xin_Animals42_IP.json # train/val/test split info ├── Animals92_OOP/ │ └── ... ├── Architecture23_IP/ ├── Architecture50_OOP/ ├── Attire28_IP/ ├── Attire54_OOP/ ├── FolkArt27_IP/ ├── FolkArt59_OOP/ ├── Food27_IP/ ├── Food53_OOP/ ├── Insects_Spiders52_IP/ ├── Insects_Spiders106_OOP/ ├── Landmark30_IP/ ├── Landmark59_OOP/ ├── Plants_Fugi56_IP/ ├── Plants_Fugi113_OOP/ ├── Pokemon39_IP/ └── Pokemon89_OOP/ ``` ### File Descriptions | File | Description | | --------------------- | --------------------------------------------------- | | `images/` | Raw image files (JPG), organized by class subfolder | | `label2name.json` | Mapping from integer label to class name string | | `name2label.json` | Mapping from class name string to integer label | | `merged_mapping.json` | Combined label mapping across splits | | `split_Xin_*.json` | Train / val / test split assignments per image | --- ## Loading the Dataset ### Option 1: Download full dataset (recommended) ```python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="MHuangX/LAION-Beyond", repo_type="dataset", local_dir="./LAION_Beyond" ) ``` ### Option 2: Download a single domain only ```python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="MHuangX/LAION-Beyond", repo_type="dataset", local_dir="./LAION_Beyond", allow_patterns="Animals42_IP/**" ) ``` --- ## Key Findings 1. **Strong image features for OOP concepts**: OpenCLIP's image encoder forms well-separated clusters for OOP concepts (clustering accuracy gap < 3% on most domains vs. IP concepts). 2. **Image-text alignment failure**: Zero-shot accuracy on OOP concepts is significantly lower than IP concepts, persisting even as pre-training data scales from 400M to 5B. 3. **Name-tuning is the key**: Our proposed FSNL and ZSNL algorithms, which fine-tune only the name (token) embeddings of OOP concepts, efficiently restore OOP generalization without degrading IP performance. --- ## Algorithms ### FSNL — Few-Shot Name Learning Optimizes only OOP concept name embeddings using a few image-text pairs, with context augmentation via similar concept shuffling. Achieves state-of-the-art on 8/9 domains. ### ZSNL — Zero-Shot Name Learning Requires no image-text pairs. Uses Novel Class Discovery (NCD) and image-text bipartite graph matching to optimize OOP name embeddings from unlabeled images only. --- ## Benchmark Results (400M split) ### OOP Few-Shot Learning (4-shot, H-mean of OOP & IP accuracy) | Method | Animals | Architecture | Attire | FolkArt | Food | Insects | Landmark | Plants | Pokemon | Avg | | --------------- | --------- | ------------ | --------- | --------- | -------- | --------- | --------- | --------- | --------- | --------- | | OpenCLIP | 26.75 | 30.75 | 25.88 | 35.04 | 15.36 | 22.38 | 40.25 | 21.43 | 24.48 | 26.92 | | CoOp | 31.37 | 57.8 | 50.39 | 52.06 | 42.55 | 25.73 | 85.89 | 24.78 | 35.52 | 45.12 | | CLIP-Adapter | 38.98 | 59.27 | 64.56 | 56.32 | 64.32 | 32.51 | 90.82 | 31.97 | 54.99 | 54.86 | | **FSNL (ours)** | **46.17** | **62.63** | **71.65** | **63.03** | **70.0** | **44.03** | **94.48** | **44.12** | **68.87** | **62.55** | --- ## Citation If you use LAION-Beyond in your research, please cite: ```bibtex @inproceedings{chen2025reproducible, title={Reproducible vision-language models meet concepts out of pre-training}, author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={14701--14711}, year={2025} } ``` --- ## License This dataset is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](http://creativecommons.org/licenses/by-sa/4.0/). --- ## Authors [Xin Huang](https://www.linkedin.com/in/mhuangx/)†, [Ziliang Chen](https://scholar.google.com/citations?user=RC-LN4QAAAAJ&hl=en)†, Xiaoxuan Fan, [Keze Wang](https://kezewang.com/), Yuyu Zhou, [Quanlong Guan](https://scholar.google.com/citations?user=v4JiSqsAAAAJ&hl=en), [Liang Lin](http://www.linliang.net/)* Affiliations: Peng Cheng Laboratory, Sun Yat-sen University, EPFL, Jinan University †Equal Contribution · *Corresponding Author

license: CC BY-SA 4.0 task_categories: - 图像分类（image-classification） - 零样本分类（zero-shot-classification） language: - 英语 tags: - 视觉语言（vision-language） - CLIP - 预训练外（out-of-pre-training） - OOP - 基准测试（benchmark） - 多模态（multimodal） - 少样本（few-shot） - 零样本（zero-shot） pretty_name: LAION-Beyond size_categories: - 100K<n<1M --- # LAION-Beyond：可复现视觉语言模型适配预训练外概念 📄 <a href="https://openaccess.thecvf.com/content/CVPR2025/papers/Chen_Reproducible_Vision-Language_Models_Meet_Concepts_Out_of_Pre-Training_CVPR_2025_paper.pdf">论文（CVPR 2025）</a> | 💻 <a href="https://github.com/M-HuangX/LAION-Beyond">代码</a> | 🌐 <a href="https://github.com/M-HuangX/laion_beyond">项目主页</a> ## 数据集概述 LAION-Beyond是**首个多领域基准测试集**，专为评估视觉语言模型（vision-language models, VLM）的预训练外（Out-of-Pre-training, OOP）泛化能力而设计，例如CLIP、OpenCLIP、EVA-CLIP。我们区分了两类视觉概念： - **IP（预训练内，In-Pre-training）**：出现在预训练数据中的概念（例如LAION-400M、LAION-2B、LAION-5B） - **OOP（预训练外，Out-of-Pre-training）**：完全未出现在预训练数据中的概念 <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure1_OOP_IP_difference.jpg" alt="IP vs OOP Difference" width="80%"> 图1：IP与OOP泛化能力对比。前者评估可见视觉概念下的泛化能力，后者测试预训练阶段未出现的概念。 本论文的核心发现是，尽管OpenCLIP的图像编码器可为OOP概念形成分离度良好的聚类，但**零样本迁移性能显著失效**——原因在于较差的图像-文本对齐：预训练阶段从未将OOP类别名称的Token嵌入与视觉特征对齐。 --- ## 数据集统计 | 划分 | 图像数量 | 概念数 | | --------- | ----------- | -------- | | OOP | 106,052 | 674 | | IP | 51,330 | 324 | | **总计** | **157,382** | **998** | <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2a_LAION_Beyond_Distribution.png" width="48%"> <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2b_Image_Counts_per_category.png" width="48%"> 图2：（左）不同LAION规模下OOP/IP概念的统计分布；（右）LAION-Beyond（400M划分）的详细训练/验证/测试拆分情况。 ### 覆盖领域 - 🐾 **动物** | 🏛️ **建筑** | 👘 **服饰** - 🎨 **民间艺术** | 🍜 **食品** | 🦋 **昆虫与蜘蛛** - 🗺️ **地标** | 🌿 **植物与真菌** | 🎮 **宝可梦** 每个领域均包含IP子集与OOP子集，覆盖LAION-400M、LAION-2B与LAION-5B三种规模，以支持神经缩放定律相关研究。 --- ## 数据集结构每个领域文件夹的命名格式为`{Domain}{NumClasses}_{IP/OOP}`，例如`Animals42_IP`、`Animals92_OOP`。 LAION_Beyond/ ├── Animals42_IP/ │ ├── images/ # 按类别组织的JPG图像文件 │ ├── label2name.json # 标签索引→类别名称映射表 │ ├── name2label.json # 类别名称→标签索引映射表 │ ├── merged_mapping.json # 合并后的跨划分标签映射表 │ └── split_Xin_Animals42_IP.json # 训练/验证/测试划分信息 ├── Animals92_OOP/ │ └── ... ├── Architecture23_IP/ ├── Architecture50_OOP/ ├── Attire28_IP/ ├── Attire54_OOP/ ├── FolkArt27_IP/ ├── FolkArt59_OOP/ ├── Food27_IP/ ├── Food53_OOP/ ├── Insects_Spiders52_IP/ ├── Insects_Spiders106_OOP/ ├── Landmark30_IP/ ├── Landmark59_OOP/ ├── Plants_Fugi56_IP/ ├── Plants_Fugi113_OOP/ ├── Pokemon39_IP/ └── Pokemon89_OOP/ ### 文件说明 | 文件名称 | 说明 | | --------------------- | --------------------------------------------------- | | `images/` | 按类别子文件夹组织的原始JPG图像文件 | | `label2name.json` | 整数标签到类别名称字符串的映射表 | | `name2label.json` | 类别名称字符串到整数标签的映射表 | | `merged_mapping.json` | 跨划分的合并标签映射表 | | `split_Xin_*.json` | 单张图像的训练/验证/测试划分分配信息 | --- ## 数据集加载方式 ### 方案1：下载完整数据集（推荐） python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="MHuangX/LAION-Beyond", repo_type="dataset", local_dir="./LAION_Beyond" ) ### 方案2：仅下载单个领域 python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="MHuangX/LAION-Beyond", repo_type="dataset", local_dir="./LAION_Beyond", allow_patterns="Animals42_IP/**" ) --- ## 核心发现 1. **OOP概念具备优质图像特征**：OpenCLIP的图像编码器可为OOP概念形成分离度良好的聚类（多数领域的聚类准确率与IP概念的差距小于3%）。 2. **图像-文本对齐失效**：OOP概念上的零样本准确率显著低于IP概念，即便预训练数据规模从400M扩展至5B，该现象依然存在。 3. **名称微调为关键解决方案**：我们提出的FSNL与ZSNL算法仅微调OOP概念的名称（Token）嵌入，可在不降低IP性能的前提下，有效恢复OOP泛化能力。 --- ## 算法介绍 ### FSNL — 少样本名称学习（Few-Shot Name Learning）仅使用少量图像-文本对优化OOP概念的名称嵌入，并通过相似概念洗牌实现上下文增强。在9个领域中的8个领域取得了当前最优性能。 ### ZSNL — 零样本名称学习（Zero-Shot Name Learning）无需使用图像-文本对。通过新颖类别发现（Novel Class Discovery, NCD）与图像-文本二分图匹配，仅利用未标记图像优化OOP名称嵌入。 --- ## 基准测试结果（400M划分） ### OOP少样本学习（4-shot，OOP与IP准确率的调和均值） | 方法 | 动物 | 建筑 | 服饰 | 民间艺术 | 食品 | 昆虫与蜘蛛 | 地标 | 植物 | 宝可梦 | 平均 | | --------------- | --------- | ------------ | --------- | --------- | -------- | --------- | --------- | --------- | --------- | --------- | | OpenCLIP | 26.75 | 30.75 | 25.88 | 35.04 | 15.36 | 22.38 | 40.25 | 21.43 | 24.48 | 26.92 | | CoOp | 31.37 | 57.8 | 50.39 | 52.06 | 42.55 | 25.73 | 85.89 | 24.78 | 35.52 | 45.12 | | CLIP-Adapter | 38.98 | 59.27 | 64.56 | 56.32 | 64.32 | 32.51 | 90.82 | 31.97 | 54.99 | 54.86 | | **FSNL（本文方法）** | **46.17** | **62.63** | **71.65** | **63.03** | **70.0** | **44.03** | **94.48** | **44.12** | **68.87** | **62.55** | --- ## 引用方式若您在研究中使用LAION-Beyond，请引用以下文献： bibtex @inproceedings{chen2025reproducible, title={Reproducible vision-language models meet concepts out of pre-training}, author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={14701--14711}, year={2025} } --- ## 许可证本数据集采用[知识共享署名-相同方式共享4.0国际许可协议（CC BY-SA 4.0）](http://creativecommons.org/licenses/by-sa/4.0/)发布。 --- ## 作者信息 [Xin Huang](https://www.linkedin.com/in/mhuangx/)†, [Ziliang Chen](https://scholar.google.com/citations?user=RC-LN4QAAAAJ&hl=en)†, Xiaoxuan Fan, [Keze Wang](https://kezewang.com/), Yuyu Zhou, [Quanlong Guan](https://scholar.google.com/citations?user=v4JiSqsAAAAJ&hl=en), [Liang Lin](http://www.linliang.net/)* 所属机构：鹏城实验室、中山大学、洛桑联邦理工学院、暨南大学 †同等贡献 · *通讯作者

提供机构：

MHuangX

5,000+

优质数据集

54 个

任务类型

进入经典数据集