five

MHuangX/LAION-Beyond

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MHuangX/LAION-Beyond
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - image-classification - zero-shot-classification language: - en tags: - vision-language - CLIP - out-of-pre-training - OOP - benchmark - multimodal - few-shot - zero-shot pretty_name: LAION-Beyond size_categories: - 100K<n<1M --- # LAION-Beyond: Reproducible Vision-Language Models Meet Concepts Out of Pre-Training <p align="center"> 📄 <a href="https://openaccess.thecvf.com/content/CVPR2025/papers/Chen_Reproducible_Vision-Language_Models_Meet_Concepts_Out_of_Pre-Training_CVPR_2025_paper.pdf">Paper (CVPR 2025)</a> | 💻 <a href="https://github.com/M-HuangX/LAION-Beyond">Code</a> | 🌐 <a href="https://github.com/M-HuangX/laion_beyond">Project Page</a> </p> ## Dataset Summary LAION-Beyond is the **first multi-domain benchmark** specifically designed to evaluate the Out-of-Pre-training (OOP) generalization of vision-language models (e.g., CLIP, OpenCLIP, EVA-CLIP). We distinguish two types of visual concepts: - **IP (In-Pre-training)**: concepts that appear in the pre-training data (e.g., LAION-400M / 2B / 5B) - **OOP (Out-of-Pre-training)**: concepts entirely absent from the pre-training data <p align="center"> <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure1_OOP_IP_difference.jpg" alt="IP vs OOP Difference" width="80%"> <br> <em>Figure 1: Comparison between IP and OOP generalization. The former evaluates generalization within seen visual concepts, while the latter tests concepts absent during pre-training.</em> </p> The key finding of our paper is that despite OpenCLIP's image encoder forming well-separated clusters for OOP concepts, **zero-shot transfer fails significantly** due to poor image-text alignment — the token embeddings for OOP class names were never aligned with visual features during pre-training. --- ## Dataset Statistics | Split | Images | Concepts | | --------- | ----------- | -------- | | OOP | 106,052 | 674 | | IP | 51,330 | 324 | | **Total** | **157,382** | **998** | <p align="center"> <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2a_LAION_Beyond_Distribution.png" width="48%"> <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2b_Image_Counts_per_category.png" width="48%"> <br> <em>Figure 2: (Left) Statistics of OOP/IP concepts across different LAION scales; (Right) Detailed train/val/test split in LAION-Beyond (400M).</em> </p> ### Domains Covered: - 🐾 **Animals** | 🏛️ **Architecture** | 👘 **Attire** - 🎨 **FolkArt** | 🍜 **Food** | 🦋 **Insects & Spiders** - 🗺️ **Landmark** | 🌿 **Plants & Fungi** | 🎮 **Pokemon** Each domain contains an IP subset and an OOP subset, covering LAION-400M, LAION-2B, and LAION-5B scales to support neural scaling law research. --- ## Dataset Structure Each domain folder is named `{Domain}{NumClasses}_{IP/OOP}`, e.g., `Animals42_IP`, `Animals92_OOP`. ``` LAION_Beyond/ ├── Animals42_IP/ │ ├── images/ # jpg images organized by class │ ├── label2name.json # label index → class name │ ├── name2label.json # class name → label index │ ├── merged_mapping.json # merged label mapping │ └── split_Xin_Animals42_IP.json # train/val/test split info ├── Animals92_OOP/ │ └── ... ├── Architecture23_IP/ ├── Architecture50_OOP/ ├── Attire28_IP/ ├── Attire54_OOP/ ├── FolkArt27_IP/ ├── FolkArt59_OOP/ ├── Food27_IP/ ├── Food53_OOP/ ├── Insects_Spiders52_IP/ ├── Insects_Spiders106_OOP/ ├── Landmark30_IP/ ├── Landmark59_OOP/ ├── Plants_Fugi56_IP/ ├── Plants_Fugi113_OOP/ ├── Pokemon39_IP/ └── Pokemon89_OOP/ ``` ### File Descriptions | File | Description | | --------------------- | --------------------------------------------------- | | `images/` | Raw image files (JPG), organized by class subfolder | | `label2name.json` | Mapping from integer label to class name string | | `name2label.json` | Mapping from class name string to integer label | | `merged_mapping.json` | Combined label mapping across splits | | `split_Xin_*.json` | Train / val / test split assignments per image | --- ## Loading the Dataset ### Option 1: Download full dataset (recommended) ```python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="MHuangX/LAION-Beyond", repo_type="dataset", local_dir="./LAION_Beyond" ) ``` ### Option 2: Download a single domain only ```python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="MHuangX/LAION-Beyond", repo_type="dataset", local_dir="./LAION_Beyond", allow_patterns="Animals42_IP/**" ) ``` --- ## Key Findings 1. **Strong image features for OOP concepts**: OpenCLIP's image encoder forms well-separated clusters for OOP concepts (clustering accuracy gap < 3% on most domains vs. IP concepts). 2. **Image-text alignment failure**: Zero-shot accuracy on OOP concepts is significantly lower than IP concepts, persisting even as pre-training data scales from 400M to 5B. 3. **Name-tuning is the key**: Our proposed FSNL and ZSNL algorithms, which fine-tune only the name (token) embeddings of OOP concepts, efficiently restore OOP generalization without degrading IP performance. --- ## Algorithms ### FSNL — Few-Shot Name Learning Optimizes only OOP concept name embeddings using a few image-text pairs, with context augmentation via similar concept shuffling. Achieves state-of-the-art on 8/9 domains. ### ZSNL — Zero-Shot Name Learning Requires no image-text pairs. Uses Novel Class Discovery (NCD) and image-text bipartite graph matching to optimize OOP name embeddings from unlabeled images only. --- ## Benchmark Results (400M split) ### OOP Few-Shot Learning (4-shot, H-mean of OOP & IP accuracy) | Method | Animals | Architecture | Attire | FolkArt | Food | Insects | Landmark | Plants | Pokemon | Avg | | --------------- | --------- | ------------ | --------- | --------- | -------- | --------- | --------- | --------- | --------- | --------- | | OpenCLIP | 26.75 | 30.75 | 25.88 | 35.04 | 15.36 | 22.38 | 40.25 | 21.43 | 24.48 | 26.92 | | CoOp | 31.37 | 57.8 | 50.39 | 52.06 | 42.55 | 25.73 | 85.89 | 24.78 | 35.52 | 45.12 | | CLIP-Adapter | 38.98 | 59.27 | 64.56 | 56.32 | 64.32 | 32.51 | 90.82 | 31.97 | 54.99 | 54.86 | | **FSNL (ours)** | **46.17** | **62.63** | **71.65** | **63.03** | **70.0** | **44.03** | **94.48** | **44.12** | **68.87** | **62.55** | --- ## Citation If you use LAION-Beyond in your research, please cite: ```bibtex @inproceedings{chen2025reproducible, title={Reproducible vision-language models meet concepts out of pre-training}, author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={14701--14711}, year={2025} } ``` --- ## License This dataset is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](http://creativecommons.org/licenses/by-sa/4.0/). --- ## Authors [Xin Huang](https://www.linkedin.com/in/mhuangx/)†, [Ziliang Chen](https://scholar.google.com/citations?user=RC-LN4QAAAAJ&hl=en)†, Xiaoxuan Fan, [Keze Wang](https://kezewang.com/), Yuyu Zhou, [Quanlong Guan](https://scholar.google.com/citations?user=v4JiSqsAAAAJ&hl=en), [Liang Lin](http://www.linliang.net/)* Affiliations: Peng Cheng Laboratory, Sun Yat-sen University, EPFL, Jinan University †Equal Contribution · *Corresponding Author

license: CC BY-SA 4.0 task_categories: - 图像分类(image-classification) - 零样本分类(zero-shot-classification) language: - 英语 tags: - 视觉语言(vision-language) - CLIP - 预训练外(out-of-pre-training) - OOP - 基准测试(benchmark) - 多模态(multimodal) - 少样本(few-shot) - 零样本(zero-shot) pretty_name: LAION-Beyond size_categories: - 100K<n<1M --- # LAION-Beyond:可复现视觉语言模型适配预训练外概念 <p align="center"> 📄 <a href="https://openaccess.thecvf.com/content/CVPR2025/papers/Chen_Reproducible_Vision-Language_Models_Meet_Concepts_Out_of_Pre-Training_CVPR_2025_paper.pdf">论文(CVPR 2025)</a> | 💻 <a href="https://github.com/M-HuangX/LAION-Beyond">代码</a> | 🌐 <a href="https://github.com/M-HuangX/laion_beyond">项目主页</a> </p> ## 数据集概述 LAION-Beyond是**首个多领域基准测试集**,专为评估视觉语言模型(vision-language models, VLM)的预训练外(Out-of-Pre-training, OOP)泛化能力而设计,例如CLIP、OpenCLIP、EVA-CLIP。 我们区分了两类视觉概念: - **IP(预训练内,In-Pre-training)**:出现在预训练数据中的概念(例如LAION-400M、LAION-2B、LAION-5B) - **OOP(预训练外,Out-of-Pre-training)**:完全未出现在预训练数据中的概念 <p align="center"> <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure1_OOP_IP_difference.jpg" alt="IP vs OOP Difference" width="80%"> <br> <em>图1:IP与OOP泛化能力对比。前者评估可见视觉概念下的泛化能力,后者测试预训练阶段未出现的概念。</em> </p> 本论文的核心发现是,尽管OpenCLIP的图像编码器可为OOP概念形成分离度良好的聚类,但**零样本迁移性能显著失效**——原因在于较差的图像-文本对齐:预训练阶段从未将OOP类别名称的Token嵌入与视觉特征对齐。 --- ## 数据集统计 | 划分 | 图像数量 | 概念数 | | --------- | ----------- | -------- | | OOP | 106,052 | 674 | | IP | 51,330 | 324 | | **总计** | **157,382** | **998** | <p align="center"> <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2a_LAION_Beyond_Distribution.png" width="48%"> <img src="https://raw.githubusercontent.com/M-HuangX/laion_beyond/master/static/images/Figure2b_Image_Counts_per_category.png" width="48%"> <br> <em>图2:(左)不同LAION规模下OOP/IP概念的统计分布;(右)LAION-Beyond(400M划分)的详细训练/验证/测试拆分情况。</em> </p> ### 覆盖领域 - 🐾 **动物** | 🏛️ **建筑** | 👘 **服饰** - 🎨 **民间艺术** | 🍜 **食品** | 🦋 **昆虫与蜘蛛** - 🗺️ **地标** | 🌿 **植物与真菌** | 🎮 **宝可梦** 每个领域均包含IP子集与OOP子集,覆盖LAION-400M、LAION-2B与LAION-5B三种规模,以支持神经缩放定律相关研究。 --- ## 数据集结构 每个领域文件夹的命名格式为`{Domain}{NumClasses}_{IP/OOP}`,例如`Animals42_IP`、`Animals92_OOP`。 LAION_Beyond/ ├── Animals42_IP/ │ ├── images/ # 按类别组织的JPG图像文件 │ ├── label2name.json # 标签索引→类别名称映射表 │ ├── name2label.json # 类别名称→标签索引映射表 │ ├── merged_mapping.json # 合并后的跨划分标签映射表 │ └── split_Xin_Animals42_IP.json # 训练/验证/测试划分信息 ├── Animals92_OOP/ │ └── ... ├── Architecture23_IP/ ├── Architecture50_OOP/ ├── Attire28_IP/ ├── Attire54_OOP/ ├── FolkArt27_IP/ ├── FolkArt59_OOP/ ├── Food27_IP/ ├── Food53_OOP/ ├── Insects_Spiders52_IP/ ├── Insects_Spiders106_OOP/ ├── Landmark30_IP/ ├── Landmark59_OOP/ ├── Plants_Fugi56_IP/ ├── Plants_Fugi113_OOP/ ├── Pokemon39_IP/ └── Pokemon89_OOP/ ### 文件说明 | 文件名称 | 说明 | | --------------------- | --------------------------------------------------- | | `images/` | 按类别子文件夹组织的原始JPG图像文件 | | `label2name.json` | 整数标签到类别名称字符串的映射表 | | `name2label.json` | 类别名称字符串到整数标签的映射表 | | `merged_mapping.json` | 跨划分的合并标签映射表 | | `split_Xin_*.json` | 单张图像的训练/验证/测试划分分配信息 | --- ## 数据集加载方式 ### 方案1:下载完整数据集(推荐) python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="MHuangX/LAION-Beyond", repo_type="dataset", local_dir="./LAION_Beyond" ) ### 方案2:仅下载单个领域 python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="MHuangX/LAION-Beyond", repo_type="dataset", local_dir="./LAION_Beyond", allow_patterns="Animals42_IP/**" ) --- ## 核心发现 1. **OOP概念具备优质图像特征**:OpenCLIP的图像编码器可为OOP概念形成分离度良好的聚类(多数领域的聚类准确率与IP概念的差距小于3%)。 2. **图像-文本对齐失效**:OOP概念上的零样本准确率显著低于IP概念,即便预训练数据规模从400M扩展至5B,该现象依然存在。 3. **名称微调为关键解决方案**:我们提出的FSNL与ZSNL算法仅微调OOP概念的名称(Token)嵌入,可在不降低IP性能的前提下,有效恢复OOP泛化能力。 --- ## 算法介绍 ### FSNL — 少样本名称学习(Few-Shot Name Learning) 仅使用少量图像-文本对优化OOP概念的名称嵌入,并通过相似概念洗牌实现上下文增强。在9个领域中的8个领域取得了当前最优性能。 ### ZSNL — 零样本名称学习(Zero-Shot Name Learning) 无需使用图像-文本对。通过新颖类别发现(Novel Class Discovery, NCD)与图像-文本二分图匹配,仅利用未标记图像优化OOP名称嵌入。 --- ## 基准测试结果(400M划分) ### OOP少样本学习(4-shot,OOP与IP准确率的调和均值) | 方法 | 动物 | 建筑 | 服饰 | 民间艺术 | 食品 | 昆虫与蜘蛛 | 地标 | 植物 | 宝可梦 | 平均 | | --------------- | --------- | ------------ | --------- | --------- | -------- | --------- | --------- | --------- | --------- | --------- | | OpenCLIP | 26.75 | 30.75 | 25.88 | 35.04 | 15.36 | 22.38 | 40.25 | 21.43 | 24.48 | 26.92 | | CoOp | 31.37 | 57.8 | 50.39 | 52.06 | 42.55 | 25.73 | 85.89 | 24.78 | 35.52 | 45.12 | | CLIP-Adapter | 38.98 | 59.27 | 64.56 | 56.32 | 64.32 | 32.51 | 90.82 | 31.97 | 54.99 | 54.86 | | **FSNL(本文方法)** | **46.17** | **62.63** | **71.65** | **63.03** | **70.0** | **44.03** | **94.48** | **44.12** | **68.87** | **62.55** | --- ## 引用方式 若您在研究中使用LAION-Beyond,请引用以下文献: bibtex @inproceedings{chen2025reproducible, title={Reproducible vision-language models meet concepts out of pre-training}, author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={14701--14711}, year={2025} } --- ## 许可证 本数据集采用[知识共享署名-相同方式共享4.0国际许可协议(CC BY-SA 4.0)](http://creativecommons.org/licenses/by-sa/4.0/)发布。 --- ## 作者信息 [Xin Huang](https://www.linkedin.com/in/mhuangx/)†, [Ziliang Chen](https://scholar.google.com/citations?user=RC-LN4QAAAAJ&hl=en)†, Xiaoxuan Fan, [Keze Wang](https://kezewang.com/), Yuyu Zhou, [Quanlong Guan](https://scholar.google.com/citations?user=v4JiSqsAAAAJ&hl=en), [Liang Lin](http://www.linliang.net/)* 所属机构:鹏城实验室、中山大学、洛桑联邦理工学院、暨南大学 †同等贡献 · *通讯作者
提供机构:
MHuangX
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作