Felldude/LLM-PictureThis-22K
收藏Hugging Face2026-05-10 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/Felldude/LLM-PictureThis-22K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- en
tags:
- qwen
- LLM
size_categories:
- 10K<n<100K
---
# Picture This
## Dataset Description
Picture This is a 22k vocabulary image-text dataset created from a large-scale web image search crawl and refined using CLIP cosine similarity filtering.
For each vocabulary term, multiple images were retrieved from image search results, embedded using CLIP, filtered by semantic similarity, captioned, and exported into parquet format for multimodal and LLM training workflows.
The goal of the dataset is to provide visually consistent concept clusters for language-to-image representation learning.
---
## Dataset Statistics
| Metric | Value |
|---|---|
| Vocabulary Size | 22,000 |
| Initial Crawl Size | ~200,000 images |
| Filtered + Captioned Images | ~180,000 |
| Format | Parquet |
| Data Type | Image + Caption Pairs |
---
## Dataset Creation
### Pipeline
```text
Image Search Web Crawl
↓
CLIP Embedding Extraction
↓
Cosine Similarity Filtering
↓
Low Similarity Removal
↓
Image Captioning
↓
Parquet Export
Collection Process
For each vocabulary term:
Retrieve approximately 10 images from a web image search crawl
Generate CLIP embeddings for all images
Compare embeddings using cosine similarity
Remove visually inconsistent or low similarity samples
Caption remaining images
Export image-caption pairs into parquet format
Intended Uses
Multimodal LLM training
CLIP-style contrastive learning
Visual grounding research
Semantic clustering experiments
Synthetic caption training
Vocabulary visualization studies
Known Issues
Public Figure Dominance
Names of people often collapse into a single highly represented identity.
Example:
Aaron
Frequently resolved into images of:
Aaron Carter
Even after cosine similarity filtering, the dataset remained highly concentrated around one person.
Semantic Convergence
Closely related vocabulary terms sometimes converge into nearly identical visual outputs.
Example:
draw
drawing
draws
All frequently resolved into female sketch artwork.
Geographic Representation Bias
Town and city names are usually visually accurate, but image search results heavily favor:
Overhead views
Skylines
Distant photography
This may bias downstream models toward those visual representations.
Commercial Product Dominance
Certain historical or cultural terms become dominated by commercial products.
Example:
Akbar
Frequently resolved into:
Akbar Tea
instead of historical figures or historical imagery.
Because the product images were highly visually consistent, cosine similarity filtering reinforced this behavior.
Limitations
Web-scale image search bias
Public figure overrepresentation
Commercial brand dominance
Geographic and cultural imbalance
Caption quality depends on captioning model quality
Cosine similarity filtering may reinforce dominant visual concepts rather than semantic diversity
This dataset should not be considered a balanced representation of concepts or language.
Planned Releases
Potential future releases may include:
High similarity subsets
Low similarity discarded samples
Raw pre-filter image sets
CLIP similarity metadata
Currently, only the captioned parquet dataset is public.
Citation
@dataset{picturethis2026,
title={Picture This},
author={Felldude},
year={2026},
description={A 22k vocabulary image-text dataset refined using CLIP cosine similarity filtering.}
}
提供机构:
Felldude



