AliN96/midjourney-prompts-embeddings
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AliN96/midjourney-prompts-embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# Midjourney Prompt–Embedding Dataset
This dataset is derived from our COLM 2024 paper, *[Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images](https://openreview.net/pdf?id=SwUsFTtM9h)*. The paper studies whether multimodal language models can infer prompts that generate images visually similar to target images produced by text-to-image systems or found in stock image collections, highlighting the relationship between real-world prompts and generated images as well as broader economic and security implications.
The dataset consists of a processed subset of approximately 4.5 million samples collected from publicly accessible Midjourney Discord channels. Each sample includes an anonymized prompt, the corresponding Midjourney model version, and a mapping to a precomputed image embedding. Embeddings are computed using CLIP (`laion/CLIP-ViT-g-14-laion2B-s12B-b42K`) and stored separately in shard files.
⚠️ Raw images are not redistributed due to platform terms of service, copyright considerations, and potential misuse risks.
## Accessing Image Embeddings
The dataset provides a mapping to precomputed image embeddings stored in shard files (`.npy`). These embeddings are not loaded automatically with `load_dataset` and should be accessed separately using the `EmbeddingShard` and `EmbeddingIndex` fields.
### Option 1: Download a single embedding shard
```python
from datasets import load_dataset
from huggingface_hub import hf_hub_download
import numpy as np
ds = load_dataset("AliN96/midjourney-prompts-embeddings", split="train")
sample = ds[0]
file_path = hf_hub_download(
repo_id="AliN96/midjourney-prompts-embeddings",
filename=sample["EmbeddingShard"], # e.g., embeddings_shard_00000_start_0_end_200000.npy
repo_type="dataset",
cache_dir="./hf_cache" # optional
)
embeddings = np.load(file_path)
vector = embeddings[sample["EmbeddingIndex"]]
```
### Option 2: Download all embedding shards
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="AliN96/midjourney-prompts-embeddings",
repo_type="dataset",
allow_patterns="*.npy",
local_dir="./embeddings"
)
```
You can change `local_dir` to control where the embedding files are stored locally.
👉 For full details on data collection, preprocessing, and usage, please refer to the GitHub repository:
[GitHub Repository](https://github.com/SPIN-UMass/MidjourneyDataset)
---
## Citation
If you use this dataset, please cite:
```bibtex
@inproceedings{
naseh2024iteratively,
title={Iteratively Prompting Multimodal {LLM}s to Reproduce Natural and {AI}-Generated Images},
author={Ali Naseh and Katherine Thai and Mohit Iyyer and Amir Houmansadr},
booktitle={First Conference on Language Modeling},
year={2024},
url={https://openreview.net/forum?id=SwUsFTtM9h}
}
提供机构:
AliN96



