AliN96/midjourney-prompts-embeddings

Name: AliN96/midjourney-prompts-embeddings
Creator: AliN96
Published: 2026-04-06 03:47:18
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/AliN96/midjourney-prompts-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit --- # Midjourney Prompt–Embedding Dataset This dataset is derived from our COLM 2024 paper, *[Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images](https://openreview.net/pdf?id=SwUsFTtM9h)*. The paper studies whether multimodal language models can infer prompts that generate images visually similar to target images produced by text-to-image systems or found in stock image collections, highlighting the relationship between real-world prompts and generated images as well as broader economic and security implications. The dataset consists of a processed subset of approximately 4.5 million samples collected from publicly accessible Midjourney Discord channels. Each sample includes an anonymized prompt, the corresponding Midjourney model version, and a mapping to a precomputed image embedding. Embeddings are computed using CLIP (`laion/CLIP-ViT-g-14-laion2B-s12B-b42K`) and stored separately in shard files. ⚠️ Raw images are not redistributed due to platform terms of service, copyright considerations, and potential misuse risks. ## Accessing Image Embeddings The dataset provides a mapping to precomputed image embeddings stored in shard files (`.npy`). These embeddings are not loaded automatically with `load_dataset` and should be accessed separately using the `EmbeddingShard` and `EmbeddingIndex` fields. ### Option 1: Download a single embedding shard ```python from datasets import load_dataset from huggingface_hub import hf_hub_download import numpy as np ds = load_dataset("AliN96/midjourney-prompts-embeddings", split="train") sample = ds[0] file_path = hf_hub_download( repo_id="AliN96/midjourney-prompts-embeddings", filename=sample["EmbeddingShard"], # e.g., embeddings_shard_00000_start_0_end_200000.npy repo_type="dataset", cache_dir="./hf_cache" # optional ) embeddings = np.load(file_path) vector = embeddings[sample["EmbeddingIndex"]] ``` ### Option 2: Download all embedding shards ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="AliN96/midjourney-prompts-embeddings", repo_type="dataset", allow_patterns="*.npy", local_dir="./embeddings" ) ``` You can change `local_dir` to control where the embedding files are stored locally. 👉 For full details on data collection, preprocessing, and usage, please refer to the GitHub repository: [GitHub Repository](https://github.com/SPIN-UMass/MidjourneyDataset) --- ## Citation If you use this dataset, please cite: ```bibtex @inproceedings{ naseh2024iteratively, title={Iteratively Prompting Multimodal {LLM}s to Reproduce Natural and {AI}-Generated Images}, author={Ali Naseh and Katherine Thai and Mohit Iyyer and Amir Houmansadr}, booktitle={First Conference on Language Modeling}, year={2024}, url={https://openreview.net/forum?id=SwUsFTtM9h} }

提供机构：

AliN96

5,000+

优质数据集

54 个

任务类型

进入经典数据集