five

AliN96/midjourney-prompts-embeddings

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AliN96/midjourney-prompts-embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit --- # Midjourney Prompt–Embedding Dataset This dataset is derived from our COLM 2024 paper, *[Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images](https://openreview.net/pdf?id=SwUsFTtM9h)*. The paper studies whether multimodal language models can infer prompts that generate images visually similar to target images produced by text-to-image systems or found in stock image collections, highlighting the relationship between real-world prompts and generated images as well as broader economic and security implications. The dataset consists of a processed subset of approximately 4.5 million samples collected from publicly accessible Midjourney Discord channels. Each sample includes an anonymized prompt, the corresponding Midjourney model version, and a mapping to a precomputed image embedding. Embeddings are computed using CLIP (`laion/CLIP-ViT-g-14-laion2B-s12B-b42K`) and stored separately in shard files. ⚠️ Raw images are not redistributed due to platform terms of service, copyright considerations, and potential misuse risks. ## Accessing Image Embeddings The dataset provides a mapping to precomputed image embeddings stored in shard files (`.npy`). These embeddings are not loaded automatically with `load_dataset` and should be accessed separately using the `EmbeddingShard` and `EmbeddingIndex` fields. ### Option 1: Download a single embedding shard ```python from datasets import load_dataset from huggingface_hub import hf_hub_download import numpy as np ds = load_dataset("AliN96/midjourney-prompts-embeddings", split="train") sample = ds[0] file_path = hf_hub_download( repo_id="AliN96/midjourney-prompts-embeddings", filename=sample["EmbeddingShard"], # e.g., embeddings_shard_00000_start_0_end_200000.npy repo_type="dataset", cache_dir="./hf_cache" # optional ) embeddings = np.load(file_path) vector = embeddings[sample["EmbeddingIndex"]] ``` ### Option 2: Download all embedding shards ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="AliN96/midjourney-prompts-embeddings", repo_type="dataset", allow_patterns="*.npy", local_dir="./embeddings" ) ``` You can change `local_dir` to control where the embedding files are stored locally. 👉 For full details on data collection, preprocessing, and usage, please refer to the GitHub repository: [GitHub Repository](https://github.com/SPIN-UMass/MidjourneyDataset) --- ## Citation If you use this dataset, please cite: ```bibtex @inproceedings{ naseh2024iteratively, title={Iteratively Prompting Multimodal {LLM}s to Reproduce Natural and {AI}-Generated Images}, author={Ali Naseh and Katherine Thai and Mohit Iyyer and Amir Houmansadr}, booktitle={First Conference on Language Modeling}, year={2024}, url={https://openreview.net/forum?id=SwUsFTtM9h} }
提供机构:
AliN96
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作