shachardon/midjourney-threads

Name: shachardon/midjourney-threads
Creator: shachardon
Published: 2023-12-10 11:26:11
License: 暂无描述

Hugging Face2023-12-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/shachardon/midjourney-threads

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-to-image language: - en pretty_name: Midjourney-Threads size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: - "threads_0.csv" - "threads_20000.csv" - "threads_40000.csv" - "threads_60000.csv" - "threads_80000.csv" - "threads_100000.csv" - "threads_120000.csv" - "threads_140000.csv" - "threads_160000.csv" --- # Dataset Card for Midjourney-Threads 🧵💬  This dataset contains users prompts from the Midjourney discord channel, organized into "threads of interaction". Each thread contains a user’s trails to create one target image. The dataset was introduced as part of the paper: [Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney][ourpaper]. [ourpaper]: https://aclanthology.org/2023.emnlp-main.253/ "markdown our paper" ### Dataset Sources  - **Repository:** https://github.com/shachardon/Mid-Journey-to-alignment - **Paper:** https://aclanthology.org/2023.emnlp-main.253/ ## Dataset Structure  Main Columns: - 'text' - the original prompt - 'args' - predefined parameters (such as the aspect ratio, chaos and [more][myexample]) - 'channel_id' - the discord channel - 'userid' - an anonymous user id - 'timestamp' - a timestamp of the prompt creation - 'label' - Ture whether an image that was generated based on that prompt was upscaled, otherwise False. - 'id' - unique id of the prompt - 'url_png' - link to the generated images (a 4-grid version) - 'main_content' - prefix of the prompt, without trailing magic-words - 'concreteness' - concreteness score, based on the [this paper][concpaper] - 'word_len' - the number of words - 'repeat_words' - the occurrences of each word that appears more than once in the prompt, excluding stop words. - 'reapeat_words_ratio' - repeat_words / word_len - 'perplexity' - the perplexity GPT-2 assigns to each prompt. - 'caption_0-3' - captions that were generated by the BLIP-2 model, with the 4 created images as its inputs. - 'phase' - train/test split, as was used to train image/text classifiers - 'magic_ratio' - the percentage of words that were recognized as magic words in the prompt - 'thread_id' - the id of the thread - 'depth' - the max depth of a constituency parse tree of the prompt. - 'num_sent_parser' - the number of sentences in the prompt. - 'num_sent_parser_ratio' - num_sent_parser / word_len - 'words_per_sent' - word_len / num_sent_parser [myexample]: https://docs.midjourney.com/docs/parameter-list "markdown more" [concpaper]: https://link.springer.com/article/10.3758/s13428-013-0403-5 "markdown this paper" ## Dataset Creation ### Source Data  We construct the dataset by scraping user-generated prompts from the Midjourney Discord server. The server contains channels in which a user can type a prompt and arguments, and then the Midjourney bot would reply with 4 generated images, combined together into a grid. Then, if the user is satisfied with one of the 4 images, they can send an 'upscale' command to the bot, to get an upscaled version of the desired image. We randomly choose one of the 'newbies' channels, where both new and experienced users are experimenting with general domain prompts. We collect 693,528 prompts (From 23 January to 1 March 2023), together with their matching images and meta-data such as timestamps and user ids (which we anonymize). #### Data Collection and Processing  We split the prompts into threads automatically, see the paper for more details. In addition, we extract features (perplexity, sentence length, and more). #### Personal and Sensitive Information  We fully anonymize the data by removing user names and other user-specific meta-data. If you recognize your prompts here and want to remove them, please send us an [email](mailto:shachar.don-yehiya@mail.huji.ac.il). The Midjourney Discord is an open community that allows others to use images and prompts whenever they are posted in a public setting. Paying users do own all assets they create, and therefore we do not include the image files in our dataset, but only links to them. ### Recommendations, Risks, and Limitations  We split the prompts into threads automatically, and therefore there are some mistakes. For more about our annotations method, please see the paper. Our manual sample did not find any offensive content in the prompts. ## Citation  **BibTeX:** ``` @inproceedings{don-yehiya-etal-2023-human, title = "Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney", author = "Don-Yehiya, Shachar and Choshen, Leshem and Abend, Omri", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.253", pages = "4146--4161", abstract = "Generating images with a Text-to-Image model often requires multiple trials, where human users iteratively update their prompt based on feedback, namely the output image. Taking inspiration from cognitive work on reference games and dialogue alignment, this paper analyzes the dynamics of the user prompts along such iterations. We compile a dataset of iterative interactions of human users with Midjourney. Our analysis then reveals that prompts predictably converge toward specific traits along these iterations. We further study whether this convergence is due to human users, realizing they missed important details, or due to adaptation to the model{'}s {``}preferences{''}, producing better images for a specific language style. We show initial evidence that both possibilities are at play. The possibility that users adapt to the model{'}s preference raises concerns about reusing user data for further training. The prompts may be biased towards the preferences of a specific model, rather than align with human intentions and natural manner of expression.", } ```

提供机构：

shachardon

原始信息汇总

数据集卡片 for Midjourney-Threads 🧵💬

数据集概述

该数据集包含来自Midjourney Discord频道的用户提示，组织成“交互线程”。每个线程包含用户创建一个目标图像的尝试。

数据集结构

主要列

text: 原始提示
args: 预定义参数（如宽高比、混沌等）
channel_id: Discord频道ID
userid: 匿名用户ID
timestamp: 提示创建时间戳
label: 生成的图像是否被放大
id: 提示的唯一ID
url_png: 生成的图像链接（4格版本）
main_content: 提示的前缀，不包括尾随的魔法词
concreteness: 具体性得分
word_len: 单词数量
repeat_words: 每个单词在提示中出现的次数（排除停用词）
reapeat_words_ratio: repeat_words / word_len
perplexity: GPT-2分配给每个提示的困惑度
caption_0-3: BLIP-2模型生成的标题
phase: 训练/测试分割
magic_ratio: 被识别为魔法词的单词百分比
thread_id: 线程ID
depth: 提示的成分解析树的最大深度
num_sent_parser: 提示中的句子数量
num_sent_parser_ratio: num_sent_parser / word_len
words_per_sent: word_len / num_sent_parser

数据集创建

源数据

数据集通过抓取Midjourney Discord服务器中的用户生成提示构建。服务器包含用户可以输入提示和参数的频道，Midjourney机器人会回复4张生成的图像，组合成一个网格。如果用户对其中一张图像满意，可以发送“放大”命令以获取所需图像的放大版本。

数据收集和处理

我们将提示自动分割成线程，并提取特征（如困惑度、句子长度等）。

个人和敏感信息

我们通过删除用户名和其他用户特定元数据来完全匿名化数据。如果用户在此处识别到自己的提示并希望删除，请发送电子邮件。

引用

@inproceedings{don-yehiya-etal-2023-human, title = "Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney", author = "Don-Yehiya, Shachar and Choshen, Leshem and Abend, Omri", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.253", pages = "4146--4161", abstract = "Generating images with a Text-to-Image model often requires multiple trials, where human users iteratively update their prompt based on feedback, namely the output image. Taking inspiration from cognitive work on reference games and dialogue alignment, this paper analyzes the dynamics of the user prompts along such iterations. We compile a dataset of iterative interactions of human users with Midjourney. Our analysis then reveals that prompts predictably converge toward specific traits along these iterations. We further study whether this convergence is due to human users, realizing they missed important details, or due to adaptation to the model{}s {``}preferences{}, producing better images for a specific language style. We show initial evidence that both possibilities are at play. The possibility that users adapt to the model{}s preference raises concerns about reusing user data for further training. The prompts may be biased towards the preferences of a specific model, rather than align with human intentions and natural manner of expression.", }