five

shachardon/midjourney-threads

收藏
Hugging Face2023-12-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shachardon/midjourney-threads
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-to-image language: - en pretty_name: Midjourney-Threads size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: - "threads_0.csv" - "threads_20000.csv" - "threads_40000.csv" - "threads_60000.csv" - "threads_80000.csv" - "threads_100000.csv" - "threads_120000.csv" - "threads_140000.csv" - "threads_160000.csv" --- # Dataset Card for Midjourney-Threads 🧵💬 <!-- Provide a quick summary of the dataset. --> This dataset contains users prompts from the Midjourney discord channel, organized into "threads of interaction". Each thread contains a user’s trails to create one target image. The dataset was introduced as part of the paper: [Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney][ourpaper]. [ourpaper]: https://aclanthology.org/2023.emnlp-main.253/ "markdown our paper" ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Repository:** https://github.com/shachardon/Mid-Journey-to-alignment - **Paper:** https://aclanthology.org/2023.emnlp-main.253/ ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> Main Columns: - 'text' - the original prompt - 'args' - predefined parameters (such as the aspect ratio, chaos and [more][myexample]) - 'channel_id' - the discord channel - 'userid' - an anonymous user id - 'timestamp' - a timestamp of the prompt creation - 'label' - Ture whether an image that was generated based on that prompt was upscaled, otherwise False. - 'id' - unique id of the prompt - 'url_png' - link to the generated images (a 4-grid version) - 'main_content' - prefix of the prompt, without trailing magic-words - 'concreteness' - concreteness score, based on the [this paper][concpaper] - 'word_len' - the number of words - 'repeat_words' - the occurrences of each word that appears more than once in the prompt, excluding stop words. - 'reapeat_words_ratio' - repeat_words / word_len - 'perplexity' - the perplexity GPT-2 assigns to each prompt. - 'caption_0-3' - captions that were generated by the BLIP-2 model, with the 4 created images as its inputs. - 'phase' - train/test split, as was used to train image/text classifiers - 'magic_ratio' - the percentage of words that were recognized as magic words in the prompt - 'thread_id' - the id of the thread - 'depth' - the max depth of a constituency parse tree of the prompt. - 'num_sent_parser' - the number of sentences in the prompt. - 'num_sent_parser_ratio' - num_sent_parser / word_len - 'words_per_sent' - word_len / num_sent_parser [myexample]: https://docs.midjourney.com/docs/parameter-list "markdown more" [concpaper]: https://link.springer.com/article/10.3758/s13428-013-0403-5 "markdown this paper" ## Dataset Creation ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> We construct the dataset by scraping user-generated prompts from the Midjourney Discord server. The server contains channels in which a user can type a prompt and arguments, and then the Midjourney bot would reply with 4 generated images, combined together into a grid. Then, if the user is satisfied with one of the 4 images, they can send an 'upscale' command to the bot, to get an upscaled version of the desired image. We randomly choose one of the 'newbies' channels, where both new and experienced users are experimenting with general domain prompts. We collect 693,528 prompts (From 23 January to 1 March 2023), together with their matching images and meta-data such as timestamps and user ids (which we anonymize). #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> We split the prompts into threads automatically, see the paper for more details. In addition, we extract features (perplexity, sentence length, and more). #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> We fully anonymize the data by removing user names and other user-specific meta-data. If you recognize your prompts here and want to remove them, please send us an [email](mailto:shachar.don-yehiya@mail.huji.ac.il). The Midjourney Discord is an open community that allows others to use images and prompts whenever they are posted in a public setting. Paying users do own all assets they create, and therefore we do not include the image files in our dataset, but only links to them. ### Recommendations, Risks, and Limitations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> We split the prompts into threads automatically, and therefore there are some mistakes. For more about our annotations method, please see the paper. Our manual sample did not find any offensive content in the prompts. ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ``` @inproceedings{don-yehiya-etal-2023-human, title = "Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney", author = "Don-Yehiya, Shachar and Choshen, Leshem and Abend, Omri", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.253", pages = "4146--4161", abstract = "Generating images with a Text-to-Image model often requires multiple trials, where human users iteratively update their prompt based on feedback, namely the output image. Taking inspiration from cognitive work on reference games and dialogue alignment, this paper analyzes the dynamics of the user prompts along such iterations. We compile a dataset of iterative interactions of human users with Midjourney. Our analysis then reveals that prompts predictably converge toward specific traits along these iterations. We further study whether this convergence is due to human users, realizing they missed important details, or due to adaptation to the model{'}s {``}preferences{''}, producing better images for a specific language style. We show initial evidence that both possibilities are at play. The possibility that users adapt to the model{'}s preference raises concerns about reusing user data for further training. The prompts may be biased towards the preferences of a specific model, rather than align with human intentions and natural manner of expression.", } ```
提供机构:
shachardon
原始信息汇总

数据集卡片 for Midjourney-Threads 🧵💬

数据集概述

该数据集包含来自Midjourney Discord频道的用户提示,组织成“交互线程”。每个线程包含用户创建一个目标图像的尝试。

数据集结构

主要列

  • text: 原始提示
  • args: 预定义参数(如宽高比、混沌等)
  • channel_id: Discord频道ID
  • userid: 匿名用户ID
  • timestamp: 提示创建时间戳
  • label: 生成的图像是否被放大
  • id: 提示的唯一ID
  • url_png: 生成的图像链接(4格版本)
  • main_content: 提示的前缀,不包括尾随的魔法词
  • concreteness: 具体性得分
  • word_len: 单词数量
  • repeat_words: 每个单词在提示中出现的次数(排除停用词)
  • reapeat_words_ratio: repeat_words / word_len
  • perplexity: GPT-2分配给每个提示的困惑度
  • caption_0-3: BLIP-2模型生成的标题
  • phase: 训练/测试分割
  • magic_ratio: 被识别为魔法词的单词百分比
  • thread_id: 线程ID
  • depth: 提示的成分解析树的最大深度
  • num_sent_parser: 提示中的句子数量
  • num_sent_parser_ratio: num_sent_parser / word_len
  • words_per_sent: word_len / num_sent_parser

数据集创建

源数据

数据集通过抓取Midjourney Discord服务器中的用户生成提示构建。服务器包含用户可以输入提示和参数的频道,Midjourney机器人会回复4张生成的图像,组合成一个网格。如果用户对其中一张图像满意,可以发送“放大”命令以获取所需图像的放大版本。

数据收集和处理

我们将提示自动分割成线程,并提取特征(如困惑度、句子长度等)。

个人和敏感信息

我们通过删除用户名和其他用户特定元数据来完全匿名化数据。如果用户在此处识别到自己的提示并希望删除,请发送电子邮件。

推荐、风险和限制

我们自动将提示分割成线程,因此存在一些错误。手动样本未发现提示中的冒犯性内容。

引用

@inproceedings{don-yehiya-etal-2023-human, title = "Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney", author = "Don-Yehiya, Shachar and Choshen, Leshem and Abend, Omri", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.253", pages = "4146--4161", abstract = "Generating images with a Text-to-Image model often requires multiple trials, where human users iteratively update their prompt based on feedback, namely the output image. Taking inspiration from cognitive work on reference games and dialogue alignment, this paper analyzes the dynamics of the user prompts along such iterations. We compile a dataset of iterative interactions of human users with Midjourney. Our analysis then reveals that prompts predictably converge toward specific traits along these iterations. We further study whether this convergence is due to human users, realizing they missed important details, or due to adaptation to the model{}s {``}preferences{}, producing better images for a specific language style. We show initial evidence that both possibilities are at play. The possibility that users adapt to the model{}s preference raises concerns about reusing user data for further training. The prompts may be biased towards the preferences of a specific model, rather than align with human intentions and natural manner of expression.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作