midjourney-v6-520k-raw
收藏魔搭社区2026-05-22 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/OmniData/midjourney-v6-520k-raw
下载链接
链接失效反馈官方服务:
资源简介:
# Synthetic Dataset: MJv6-520k
Pulled from Midjourney on 19 Jun 2024, filtered down to just singular images.
Japanese captions are filtered via GPT3.5 into English -> `gpt_caption` column.
Original captions are available as the `original_text` column.
Each file has a metadata JSON and txt file with the same name. The metadata is the same from the parquet table. The text file is for use in SimpleTuner or Kohya for training.
**This dataset contains the full images.**
Code to compile parquet:
```py
"""
Python.
"""
# A script to compile all .json files in the pwd into a parquet file
column_types = {
"id": "int64",
"version": "str",
"arguments": "str",
"original_text": "str",
"caption": "str",
"gpt_caption": "str",
"width": "int",
"height": "int",
"reactions": "dict"
}
# Map column types to their corresponding pandas types
import pandas as pd
column_types = {k: pd.api.types.infer_dtype(v) for k, v in column_types.items()}
# Read all .json files in the pwd
import json
import os
data = []
for file in os.listdir():
if file.endswith(".json"):
with open(file, "r") as f:
data.append(json.load(f))
# Convert the data to a DataFrame
df = pd.DataFrame(data)
# Convert the columns to the correct types
for col, dtype in column_types.items():
df[col] = df[col].astype(dtype)
# Save the DataFrame to a parquet file
df.to_parquet("train.parquet")
# Print the first few rows of the DataFrame
print(df.head())
```
# 合成数据集:MJv6-520k
本数据集于2024年6月19日从Midjourney平台抓取,经筛选后仅保留单张图像。
日语标题已通过GPT-3.5翻译为英语,对应字段为`gpt_caption`。
原始标题存储于`original_text`字段中。
每个数据文件均包含同名的元数据JSON文件与文本文件:元数据与Parquet表格中的信息完全一致;文本文件可用于SimpleTuner或Kohya的模型训练流程。
**本数据集包含完整原始图像。**
用于生成Parquet文件的代码如下:
py
"""
Python 脚本。
"""
# 用于将当前工作目录下所有JSON文件合并为Parquet文件的脚本
column_types = {
"id": "int64",
"version": "str",
"arguments": "str",
"original_text": "str",
"caption": "str",
"gpt_caption": "str",
"width": "int",
"height": "int",
"reactions": "dict"
}
# 将列类型映射至对应的pandas数据类型
import pandas as pd
column_types = {k: pd.api.types.infer_dtype(v) for k, v in column_types.items()}
# 读取当前目录下所有JSON文件
import json
import os
data = []
for file in os.listdir():
if file.endswith(".json"):
with open(file, "r") as f:
data.append(json.load(f))
# 将数据转换为DataFrame格式
df = pd.DataFrame(data)
# 将各列转换为指定数据类型
for col, dtype in column_types.items():
df[col] = df[col].astype(dtype)
# 将DataFrame保存为Parquet文件
df.to_parquet("train.parquet")
# 打印DataFrame的前几行数据
print(df.head())
提供机构:
maas
创建时间:
2024-07-27
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含52万张从Midjourney获取的单张图像,原始日文描述已通过GPT3.5翻译为英文,并提供元数据JSON和文本文件,适用于SimpleTuner或Kohya等训练框架。
以上内容由遇见数据集搜集并总结生成



