high-quality-midjouney-srefs
收藏魔搭社区2026-01-06 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/high-quality-midjouney-srefs
下载链接
链接失效反馈官方服务:
资源简介:
# Midjourney Image Scraper & Dataset Creator
A complete toolkit for scraping Midjourney images, generating captions, and creating HuggingFace datasets with optional automatic upload to HuggingFace Hub.
## 🌟 Features
- **🔍 Web Scraping**: Download images from midjourneysref.com with comprehensive error handling
- **🤖 AI Captioning**: Automatic image captioning using Moondream API with **auto-resume capability**
- **✂️ Smart Cropping**: AI-powered image cropping using OpenAI to optimize aspect ratios
- **🔄 Interrupt-Safe**: Stop and restart any process - it'll pick up where you left off
- **📦 HuggingFace Datasets**: Create professional HF-compatible datasets with metadata
- **☁️ Hub Upload**: Direct upload to HuggingFace Hub with one command
- **📊 Detailed Logging**: Comprehensive feedback and statistics for all operations
## 🚀 Quick Start
### 1. Installation
```bash
# Clone or download this repository
git clone <your-repo-url>
cd style_scraper
# Install all dependencies
pip install -r requirements.txt
```
### 2. Environment Setup
Create a `.env` file in the project root:
```bash
# Required for image captioning
MOONDREAM_API_KEY=your_moondream_api_key_here
# Required for AI-powered cropping
OPENAI_API_KEY=your_openai_api_key_here
# Required for HuggingFace Hub upload
HF_TOKEN=your_huggingface_token_here
```
**Get your tokens:**
- **Moondream API**: https://moondream.ai/api
- **OpenAI API**: https://platform.openai.com/api-keys
- **HuggingFace Hub**: https://huggingface.co/settings/tokens (select "Write" permissions)
### 3. Basic Usage
```bash
# 1. Scrape images
python scraper.py
# 2. Generate captions (optional but recommended)
python caption_images.py
# 3. Crop images with AI analysis (optional)
python crop_dataset.py
# 4. Create HuggingFace dataset
python create_hf_dataset.py --prompts prompts.csv
# 5. Or create and upload in one step
python create_hf_dataset.py --prompts prompts.csv --upload --repo-id your-username/my-dataset
```
## 📖 Detailed Usage
### Step 1: Image Scraping (`scraper.py`)
Downloads images from midjourneysref.com with comprehensive error handling.
```bash
python scraper.py
```
**Features:**
- Downloads images from pages 6-20 by default (configurable in script)
- Pre-flight connectivity testing
- Duplicate detection and skipping
- Detailed error categorization and statistics
- Comprehensive logging to `scraper.log`
**Output:** Images saved to `midjourney_images/` folder
### Step 2: Image Captioning (`caption_images.py`)
Generates captions for uncaptioned images using the Moondream API. **Automatically resumes** from where you left off!
```bash
# Basic usage - automatically resumes from existing captions
python caption_images.py
# First run captions 50 images, then gets interrupted...
# Second run automatically continues with remaining images
python caption_images.py # No extra flags needed!
# Force recaption all images (ignoring existing captions)
python caption_images.py --recaption
# Custom options
python caption_images.py --images my_images/ --output captions.csv --delay 0.5
# Caption length control
python caption_images.py --prompt-length short # Concise captions
python caption_images.py --prompt-length normal # Detailed captions
python caption_images.py --prompt-length mix # Random mix (default)
```
**Options:**
- `--images, -i`: Input folder (default: `midjourney_images`)
- `--output, -o`: Output CSV file (default: `prompts.csv`)
- `--existing, -e`: Custom existing prompts file (optional, by default uses output file)
- `--recaption, -r`: Force recaption all images, ignoring existing captions
- `--prompt-length, -l`: Caption length - `short`, `normal`, or `mix` (default: `mix`)
- `--delay, -d`: API rate limiting delay (default: 1.0 seconds)
- `--verbose, -v`: Enable debug logging
**Features:**
- **🔄 Auto-resume**: Automatically skips already-captioned images by default
- **🚀 Interrupt-safe**: Can safely stop and restart the process anytime
- Rate limiting to respect API limits
- API connection testing before processing
- Comprehensive error handling and statistics
- Compatible CSV output for dataset creation
**Output:** CSV file with `filename,prompt` columns
### Step 3: AI-Powered Image Cropping (`crop_dataset.py`)
Analyzes images using OpenAI's vision API to determine optimal crop ratios and creates cropped versions.
```bash
# Basic usage - processes all images in prompts.csv
python crop_dataset.py
# Test with a single image first
python test_crop.py
```
**Features:**
- **🤖 AI Analysis**: Uses OpenAI GPT-4 Vision to analyze each image for optimal cropping
- **📐 Smart Ratios**: Supports 16:9, 9:16, 4:3, 3:4, 1:1, or no cropping
- **🎯 Content-Aware**: Considers image content, composition, and aesthetics
- **📊 Metadata Tracking**: Saves crop ratios and dimensions to updated CSV
- **🔒 Fallback Safe**: Defaults to 1:1 (square) for invalid responses or errors
**Crop Ratios:**
- `16:9` - Wide scenes, landscapes, group shots
- `9:16` - Tall subjects, full-body portraits, buildings
- `4:3` - Balanced framing, single subjects, general scenes
- `3:4` - Portraits, vertical objects, close-ups
- `1:1` - Square format, where entire image is important
- `no` - Keep original aspect ratio (treated as 1:1)
**Output:**
- `cropped_images/` - Folder with all cropped images
- `prompts_with_crops.csv` - Updated CSV with crop metadata
- `crop_dataset.log` - Detailed processing log
**CSV Columns Added:**
- `crop_ratio` - AI-recommended aspect ratio
- `cropped_width` - Width of cropped image in pixels
- `cropped_height` - Height of cropped image in pixels
- `cropped_filename` - Filename of the cropped version
### Step 4: Dataset Creation (`create_hf_dataset.py`)
Creates a professional HuggingFace-compatible dataset with metadata extraction.
```bash
# Local dataset only
python create_hf_dataset.py --prompts prompts.csv
# Using cropped images and metadata
python create_hf_dataset.py --prompts prompts_with_crops.csv --input cropped_images
# With HuggingFace Hub upload
python create_hf_dataset.py --prompts prompts.csv --upload --repo-id username/dataset-name
```
**Options:**
- `--input, -i`: Input images folder (default: `midjourney_images`)
- `--output, -o`: Output dataset folder (default: `midjourney_hf_dataset`)
- `--name, -n`: Dataset name (default: `midjourney-images`)
- `--prompts, -p`: Path to prompts CSV file
- `--upload, -u`: Upload to HuggingFace Hub
- `--repo-id, -r`: HuggingFace repository ID (e.g., `username/dataset-name`)
- `--verbose, -v`: Enable debug logging
**Features:**
- Extracts comprehensive image metadata (resolution, file size, orientation, etc.)
- Style reference (`sref`) extraction from filenames
- HuggingFace-compatible structure and configuration
- Automatic README generation with usage examples
- Dataset loading scripts for seamless integration
- Optional direct upload to HuggingFace Hub
## 📊 Output Structure
### Dataset Directory Structure
```
midjourney_hf_dataset/
├── images/ # All image files
├── metadata/
│ ├── metadata.csv # Main HuggingFace metadata
│ ├── metadata_detailed.json # Detailed metadata
│ └── dataset_summary.json # Dataset statistics
├── dataset_config/ # HuggingFace configuration
│ ├── dataset_infos.json
│ └── midjourney-images.py # Dataset loading script
└── README.md # Generated documentation
```
### Metadata Fields
Each image includes comprehensive metadata:
- `filename`: Original image filename
- `file_path`: Relative path within dataset
- `sref`: Style reference ID (extracted from filename)
- `prompt`: AI-generated or provided caption
- `width`: Image width in pixels
- `height`: Image height in pixels
- `file_size_mb`: File size in megabytes
- `size_category`: Resolution category (high/medium/low)
- `orientation`: Image orientation (landscape/portrait/square)
## 🔧 Configuration
### Scraper Configuration
Edit variables in `scraper.py`:
```python
# Folder where you want to save the images
DOWNLOAD_FOLDER = "midjourney_images"
# Page range to scrape
start_page = 6
end_page = 20
```
### Caption Configuration
The captioning script supports rate limiting, automatic resuming, and caption length control:
```bash
# Caption length options
python caption_images.py --prompt-length short # Short, concise captions
python caption_images.py --prompt-length normal # Detailed descriptions
python caption_images.py --prompt-length mix # Random mix (default - adds variety)
# Rate limiting for API stability
python caption_images.py --delay 1.0
# Auto-resume from previous run (default behavior)
python caption_images.py
# Force recaption everything from scratch
python caption_images.py --recaption
# Use a different existing file for comparison
python caption_images.py --existing other_captions.csv
```
### Dataset Configuration
Customize dataset creation:
```bash
# Custom dataset name and structure
python create_hf_dataset.py \
--input my_images/ \
--output my_custom_dataset/ \
--name "my-custom-midjourney-dataset" \
--prompts captions.csv
```
## 🤗 Using Your Dataset
### Loading from HuggingFace Hub
```python
from datasets import load_dataset
# Load your uploaded dataset
dataset = load_dataset("username/your-dataset-name")
# Access data
for example in dataset["train"]:
image = example["image"] # PIL Image
prompt = example["prompt"] # Caption text
sref = example["sref"] # Style reference
width = example["width"] # Image width
# ... other metadata
```
### Loading Locally
```python
import pandas as pd
from PIL import Image
import os
# Load metadata
metadata = pd.read_csv("midjourney_hf_dataset/metadata/metadata.csv")
# Load specific image
def load_image(filename):
return Image.open(f"midjourney_hf_dataset/images/{filename}")
# Filter by criteria
high_res = metadata[metadata['size_category'] == 'high_resolution']
has_prompts = metadata[metadata['prompt'] != ""]
same_style = metadata[metadata['sref'] == '4160600070']
```
## 📈 Advanced Usage
### Batch Processing
```bash
# Process multiple scraping sessions - auto-resumes captioning!
for i in {1..5}; do
python scraper.py
python caption_images.py # Automatically skips existing captions
done
# Create final dataset
python create_hf_dataset.py --prompts prompts.csv --upload --repo-id username/large-midjourney-dataset
# Alternative: Force fresh captions for each batch
for i in {1..5}; do
python scraper.py
python caption_images.py --recaption
done
```
### Custom Prompts
You can provide your own prompts instead of using AI captioning:
```csv
filename,prompt
4160600070-1-d9409ee5.png,"A majestic dragon soaring over snow-capped mountains"
4160600070-2-a8b7c9d2.png,"Cyberpunk cityscape with neon reflections in rain"
```
### Style-Based Datasets
```bash
# Filter by style reference before creating dataset
python -c "
import pandas as pd
df = pd.read_csv('prompts.csv')
style_df = df[df['filename'].str.startswith('4160600070')]
style_df.to_csv('style_specific_prompts.csv', index=False)
"
python create_hf_dataset.py --prompts style_specific_prompts.csv --name "style-4160600070"
```
## 🐛 Troubleshooting
### Common Issues
**1. Missing API Keys**
```
Error: MOONDREAM_API_KEY not found
```
- Ensure `.env` file exists with valid API key
- Check API key has sufficient credits
**2. HuggingFace Upload Fails**
```
Error: HF_TOKEN not found
```
- Create token at https://huggingface.co/settings/tokens
- Ensure "Write" permissions are selected
- Check repository name is available
**3. No Images Found**
```
Warning: No images found with primary selector
```
- Website structure may have changed
- Check internet connection
- Verify target pages exist
**4. Caption Generation Fails**
```
Failed to caption image: API error
```
- Check Moondream API status
- Verify API key and credits
- Reduce rate limiting with `--delay`
- The script auto-resumes, so you can safely restart after fixing the issue
**5. Want to Recaption Existing Images**
```
Images already have captions but I want to regenerate them
```
- Use `--recaption` flag to ignore existing captions
- Or delete the existing CSV file to start fresh
### Log Files
Check these log files for detailed debugging:
- `scraper.log`: Web scraping logs
- `caption_images.log`: Captioning process logs
- `hf_dataset_creation.log`: Dataset creation logs
## 📄 License
This project is for educational and research purposes. Please respect the terms of service of the source website and API providers.
## 🤝 Contributing
Feel free to submit issues and enhancement requests!
## 🙏 Acknowledgments
- Images sourced from [midjourneysref.com](https://midjourneysref.com)
- Captioning powered by [Moondream API](https://moondream.ai)
- Dataset hosting by [🤗 Hugging Face](https://huggingface.co)
# Midjourney 图像抓取器与数据集创建工具(Midjourney Image Scraper & Dataset Creator)
一款用于抓取Midjourney图像、生成图像标题以及创建HuggingFace数据集的完整工具包,支持可选的自动上传至HuggingFace Hub。
## 🌟 核心特性
- **🔍 网页抓取**:从midjourneysref.com下载图像,附带全面的错误处理机制
- **🤖 AI 图像标题生成**:使用Moondream API自动生成图像标题,支持**自动续跑功能**
- **✂️ 智能裁剪**:依托OpenAI的AI能力进行图像裁剪,优化画幅比例
- **🔄 断点续跑**:可随时停止或重启进程,程序会从上次中断处自动恢复执行
- **📦 HuggingFace 数据集**:创建符合专业标准的HF兼容数据集并附带元数据
- **☁️ Hub 上传**:单命令直接上传至HuggingFace Hub
- **📊 详细日志**:为所有操作提供全面的反馈与统计信息
## 🚀 快速上手
### 1. 安装部署
bash
# 克隆或下载本仓库
git clone <your-repo-url>
cd style_scraper
# 安装所有依赖项
pip install -r requirements.txt
### 2. 环境配置
在项目根目录创建 `.env` 文件:
bash
# 图像标题生成所需
MOONDREAM_API_KEY=your_moondream_api_key_here
# AI 图像裁剪所需
OPENAI_API_KEY=your_openai_api_key_here
# HuggingFace Hub 上传所需
HF_TOKEN=your_huggingface_token_here
**获取对应令牌:**
- **Moondream API**:https://moondream.ai/api
- **OpenAI API**:https://platform.openai.com/api-keys
- **HuggingFace Hub**:https://huggingface.co/settings/tokens(需选择“写入”权限)
### 3. 基础用法
bash
# 1. 抓取图像
python scraper.py
# 2. 生成图像标题(可选但推荐)
python caption_images.py
# 3. 使用AI分析裁剪图像(可选)
python crop_dataset.py
# 4. 创建HuggingFace数据集
python create_hf_dataset.py --prompts prompts.csv
# 5. 或一步完成创建与上传
python create_hf_dataset.py --prompts prompts.csv --upload --repo-id your-username/my-dataset
## 📖 详细使用指南
### 步骤1:图像抓取(`scraper.py`)
从midjourneysref.com下载图像,附带全面的错误处理机制。
bash
python scraper.py
**功能特性:**
- 默认抓取第6至20页的内容(可在脚本中自定义配置)
- 预检连接测试
- 重复图像检测与跳过
- 详细的错误分类与统计信息
- 日志文件写入至 `scraper.log`
**输出:** 图像保存至 `midjourney_images/` 文件夹
### 步骤2:图像标题生成(`caption_images.py`)
使用Moondream API为未标注标题的图像自动生成标题,**支持自动续跑**,从上次中断处恢复。
bash
# 基础用法 - 自动从已有标题处续跑
python caption_images.py
# 首次运行生成50张图像的标题后中断...
# 第二次运行将自动继续处理剩余图像
python caption_images.py # 无需额外参数!
# 强制重新为所有图像生成标题(忽略已有标题)
python caption_images.py --recaption
# 自定义选项
python caption_images.py --images my_images/ --output captions.csv --delay 0.5
# 标题长度控制
python caption_images.py --prompt-length short # 简短精炼的标题
python caption_images.py --prompt-length normal # 详细描述性标题
python caption_images.py --prompt-length mix # 随机混合模式(默认)
**可选参数:**
- `--images, -i`:输入图像文件夹(默认:`midjourney_images`)
- `--output, -o`:输出CSV文件路径(默认:`prompts.csv`)
- `--existing, -e`:自定义已有标题文件(可选,默认使用输出文件)
- `--recaption, -r`:强制重新生成所有图像标题,忽略已有内容
- `--prompt-length, -l`:标题长度模式 - `short`、`normal` 或 `mix`(默认:`mix`)
- `--delay, -d`:API请求速率限制延迟(默认:1.0秒)
- `--verbose, -v`:启用调试日志
**功能特性:**
- **🔄 自动续跑**:默认自动跳过已生成标题的图像
- **🚀 断点安全**:可随时安全停止或重启进程
- 速率限制以遵守API调用限制
- 处理前的API连接测试
- 全面的错误处理与统计信息
- 兼容数据集创建所需的CSV输出格式
**输出:** 包含`filename,prompt`两列的CSV文件
### 步骤3:AI驱动的图像裁剪(`crop_dataset.py`)
使用OpenAI视觉API分析图像,确定最优裁剪比例并生成裁剪后的图像版本。
bash
# 基础用法 - 处理prompts.csv中的所有图像
python crop_dataset.py
# 先单张图像测试裁剪效果
python test_crop.py
**功能特性:**
- **🤖 AI 分析**:使用OpenAI GPT-4 Vision分析每张图像,确定最优裁剪方案
- **📐 智能比例**:支持16:9、9:16、4:3、3:4、1:1或不裁剪模式
- **🎯 内容感知**:考虑图像内容、构图与美学效果
- **📊 元数据跟踪**:将裁剪比例与尺寸信息保存至更新后的CSV文件
- **🔒 安全 fallback**:若API返回无效结果或发生错误,默认使用1:1(方形)裁剪
**支持的裁剪比例:**
- `16:9` - 宽幅场景、风景、群组合影
- `9:16` - 竖幅主体、全身人像、建筑拍摄
- `4:3` - 均衡构图、单一主体、通用场景
- `3:4` - 人像、垂直物体、特写拍摄
- `1:1` - 方形画幅,需保留完整图像内容
- `no` - 保留原始画幅比例(视为1:1处理)
**输出:**
- `cropped_images/` - 存放所有裁剪后图像的文件夹
- `prompts_with_crops.csv` - 包含裁剪元数据的更新版CSV文件
- `crop_dataset.log` - 详细的处理日志
**新增CSV列:**
- `crop_ratio` - AI推荐的画幅比例
- `cropped_width` - 裁剪后图像的像素宽度
- `cropped_height` - 裁剪后图像的像素高度
- `cropped_filename` - 裁剪后图像的文件名
### 步骤4:数据集创建(`create_hf_dataset.py`)
创建符合专业标准的HuggingFace兼容数据集并提取元数据。
bash
# 仅生成本地数据集
python create_hf_dataset.py --prompts prompts.csv
# 使用裁剪后的图像与元数据
python create_hf_dataset.py --prompts prompts_with_crops.csv --input cropped_images
# 直接上传至HuggingFace Hub
python create_hf_dataset.py --prompts prompts.csv --upload --repo-id username/dataset-name
**可选参数:**
- `--input, -i`:输入图像文件夹(默认:`midjourney_images`)
- `--output, -o`:输出数据集文件夹(默认:`midjourney_hf_dataset`)
- `--name, -n`:数据集名称(默认:`midjourney-images`)
- `--prompts, -p`:标题CSV文件路径
- `--upload, -u`:上传至HuggingFace Hub
- `--repo-id, -r`:HuggingFace仓库ID(例如:`username/dataset-name`)
- `--verbose, -v`:启用调试日志
**功能特性:**
- 提取全面的图像元数据(分辨率、文件大小、方向等)
- 从文件名提取风格参考ID(sref)
- 符合HuggingFace规范的目录结构与配置
- 自动生成包含使用示例的README文档
- 数据集加载脚本,实现无缝集成
- 可选直接上传至HuggingFace Hub
## 📊 输出目录结构
### 数据集目录结构
midjourney_hf_dataset/
├── images/ # 所有图像文件
├── metadata/
│ ├── metadata.csv # HuggingFace主元数据文件
│ ├── metadata_detailed.json # 详细元数据
│ └── dataset_summary.json # 数据集统计信息
├── dataset_config/ # HuggingFace配置文件
│ ├── dataset_infos.json
│ └── midjourney-images.py # 数据集加载脚本
└── README.md # 自动生成的文档
### 元数据字段
每张图像包含以下全面元数据:
- `filename`:原始图像文件名
- `file_path`:数据集内的相对路径
- `sref`:风格参考ID(从文件名提取)
- `prompt`:AI生成或提供的图像标题
- `width`:图像像素宽度
- `height`:图像像素高度
- `file_size_mb`:文件大小(单位:MB)
- `size_category`:分辨率类别(高/中/低)
- `orientation`:图像方向(横向/纵向/方形)
## 🔧 配置调整
### 抓取脚本配置
编辑`scraper.py`中的变量:
python
# 图像保存文件夹
DOWNLOAD_FOLDER = "midjourney_images"
# 抓取的页码范围
start_page = 6
end_page = 20
### 标题生成脚本配置
标题生成脚本支持速率限制、自动续跑与标题长度控制:
bash
# 标题长度选项
python caption_images.py --prompt-length short # 简短标题
python caption_images.py --prompt-length normal # 详细描述标题
python caption_images.py --prompt-length mix # 随机混合模式(默认,增加多样性)
# 调整API请求速率以保证稳定性
python caption_images.py --delay 1.0
# 从上次运行处自动续跑(默认行为)
python caption_images.py
# 强制从头开始重新生成所有标题
python caption_images.py --recaption
# 使用其他已有标题文件进行比对
python caption_images.py --existing other_captions.csv
### 数据集创建脚本配置
自定义数据集创建参数:
bash
# 自定义数据集名称与结构
python create_hf_dataset.py
--input my_images/
--output my_custom_dataset/
--name "my-custom-midjourney-dataset"
--prompts captions.csv
## 🤗 使用生成的数据集
### 从HuggingFace Hub加载
python
from datasets import load_dataset
# 加载您上传的数据集
dataset = load_dataset("username/your-dataset-name")
# 访问数据
for example in dataset["train"]:
image = example["image"] # PIL Image图像对象
prompt = example["prompt"] # 图像标题文本
sref = example["sref"] # 风格参考ID
width = example["width"] # 图像宽度
# ... 其他元数据字段
### 本地加载数据集
python
import pandas as pd
from PIL import Image
import os
# 加载元数据
metadata = pd.read_csv("midjourney_hf_dataset/metadata/metadata.csv")
# 加载指定图像
def load_image(filename):
return Image.open(f"midjourney_hf_dataset/images/{filename}")
# 按条件筛选数据
high_res = metadata[metadata['size_category'] == 'high_resolution']
has_prompts = metadata[metadata['prompt'] != ""]
same_style = metadata[metadata['sref'] == '4160600070']
## 📈 高级用法
### 批量处理
bash
# 多次抓取会话 - 标题生成自动续跑!
for i in {1..5}; do
python scraper.py
python caption_images.py # 自动跳过已生成标题的图像
done
# 创建最终数据集
python create_hf_dataset.py --prompts prompts.csv --upload --repo-id username/large-midjourney-dataset
# 替代方案:为每个批处理重新生成标题
for i in {1..5}; do
python scraper.py
python caption_images.py --recaption
done
### 自定义标题
您可以提供自定义标题而非使用AI生成:
csv
filename,prompt
4160600070-1-d9409ee5.png,"A majestic dragon soaring over snow-capped mountains"
4160600070-2-a8b7c9d2.png,"Cyberpunk cityscape with neon reflections in rain"
### 基于风格的专属数据集
bash
# 在创建数据集前按风格参考ID筛选数据
python -c "
import pandas as pd
df = pd.read_csv('prompts.csv')
style_df = df[df['filename'].str.startswith('4160600070')]
style_df.to_csv('style_specific_prompts.csv', index=False)
"
python create_hf_dataset.py --prompts style_specific_prompts.csv --name "style-4160600070"
## 🐛 故障排除
### 常见问题
**1. 缺少API密钥**
Error: MOONDREAM_API_KEY not found
- 确保`.env`文件存在且包含有效的API密钥
- 检查API密钥是否有足够的调用额度
**2. HuggingFace上传失败**
Error: HF_TOKEN not found
- 在https://huggingface.co/settings/tokens创建令牌
- 确保选择了“写入”权限
- 检查仓库名称是否可用
**3. 未找到任何图像**
Warning: No images found with primary selector
- 目标网站的结构可能已变更
- 检查网络连接
- 确认目标页面是否存在
**4. 标题生成失败**
Failed to caption image: API error
- 检查Moondream API的运行状态
- 确认API密钥与调用额度是否有效
- 使用`--delay`参数降低请求速率
- 脚本支持自动续跑,修复问题后可安全重启
**5. 需要重新生成已有图像的标题**
Images already have captions but I want to regenerate them
- 使用`--recaption`参数忽略已有标题重新生成
- 或删除现有CSV文件从头开始
### 日志文件
查看以下日志文件进行详细调试:
- `scraper.log`:网页抓取日志
- `caption_images.log`:标题生成进程日志
- `hf_dataset_creation.log`:数据集创建日志
## 📄 许可证
本项目仅用于教育与研究用途,请遵守源网站与API服务提供商的服务条款。
## 🤝 贡献
欢迎提交问题报告与功能改进请求!
## 🙏 致谢
- 图像来源:[midjourneysref.com](https://midjourneysref.com)
- 标题生成服务:[Moondream API](https://moondream.ai)
- 数据集托管:[🤗 Hugging Face](https://huggingface.co)
提供机构:
maas
创建时间:
2025-07-22



