alchemist
收藏魔搭社区2025-12-04 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/alchemist
下载链接
链接失效反馈官方服务:
资源简介:

# Alchemist 👨🔬
## Dataset Description
**Alchemist** is a compact, high-quality dataset comprising 3,350 image-text pairs, meticulously curated for supervised fine-tuning (SFT) of pre-trained text-to-image (T2I) generative models. The primary goal of Alchemist is to significantly enhance the generative quality (particularly aesthetic appeal and image complexity) of T2I models while preserving their inherent diversity in content, composition, and style.
This dataset and its creation methodology are introduced in the research paper: \
**"[Alchemist: Turning Public Text-to-Image Data into Generative Gold](https://huggingface.co/papers/2505.19297)"**
## Dataset Creation
### Curation Rationale
Existing methods for creating SFT datasets often rely on very large-scale data or filtering techniques that may not optimally select for samples that provide the maximum boost in SFT performance. The Alchemist dataset was created to address the need for a smaller, yet highly effective, general-purpose SFT resource.
Our methodology is detailed in the associated paper and involves a multi-stage filtering pipeline:
1. **Source Data:** We started with an initial pool of approximately 10 billion web-scraped images.
2. **Image-Centric Pre-filtering:** Unlike approaches that perform early text-based filtering (which can discard valuable visual content), our initial stages focused on image quality. This involved:
* Filtering for safety (NSFW removal) and resolution (retaining images > 1MPx).
* Coarse-grained quality assessment using lightweight binary classifiers to remove images with severe degradations, watermarks, blur, or low aesthetics.
* Image deduplication using SIFT-like features and fine-grained perceptual quality filtering using the TOPIQ no-reference IQA model. This resulted in ~300 million high-quality images.
3. **Diffusion Model-Guided Quality Estimation (Core Novelty):** The cornerstone of our pipeline is the use of a pre-trained diffusion model as a sophisticated quality estimator. This model identifies image-text pair candidates (after a preliminary captioning of the 300M images) that possess a rare combination of visual appeal characteristics crucial for maximizing SFT performance. This involves extracting cross-attention activations with respect to a multi-keyword prompt designed to evoke these desired qualities.
4. **Final Selection & Re-captioning:** The top 3,350 images selected by the diffusion-based scorer were then **re-captioned**. Critically, this re-captioning aimed to generate **moderately descriptive, user-like prompts** rather than exhaustively detailed descriptions, as our preliminary experiments showed this style yields optimal SFT outcomes.
### Data Fields
Each instance in the dataset consists of:
* `img_key`: A hash that uniquely identifies a text-image pair.
* `url`: A url that can be used to download an image.
* `prompt`: A synthetic, user-like prompt assosiated with the corresponding image.
### Data Splits
The dataset contains a single split:
* `train`: 3,350 samples.
## Usage
The Alchemist dataset is designed for supervised fine-tuning of text-to-image models. Our paper demonstrates its effectiveness across five different Stable Diffusion architectures (SD1.5, SD2.1, SDXL, SD3.5 M, SD3.5 L).
### Getting Started with the `datasets` library
To load the dataset:
```python
from datasets import load_dataset
dataset = load_dataset("yandex/alchemist", split="train")
# Example: Accessing the first sample
print(dataset[0]['prompt'])
```

# 炼金术士(Alchemist)👨🔬
## 数据集说明
**炼金术士(Alchemist)** 是一款紧凑高质的数据集,包含3350组图像-文本对,专为预训练文本到图像(T2I, text-to-image)生成模型的监督微调(SFT, supervised fine-tuning)任务精心打造。该数据集的核心目标是在保留文本到图像模型原有内容、构图与风格多样性的前提下,显著提升模型的生成质量——尤其是审美表现力与图像复杂度。
本数据集及其构建方法已在研究论文"[Alchemist:将公开文本到图像数据转化为生成黄金](https://huggingface.co/papers/2505.19297)"中进行了详细介绍。
## 数据集构建
### 筛选逻辑
现有监督微调数据集的构建方法通常依赖超大规模数据或过滤技术,却未必能最优选择可最大化提升微调性能的样本。炼金术士数据集正是为解决这一痛点而生,旨在提供一款体量更小但效果优异的通用型监督微调资源。
我们的构建方法详见相关论文,整体采用多阶段过滤流水线:
1. **源数据**:初始数据集池源自约100亿张网络爬取图像。
2. **以图像为中心的预过滤**:与早期基于文本过滤(可能丢弃有价值的视觉内容)的方法不同,我们的初始阶段聚焦图像质量。具体包括:
* 安全性过滤(移除NSFW内容)与分辨率筛选(保留分辨率大于1百万像素的图像);
* 使用轻量级二分类器进行粗粒度质量评估,剔除存在严重退化、水印、模糊或审美质量低下的图像;
* 采用类SIFT(尺度不变特征变换, Scale-Invariant Feature Transform)特征进行图像去重,并借助TOPIQ无参考图像质量评估(IQA, Image Quality Assessment)模型进行细粒度感知质量过滤,最终得到约3亿张高质量图像。
3. **扩散模型引导的质量评估(核心创新点)**:本流水线的基石是使用预训练扩散模型作为高精度质量评估器。在对3亿张图像进行初步字幕生成后,该模型可识别出兼具关键视觉吸引力特征的图像-文本对候选样本,这些特征正是最大化监督微调性能的核心要素。具体操作包括:针对旨在唤起这些理想特征的多关键词提示词,提取其交叉注意力激活值。
4. **最终筛选与重新生成提示词**:经基于扩散模型的评分器选出的前3350张图像,随后被**重新生成提示词**。至关重要的是,本次生成旨在创建**中等详实、贴合用户使用习惯的提示词**,而非过于详尽的描述——我们的预实验表明,这种风格可实现最优的监督微调效果。
### 数据字段
数据集的每个实例包含以下字段:
* `img_key`:用于唯一标识一组图像-文本对的哈希值。
* `url`:可用于下载对应图像的链接。
* `prompt`:与对应图像关联的合成式、贴合用户习惯的提示词。
### 数据划分
本数据集仅包含一个划分:
* `train`:共3350个样本。
## 数据集用途
炼金术士数据集专为文本到图像模型的监督微调任务设计。我们的论文验证了其在五种不同Stable Diffusion架构(SD1.5、SD2.1、SDXL、SD3.5 M、SD3.5 L)上的有效性。
### 使用`datasets`库快速上手
加载数据集的代码示例如下:
python
from datasets import load_dataset
dataset = load_dataset("yandex/alchemist", split="train")
# 示例:访问第一个样本的提示词
print(dataset[0]['prompt'])
提供机构:
maas
创建时间:
2025-05-31



