terminusresearch/ideogram-75k
收藏Hugging Face2024-07-12 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/terminusresearch/ideogram-75k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: agpl-3.0
---
# Ideogram-75k
## Dataset Details
This dataset is not authorised by, curated by, or related to Ideogram.
#### This dataset contains the `ideogram-25k` dataset contents. Do not use both!
### Dataset Description
- **Curated by:** @pseudoterminalx
- **License:** AGPLv3.
**Note**: All models created using this dataset are a derivative of it, and must have an open release under a permissible or copyleft license.
### Dataset Sources
Pulled ~75,000 images from Ideogram, a proprietary image generation service that excels at typography.
## Uses
- Fine-tuning or training text-to-image models and classifiers
- Analysis of Ideogram user bias
## Dataset Structure
- Filenames are an SHA256 hash of the image data, and can be used to verify the integrity.
- The `caption` column was obtained by asking Microsoft Florence2 (ft) to accurately describe what it sees.
## Dataset Creation
### Curation Rationale
Ideogram's users focus on typography generations, which makes it a suitable source for a lot of high quality typography data.
As a synthetic data source, its outputs are free of copyright concerns.
#### Data Collection and Processing
Used a custom Selenium application in Python that monitors the Ideogram service for posts and immediately saves them to disk.
Data is deduplicated by its SHA256 hash.
## Bias, Risks, and Limitations
As the captions all currently come from a single synthetic source, the bias of the Llava 34B captioner is present throughout this dataset.
More captions will be added.
## Citation
If there is any model built using this dataset or any further augmentations (eg. new captions) are added, this page & Terminus Research should be cited.
提供机构:
terminusresearch



