edgarcancinoe/celebahq_512_id_clusters
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/edgarcancinoe/celebahq_512_id_clusters
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "celebahq_512 with SRK identity labels"
task_categories:
- image-classification
tags:
- faces
- celeba-hq
- clustering
- identity-labels
source_datasets:
- jxie/celeba-hq
size_categories:
- 10K<n<100K
---
# celebahq_512 with SRK identity labels
## Summary
This dataset is a derived version of [jxie/celeba-hq](https://huggingface.co/datasets/jxie/celeba-hq/tree/7ecc6a45edfb5483ccf2f7df1035d298ffe7c76b). It keeps the original image set and adds automatically generated identity-group labels derived from face-embedding clustering.
As explained in our experimental setup, we use CelebA-HQ from Karras et al. (2018), specifically the Hugging Face snapshot at revision `7ecc6a45edfb5483ccf2f7df1035d298ffe7c76b`. The referenced CelebA-HQ version provides gender labels but no identity annotations. To support identity unlearning, we therefore construct identity labels automatically by clustering the embedding space.
## How identity labels were created
We cluster the embedding space using DBSCAN, a density-based method that groups samples according to local similarity without requiring a predefined number of clusters. We use the Scikit-Learn implementation with cosine distance.
Nearest-neighbor cosine-similarity analysis reveals two clear modes, with peaks around `s ~= 0.8` and `s ~= 0.25`. The high-similarity peak corresponds to samples of the same identity, while the lower peak captures ArcFace-similar but distinct individuals. Between these modes, a minimum appears around `s ~= 0.4`, providing a natural separation threshold.
For this release, we use:
- cosine distance threshold `eps = 0.35`
- minimum samples per cluster `min_samples = 2`
This procedure produces `9,683` clusters over `27,996` images, which we use as identity labels during training. Manual inspection confirms that the resulting clusters are visually consistent. Empirically, similarities above `s > 0.6` almost always correspond to the same individual, with only rare exceptions arising from lighting changes or strong facial occlusions.
## Columns
- `image`: image file
- `file_name`: original file name
- `cluster_id`: DBSCAN-derived identity label
- `cluster_size`: number of images assigned to that identity cluster
## Notes
- `cluster_id` values are automatically generated labels, not official CelebA-HQ person identifiers.
- The identity annotations are derived from embedding clustering rather than provided by the source dataset.
- By default, this export keeps only the image and cluster-related columns needed for downstream identity-unlearning experiments.
## Source references
- Original dataset snapshot: [jxie/celeba-hq @ 7ecc6a45edfb5483ccf2f7df1035d298ffe7c76b](https://huggingface.co/datasets/jxie/celeba-hq/tree/7ecc6a45edfb5483ccf2f7df1035d298ffe7c76b)
- Original paper: [Karras et al., 2018, "A Style-Based Generator Architecture for Generative Adversarial Networks"](https://arxiv.org/abs/1812.04948)
提供机构:
edgarcancinoe



