edgarcancinoe/celebahq_512_id_clusters

Name: edgarcancinoe/celebahq_512_id_clusters
Creator: edgarcancinoe
Published: 2026-04-02 17:44:23
License: 暂无描述

Hugging Face2026-04-02 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/edgarcancinoe/celebahq_512_id_clusters

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: "celebahq_512 with SRK identity labels" task_categories: - image-classification tags: - faces - celeba-hq - clustering - identity-labels source_datasets: - jxie/celeba-hq size_categories: - 10K<n<100K --- # celebahq_512 with SRK identity labels ## Summary This dataset is a derived version of [jxie/celeba-hq](https://huggingface.co/datasets/jxie/celeba-hq/tree/7ecc6a45edfb5483ccf2f7df1035d298ffe7c76b). It keeps the original image set and adds automatically generated identity-group labels derived from face-embedding clustering. As explained in our experimental setup, we use CelebA-HQ from Karras et al. (2018), specifically the Hugging Face snapshot at revision `7ecc6a45edfb5483ccf2f7df1035d298ffe7c76b`. The referenced CelebA-HQ version provides gender labels but no identity annotations. To support identity unlearning, we therefore construct identity labels automatically by clustering the embedding space. ## How identity labels were created We cluster the embedding space using DBSCAN, a density-based method that groups samples according to local similarity without requiring a predefined number of clusters. We use the Scikit-Learn implementation with cosine distance. Nearest-neighbor cosine-similarity analysis reveals two clear modes, with peaks around `s ~= 0.8` and `s ~= 0.25`. The high-similarity peak corresponds to samples of the same identity, while the lower peak captures ArcFace-similar but distinct individuals. Between these modes, a minimum appears around `s ~= 0.4`, providing a natural separation threshold. For this release, we use: - cosine distance threshold `eps = 0.35` - minimum samples per cluster `min_samples = 2` This procedure produces `9,683` clusters over `27,996` images, which we use as identity labels during training. Manual inspection confirms that the resulting clusters are visually consistent. Empirically, similarities above `s > 0.6` almost always correspond to the same individual, with only rare exceptions arising from lighting changes or strong facial occlusions. ## Columns - `image`: image file - `file_name`: original file name - `cluster_id`: DBSCAN-derived identity label - `cluster_size`: number of images assigned to that identity cluster ## Notes - `cluster_id` values are automatically generated labels, not official CelebA-HQ person identifiers. - The identity annotations are derived from embedding clustering rather than provided by the source dataset. - By default, this export keeps only the image and cluster-related columns needed for downstream identity-unlearning experiments. ## Source references - Original dataset snapshot: [jxie/celeba-hq @ 7ecc6a45edfb5483ccf2f7df1035d298ffe7c76b](https://huggingface.co/datasets/jxie/celeba-hq/tree/7ecc6a45edfb5483ccf2f7df1035d298ffe7c76b) - Original paper: [Karras et al., 2018, "A Style-Based Generator Architecture for Generative Adversarial Networks"](https://arxiv.org/abs/1812.04948)

提供机构：

edgarcancinoe

5,000+

优质数据集

54 个

任务类型

进入经典数据集