five

mikelmh025/ClothingADC

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mikelmh025/ClothingADC
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 pretty_name: Clothing-ADC task_categories: - image-classification - image-feature-extraction language: - en source_datasets: - original size_categories: - 1M<n<10M paperswithcode_id: clothing-adc arxiv: 2408.11338 dataset_info: features: - name: id dtype: string - name: image dtype: image - name: class dtype: string - name: color dtype: string - name: material dtype: string - name: pattern dtype: string splits: - name: train num_examples: 1036738 - name: validation num_examples: 20000 - name: test num_examples: 20000 tags: - clothing - fashion - computer-vision - fine-grained-recognition - label-noise - noisy-labels - long-tail - class-imbalance - benchmark - web-crawled - data-curation - robustness --- # Clothing-ADC Dataset **Paper:** [Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond](https://arxiv.org/abs/2408.11338) **Authors:** Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhu, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu **Institutions:** UC Santa Cruz, HKUST(GZ), UC Davis, SUSTech, Zhejiang University, Yale University, Carnegie Mellon University, Nanyang Technological University, Microsoft --- ## Dataset Summary **Clothing-ADC** is a large-scale clothing image classification dataset built with the **Automatic Dataset Construction (ADC)** pipeline. Instead of the traditional approach of collecting images first and then annotating them, ADC reverses this process: it uses GPT-4 to design fine-grained class hierarchies, then automatically collects labeled images from Google Image Search using the class descriptions as queries. The result is a dataset with **over 1 million images**, **12 main clothing classes**, and **12,000 fine-grained subclasses** defined by combinations of color, material, and pattern attributes — all without requiring domain expertise or manual annotation of individual samples. The dataset also serves as a benchmark platform for three real-world data challenges that arise during automatic dataset construction: 1. **Label Noise Detection** 2. **Learning with Noisy Labels** 3. **Class-Imbalanced Learning** --- ## Dataset Statistics | Property | Value | |---|---| | Total samples | 1,076,738 | | Image resolution | 256 × 256 | | Main classes | 12 | | Total subclasses | 12,000 | | Avg. samples per subclass | ~89.73 | | Label noise rate (train) | 22.2% – 32.7% | ### Dataset Splits | Split | Size | Label Quality | |---|---|---| | Train | 1,036,738 | Web-collected (noisy) | | Validation | 20,000 | Human-verified (clean) | | Test | 20,000 | Human-verified (clean) | ### Main Classes (12) Sweater, Windbreaker, T-shirt, Shirt, Knitwear, Hoodie, Jacket, Suit, Shawl, Dress, Vest, Underwear ### Subclass Structure Each main class has **1,000 subclasses** defined by three attributes with 10 options each: | Attribute | # Options | Examples | |---|---|---| | Color | 10 | white, black, red, navy, grey, … | | Material | 10 | cotton, wool, polyester, denim, … | | Pattern | 10 | solid, striped, plaid, floral, … | Search queries are formed as `"<Color> <Material> <Pattern> <Clothing Type>"` (e.g., `"white cotton fisherman sweater"`), which serve simultaneously as the image search query and the sample's fine-grained label. --- ## ADC Pipeline The ADC pipeline consists of three steps: **Step 1 — Dataset Design with LLMs** GPT-4 is prompted to enumerate attribute options for each clothing category (`"Show me <30–80> ways to describe <Attribute> of <Class>"`). The resulting categories are reviewed iteratively, avoiding the need for human domain expertise. **Step 2 — Automated Labeling** The Google Image API is queried with composite search strings. The top ~100 results per query are collected and labeled automatically. Each query string is the sample's label, eliminating manual annotation entirely. **Step 3 — Data Curation and Cleaning** - **Algorithmic curation:** Label noise detection methods (e.g., Simi-Feat / Docta) automatically filter mislabeled samples, reducing noise from ~22.2% to ~10.7%. - **Human-in-the-loop (clean splits):** For the validation and test sets, human annotators on Amazon MTurk verified labels by selecting correct samples from machine-labeled batches (minimum 4 of 20 per query). Only samples with full human–machine agreement are included in the clean splits. --- ## Benchmark Tasks ### 1. Label Noise Detection (`Clothing-ADC-Detection`) A 20,000-sample subset with both noisy and clean labels, annotated by 3 Amazon MTurk workers per image (correct / unsure / incorrect). Used to benchmark noise detection algorithms. **Metric:** F1-score of detected corrupted instances | Method | F1-Score | |---|---| | CORES | 0.4793 | | Confident Learning (CL) | 0.4352 | | Deep k-NN | 0.3991 | | Simi-Feat | **0.5721** | --- ### 2. Label Noise Learning (`Clothing-ADC` / `Clothing-ADC-Tiny`) Train on the full noisy training set; evaluate on the clean held-out test set. A tiny version (~50K train images) is also provided for fast experimentation. **Metric:** Classification accuracy on clean test set (12-class) Selected results (ResNet-50, 20 epochs): | Method | Full | Tiny | |---|---|---| | Cross-Entropy (baseline) | — | — | | Positive Label Smoothing | ↑ | ↑ | | Taylor CE | **best** | **best** | | DivideMix | competitive | competitive | --- ### 3. Class-Imbalanced Learning (`Clothing-ADC-CLT`) A class-level long-tail version of the dataset, with imbalance ratios ρ ∈ {10, 50, 100}. Noisy samples are removed prior to constructing this benchmark using algorithmic curation (Docta + learning-centric curation), yielding ~562,263 clean images. **Metric:** δ-worst accuracy (interpolates between mean accuracy at δ=0 and worst-class accuracy at δ→∞) | Method | ρ=10 (δ=0) | ρ=100 (δ=0) | ρ=10 (δ=∞) | ρ=100 (δ=∞) | |---|---|---|---|---| | Cross-Entropy | 57.80 | 30.10 | 0.96 | 0.00 | | Focal Loss | 72.70 | 62.28 | 38.12 | 13.44 | | LDAM | 72.50 | 63.25 | 40.90 | 15.69 | | Balanced Softmax | 74.18 | 69.47 | 48.54 | 50.60 | | Logit-Adjust | **74.08** | **69.44** | 47.45 | 43.26 | | Drops | 73.66 | 67.15 | **50.85** | 32.43 | --- ## Comparison with Existing Datasets | Dataset | # Train/Test | # Classes | Noise Rate (%) | Has Attributes | Auto Annotation | Requires Expert? | |---|---|---|---|---|---|---| | iNaturalist | 579k/279k | 54k | ~0 | ✗ | ✗ | ✓ | | WebVision | 2.4M/100k | 1000 | 20 | ✗ | ✓ | ✓ | | ANIMAL-10N | 50k/10k | 10 | 8 | ✗ | ✗ | ✗ | | CIFAR-10N | 50k/10k | 10 | 9–40 | ✗ | ✗ | ✗ | | Food-101N | 75.75k/25.25k | 101 | 18.4 | ✗ | ✗ | ✓ | | Clothing1M | 1M total | 14 | 38.5 | ✗ | ✗ | ✓ | | **Clothing-ADC (Ours)** | **1M/20k** | **12** | **22.2–32.7** | **12k** | **✓** | **✗** | --- ## Data Fields Each sample contains: - `id`: unique image identifier string - `image`: PIL image (256×256 RGB) - `class`: main clothing category string (e.g., `"Sweater"`) - `color`: color attribute label (e.g., `"white"`) - `material`: material attribute label (e.g., `"cotton"`) - `pattern`: pattern attribute label (e.g., `"fisherman"`) --- ## Usage ```python from datasets import load_dataset # Full dataset ds = load_dataset("mikelmh025/ClothingADC") # Access splits train = ds["train"] val = ds["validation"] test = ds["test"] # Example: iterate over test set for sample in test: image = sample["image"] category = sample["class"] color = sample["color"] material = sample["material"] pattern = sample["pattern"] ``` --- ## Citation If you use Clothing-ADC in your research, please cite: ```bibtex @article{liu2024adc, title = {Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond}, author = {Minghao Liu and Zonglin Di and Jiaheng Wei and Zhongruo Wang and Hengxiang Zhang and Ruixuan Xiao and Haoyu Wang and Jinlong Pang and Hao Chen and Ankit Shah and Hongxin Wei and Xinlei He and Zhaowei Zhu and Haobo Wang and Lei Feng and Jindong Wang and James Davis and Yang Liu}, journal = {arXiv preprint arXiv:2408.11338}, year = {2024} } ``` --- ## License This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). Images are collected from Google Image Search and remain subject to their original source licenses. This dataset is intended for research purposes only.
提供机构:
mikelmh025
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作