mikelmh025/ClothingADC
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mikelmh025/ClothingADC
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
pretty_name: Clothing-ADC
task_categories:
- image-classification
- image-feature-extraction
language:
- en
source_datasets:
- original
size_categories:
- 1M<n<10M
paperswithcode_id: clothing-adc
arxiv: 2408.11338
dataset_info:
features:
- name: id
dtype: string
- name: image
dtype: image
- name: class
dtype: string
- name: color
dtype: string
- name: material
dtype: string
- name: pattern
dtype: string
splits:
- name: train
num_examples: 1036738
- name: validation
num_examples: 20000
- name: test
num_examples: 20000
tags:
- clothing
- fashion
- computer-vision
- fine-grained-recognition
- label-noise
- noisy-labels
- long-tail
- class-imbalance
- benchmark
- web-crawled
- data-curation
- robustness
---
# Clothing-ADC Dataset
**Paper:** [Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond](https://arxiv.org/abs/2408.11338)
**Authors:** Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhu, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu
**Institutions:** UC Santa Cruz, HKUST(GZ), UC Davis, SUSTech, Zhejiang University, Yale University, Carnegie Mellon University, Nanyang Technological University, Microsoft
---
## Dataset Summary
**Clothing-ADC** is a large-scale clothing image classification dataset built with the **Automatic Dataset Construction (ADC)** pipeline. Instead of the traditional approach of collecting images first and then annotating them, ADC reverses this process: it uses GPT-4 to design fine-grained class hierarchies, then automatically collects labeled images from Google Image Search using the class descriptions as queries.
The result is a dataset with **over 1 million images**, **12 main clothing classes**, and **12,000 fine-grained subclasses** defined by combinations of color, material, and pattern attributes — all without requiring domain expertise or manual annotation of individual samples.
The dataset also serves as a benchmark platform for three real-world data challenges that arise during automatic dataset construction:
1. **Label Noise Detection**
2. **Learning with Noisy Labels**
3. **Class-Imbalanced Learning**
---
## Dataset Statistics
| Property | Value |
|---|---|
| Total samples | 1,076,738 |
| Image resolution | 256 × 256 |
| Main classes | 12 |
| Total subclasses | 12,000 |
| Avg. samples per subclass | ~89.73 |
| Label noise rate (train) | 22.2% – 32.7% |
### Dataset Splits
| Split | Size | Label Quality |
|---|---|---|
| Train | 1,036,738 | Web-collected (noisy) |
| Validation | 20,000 | Human-verified (clean) |
| Test | 20,000 | Human-verified (clean) |
### Main Classes (12)
Sweater, Windbreaker, T-shirt, Shirt, Knitwear, Hoodie, Jacket, Suit, Shawl, Dress, Vest, Underwear
### Subclass Structure
Each main class has **1,000 subclasses** defined by three attributes with 10 options each:
| Attribute | # Options | Examples |
|---|---|---|
| Color | 10 | white, black, red, navy, grey, … |
| Material | 10 | cotton, wool, polyester, denim, … |
| Pattern | 10 | solid, striped, plaid, floral, … |
Search queries are formed as `"<Color> <Material> <Pattern> <Clothing Type>"` (e.g., `"white cotton fisherman sweater"`), which serve simultaneously as the image search query and the sample's fine-grained label.
---
## ADC Pipeline
The ADC pipeline consists of three steps:
**Step 1 — Dataset Design with LLMs**
GPT-4 is prompted to enumerate attribute options for each clothing category (`"Show me <30–80> ways to describe <Attribute> of <Class>"`). The resulting categories are reviewed iteratively, avoiding the need for human domain expertise.
**Step 2 — Automated Labeling**
The Google Image API is queried with composite search strings. The top ~100 results per query are collected and labeled automatically. Each query string is the sample's label, eliminating manual annotation entirely.
**Step 3 — Data Curation and Cleaning**
- **Algorithmic curation:** Label noise detection methods (e.g., Simi-Feat / Docta) automatically filter mislabeled samples, reducing noise from ~22.2% to ~10.7%.
- **Human-in-the-loop (clean splits):** For the validation and test sets, human annotators on Amazon MTurk verified labels by selecting correct samples from machine-labeled batches (minimum 4 of 20 per query). Only samples with full human–machine agreement are included in the clean splits.
---
## Benchmark Tasks
### 1. Label Noise Detection (`Clothing-ADC-Detection`)
A 20,000-sample subset with both noisy and clean labels, annotated by 3 Amazon MTurk workers per image (correct / unsure / incorrect). Used to benchmark noise detection algorithms.
**Metric:** F1-score of detected corrupted instances
| Method | F1-Score |
|---|---|
| CORES | 0.4793 |
| Confident Learning (CL) | 0.4352 |
| Deep k-NN | 0.3991 |
| Simi-Feat | **0.5721** |
---
### 2. Label Noise Learning (`Clothing-ADC` / `Clothing-ADC-Tiny`)
Train on the full noisy training set; evaluate on the clean held-out test set. A tiny version (~50K train images) is also provided for fast experimentation.
**Metric:** Classification accuracy on clean test set (12-class)
Selected results (ResNet-50, 20 epochs):
| Method | Full | Tiny |
|---|---|---|
| Cross-Entropy (baseline) | — | — |
| Positive Label Smoothing | ↑ | ↑ |
| Taylor CE | **best** | **best** |
| DivideMix | competitive | competitive |
---
### 3. Class-Imbalanced Learning (`Clothing-ADC-CLT`)
A class-level long-tail version of the dataset, with imbalance ratios ρ ∈ {10, 50, 100}. Noisy samples are removed prior to constructing this benchmark using algorithmic curation (Docta + learning-centric curation), yielding ~562,263 clean images.
**Metric:** δ-worst accuracy (interpolates between mean accuracy at δ=0 and worst-class accuracy at δ→∞)
| Method | ρ=10 (δ=0) | ρ=100 (δ=0) | ρ=10 (δ=∞) | ρ=100 (δ=∞) |
|---|---|---|---|---|
| Cross-Entropy | 57.80 | 30.10 | 0.96 | 0.00 |
| Focal Loss | 72.70 | 62.28 | 38.12 | 13.44 |
| LDAM | 72.50 | 63.25 | 40.90 | 15.69 |
| Balanced Softmax | 74.18 | 69.47 | 48.54 | 50.60 |
| Logit-Adjust | **74.08** | **69.44** | 47.45 | 43.26 |
| Drops | 73.66 | 67.15 | **50.85** | 32.43 |
---
## Comparison with Existing Datasets
| Dataset | # Train/Test | # Classes | Noise Rate (%) | Has Attributes | Auto Annotation | Requires Expert? |
|---|---|---|---|---|---|---|
| iNaturalist | 579k/279k | 54k | ~0 | ✗ | ✗ | ✓ |
| WebVision | 2.4M/100k | 1000 | 20 | ✗ | ✓ | ✓ |
| ANIMAL-10N | 50k/10k | 10 | 8 | ✗ | ✗ | ✗ |
| CIFAR-10N | 50k/10k | 10 | 9–40 | ✗ | ✗ | ✗ |
| Food-101N | 75.75k/25.25k | 101 | 18.4 | ✗ | ✗ | ✓ |
| Clothing1M | 1M total | 14 | 38.5 | ✗ | ✗ | ✓ |
| **Clothing-ADC (Ours)** | **1M/20k** | **12** | **22.2–32.7** | **12k** | **✓** | **✗** |
---
## Data Fields
Each sample contains:
- `id`: unique image identifier string
- `image`: PIL image (256×256 RGB)
- `class`: main clothing category string (e.g., `"Sweater"`)
- `color`: color attribute label (e.g., `"white"`)
- `material`: material attribute label (e.g., `"cotton"`)
- `pattern`: pattern attribute label (e.g., `"fisherman"`)
---
## Usage
```python
from datasets import load_dataset
# Full dataset
ds = load_dataset("mikelmh025/ClothingADC")
# Access splits
train = ds["train"]
val = ds["validation"]
test = ds["test"]
# Example: iterate over test set
for sample in test:
image = sample["image"]
category = sample["class"]
color = sample["color"]
material = sample["material"]
pattern = sample["pattern"]
```
---
## Citation
If you use Clothing-ADC in your research, please cite:
```bibtex
@article{liu2024adc,
title = {Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond},
author = {Minghao Liu and Zonglin Di and Jiaheng Wei and Zhongruo Wang and
Hengxiang Zhang and Ruixuan Xiao and Haoyu Wang and Jinlong Pang and
Hao Chen and Ankit Shah and Hongxin Wei and Xinlei He and
Zhaowei Zhu and Haobo Wang and Lei Feng and Jindong Wang and
James Davis and Yang Liu},
journal = {arXiv preprint arXiv:2408.11338},
year = {2024}
}
```
---
## License
This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). Images are collected from Google Image Search and remain subject to their original source licenses. This dataset is intended for research purposes only.
提供机构:
mikelmh025



