Tellurio/LF-Amazon-131K
收藏Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Tellurio/LF-Amazon-131K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: extreme-classification-repository
license_link: https://manikvarma.org/downloads/XC/XMLRepository.html
task_categories:
- text-classification
task_ids:
- multi-label-classification
tags:
- extreme-multi-label
- amazon
- product-categorization
- nlp
language:
- en
pretty_name: "LF-Amazon-131K"
size_categories:
- 100K<n<1M
---
# LF-Amazon-131K
An extreme multi-label text classification dataset from the
[Extreme Classification Repository](http://manikvarma.org/downloads/XC/XMLRepository.html).
Amazon products are labeled with fine-grained category tags drawn from a
vocabulary of **131,073 labels**.
## Dataset Description
Each example is an Amazon product with a title and description, tagged with
one or more labels from a massive label space. Labels themselves have rich
text metadata (title and description), making this dataset suitable for
both traditional extreme classification and zero-/few-shot approaches.
| Property | Value |
|---|---|
| Train examples | 294,805 |
| Test examples | 134,835 |
| Total labels | 131,073 |
| Mean labels per example | ~2.3 |
| Median label frequency | 3 |
### Label distribution
The label distribution is extremely long-tailed:
| Training examples per label | # Labels | % of total |
|---|---|---|
| 1 | 25,253 | 19.3% |
| 2–5 | 67,439 | 51.5% |
| 6–10 | 26,080 | 19.9% |
| 11–50 | 11,926 | 9.1% |
| 51+ | 375 | 0.3% |
Over 70% of labels have 5 or fewer training examples, making this dataset
a challenging benchmark that blurs the line between supervised classification
and zero-/few-shot retrieval.
## Fields
| Field | Type | Description |
|---|---|---|
| `uid` | string | Amazon ASIN |
| `title` | string | Product title |
| `content` | string | Product description |
| `label_ids` | list[int] | Label indices (into the 131K label vocabulary) |
| `label_titles` | list[string] | Human-readable label titles |
| `relevance_scores` | list[float] | Relevance score per label (all 1.0 in this dataset) |
| `filter_label_ids` | list[int] | Curated/filtered label subset indices |
| `filter_label_titles` | list[string] | Human-readable filtered label titles |
## Additional files
- **`label_vocab.json`** — ordered list of all 131,073 label titles (index = label ID)
- **`label_metadata.jsonl`** — full label metadata (id, uid, title, content) for
zero-shot / label-text-aware approaches
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Tellurio/LF-Amazon-131K")
example = ds["train"][0]
print(example["title"])
print(example["label_titles"]) # e.g. ["Methodical Bible Study"]
# Label IDs are also available for efficient encoding
print(example["label_ids"]) # e.g. [4315]
```
### Loading label metadata for zero-shot approaches
Each of the 131,073 labels has rich text metadata (title and description)
stored in `label_metadata.jsonl`. This is useful for zero-shot or
label-text-aware approaches where you match product text against label text.
```python
import json
from huggingface_hub import hf_hub_download
# Download the label metadata file from the repo
path = hf_hub_download(
repo_id="Tellurio/LF-Amazon-131K",
filename="label_metadata.jsonl",
repo_type="dataset",
)
# Load into a list (index = label ID)
labels = []
with open(path) as f:
for line in f:
labels.append(json.loads(line))
# Look up a label by ID
print(labels[4315])
# {"id": 4315, "uid": "...", "title": "How to Read the Bible as Literature", "content": "..."}
```
## Citation
If you use this dataset, please cite the Extreme Classification Repository:
```bibtex
@misc{xmlrepo,
author = {Manik Varma},
title = {Extreme Classification Repository},
url = {http://manikvarma.org/downloads/XC/XMLRepository.html}
}
```
## License
This dataset is redistributed from the
[Extreme Classification Repository](http://manikvarma.org/downloads/XC/XMLRepository.html).
Please refer to the original source for licensing terms.
提供机构:
Tellurio



