Tellurio/LF-Amazon-131K

Name: Tellurio/LF-Amazon-131K
Creator: Tellurio
Published: 2026-03-29 18:40:45
License: 暂无描述

Hugging Face2026-03-29 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Tellurio/LF-Amazon-131K

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: extreme-classification-repository license_link: https://manikvarma.org/downloads/XC/XMLRepository.html task_categories: - text-classification task_ids: - multi-label-classification tags: - extreme-multi-label - amazon - product-categorization - nlp language: - en pretty_name: "LF-Amazon-131K" size_categories: - 100K<n<1M --- # LF-Amazon-131K An extreme multi-label text classification dataset from the [Extreme Classification Repository](http://manikvarma.org/downloads/XC/XMLRepository.html). Amazon products are labeled with fine-grained category tags drawn from a vocabulary of **131,073 labels**. ## Dataset Description Each example is an Amazon product with a title and description, tagged with one or more labels from a massive label space. Labels themselves have rich text metadata (title and description), making this dataset suitable for both traditional extreme classification and zero-/few-shot approaches. | Property | Value | |---|---| | Train examples | 294,805 | | Test examples | 134,835 | | Total labels | 131,073 | | Mean labels per example | ~2.3 | | Median label frequency | 3 | ### Label distribution The label distribution is extremely long-tailed: | Training examples per label | # Labels | % of total | |---|---|---| | 1 | 25,253 | 19.3% | | 2–5 | 67,439 | 51.5% | | 6–10 | 26,080 | 19.9% | | 11–50 | 11,926 | 9.1% | | 51+ | 375 | 0.3% | Over 70% of labels have 5 or fewer training examples, making this dataset a challenging benchmark that blurs the line between supervised classification and zero-/few-shot retrieval. ## Fields | Field | Type | Description | |---|---|---| | `uid` | string | Amazon ASIN | | `title` | string | Product title | | `content` | string | Product description | | `label_ids` | list[int] | Label indices (into the 131K label vocabulary) | | `label_titles` | list[string] | Human-readable label titles | | `relevance_scores` | list[float] | Relevance score per label (all 1.0 in this dataset) | | `filter_label_ids` | list[int] | Curated/filtered label subset indices | | `filter_label_titles` | list[string] | Human-readable filtered label titles | ## Additional files - **`label_vocab.json`** — ordered list of all 131,073 label titles (index = label ID) - **`label_metadata.jsonl`** — full label metadata (id, uid, title, content) for zero-shot / label-text-aware approaches ## Usage ```python from datasets import load_dataset ds = load_dataset("Tellurio/LF-Amazon-131K") example = ds["train"][0] print(example["title"]) print(example["label_titles"]) # e.g. ["Methodical Bible Study"] # Label IDs are also available for efficient encoding print(example["label_ids"]) # e.g. [4315] ``` ### Loading label metadata for zero-shot approaches Each of the 131,073 labels has rich text metadata (title and description) stored in `label_metadata.jsonl`. This is useful for zero-shot or label-text-aware approaches where you match product text against label text. ```python import json from huggingface_hub import hf_hub_download # Download the label metadata file from the repo path = hf_hub_download( repo_id="Tellurio/LF-Amazon-131K", filename="label_metadata.jsonl", repo_type="dataset", ) # Load into a list (index = label ID) labels = [] with open(path) as f: for line in f: labels.append(json.loads(line)) # Look up a label by ID print(labels[4315]) # {"id": 4315, "uid": "...", "title": "How to Read the Bible as Literature", "content": "..."} ``` ## Citation If you use this dataset, please cite the Extreme Classification Repository: ```bibtex @misc{xmlrepo, author = {Manik Varma}, title = {Extreme Classification Repository}, url = {http://manikvarma.org/downloads/XC/XMLRepository.html} } ``` ## License This dataset is redistributed from the [Extreme Classification Repository](http://manikvarma.org/downloads/XC/XMLRepository.html). Please refer to the original source for licensing terms.

提供机构：

Tellurio

5,000+

优质数据集

54 个

任务类型

进入经典数据集