thepian/amazon-esci-data

Name: thepian/amazon-esci-data
Creator: thepian
Published: 2026-04-12 15:21:54
License: 暂无描述

Hugging Face2026-04-12 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/thepian/amazon-esci-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: products features: - name: product_id dtype: string - name: product_title dtype: string - name: product_description dtype: string - name: product_bullet_point dtype: string - name: product_brand dtype: string - name: product_color dtype: string - name: product_locale dtype: string - name: split dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 1650407845 num_examples: 1371823 - name: test num_bytes: 537176847 num_examples: 443101 download_size: 1149707182 dataset_size: 2187584692 - config_name: queries features: - name: example_id dtype: int64 - name: query dtype: string - name: query_id dtype: int64 - name: product_id dtype: string - name: product_locale dtype: string - name: esci_label dtype: string - name: small_version dtype: int64 - name: large_version dtype: int64 - name: split dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 198670365 num_examples: 1983272 - name: test num_bytes: 63544917 num_examples: 638016 download_size: 63596052 dataset_size: 262215282 - config_name: sources features: - name: query_id dtype: int64 - name: source dtype: string - name: split dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 3458419 num_examples: 99683 - name: test num_bytes: 1048200 num_examples: 30969 download_size: 1510331 dataset_size: 4506619 configs: - config_name: products data_files: - split: train path: products/train-* - split: test path: products/test-* - config_name: queries data_files: - split: train path: queries/train-* - split: test path: queries/test-* - config_name: sources data_files: - split: train path: sources/train-* - split: test path: sources/test-* license: apache-2.0 task_categories: - text-classification - token-classification - text-generation - sentence-similarity language: - en - ja - es tags: - amazon - retrieval - search - ecommerce - ranking - reranking size_categories: - 1M<n<10M --- # Amazon Shopping Queries Dataset Dataset for improving product search, ranking and recommendations, featuring query-product pairs with detailed relevance labels. ## Overview The dataset contains search queries paired with up to 40 potentially relevant products, each labeled using the ESCI system: - **E**xact match: Products that perfectly match the customer's search intent (e.g., searching "iPhone 13" and finding "Apple iPhone 13 128GB") - **S**ubstitute product: Alternative products that could satisfy the same need (e.g., searching "iPhone 13" and finding "iPhone 14" or "Samsung Galaxy S23") - **C**omplement product: Products commonly bought together with the searched item (e.g., searching "iPhone 13" and finding "iPhone 13 case" or "screen protector") - **I**rrelevant result: Products that don't match the customer's search intent (e.g., searching "iPhone 13" and finding "laptop charger") ## Dataset Statistics ### Reduced Version (Task 1) - 48,300 unique queries - 1,118,011 query-product pairs - **Focus**: Filtered to exclude "easy" queries, making it more challenging - Language distribution: - English (US): 29,844 queries - Spanish (ES): 8,049 queries - Japanese (JP): 10,407 queries ### Full Version (Tasks 2 & 3) - 130,652 unique queries - 2,621,738 query-product pairs - **Focus**: Includes both easy and challenging queries - Language distribution: - English (US): 97,345 queries - Spanish (ES): 15,180 queries - Japanese (JP): 18,127 queries ## Features - Rich product metadata including: - Product title - Product description - Product bullet points - Brand information - Color information - Multilingual support (English, Japanese, Spanish) - Train/test splits for each task ## Download Install `datasets` library: ```bash pip install datasets ``` Donwload files: ```python from datasets import load_dataset queries = load_dataset(path="milistu/amazon-esci-data", name="queries", split=["train", "test"]) products = load_dataset(path="milistu/amazon-esci-data", name="products", split=["train", "test"]) sources = load_dataset(path="milistu/amazon-esci-data", name="sources", split=["train", "test"]) ``` ## Use Cases 1. **Product Ranking**: Develop algorithms to rank relevant products higher in search results 2. **Relevance Classification**: Build models to classify products as Exact, Substitute, Complement, or Irrelevant 3. **Substitute Detection**: Identify substitute products for improved product recommendations 4. **Semantic Search**: Train embedding models (like BERT, sentence-transformers) to: - Capture semantic similarity between queries and products - Handle long-tail queries with no exact keyword matches - Understand product relationships across categories - Example: Query "comfortable running shoes for marathon" can match with "Nike Air Zoom Alphafly" even without exact keyword overlap ## Citation Originally sourced from ["Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search"](https://github.com/amazon-science/esci-data?tab=readme-ov-file), this version is optimized for machine learning applications and semantic search research. ``` @article{reddy2022shopping, title={Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search}, author={Chandan K. Reddy and Lluís Màrquez and Fran Valero and Nikhil Rao and Hugo Zaragoza and Sambaran Bandyopadhyay and Arnab Biswas and Anlu Xing and Karthik Subbian}, year={2022}, eprint={2206.06588}, archivePrefix={arXiv} } ```

提供机构：

thepian

5,000+

优质数据集

54 个

任务类型

进入经典数据集