five

apexlearningcurve/Amazon-Search-Benchmark

收藏
Hugging Face2024-11-02 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/apexlearningcurve/Amazon-Search-Benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en size_categories: - 100K<n<1M task_categories: - sentence-similarity tags: - recommendation - product search - benchmark - e-commerce - amazon configs: - config_name: default data_files: - split: test path: data/test-* dataset_info: features: - name: item_id dtype: string - name: queries_old sequence: string - name: short_query dtype: string - name: long_query dtype: string - name: product_text dtype: string splits: - name: test num_bytes: 17954974 num_examples: 20373 download_size: 11207049 dataset_size: 17954974 --- ## 📃 Dataset Summary This dataset is a benchmark for query-product retrieval, designed to evaluate the performance of algorithms in retrieving relevant products based on user search queries. It is built from two Hugging Face datasets: - [Amazon-C4](https://huggingface.co/datasets/McAuley-Lab/Amazon-C4) - [Amazon Reviews 2023](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) The dataset focuses on enhancing the retrieval of products by using structured product information (title, description, category) to generate search queries and assess relevance. **The dataset contains two main parts**: 1. **Query-Product Pairs**: This subset includes around 20,000 query-product pairs generated using an LLM, specifically **OpenAI `GPT-4o-2024-08-06`** model. Queries were generated based on product metadata, particularly the `product_text` field, which is a concatenation of the product title, description, and category. This approach ensures a broader applicability as title and description are almost always available, unlike user reviews. Two types of queries are included: **short_query** (1-3 words) and **long_query** (4-6 words), making the dataset suitable for testing both short- and long-form search query performance. 2. **Product Metadata**: The data is based on the existing 1 million product sample from the Amazon-C4 dataset, which was supplemented with metadata (title, description, category) from the Amazon Reviews 2023 dataset, products that lacked both title and description were filtered out. ## 🔑 Key Features and Enhancements - **Query Generation Approach** - **Product listing based generation**: Unlike the original query set, which relies on user reviews, this benchmark uses product title, description, and category to generate queries. This method provides greater flexibility and broader applicability since metadata is typically available for all products, even in the absence of reviews. - **Search queries, not questions**: In the original set, queries formed like questions which is not how customers typically search on e-commerce sites. Our queries are formed like search queries, often consisting of keywords and features that identify a product. - **Long and short queries**: We provide both short(1-3 words) and long(4-5 words) queries to cover different search cases. - **Product Text Embeddings** - Each product’s text (title + description + category) was embedded using the **OpenAI `text-embedding-3-small`** model, enabling advanced retrieval tasks. - **L2 Similarity Search and Mapping**: - Using a FAISS index, we performed an L2 similarity search to retrieve the top 100 similar products for each item in the dataset. A mapping of each product and its most similar products is provided to support product recommendation and search relevance tasks. ## 🔮 Artifacts: - A filtered, transformed 20k test set with added columns, including `product_text`, `queries_old`, `short_query`, and `long_query`. - The full 1 million product sample with supplemented metadata - Embeddings for all 1 million products. - A mapping of the top 100 similar products for each product, based on L2 similarity search. This dataset is a comprehensive resource for testing and refining query-product retrieval algorithms, offering a flexible and robust benchmark for e-commerce search and recommendation systems. ## 🔧 Usage Downloading queries: ```Python queries = load_dataset("apexlearningcurve/Amazon-Search-Benchmark", split="test") ``` Downloading 1M Data rows: ```Python filepath = hf_hub_download( repo_id="apexlearningcurve/Amazon-Search-Benchmark", filename='sampled_item_metadata_1M_filtered.jsonl', repo_type='dataset' ) item_pool = [] with open(filepath, 'r') as file: for line in file: item_pool.append(json.loads(line.strip())) ``` Downloading raw parquet file(s): ```Python filepath = hf_hub_download( repo_id="apexlearningcurve/Amazon-Search-Benchmark", filename="raw_data/raw_meta_Appliances_c4.parquet", repo_type="dataset", cache_dir="./cache", ) df = pd.read_parquet(path=filepath) ``` ## 🧑‍💻 Authors [Milutin Studen](https://www.linkedin.com/in/milutin-studen/) [Andrej Botka](https://www.linkedin.com/in/andrej-botka-313080245/)
提供机构:
apexlearningcurve
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作