thepian/product-query-benchmark

Name: thepian/product-query-benchmark
Creator: thepian
Published: 2026-04-24 18:43:11
License: 暂无描述

Hugging Face2026-04-24 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/thepian/product-query-benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 language: - en - de - fr - ca - es - nl - uk - pt - it - pl - sv - sr - cs - fi - el - bg - hu - sl - ro - lv - da - lt - et - eu - sk - ga - hr - is - lb - mt tags: - product-search - search intent - information-retrieval - e-commerce - ner - relevance - brand-recognition version: 1.0.0 configs: - config_name: products data_files: products.parquet - config_name: examples data_files: examples.parquet - config_name: brand_aliases data_files: brand_aliases.parquet - config_name: brand_examples data_files: brand_examples.parquet --- # Product Query Benchmark A quality-filtered product search benchmark derived from the [Amazon ESCI dataset](https://github.com/amazon-science/esci-data), enriched with brand metadata from EUIPO (EU trademark registry) and Wikidata. Secondly, queries from ESCII are extended with variants involving origins, exclusions, certifications. Only **Tier-1 brands** are included — brands that have a confirmed EUIPO trademark registration, giving a stable, verifiable brand identity (`brand_euipo_id`) for each product. ## Files | File | Rows | Description | |---|---|---| | `products.parquet` | ~260k | Products with ≥1 Exact relevance judgment, with brand enrichment | | `examples.parquet` | ~504k | Query-product relevance pairs (E/S/C/I), query text inlined | | `brand_aliases.parquet` | ~21k | Multilingual brand name variants (30 European languages, Wikidata-sourced) | | `brand_examples.parquet` | ~269k | Query-brand relevance pairs derived by aggregating product labels | ## Schema ### `products.parquet` | Column | Type | Description | |---|---|---| | `product_id` | string | ESCI product_id (matches original Amazon ESCI dataset) | | `product_title` | string | English product title | | `product_description` | string | Long prose description (NULL for ~64% of products) | | `product_bullet_point` | string | Feature bullet points (NULL for ~51% of products) | | `product_brand` | string | Raw brand string from ESCI | | `brand_euipo_id` | string | Stable EU trademark ID — use as brand key | | `brand_country` | string | ISO 3166-1 alpha-2 country code (NULL for ~74% of brands) | | `brand_sector` | string | Coarse sector derived from EUIPO Nice classes (see table below) | | `euipo_nice_classes` | string | JSON array of EUIPO Nice classification numbers, e.g. `[3, 5]` | ### `examples.parquet` | Column | Type | Description | |---|---|---| | `query_id` | string | Stable MD5-derived hex ID for grouping all pairs from one query | | `query` | string | Raw search query text | | `product_id` | string | Joins to `products.product_id`. Either solely implied or directly referenced | | `esci_label` | string | `Exact` / `Substitute` / `Complement` / `Irrelevant` | | `product_brand` | string | Raw brand string in search | | `origin` | string | Country of origin in search | | `certification` | string | Certification referenced in search | | `exclusions` | string array | Exclusions referenced in search | | `split` | string | `train` / `test` (ESCI's original split) | ### `brand_aliases.parquet` | Column | Type | Description | |---|---|---| | `brand_euipo_id` | string | Joins to `products.brand_euipo_id` | | `brand_name` | string | Canonical brand name | | `alias` | string | Alternate name (translated, abbreviated, legal variant, etc.) | | `language` | string | ISO 639-1 language code; NULL = language-agnostic | | `source` | string | `wikidata` / `euipo` / `manual` | ### `brand_examples.parquet` Brand-level relevance derived by aggregating product labels. For each (query, brand) pair, `brand_label` is the highest-priority label among all products from that brand judged for that query: `Exact` > `Substitute` > `Complement` > `Irrelevant`. | Column | Type | Description | |---|---|---| | `query_id` | string | Joins to `examples.query_id` | | `query` | string | Raw search query text | | `brand_euipo_id` | string | Joins to `products.brand_euipo_id` | | `brand_label` | string | `Exact` / `Substitute` / `Complement` / `Irrelevant` | | `brand_origin` | string | `eu` / `non-eu` / `unknown` — derived from `brand_country`; ~26% of brands have country data | | `split` | string | `train` / `test` (inherited from query's examples) | ## Sector Distribution Brand sector is derived from EUIPO Nice Classification goods classes (1–34 take priority over service classes 35–45). | Sector | Brands | Products | |---|---|---| | electronics | 3,138 | 85,890 | | clothing | 1,204 | 37,992 | | bags_luggage | 548 | 20,481 | | home_living | 1,448 | 27,821 | | sports_toys | 475 | 9,793 | | personal_care | 1,817 | 46,875 | | office_media | 656 | 15,935 | | food | 594 | 12,076 | | beverages | 166 | 2,747 | | hardware | 1,527 | 36,321 | | jewelry | 380 | 9,837 | | medical | 326 | 7,538 | | vehicles | 335 | 6,096 | | other | 545 | 12,408 | | **TOTAL** | **13,159** | **331,810** | ## Usage ```python from datasets import load_dataset # Pin to a specific version for reproducible training runs ds = load_dataset("thepian/product-query-benchmark", revision="v1.0.0") products = ds["products"].to_pandas() examples = ds["examples"].to_pandas() brand_aliases = ds["brand_aliases"].to_pandas() ``` Or load individual files: ```python import pandas as pd products = pd.read_parquet("hf://datasets/thepian/product-query-benchmark/products.parquet") examples = pd.read_parquet("hf://datasets/thepian/product-query-benchmark/examples.parquet") ``` ## Versioning This dataset uses semantic versioning via git tags. Always load with `revision=` for reproducible results — the default (`HEAD`) may change between runs. | Version | Date | Notes | |---|---|---| | v1.0.0 | 2026-04-24 | Initial release: 13,159 EUIPO-verified brands, US locale | **Schema stability**: `brand_country` coverage is ~26% in v1 (Wikidata-sourced only). Product-level category inference and NACE codes are planned for v2.0.0. ## License - **Relevance judgments and product metadata**: [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) (inherited from [Amazon ESCI](https://github.com/amazon-science/esci-data)) - **Brand enrichment** (EUIPO, Wikidata): open data, compatible with CC BY-NC 4.0 Commercial use requires a separate license from Amazon for the underlying ESCI data. ## Citation If you use this dataset, please cite the original Amazon ESCI paper: ```bibtex @article{reddy2022shopping, title={Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search}, author={Reddy, Chandan K and Halverson, Llana and Deshpande, Ohad and others}, journal={arXiv preprint arXiv:2206.06588}, year={2022} } ```

提供机构：

thepian

5,000+

优质数据集

54 个

任务类型

进入经典数据集