thepian/product-query-benchmark
收藏Hugging Face2026-04-24 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/thepian/product-query-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
- de
- fr
- ca
- es
- nl
- uk
- pt
- it
- pl
- sv
- sr
- cs
- fi
- el
- bg
- hu
- sl
- ro
- lv
- da
- lt
- et
- eu
- sk
- ga
- hr
- is
- lb
- mt
tags:
- product-search
- search intent
- information-retrieval
- e-commerce
- ner
- relevance
- brand-recognition
version: 1.0.0
configs:
- config_name: products
data_files: products.parquet
- config_name: examples
data_files: examples.parquet
- config_name: brand_aliases
data_files: brand_aliases.parquet
- config_name: brand_examples
data_files: brand_examples.parquet
---
# Product Query Benchmark
A quality-filtered product search benchmark derived from the
[Amazon ESCI dataset](https://github.com/amazon-science/esci-data), enriched with brand
metadata from EUIPO (EU trademark registry) and Wikidata. Secondly, queries from ESCII are extended with
variants involving origins, exclusions, certifications.
Only **Tier-1 brands** are included — brands that have a confirmed EUIPO trademark registration,
giving a stable, verifiable brand identity (`brand_euipo_id`) for each product.
## Files
| File | Rows | Description |
|---|---|---|
| `products.parquet` | ~260k | Products with ≥1 Exact relevance judgment, with brand enrichment |
| `examples.parquet` | ~504k | Query-product relevance pairs (E/S/C/I), query text inlined |
| `brand_aliases.parquet` | ~21k | Multilingual brand name variants (30 European languages, Wikidata-sourced) |
| `brand_examples.parquet` | ~269k | Query-brand relevance pairs derived by aggregating product labels |
## Schema
### `products.parquet`
| Column | Type | Description |
|---|---|---|
| `product_id` | string | ESCI product_id (matches original Amazon ESCI dataset) |
| `product_title` | string | English product title |
| `product_description` | string | Long prose description (NULL for ~64% of products) |
| `product_bullet_point` | string | Feature bullet points (NULL for ~51% of products) |
| `product_brand` | string | Raw brand string from ESCI |
| `brand_euipo_id` | string | Stable EU trademark ID — use as brand key |
| `brand_country` | string | ISO 3166-1 alpha-2 country code (NULL for ~74% of brands) |
| `brand_sector` | string | Coarse sector derived from EUIPO Nice classes (see table below) |
| `euipo_nice_classes` | string | JSON array of EUIPO Nice classification numbers, e.g. `[3, 5]` |
### `examples.parquet`
| Column | Type | Description |
|---|---|---|
| `query_id` | string | Stable MD5-derived hex ID for grouping all pairs from one query |
| `query` | string | Raw search query text |
| `product_id` | string | Joins to `products.product_id`. Either solely implied or directly referenced |
| `esci_label` | string | `Exact` / `Substitute` / `Complement` / `Irrelevant` |
| `product_brand` | string | Raw brand string in search |
| `origin` | string | Country of origin in search |
| `certification` | string | Certification referenced in search |
| `exclusions` | string array | Exclusions referenced in search |
| `split` | string | `train` / `test` (ESCI's original split) |
### `brand_aliases.parquet`
| Column | Type | Description |
|---|---|---|
| `brand_euipo_id` | string | Joins to `products.brand_euipo_id` |
| `brand_name` | string | Canonical brand name |
| `alias` | string | Alternate name (translated, abbreviated, legal variant, etc.) |
| `language` | string | ISO 639-1 language code; NULL = language-agnostic |
| `source` | string | `wikidata` / `euipo` / `manual` |
### `brand_examples.parquet`
Brand-level relevance derived by aggregating product labels. For each (query, brand) pair,
`brand_label` is the highest-priority label among all products from that brand judged for
that query: `Exact` > `Substitute` > `Complement` > `Irrelevant`.
| Column | Type | Description |
|---|---|---|
| `query_id` | string | Joins to `examples.query_id` |
| `query` | string | Raw search query text |
| `brand_euipo_id` | string | Joins to `products.brand_euipo_id` |
| `brand_label` | string | `Exact` / `Substitute` / `Complement` / `Irrelevant` |
| `brand_origin` | string | `eu` / `non-eu` / `unknown` — derived from `brand_country`; ~26% of brands have country data |
| `split` | string | `train` / `test` (inherited from query's examples) |
## Sector Distribution
Brand sector is derived from EUIPO Nice Classification goods classes (1–34 take priority over
service classes 35–45).
| Sector | Brands | Products |
|---|---|---|
| electronics | 3,138 | 85,890 |
| clothing | 1,204 | 37,992 |
| bags_luggage | 548 | 20,481 |
| home_living | 1,448 | 27,821 |
| sports_toys | 475 | 9,793 |
| personal_care | 1,817 | 46,875 |
| office_media | 656 | 15,935 |
| food | 594 | 12,076 |
| beverages | 166 | 2,747 |
| hardware | 1,527 | 36,321 |
| jewelry | 380 | 9,837 |
| medical | 326 | 7,538 |
| vehicles | 335 | 6,096 |
| other | 545 | 12,408 |
| **TOTAL** | **13,159** | **331,810** |
## Usage
```python
from datasets import load_dataset
# Pin to a specific version for reproducible training runs
ds = load_dataset("thepian/product-query-benchmark", revision="v1.0.0")
products = ds["products"].to_pandas()
examples = ds["examples"].to_pandas()
brand_aliases = ds["brand_aliases"].to_pandas()
```
Or load individual files:
```python
import pandas as pd
products = pd.read_parquet("hf://datasets/thepian/product-query-benchmark/products.parquet")
examples = pd.read_parquet("hf://datasets/thepian/product-query-benchmark/examples.parquet")
```
## Versioning
This dataset uses semantic versioning via git tags. Always load with `revision=` for
reproducible results — the default (`HEAD`) may change between runs.
| Version | Date | Notes |
|---|---|---|
| v1.0.0 | 2026-04-24 | Initial release: 13,159 EUIPO-verified brands, US locale |
**Schema stability**: `brand_country` coverage is ~26% in v1 (Wikidata-sourced only).
Product-level category inference and NACE codes are planned for v2.0.0.
## License
- **Relevance judgments and product metadata**: [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)
(inherited from [Amazon ESCI](https://github.com/amazon-science/esci-data))
- **Brand enrichment** (EUIPO, Wikidata): open data, compatible with CC BY-NC 4.0
Commercial use requires a separate license from Amazon for the underlying ESCI data.
## Citation
If you use this dataset, please cite the original Amazon ESCI paper:
```bibtex
@article{reddy2022shopping,
title={Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search},
author={Reddy, Chandan K and Halverson, Llana and Deshpande, Ohad and others},
journal={arXiv preprint arXiv:2206.06588},
year={2022}
}
```
提供机构:
thepian



