prestoai/arabic-ecom-search-bench
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/prestoai/arabic-ecom-search-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
task_categories:
- text-retrieval
tags:
- e-commerce
- arabic
- search
- retrieval
- benchmark
- libyan-dialect
- msa
- catalog-search
- ndcg
pretty_name: ArabicEcomSearchBench
size_categories:
- 10K<n<100K
---
# ArabicEcomSearchBench
<p align="center">
<img src="./Gemini_Generated_Image_svft5tsvft5tsvft.png" width="600">
</p>
**Benchmark for end-to-end Arabic e-commerce retrieval systems, covering lexical, dense, hybrid, and multi-stage retrieval pipelines.**
## Why This Benchmark?
Existing Arabic NLP benchmarks and MTEB focus heavily on **embedding-level evaluation tasks** — semantic similarity, classification, or general-purpose retrieval. These benchmarks:
- Evaluate **components in isolation** (embeddings, rerankers) rather than the full search pipeline a customer actually experiences.
- Use **generic domains** (news, Wikipedia, QA) that do not reflect the vocabulary, intent patterns, or relevance expectations of **e-commerce catalog search**.
- Lack coverage of **Arabic dialects** — real customers in Libya, Egypt, or the Gulf don't search in formal MSA.
ArabicEcomSearchBench fills this gap by evaluating **end-to-end retrieval quality** on real e-commerce queries in **Modern Standard Arabic (MSA) and Libyan dialect**, with graded relevance judgments designed for catalog search.
### How does it compare to STS17, STS22-v2, and MTEB?
| Aspect | STS17 | STS22-v2 | ArabicMTEB | ArabicEcomSearchBench |
|--------|-------|----------|------------|-----------------------|
| **Task** | Sentence-pair similarity | Cross-lingual sentence similarity | Embedding evaluation (94 datasets, 8 task types) | End-to-end retrieval system evaluation |
| **Domain** | Generic / news (translated from English) | News headlines, captions | News, legal, medical, finance, Wikipedia, cultural | E-commerce catalog search |
| **Arabic data scale** | ~250 sentence pairs | ~250 Arabic pairs | 94 datasets, includes domain-specific retrieval | 29K queries, 107K products, 262K judgments |
| **Language variety** | MSA only (translated) | MSA only (formal news) | MSA + dialects (Egyptian, Gulf, Moroccan, Levantine) | MSA + Libyan dialect (organic queries) |
| **Relevance scheme** | Continuous similarity (0-5) | Continuous similarity (0-5) | Binary or continuous | 5-level graded relevance + 96K hard negatives |
| **What it evaluates** | Embedding meaning similarity | Cross-lingual embedding alignment | Individual components (embeddings) | Full pipeline: indexing -> retrieval -> ranking |
| **Metric focus** | Spearman / Pearson correlation | Spearman / Pearson correlation | nDCG, MAP, Recall (per task type) | nDCG, Recall, MRR, Precision, Success rate, ERR |
| **Hard negatives** | None | None | None | 96K explicitly labeled hard negatives |
| **E-commerce** | No | No | No | Yes |
**In short:**
- **STS17 / STS22-v2** tell you whether your embeddings understand that two Arabic sentences mean similar things — using a few hundred translated/formal sentence pairs from news domains.
- **ArabicMTEB** is the most comprehensive Arabic embedding benchmark — it covers dialects and multiple domains (news, legal, medical, finance) but has **no e-commerce data** and still evaluates **embeddings in isolation**, not end-to-end search systems.
- **ArabicEcomSearchBench** tells you whether your **search system** actually helps Arabic-speaking customers find the right product — in their own dialect, at e-commerce scale, with graded relevance and hard negatives that catch the mistakes that matter in catalog search.
## Dataset Overview
| Statistic | Value |
| ------------------------------ | -------------------- |
| Queries | 29,014 |
| Corpus items | 107,041 |
| Total relevance judgments | 262,599 |
| Hard negatives (score=-1) | 96,510 |
| Positive judgments (score 1-3) | 162,549 |
| Languages | MSA + Libyan dialect |
### Relevance Scale
| Score | Description |
| ----- | ---------------------------------------------------------------------------------- |
| 3 | **Fully matched** — text or semantic match to query intent |
| 2 | **Relevant** — related but not an exact match |
| 1 | **Somewhat relevant** — tangentially related |
| 0 | **Irrelevant** — no meaningful relation to the query |
| -1 | **Hard negative** — visually/textually similar but not relevant (diagnostic only) |
> Hard negatives (score -1) are **excluded** from primary metrics (nDCG, Recall, etc.) and reported separately as diagnostic metrics.
## Files
```
arabic-ecom-search-bench/
├── README.md # This file
├── evaluate.py # Evaluation script (system-agnostic)
├── data/
│ ├── candidates.jsonl # Full candidate set with relevance judgments
│ ├── queries.jsonl # Query ID → query text
│ ├── corpus.jsonl # Item ID → product_name_ar, category
│ ├── qrels.tsv # TREC-format qrels
│ ├── meta.json # Dataset statistics
│ └── convert.py # Script used to generate data files
└── examples/
└── meilisearch/
└── sync_meilisearch_documents.py # Sync corpus into Meilisearch
```
### Data Formats
**candidates.jsonl** — one JSON object per line:
```json
{
"query_id": "2",
"query": "كابل شحن 3 امبير",
"candidates": [
{"item_id": 262261, "product_name_ar": "كابل شحن 3 امبير -MOXOM", "relevance": 3, "category_id": "2", "category_name_ar": "كابل شحن"},
{"item_id": 14, "product_name_ar": "كابل شحن ميكرو 5امبير -DBRUI", "relevance": 2, "category_id": "2", "category_name_ar": "كابل شحن"}
]
}
```
**qrels.tsv** — TREC-style, compatible with [trec_eval](https://github.com/usnistgov/trec_eval) and [pytrec_eval](https://github.com/cvangysel/pytrec_eval):
```
query_id iter item_id relevance
2 0 262261 3
2 0 14 2
```
**queries.jsonl**:
```json
{"query_id": "2", "query": "كابل شحن 3 امبير"}
```
**corpus.jsonl**:
```json
{"item_id": 262261, "product_name_ar": "كابل شحن 3 امبير -MOXOM", "category_id": "2", "category_name_ar": "كابل شحن"}
```
## Syncing the Corpus to Your Search Engine
Before you can run queries and evaluate, you need to **index the benchmark corpus** into whatever search engine or retrieval system you are testing. The corpus is provided as `data/corpus.jsonl` — each line is a JSON document:
```json
{"item_id": 262261, "product_name_ar": "كابل شحن 3 امبير -MOXOM", "category_id": "2", "category_name_ar": "كابل شحن"}
```
### General steps (any engine)
1. **Create an index/collection** in your search engine with `item_id` as the primary key.
2. **Load `data/corpus.jsonl`** — read line by line, parse JSON, and upload in batches.
3. **Mark `product_name_ar` and `category_name_ar` as searchable** — these are the fields your engine should search against.
4. **Optionally make `category_id` / `category_name_ar` filterable** — useful if your engine supports filtered search.
5. **Run your queries** from `data/queries.jsonl` against the index and collect the results.
Below is a generic Python loader you can adapt to any engine:
```python
import json
def load_corpus(path="data/corpus.jsonl"):
"""Yield documents from the benchmark corpus."""
with open(path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
yield json.loads(line)
# Upload to your engine
for batch in batched(load_corpus(), size=500):
your_engine.index(documents=batch, primary_key="item_id")
```
### Meilisearch
A ready-to-use sync script is provided in [`examples/meilisearch/`](examples/meilisearch/sync_meilisearch_documents.py).
```bash
pip install meilisearch
# Option 1: env vars
export MEILI_URL=http://localhost:7700
export MEILI_API_KEY=your_master_key
python examples/meilisearch/sync_meilisearch_documents.py
# Option 2: explicit flags
python examples/meilisearch/sync_meilisearch_documents.py \
--url http://localhost:7700 \
--api-key your_master_key \
--index arabic_ecom_bench
# Custom settings
python examples/meilisearch/sync_meilisearch_documents.py \
--chunk-size 1000 \
--searchable-attributes product_name_ar category_name_ar \
--filterable-attributes category_id category_name_ar
```
The script will:
- Create the index (or skip if it already exists)
- Configure searchable and filterable attributes
- Upload all 107K documents in batches
### Elasticsearch / OpenSearch
```python
from elasticsearch import Elasticsearch, helpers
import json
es = Elasticsearch("http://localhost:9200")
# Create index with Arabic analyzer
es.indices.create(index="arabic_ecom_bench", body={
"settings": {"analysis": {"analyzer": {"default": {"type": "arabic"}}}},
"mappings": {
"properties": {
"item_id": {"type": "keyword"},
"product_name_ar": {"type": "text", "analyzer": "arabic"},
"category_id": {"type": "keyword"},
"category_name_ar": {"type": "keyword"},
}
}
})
# Bulk index
def gen_actions():
with open("data/corpus.jsonl") as f:
for line in f:
doc = json.loads(line)
yield {"_index": "arabic_ecom_bench", "_id": doc["item_id"], "_source": doc}
helpers.bulk(es, gen_actions(), chunk_size=500)
```
### Typesense
```python
import typesense
import json
client = typesense.Client({
"nodes": [{"host": "localhost", "port": "8108", "protocol": "http"}],
"api_key": "your_api_key",
})
# Create collection
client.collections.create({
"name": "arabic_ecom_bench",
"fields": [
{"name": "item_id", "type": "string", "facet": False},
{"name": "product_name_ar", "type": "string"},
{"name": "category_id", "type": "string", "facet": True},
{"name": "category_name_ar", "type": "string", "facet": True},
],
})
# Import via JSONL (Typesense supports direct JSONL import)
with open("data/corpus.jsonl") as f:
jsonl = f.read()
client.collections["arabic_ecom_bench"].documents.import_(jsonl, {"action": "create"})
```
### After syncing — generate results
Once your corpus is indexed, run all benchmark queries and write results:
```python
import json
with open("data/queries.jsonl") as qf, open("my_results.jsonl", "w") as out:
for line in qf:
q = json.loads(line)
hits = your_engine.search(q["query"], limit=50)
out.write(json.dumps({
"query_id": q["query_id"],
"retrieved": [{"item_id": h["item_id"]} for h in hits],
}, ensure_ascii=False) + "\n")
```
Then evaluate:
```bash
python evaluate.py --run my_results.jsonl --k 10 20 50 --output report.json
```
## Evaluation
### Metrics
**Primary metrics** (computed on relevance 0..3 only):
| Metric | Description |
| ----------------------- | --------------------------------------------------------------------------------------------- |
| **nDCG@k** | Normalized Discounted Cumulative Gain — primary metric, rewards relevant items ranked higher |
| **Recall@k** | Fraction of all relevant items found in top-k |
| **MRR@k** | Mean Reciprocal Rank — rank of first relevant result |
| **Success@k** (HitRate) | Binary: did any relevant item appear in top-k? |
| **Precision@k** | Fraction of top-k items that are relevant |
| **ERR@k** | Expected Reciprocal Rank — models user stopping behavior |
**Hard-negative diagnostics** (score -1):
| Metric | Description |
| ------------------ | ------------------------------------------- |
| **HardNegative@k** | Count of hard negatives in top-k |
| **HN-rate@k** | Fraction of top-k that are hard negatives |
| **HN-first-rank** | First rank at which a hard negative appears |
### Running the Evaluation
**Step 1:** Generate results from your search system in JSONL format:
```json
{"query_id": "2", "retrieved": [{"item_id": 262261}, {"item_id": 35}, {"item_id": 14}]}
{"query_id": "3", "retrieved": [{"item_id": 100}, {"item_id": 200}]}
```
Each line must have `query_id` and `retrieved` (ordered list of results, best first). Each entry in `retrieved` needs at minimum an `item_id`.
**Step 2:** Run evaluation:
```bash
# Basic evaluation
python evaluate.py --run my_results.jsonl
# Custom k values + JSON report
python evaluate.py --run my_results.jsonl --k 10 20 50 --output report.json
# Include per-query breakdown
python evaluate.py --run my_results.jsonl --k 10 20 --output report.json --per-query
```
### Example: Adapting for Your Search Engine
```python
import json
from your_search_client import SearchClient
client = SearchClient(...)
# Load queries
queries = []
with open("data/queries.jsonl") as f:
for line in f:
queries.append(json.loads(line))
# Run searches and collect results
with open("my_results.jsonl", "w") as out:
for q in queries:
hits = client.search(q["query"], limit=50)
result = {
"query_id": q["query_id"],
"retrieved": [{"item_id": hit["id"]} for hit in hits],
}
out.write(json.dumps(result, ensure_ascii=False) + "\n")
```
Then evaluate:
```bash
python evaluate.py --run my_results.jsonl --k 10 20 50 --output report.json
```
## Intended Use
This benchmark evaluates **end-to-end, customer-facing search and retrieval systems** for Arabic e-commerce — regardless of the underlying technology (lexical, dense, hybrid, multi-stage, or any combination). It doesn't matter how your system retrieves and ranks results; what matters is the **final ranked list the customer sees**.
## Baseline Results
### Primary Metrics
| Metric | @10 | @20 | @50 |
|--------|-----|-----|-----|
| **nDCG** | 0.6241 | 0.6415 | 0.6504 |
| **Recall** | 0.4825 | 0.5367 | 0.5747 |
| **MRR** | 0.7563 | 0.7575 | 0.7577 |
| **Success (HitRate)** | 0.8537 | 0.8691 | 0.8776 |
| **Precision** | 0.3304 | 0.2441 | 0.1962 |
| **ERR** | 0.6552 | 0.6566 | 0.6570 |
### Hard-Negative Diagnostics
| Metric | @10 | @20 | @50 |
|--------|-----|-----|-----|
| **HN count** | 0.85 | 1.09 | 1.26 |
| **HN rate** | 9.4% | 7.3% | 5.7% |
- **HN first rank (mean):** 8.1 (across 13,881 queries that surfaced at least one hard negative)
## Reporting Results (System Card)
Since this benchmark evaluates **end-to-end systems** — not isolated components — results can change with any configuration update, version upgrade, or pipeline change. To make results reproducible and comparable, we recommend including a **system card** alongside your results.
### System card format
Include a `system_card.json` alongside your results file. All fields are **optional** — share as much or as little as you want:
#### Example
```json
{
"system_name": "Name of the search system, or a codename (e.g. 'Elasticsearch', 'ProjectAlpha-v2'",
"system_version": "1.12.0",
"retrieval_method": "Hybrid (BM25 + semantic)",
"query_preprocessing": "Default Arabic tokenizer, no custom stemmer, Dialect normalization via synonym list",
"ranking_rules": "words, typo, proximity, attribute, sort, exactness",
"results_limit_per_query": 50,
"embedding_model": "Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2",
"reranker": null,
"notes": "Any additional context about the setup"
}
```
**For proprietary or closed-source systems:** you can use a codename with a version (e.g. `"system_name": "InternalSearch"`, `"system_version": "v3.2"`) instead of disclosing the actual system. Describe *what* the system does at a high level, not *how* — e.g. "proprietary hybrid retrieval with Arabic language support" is a valid `retrieval_method`.
## Limitations
- **Product catalog:** Based on a single e-commerce platform's catalog; category distribution may not generalize to all Arabic markets.
- **Dialect coverage:** Currently covers MSA + Libyan dialect. Egyptian, Gulf, Levantine, and Maghreb dialects are planned for future versions.
- **Relevance judgments:** Generated via a combination of heuristic and LLM-based labeling, then partially verified by humans. Some edge cases may exist.
## Citation
If you use this benchmark, please cite:
```bibtex
@misc{arabicecomsearchbench2025,
title={ArabicEcomSearchBench: A Benchmark for End-to-End Arabic E-Commerce Retrieval},
author={Mohamed Okasha, AbuBaker Naji and Talal Badi},
year={2025},
url={https://huggingface.co/datasets/presto-ai/ArabicEcomSearchBench}
}
```
## License
The benchmark data and evaluation code are released for research and evaluation purposes.
提供机构:
prestoai



