prestoai/arabic-ecom-search-bench

Name: prestoai/arabic-ecom-search-bench
Creator: prestoai
Published: 2026-04-05 14:31:08
License: 暂无描述

Hugging Face2026-04-05 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/prestoai/arabic-ecom-search-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ar task_categories: - text-retrieval tags: - e-commerce - arabic - search - retrieval - benchmark - libyan-dialect - msa - catalog-search - ndcg pretty_name: ArabicEcomSearchBench size_categories: - 10K<n<100K --- # ArabicEcomSearchBench <p align="center"> <img src="./Gemini_Generated_Image_svft5tsvft5tsvft.png" width="600"> </p> **Benchmark for end-to-end Arabic e-commerce retrieval systems, covering lexical, dense, hybrid, and multi-stage retrieval pipelines.** ## Why This Benchmark? Existing Arabic NLP benchmarks and MTEB focus heavily on **embedding-level evaluation tasks** — semantic similarity, classification, or general-purpose retrieval. These benchmarks: - Evaluate **components in isolation** (embeddings, rerankers) rather than the full search pipeline a customer actually experiences. - Use **generic domains** (news, Wikipedia, QA) that do not reflect the vocabulary, intent patterns, or relevance expectations of **e-commerce catalog search**. - Lack coverage of **Arabic dialects** — real customers in Libya, Egypt, or the Gulf don't search in formal MSA. ArabicEcomSearchBench fills this gap by evaluating **end-to-end retrieval quality** on real e-commerce queries in **Modern Standard Arabic (MSA) and Libyan dialect**, with graded relevance judgments designed for catalog search. ### How does it compare to STS17, STS22-v2, and MTEB? | Aspect | STS17 | STS22-v2 | ArabicMTEB | ArabicEcomSearchBench | |--------|-------|----------|------------|-----------------------| | **Task** | Sentence-pair similarity | Cross-lingual sentence similarity | Embedding evaluation (94 datasets, 8 task types) | End-to-end retrieval system evaluation | | **Domain** | Generic / news (translated from English) | News headlines, captions | News, legal, medical, finance, Wikipedia, cultural | E-commerce catalog search | | **Arabic data scale** | ~250 sentence pairs | ~250 Arabic pairs | 94 datasets, includes domain-specific retrieval | 29K queries, 107K products, 262K judgments | | **Language variety** | MSA only (translated) | MSA only (formal news) | MSA + dialects (Egyptian, Gulf, Moroccan, Levantine) | MSA + Libyan dialect (organic queries) | | **Relevance scheme** | Continuous similarity (0-5) | Continuous similarity (0-5) | Binary or continuous | 5-level graded relevance + 96K hard negatives | | **What it evaluates** | Embedding meaning similarity | Cross-lingual embedding alignment | Individual components (embeddings) | Full pipeline: indexing -> retrieval -> ranking | | **Metric focus** | Spearman / Pearson correlation | Spearman / Pearson correlation | nDCG, MAP, Recall (per task type) | nDCG, Recall, MRR, Precision, Success rate, ERR | | **Hard negatives** | None | None | None | 96K explicitly labeled hard negatives | | **E-commerce** | No | No | No | Yes | **In short:** - **STS17 / STS22-v2** tell you whether your embeddings understand that two Arabic sentences mean similar things — using a few hundred translated/formal sentence pairs from news domains. - **ArabicMTEB** is the most comprehensive Arabic embedding benchmark — it covers dialects and multiple domains (news, legal, medical, finance) but has **no e-commerce data** and still evaluates **embeddings in isolation**, not end-to-end search systems. - **ArabicEcomSearchBench** tells you whether your **search system** actually helps Arabic-speaking customers find the right product — in their own dialect, at e-commerce scale, with graded relevance and hard negatives that catch the mistakes that matter in catalog search. ## Dataset Overview | Statistic | Value | | ------------------------------ | -------------------- | | Queries | 29,014 | | Corpus items | 107,041 | | Total relevance judgments | 262,599 | | Hard negatives (score=-1) | 96,510 | | Positive judgments (score 1-3) | 162,549 | | Languages | MSA + Libyan dialect | ### Relevance Scale | Score | Description | | ----- | ---------------------------------------------------------------------------------- | | 3 | **Fully matched** — text or semantic match to query intent | | 2 | **Relevant** — related but not an exact match | | 1 | **Somewhat relevant** — tangentially related | | 0 | **Irrelevant** — no meaningful relation to the query | | -1 | **Hard negative** — visually/textually similar but not relevant (diagnostic only) | > Hard negatives (score -1) are **excluded** from primary metrics (nDCG, Recall, etc.) and reported separately as diagnostic metrics. ## Files ``` arabic-ecom-search-bench/ ├── README.md # This file ├── evaluate.py # Evaluation script (system-agnostic) ├── data/ │ ├── candidates.jsonl # Full candidate set with relevance judgments │ ├── queries.jsonl # Query ID → query text │ ├── corpus.jsonl # Item ID → product_name_ar, category │ ├── qrels.tsv # TREC-format qrels │ ├── meta.json # Dataset statistics │ └── convert.py # Script used to generate data files └── examples/ └── meilisearch/ └── sync_meilisearch_documents.py # Sync corpus into Meilisearch ``` ### Data Formats **candidates.jsonl** — one JSON object per line: ```json { "query_id": "2", "query": "كابل شحن 3 امبير", "candidates": [ {"item_id": 262261, "product_name_ar": "كابل شحن 3 امبير -MOXOM", "relevance": 3, "category_id": "2", "category_name_ar": "كابل شحن"}, {"item_id": 14, "product_name_ar": "كابل شحن ميكرو 5امبير -DBRUI", "relevance": 2, "category_id": "2", "category_name_ar": "كابل شحن"} ] } ``` **qrels.tsv** — TREC-style, compatible with [trec_eval](https://github.com/usnistgov/trec_eval) and [pytrec_eval](https://github.com/cvangysel/pytrec_eval): ``` query_id iter item_id relevance 2 0 262261 3 2 0 14 2 ``` **queries.jsonl**: ```json {"query_id": "2", "query": "كابل شحن 3 امبير"} ``` **corpus.jsonl**: ```json {"item_id": 262261, "product_name_ar": "كابل شحن 3 امبير -MOXOM", "category_id": "2", "category_name_ar": "كابل شحن"} ``` ## Syncing the Corpus to Your Search Engine Before you can run queries and evaluate, you need to **index the benchmark corpus** into whatever search engine or retrieval system you are testing. The corpus is provided as `data/corpus.jsonl` — each line is a JSON document: ```json {"item_id": 262261, "product_name_ar": "كابل شحن 3 امبير -MOXOM", "category_id": "2", "category_name_ar": "كابل شحن"} ``` ### General steps (any engine) 1. **Create an index/collection** in your search engine with `item_id` as the primary key. 2. **Load `data/corpus.jsonl`** — read line by line, parse JSON, and upload in batches. 3. **Mark `product_name_ar` and `category_name_ar` as searchable** — these are the fields your engine should search against. 4. **Optionally make `category_id` / `category_name_ar` filterable** — useful if your engine supports filtered search. 5. **Run your queries** from `data/queries.jsonl` against the index and collect the results. Below is a generic Python loader you can adapt to any engine: ```python import json def load_corpus(path="data/corpus.jsonl"): """Yield documents from the benchmark corpus.""" with open(path, "r", encoding="utf-8") as f: for line in f: line = line.strip() if line: yield json.loads(line) # Upload to your engine for batch in batched(load_corpus(), size=500): your_engine.index(documents=batch, primary_key="item_id") ``` ### Meilisearch A ready-to-use sync script is provided in [`examples/meilisearch/`](examples/meilisearch/sync_meilisearch_documents.py). ```bash pip install meilisearch # Option 1: env vars export MEILI_URL=http://localhost:7700 export MEILI_API_KEY=your_master_key python examples/meilisearch/sync_meilisearch_documents.py # Option 2: explicit flags python examples/meilisearch/sync_meilisearch_documents.py \ --url http://localhost:7700 \ --api-key your_master_key \ --index arabic_ecom_bench # Custom settings python examples/meilisearch/sync_meilisearch_documents.py \ --chunk-size 1000 \ --searchable-attributes product_name_ar category_name_ar \ --filterable-attributes category_id category_name_ar ``` The script will: - Create the index (or skip if it already exists) - Configure searchable and filterable attributes - Upload all 107K documents in batches ### Elasticsearch / OpenSearch ```python from elasticsearch import Elasticsearch, helpers import json es = Elasticsearch("http://localhost:9200") # Create index with Arabic analyzer es.indices.create(index="arabic_ecom_bench", body={ "settings": {"analysis": {"analyzer": {"default": {"type": "arabic"}}}}, "mappings": { "properties": { "item_id": {"type": "keyword"}, "product_name_ar": {"type": "text", "analyzer": "arabic"}, "category_id": {"type": "keyword"}, "category_name_ar": {"type": "keyword"}, } } }) # Bulk index def gen_actions(): with open("data/corpus.jsonl") as f: for line in f: doc = json.loads(line) yield {"_index": "arabic_ecom_bench", "_id": doc["item_id"], "_source": doc} helpers.bulk(es, gen_actions(), chunk_size=500) ``` ### Typesense ```python import typesense import json client = typesense.Client({ "nodes": [{"host": "localhost", "port": "8108", "protocol": "http"}], "api_key": "your_api_key", }) # Create collection client.collections.create({ "name": "arabic_ecom_bench", "fields": [ {"name": "item_id", "type": "string", "facet": False}, {"name": "product_name_ar", "type": "string"}, {"name": "category_id", "type": "string", "facet": True}, {"name": "category_name_ar", "type": "string", "facet": True}, ], }) # Import via JSONL (Typesense supports direct JSONL import) with open("data/corpus.jsonl") as f: jsonl = f.read() client.collections["arabic_ecom_bench"].documents.import_(jsonl, {"action": "create"}) ``` ### After syncing — generate results Once your corpus is indexed, run all benchmark queries and write results: ```python import json with open("data/queries.jsonl") as qf, open("my_results.jsonl", "w") as out: for line in qf: q = json.loads(line) hits = your_engine.search(q["query"], limit=50) out.write(json.dumps({ "query_id": q["query_id"], "retrieved": [{"item_id": h["item_id"]} for h in hits], }, ensure_ascii=False) + "\n") ``` Then evaluate: ```bash python evaluate.py --run my_results.jsonl --k 10 20 50 --output report.json ``` ## Evaluation ### Metrics **Primary metrics** (computed on relevance 0..3 only): | Metric | Description | | ----------------------- | --------------------------------------------------------------------------------------------- | | **nDCG@k** | Normalized Discounted Cumulative Gain — primary metric, rewards relevant items ranked higher | | **Recall@k** | Fraction of all relevant items found in top-k | | **MRR@k** | Mean Reciprocal Rank — rank of first relevant result | | **Success@k** (HitRate) | Binary: did any relevant item appear in top-k? | | **Precision@k** | Fraction of top-k items that are relevant | | **ERR@k** | Expected Reciprocal Rank — models user stopping behavior | **Hard-negative diagnostics** (score -1): | Metric | Description | | ------------------ | ------------------------------------------- | | **HardNegative@k** | Count of hard negatives in top-k | | **HN-rate@k** | Fraction of top-k that are hard negatives | | **HN-first-rank** | First rank at which a hard negative appears | ### Running the Evaluation **Step 1:** Generate results from your search system in JSONL format: ```json {"query_id": "2", "retrieved": [{"item_id": 262261}, {"item_id": 35}, {"item_id": 14}]} {"query_id": "3", "retrieved": [{"item_id": 100}, {"item_id": 200}]} ``` Each line must have `query_id` and `retrieved` (ordered list of results, best first). Each entry in `retrieved` needs at minimum an `item_id`. **Step 2:** Run evaluation: ```bash # Basic evaluation python evaluate.py --run my_results.jsonl # Custom k values + JSON report python evaluate.py --run my_results.jsonl --k 10 20 50 --output report.json # Include per-query breakdown python evaluate.py --run my_results.jsonl --k 10 20 --output report.json --per-query ``` ### Example: Adapting for Your Search Engine ```python import json from your_search_client import SearchClient client = SearchClient(...) # Load queries queries = [] with open("data/queries.jsonl") as f: for line in f: queries.append(json.loads(line)) # Run searches and collect results with open("my_results.jsonl", "w") as out: for q in queries: hits = client.search(q["query"], limit=50) result = { "query_id": q["query_id"], "retrieved": [{"item_id": hit["id"]} for hit in hits], } out.write(json.dumps(result, ensure_ascii=False) + "\n") ``` Then evaluate: ```bash python evaluate.py --run my_results.jsonl --k 10 20 50 --output report.json ``` ## Intended Use This benchmark evaluates **end-to-end, customer-facing search and retrieval systems** for Arabic e-commerce — regardless of the underlying technology (lexical, dense, hybrid, multi-stage, or any combination). It doesn't matter how your system retrieves and ranks results; what matters is the **final ranked list the customer sees**. ## Baseline Results ### Primary Metrics | Metric | @10 | @20 | @50 | |--------|-----|-----|-----| | **nDCG** | 0.6241 | 0.6415 | 0.6504 | | **Recall** | 0.4825 | 0.5367 | 0.5747 | | **MRR** | 0.7563 | 0.7575 | 0.7577 | | **Success (HitRate)** | 0.8537 | 0.8691 | 0.8776 | | **Precision** | 0.3304 | 0.2441 | 0.1962 | | **ERR** | 0.6552 | 0.6566 | 0.6570 | ### Hard-Negative Diagnostics | Metric | @10 | @20 | @50 | |--------|-----|-----|-----| | **HN count** | 0.85 | 1.09 | 1.26 | | **HN rate** | 9.4% | 7.3% | 5.7% | - **HN first rank (mean):** 8.1 (across 13,881 queries that surfaced at least one hard negative) ## Reporting Results (System Card) Since this benchmark evaluates **end-to-end systems** — not isolated components — results can change with any configuration update, version upgrade, or pipeline change. To make results reproducible and comparable, we recommend including a **system card** alongside your results. ### System card format Include a `system_card.json` alongside your results file. All fields are **optional** — share as much or as little as you want: #### Example ```json { "system_name": "Name of the search system, or a codename (e.g. 'Elasticsearch', 'ProjectAlpha-v2'", "system_version": "1.12.0", "retrieval_method": "Hybrid (BM25 + semantic)", "query_preprocessing": "Default Arabic tokenizer, no custom stemmer, Dialect normalization via synonym list", "ranking_rules": "words, typo, proximity, attribute, sort, exactness", "results_limit_per_query": 50, "embedding_model": "Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2", "reranker": null, "notes": "Any additional context about the setup" } ``` **For proprietary or closed-source systems:** you can use a codename with a version (e.g. `"system_name": "InternalSearch"`, `"system_version": "v3.2"`) instead of disclosing the actual system. Describe *what* the system does at a high level, not *how* — e.g. "proprietary hybrid retrieval with Arabic language support" is a valid `retrieval_method`. ## Limitations - **Product catalog:** Based on a single e-commerce platform's catalog; category distribution may not generalize to all Arabic markets. - **Dialect coverage:** Currently covers MSA + Libyan dialect. Egyptian, Gulf, Levantine, and Maghreb dialects are planned for future versions. - **Relevance judgments:** Generated via a combination of heuristic and LLM-based labeling, then partially verified by humans. Some edge cases may exist. ## Citation If you use this benchmark, please cite: ```bibtex @misc{arabicecomsearchbench2025, title={ArabicEcomSearchBench: A Benchmark for End-to-End Arabic E-Commerce Retrieval}, author={Mohamed Okasha, AbuBaker Naji and Talal Badi}, year={2025}, url={https://huggingface.co/datasets/presto-ai/ArabicEcomSearchBench} } ``` ## License The benchmark data and evaluation code are released for research and evaluation purposes.

提供机构：

prestoai

5,000+

优质数据集

54 个

任务类型

进入经典数据集