Haakkim/Haakkim-1.0v

Name: Haakkim/Haakkim-1.0v
Creator: Haakkim
Published: 2026-04-09 05:15:23
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Haakkim/Haakkim-1.0v

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation - question-answering language: - ar tags: - arabic - llm-evaluation - human-preference - arena - bradley-terry - dialects - leaderboard - pairwise-comparison size_categories: - 1K<n<10K pretty_name: Haakkim Arabic LLM Arena Battles v1.0 configs: - config_name: default data_files: - split: train path: data/haakkim_full_battles.parquet dataset_info: features: - name: id dtype: string - name: model_a dtype: string - name: model_b dtype: string - name: winner dtype: string - name: evaluation_session_id dtype: string - name: evaluation_order dtype: int64 - name: conversation_a dtype: string - name: conversation_b dtype: string - name: full_conversation dtype: string - name: conv_metadata dtype: string - name: timestamp dtype: string splits: - name: train num_examples: 1274 --- # Haakkim — Arabic LLM Arena Battles v1.0 **Haakkim** (حَكِّم, "judge") is an open arena-style human preference evaluation platform for Arabic large language models. This dataset contains all battle records collected during the first release snapshot, including voted comparisons and skipped battles across 11 Arabic dialect varieties. > 🌐 **Platform:** [https://haakkim.tech](https://haakkim.tech) > 🏆 **Live Leaderboard:** [https://haakkim.tech/#leaderboard](https://haakkim.tech/#leaderboard) --- ## Dataset Summary | Statistic | Value | |---|---| | **Total battle records** | 1,273 | | — Voted comparisons | 1,130 | | — Skipped (no vote cast) | 143 | | **Ranked-eligible battles (BT scoring)** | 831 | | **Models evaluated** | 67 (ranked MSA) | | **Dialects covered** | 11 | | **Evaluation modes** | Ranked Arena, Side-by-Side, 10 Questions | | **Scoring method** | Bradley–Terry with IPW & bootstrap CIs | | **Release date** | April 2026 | --- ## Vote Outcome Distribution (Voted Battles) | Outcome | Count | Share | |---|---|---| | Model A wins | 360 | 31.9% | | Model B wins | 364 | 32.2% | | Tie | 342 | 30.3% | | Both Bad | 64 | 5.7% | | **Total voted** | **1,130** | | --- ## Dialect Coverage | Dialect | Battles | Share | |---|---|---| | Modern Standard Arabic (MSA) | 986 | 77.5% | | Tunisian | 115 | 9.0% | | Saudi | 83 | 6.5% | | Egyptian | 45 | 3.5% | | Levantine | 22 | 1.7% | | Sudanese | 12 | 0.9% | | Omani | 5 | 0.4% | | Iraqi | 2 | 0.2% | | Moroccan | 1 | <0.1% | | Libyan | 1 | <0.1% | | Algerian | 1 | <0.1% | --- ## Evaluation Mode Breakdown | Mode | Battles | Used for Ranking | |---|---|---| | Ranked Arena (`arena_random`) | 952 | ✅ Yes (BT scoring) | | Side-by-Side (`side_by_side`) | 292 | ❌ Win-rate only | | 10 Questions (`ten_questions`) | 29 | ❌ Win-rate only | --- ## MSA Official Leaderboard (Top 10) Scores are Bradley–Terry scores centered at 1000. A 1-point gap corresponds to win odds of e¹ ≈ 2.7:1. The full 67-model leaderboard is available at [haakkim.tech/leaderboard](https://haakkim.tech/#leaderboard). | Rank | Model | BT Score | 95% CI | Battles | |---|---|---|---|---| | 1 | mistralai/ministral-3b-2512 | 1001.75 | [1001.20, 1002.93] | 40 | | 2 | mistralai/ministral-8b-2512 | 1001.61 | [1000.72, 1002.97] | 43 | | 3 | Qwen/Qwen3-235B-A22B-Thinking-2507 | 1001.21 | [1000.47, 1002.00] | 38 | | 4 | Qwen/Qwen3-30B-A3B-Instruct-2507 | 1001.14 | [999.96, 1002.83] | 31 | | 5 | deepseek/deepseek-v3.2-exp | 1001.13 | [1000.27, 1002.16] | 38 | | 6 | deepseek/deepseek-v3.1 | 1000.99 | [999.81, 1002.07] | 29 | | 7 | Qwen/Qwen3-235B-A22B-Instruct-2507 | 1000.98 | [1000.12, 1002.08] | 39 | | 8 | deepseek/deepseek-r1-0528 | 1000.93 | [1000.10, 1002.14] | 38 | | 9 | openai/gpt-oss-120b | 1000.93 | [1000.04, 1002.58] | 25 | | 10 | deepseek/deepseek-v3.2 | 1000.89 | [999.86, 1002.25] | 31 | > **Note on score scale:** Haakkim uses unscaled log-odds units (centered at 1000), not Elo-style scaling. This produces a ~4-point spread across 67 models. Chatbot Arena-style scores (scaled by 400/ln10 ≈ 173.7) would show the same ranking with hundreds-of-points spreads. Both conventions encode identical win probabilities. --- ## Data Schema Each row is a JSON object with the following fields: ```json { "id": "<UUID>", "model_a": "<model name>", "model_b": "<model name>", "winner": "model_a" | "model_b" | "tie" | "both_bad" | "skipped", "evaluation_session_id": "<UUID>", "evaluation_order": <int>, "conversation_a": [{"role": "...", "content": "..."}, ...], "conversation_b": [{"role": "...", "content": "..."}, ...], "full_conversation": [{"role": "...", "content": "..."}, ...], "conv_metadata": { "source": "arena_random" | "side_by_side" | "ten_questions", "dialect": "msa" | "saudi" | "egyptian" | "levantine" | ..., "is_ranked_arena": true | false, "sampling_weight": <float>, "category_tag": {"domain": "...", "tags": [...], "difficulty": "..."}, "input_tokens_a": <int>, "output_tokens_a": <int>, "input_tokens_b": <int>, "output_tokens_b": <int> }, "timestamp": "<ISO 8601>" } ``` ### Field Notes - **model_a / model_b**: Reflect the exact presentation order the evaluator saw (A/B swap already applied). Model identities are not anonymised. - **winner**: `"skipped"` means the user did not cast a vote. These rows are included for transparency but are **excluded from BT scoring**. - **conversation_a / conversation_b**: Turn-by-turn transcripts (system prompt, user message, assistant response) for each model side. - **full_conversation**: Interleaved transcript combining both sides as the evaluator saw them. - **conv_metadata.sampling_weight**: Inverse-probability weight used for BT estimation (clamped to [P1, P99]). - **conv_metadata.category_tag**: Automated domain/difficulty annotation (does not affect scoring). --- ## Preprocessing & Scoring Pipeline Before BT scoring, all battles pass through a fixed preprocessing pipeline: | Step | Drop Reason | Dropped | Remaining | |---|---|---|---| | Raw joined battles | — | — | 1,133 | | Structural validation | Invalid winner, missing model/prompt/response | 37 | 1,096 | | Source filter | Non-rankable source (Side-by-Side, 10Q) | 265 | 831 | | Exact deduplication | Duplicate battle IDs | 0 | 831 | | Repeat cap | >1 vote per (session, prompt, model pair) | 0 | 831 | | Failed comparisons | Empty or failed model generation | 0 | 831 | | **Ranked-eligible** | | | **831** | **Effective Sample Size (ESS):** 465 (clamped) — indicates limited variance inflation despite exposure imbalance. **Graph:** 67 nodes, 774 edges, density 0.35, fully connected (1 component). --- ## Privacy & PII All text fields are scrubbed before release. Masked patterns include: email addresses, phone numbers, national IDs, passport numbers, credit card numbers, dates of birth, usernames, and IBANs. Rows containing high-risk PII are dropped entirely. The export never includes device identifiers, IP addresses, or browser fingerprints. --- ## Team This dataset was created by researchers at the **College of Computing, Umm Al-Qura University**, Mecca, Saudi Arabia. | Name | Role | HuggingFace | |---|---|---| | **Mourad Mars** | Principal Investigator | [@mouradmars](https://huggingface.co/mouradmars) | | **Hassan Barmandah** | AI Researcher | [@HassanB4](https://huggingface.co/HassanB4) | | **Abdulrhman Alassaf** | Software Engineer | — | --- ## Citation If you use this dataset in your research, please cite: ```bibtex @misc{mars2026haakkim, title = {Haakkim: An Arena-Style Human Preference Evaluation Platform for Arabic {LLMs}}, author = {Mars, Mourad and Barmandah, Hassan and Alassaf, Abdulrhman}, year = {2026}, howpublished = {\url{https://huggingface.co/datasets/Haakkim/Haakkim-1.0v}}, note = {College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia} } ``` --- ## License This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). Model responses remain subject to the terms of the respective model providers. --- ## Contact - Platform: [https://haakkim.tech](https://haakkim.tech) - Institution: College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia

提供机构：

Haakkim

5,000+

优质数据集

54 个

任务类型

进入经典数据集