Haakkim/Haakkim-1.0v
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Haakkim/Haakkim-1.0v
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
language:
- ar
tags:
- arabic
- llm-evaluation
- human-preference
- arena
- bradley-terry
- dialects
- leaderboard
- pairwise-comparison
size_categories:
- 1K<n<10K
pretty_name: Haakkim Arabic LLM Arena Battles v1.0
configs:
- config_name: default
data_files:
- split: train
path: data/haakkim_full_battles.parquet
dataset_info:
features:
- name: id
dtype: string
- name: model_a
dtype: string
- name: model_b
dtype: string
- name: winner
dtype: string
- name: evaluation_session_id
dtype: string
- name: evaluation_order
dtype: int64
- name: conversation_a
dtype: string
- name: conversation_b
dtype: string
- name: full_conversation
dtype: string
- name: conv_metadata
dtype: string
- name: timestamp
dtype: string
splits:
- name: train
num_examples: 1274
---
# Haakkim — Arabic LLM Arena Battles v1.0
**Haakkim** (حَكِّم, "judge") is an open arena-style human preference evaluation platform for Arabic large language models. This dataset contains all battle records collected during the first release snapshot, including voted comparisons and skipped battles across 11 Arabic dialect varieties.
> 🌐 **Platform:** [https://haakkim.tech](https://haakkim.tech)
> 🏆 **Live Leaderboard:** [https://haakkim.tech/#leaderboard](https://haakkim.tech/#leaderboard)
---
## Dataset Summary
| Statistic | Value |
|---|---|
| **Total battle records** | 1,273 |
| — Voted comparisons | 1,130 |
| — Skipped (no vote cast) | 143 |
| **Ranked-eligible battles (BT scoring)** | 831 |
| **Models evaluated** | 67 (ranked MSA) |
| **Dialects covered** | 11 |
| **Evaluation modes** | Ranked Arena, Side-by-Side, 10 Questions |
| **Scoring method** | Bradley–Terry with IPW & bootstrap CIs |
| **Release date** | April 2026 |
---
## Vote Outcome Distribution (Voted Battles)
| Outcome | Count | Share |
|---|---|---|
| Model A wins | 360 | 31.9% |
| Model B wins | 364 | 32.2% |
| Tie | 342 | 30.3% |
| Both Bad | 64 | 5.7% |
| **Total voted** | **1,130** | |
---
## Dialect Coverage
| Dialect | Battles | Share |
|---|---|---|
| Modern Standard Arabic (MSA) | 986 | 77.5% |
| Tunisian | 115 | 9.0% |
| Saudi | 83 | 6.5% |
| Egyptian | 45 | 3.5% |
| Levantine | 22 | 1.7% |
| Sudanese | 12 | 0.9% |
| Omani | 5 | 0.4% |
| Iraqi | 2 | 0.2% |
| Moroccan | 1 | <0.1% |
| Libyan | 1 | <0.1% |
| Algerian | 1 | <0.1% |
---
## Evaluation Mode Breakdown
| Mode | Battles | Used for Ranking |
|---|---|---|
| Ranked Arena (`arena_random`) | 952 | ✅ Yes (BT scoring) |
| Side-by-Side (`side_by_side`) | 292 | ❌ Win-rate only |
| 10 Questions (`ten_questions`) | 29 | ❌ Win-rate only |
---
## MSA Official Leaderboard (Top 10)
Scores are Bradley–Terry scores centered at 1000. A 1-point gap corresponds to win odds of e¹ ≈ 2.7:1. The full 67-model leaderboard is available at [haakkim.tech/leaderboard](https://haakkim.tech/#leaderboard).
| Rank | Model | BT Score | 95% CI | Battles |
|---|---|---|---|---|
| 1 | mistralai/ministral-3b-2512 | 1001.75 | [1001.20, 1002.93] | 40 |
| 2 | mistralai/ministral-8b-2512 | 1001.61 | [1000.72, 1002.97] | 43 |
| 3 | Qwen/Qwen3-235B-A22B-Thinking-2507 | 1001.21 | [1000.47, 1002.00] | 38 |
| 4 | Qwen/Qwen3-30B-A3B-Instruct-2507 | 1001.14 | [999.96, 1002.83] | 31 |
| 5 | deepseek/deepseek-v3.2-exp | 1001.13 | [1000.27, 1002.16] | 38 |
| 6 | deepseek/deepseek-v3.1 | 1000.99 | [999.81, 1002.07] | 29 |
| 7 | Qwen/Qwen3-235B-A22B-Instruct-2507 | 1000.98 | [1000.12, 1002.08] | 39 |
| 8 | deepseek/deepseek-r1-0528 | 1000.93 | [1000.10, 1002.14] | 38 |
| 9 | openai/gpt-oss-120b | 1000.93 | [1000.04, 1002.58] | 25 |
| 10 | deepseek/deepseek-v3.2 | 1000.89 | [999.86, 1002.25] | 31 |
> **Note on score scale:** Haakkim uses unscaled log-odds units (centered at 1000), not Elo-style scaling. This produces a ~4-point spread across 67 models. Chatbot Arena-style scores (scaled by 400/ln10 ≈ 173.7) would show the same ranking with hundreds-of-points spreads. Both conventions encode identical win probabilities.
---
## Data Schema
Each row is a JSON object with the following fields:
```json
{
"id": "<UUID>",
"model_a": "<model name>",
"model_b": "<model name>",
"winner": "model_a" | "model_b" | "tie" | "both_bad" | "skipped",
"evaluation_session_id": "<UUID>",
"evaluation_order": <int>,
"conversation_a": [{"role": "...", "content": "..."}, ...],
"conversation_b": [{"role": "...", "content": "..."}, ...],
"full_conversation": [{"role": "...", "content": "..."}, ...],
"conv_metadata": {
"source": "arena_random" | "side_by_side" | "ten_questions",
"dialect": "msa" | "saudi" | "egyptian" | "levantine" | ...,
"is_ranked_arena": true | false,
"sampling_weight": <float>,
"category_tag": {"domain": "...", "tags": [...], "difficulty": "..."},
"input_tokens_a": <int>,
"output_tokens_a": <int>,
"input_tokens_b": <int>,
"output_tokens_b": <int>
},
"timestamp": "<ISO 8601>"
}
```
### Field Notes
- **model_a / model_b**: Reflect the exact presentation order the evaluator saw (A/B swap already applied). Model identities are not anonymised.
- **winner**: `"skipped"` means the user did not cast a vote. These rows are included for transparency but are **excluded from BT scoring**.
- **conversation_a / conversation_b**: Turn-by-turn transcripts (system prompt, user message, assistant response) for each model side.
- **full_conversation**: Interleaved transcript combining both sides as the evaluator saw them.
- **conv_metadata.sampling_weight**: Inverse-probability weight used for BT estimation (clamped to [P1, P99]).
- **conv_metadata.category_tag**: Automated domain/difficulty annotation (does not affect scoring).
---
## Preprocessing & Scoring Pipeline
Before BT scoring, all battles pass through a fixed preprocessing pipeline:
| Step | Drop Reason | Dropped | Remaining |
|---|---|---|---|
| Raw joined battles | — | — | 1,133 |
| Structural validation | Invalid winner, missing model/prompt/response | 37 | 1,096 |
| Source filter | Non-rankable source (Side-by-Side, 10Q) | 265 | 831 |
| Exact deduplication | Duplicate battle IDs | 0 | 831 |
| Repeat cap | >1 vote per (session, prompt, model pair) | 0 | 831 |
| Failed comparisons | Empty or failed model generation | 0 | 831 |
| **Ranked-eligible** | | | **831** |
**Effective Sample Size (ESS):** 465 (clamped) — indicates limited variance inflation despite exposure imbalance.
**Graph:** 67 nodes, 774 edges, density 0.35, fully connected (1 component).
---
## Privacy & PII
All text fields are scrubbed before release. Masked patterns include: email addresses, phone numbers, national IDs, passport numbers, credit card numbers, dates of birth, usernames, and IBANs. Rows containing high-risk PII are dropped entirely. The export never includes device identifiers, IP addresses, or browser fingerprints.
---
## Team
This dataset was created by researchers at the **College of Computing, Umm Al-Qura University**, Mecca, Saudi Arabia.
| Name | Role | HuggingFace |
|---|---|---|
| **Mourad Mars** | Principal Investigator | [@mouradmars](https://huggingface.co/mouradmars) |
| **Hassan Barmandah** | AI Researcher | [@HassanB4](https://huggingface.co/HassanB4) |
| **Abdulrhman Alassaf** | Software Engineer | — |
---
## Citation
If you use this dataset in your research, please cite:
```bibtex
@misc{mars2026haakkim,
title = {Haakkim: An Arena-Style Human Preference Evaluation Platform for Arabic {LLMs}},
author = {Mars, Mourad and Barmandah, Hassan and Alassaf, Abdulrhman},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/Haakkim/Haakkim-1.0v}},
note = {College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia}
}
```
---
## License
This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). Model responses remain subject to the terms of the respective model providers.
---
## Contact
- Platform: [https://haakkim.tech](https://haakkim.tech)
- Institution: College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia
提供机构:
Haakkim



