razsarusi/open-llm-leaderboard-eda
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/razsarusi/open-llm-leaderboard-eda
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# 🤖 Open LLM Leaderboard – Exploratory Data Analysis
<!-- VIDEO WILL GO HERE AFTER RECORDING -->
## Dataset Overview
**Source:** [HuggingFace – open-llm-leaderboard/contents](https://huggingface.co/datasets/open-llm-leaderboard/contents)
**Size:** 4,575 rows × 35 columns
**What is this dataset?**
Each row represents one open-source Large Language Model (LLM) evaluated on 6 standardized benchmarks.
| Benchmark | What it tests | Average Score |
|-----------|--------------|---------------|
| IFEval | Ability to follow instructions | 45.6 |
| BBH | Complex reasoning | 27.6 |
| MMLU-PRO | Professional general knowledge | 25.4 |
| MATH Lvl 5 | Advanced mathematics | 15.5 |
| MUSR | Multi-step reasoning | 10.0 |
| GPQA | Graduate-level science questions | 6.7 |
**Target Variable:** `Average` – overall average score across all 6 benchmarks
---
## Main Question
> "What factors predict an LLM's overall benchmark performance?"
---
## Data Cleaning
| Step | Action | Result |
|------|--------|--------|
| 1 | Removed Flagged models | 4,576 -> 4,575 rows |
| 2 | Checked for duplicate rows | 0 duplicates found |
| 3 | Replaced #Params = -1 with NaN | 3 hidden values fixed |
| 4 | Filled empty Hub License with "Unknown" | 1,752 values filled |
| 5 | Dropped Model column (HTML) | 36 -> 35 columns |
| 6 | Cleaned Type column (emojis + mapping) | 7 clean categories |
| 7 | Converted 6 boolean columns to 0/1 | Ready for analysis |
---
## Outlier Detection
| Column | Outlier | Model | Decision |
|--------|---------|-------|----------|
| Hub likes | 6,093 likes | meta-llama/Meta-Llama-3-8B | KEEP |
| CO2 cost | 186.61 kg | alpindale/WizardLM-2-8x22B | KEEP |
| #Params | 140.63B | mistral-community/mixtral-8x22B-v0.3 | KEEP |
All outliers represent real and legitimate models.
---
## Research Questions & Findings
### Q1: Does model size predict performance?
Larger models do NOT consistently outperform smaller ones.
Small models (0-10B) can achieve scores as high as 40-50.
### Q2: Which model type performs best?
- multimodal: highest median (~27)
- merged: high median (~25) but high variance
- chat: consistent performance (~23)
- pretrained: lowest median (~8)
### Q3: Which benchmark is the hardest?
GPQA (6.7) is the hardest. IFEval (45.6) is the easiest.
### Q4: Is there a CO2 vs performance tradeoff?
No clear relationship. Efficient small models can outperform expensive large ones.
### Q5: What is the correlation between variables?
BBH and MMLU-PRO are highly correlated (0.96).
Model size has only moderate correlation with performance (0.43).
---
## Key Conclusions
1. Model size alone does NOT predict performance
2. Training type matters more than size
3. GPQA is the hardest benchmark (mean = 6.7)
4. Popularity does not equal performance
5. Benchmark scores are highly interconnected
---
## Files
- `Copy_of_Assignment_1_EDA_&_Dataset_raz_sarusi.ipynb` – Full EDA notebook
- `llm_data.parquet` – The dataset
提供机构:
razsarusi



