five

razsarusi/open-llm-leaderboard-eda

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/razsarusi/open-llm-leaderboard-eda
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit --- # 🤖 Open LLM Leaderboard – Exploratory Data Analysis <!-- VIDEO WILL GO HERE AFTER RECORDING --> ## Dataset Overview **Source:** [HuggingFace – open-llm-leaderboard/contents](https://huggingface.co/datasets/open-llm-leaderboard/contents) **Size:** 4,575 rows × 35 columns **What is this dataset?** Each row represents one open-source Large Language Model (LLM) evaluated on 6 standardized benchmarks. | Benchmark | What it tests | Average Score | |-----------|--------------|---------------| | IFEval | Ability to follow instructions | 45.6 | | BBH | Complex reasoning | 27.6 | | MMLU-PRO | Professional general knowledge | 25.4 | | MATH Lvl 5 | Advanced mathematics | 15.5 | | MUSR | Multi-step reasoning | 10.0 | | GPQA | Graduate-level science questions | 6.7 | **Target Variable:** `Average` – overall average score across all 6 benchmarks --- ## Main Question > "What factors predict an LLM's overall benchmark performance?" --- ## Data Cleaning | Step | Action | Result | |------|--------|--------| | 1 | Removed Flagged models | 4,576 -> 4,575 rows | | 2 | Checked for duplicate rows | 0 duplicates found | | 3 | Replaced #Params = -1 with NaN | 3 hidden values fixed | | 4 | Filled empty Hub License with "Unknown" | 1,752 values filled | | 5 | Dropped Model column (HTML) | 36 -> 35 columns | | 6 | Cleaned Type column (emojis + mapping) | 7 clean categories | | 7 | Converted 6 boolean columns to 0/1 | Ready for analysis | --- ## Outlier Detection | Column | Outlier | Model | Decision | |--------|---------|-------|----------| | Hub likes | 6,093 likes | meta-llama/Meta-Llama-3-8B | KEEP | | CO2 cost | 186.61 kg | alpindale/WizardLM-2-8x22B | KEEP | | #Params | 140.63B | mistral-community/mixtral-8x22B-v0.3 | KEEP | All outliers represent real and legitimate models. --- ## Research Questions & Findings ### Q1: Does model size predict performance? Larger models do NOT consistently outperform smaller ones. Small models (0-10B) can achieve scores as high as 40-50. ### Q2: Which model type performs best? - multimodal: highest median (~27) - merged: high median (~25) but high variance - chat: consistent performance (~23) - pretrained: lowest median (~8) ### Q3: Which benchmark is the hardest? GPQA (6.7) is the hardest. IFEval (45.6) is the easiest. ### Q4: Is there a CO2 vs performance tradeoff? No clear relationship. Efficient small models can outperform expensive large ones. ### Q5: What is the correlation between variables? BBH and MMLU-PRO are highly correlated (0.96). Model size has only moderate correlation with performance (0.43). --- ## Key Conclusions 1. Model size alone does NOT predict performance 2. Training type matters more than size 3. GPQA is the hardest benchmark (mean = 6.7) 4. Popularity does not equal performance 5. Benchmark scores are highly interconnected --- ## Files - `Copy_of_Assignment_1_EDA_&_Dataset_raz_sarusi.ipynb` – Full EDA notebook - `llm_data.parquet` – The dataset
提供机构:
razsarusi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作