enescingoz/humaneval-apple-silicon
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/enescingoz/humaneval-apple-silicon
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: mit
tags:
- benchmark
- apple-silicon
- coding
- llm-evaluation
- macos
- humaneval
task_categories:
- text-generation
size_categories:
- n<1K
---
# Mac Coding Bench Results v1 — Speed + Code Quality Benchmarks on Apple Silicon
Speed and code quality benchmarks for quantized LLMs running locally on Apple Silicon Macs. The dataset pairs inference speed measurements (tokens/sec) with HumanEval+ functional correctness scores for 21 models, across three hardware configurations (M1, M2 Max, M5) totaling 123 benchmark results.
## Key Highlights
- **Qwen 3.6 35B-A3B** achieves 89.6% HumanEval+ pass@1 at 16.7 tok/s — best quality score in the dataset, and fast thanks to MoE architecture
- **Qwen 2.5 Coder 7B** hits 84.2% at 11.3 tok/s — best quality-to-speed ratio for a dense model
- **Phi 4 Mini 3.8B** reaches 70.7% at 19.6 tok/s — strong for its size
- **Gemma 4 family** scores 9-31% on HumanEval+, significantly underperforming Gemma 3 (34-79%)
- 21 models evaluated for code quality, 123 total speed benchmark results across 3 chips
- All models quantized to Q4_K_M (GGUF) or 4-bit (MLX)
## Dataset Description
Each row represents one model benchmarked on one hardware configuration. Speed metrics come from `llama-bench` (GGUF) or `mlx_lm.benchmark` (MLX). Code quality metrics come from EvalPlus HumanEval+ evaluation (164 problems). Quality scores are available for 21 models on the M5 configuration; the remaining 102 rows have speed-only data across M1, M2 Max, and M5.
## Data Fields
| Field | Type | Description |
|---|---|---|
| `model_name` | string | Human-readable model name |
| `model_id` | string | Slug identifier for the model |
| `params` | string | Parameter count (e.g., "7B", "35B") |
| `quant` | string | Quantization method (Q4_K_M or 4bit) |
| `runtime` | string | Inference runtime (llama.cpp or mlx-lm) |
| `chip` | string | Apple Silicon chip (M1, M2 Max, M5) |
| `cpu_cores` | int | Number of CPU cores |
| `gpu_cores` | int | Number of GPU cores |
| `ram_gb` | int | Total system RAM in GB |
| `os_version` | string | macOS kernel version |
| `pp128_toks` | float | Prompt processing speed, 128 tokens (tok/s) |
| `pp256_toks` | float | Prompt processing speed, 256 tokens (tok/s) |
| `pp512_toks` | float | Prompt processing speed, 512 tokens (tok/s) |
| `tg128_toks` | float | Text generation speed, 128 tokens (tok/s) |
| `tg256_toks` | float | Text generation speed, 256 tokens (tok/s) |
| `peak_memory_gb` | float | Peak RSS memory usage in GB |
| `humaneval_plus_pass1` | float | HumanEval+ pass@1 score (0-1), null if not evaluated |
| `humaneval_base_pass1` | float | HumanEval base pass@1 score (0-1), null if not evaluated |
| `perplexity` | float | Perplexity score, null if not evaluated |
| `eval_framework_version` | string | EvalPlus version used |
| `timestamp` | string | ISO 8601 timestamp of the benchmark run |
## Hardware Configurations
| Chip | CPU Cores | GPU Cores | RAM | Rows |
|---|---|---|---|---|
| Apple M1 | 8 | 7 | 16 GB | 20 |
| Apple M2 Max | 12 | 38 | 32 GB | 39 |
| Apple M5 | 10 | 10 | 32 GB | 64 |
Code quality evaluations (HumanEval+) were run on the M5 configuration only.
## Benchmark Methodology
**Speed benchmarks:**
- GGUF models: `llama-bench` with flash attention enabled, all layers offloaded to GPU (`-ngl 99`)
- MLX models: `mlx_lm.benchmark`
- Prompt processing measured at 128, 256, and 512 input tokens
- Text generation measured at 128 and 256 output tokens
**Code quality benchmarks:**
- Framework: [EvalPlus](https://github.com/evalplus/evalplus) HumanEval+
- 164 problems with 80x test coverage over the original HumanEval test suite
- Greedy decoding (temperature=0)
- Reasoning models evaluated with `--no-think` flag
- Quantization: Q4_K_M (GGUF), 4-bit (MLX)
## Models with Code Quality Scores
All results on Apple M5, 32 GB, Q4_K_M quantization.
| Model | Params | HumanEval+ pass@1 | tg128 (tok/s) |
|---|---|---|---|
| Qwen 3.6 35B-A3B | 35B | 89.6% | 16.7 |
| Qwen 2.5 Coder 32B | 32B | 87.2% | 2.5 |
| Qwen 2.5 Coder 14B | 14B | 86.6% | 5.9 |
| Qwen 2.5 Coder 7B | 7B | 84.2% | 11.3 |
| Phi 4 14B | 14B | 82.3% | 5.3 |
| Devstral Small 24B | 24B | 81.7% | 3.6 |
| Gemma 3 27B | 27B | 78.7% | 3.1 |
| Gemma 3 12B | 12B | 75.6% | 5.7 |
| Mistral Small 3.1 24B | 24B | 75.6% | 3.6 |
| Phi 4 Mini 3.8B | 3.8B | 70.7% | 19.6 |
| Mistral Nemo 12B | 12B | 64.6% | 6.9 |
| Gemma 3 4B | 4B | 64.6% | 16.5 |
| Llama 3.1 8B Instruct | 8B | 61.0% | 10.8 |
| Llama 3.2 3B Instruct | 3B | 60.4% | 24.1 |
| Mistral 7B Instruct v0.3 | 7B | 37.2% | 11.5 |
| Gemma 3 1B | 1B | 34.2% | 46.6 |
| Llama 3.2 1B Instruct | 1B | 32.9% | 59.4 |
| Gemma 4 31B | 31B | 31.1% | 5.5 |
| Gemma 4 E4B | 4B | 14.6% | 36.7 |
| Gemma 4 26B-A4B MoE | 26B | 12.2% | 16.2 |
| Gemma 4 E2B | 2B | 9.2% | 29.2 |
## Limitations
- **Single quantization level**: All GGUF models use Q4_K_M; only one MLX model included. Results may differ at other quantization levels.
- **Limited hardware configs**: Three Apple Silicon chips (M1 16GB, M2 Max 32GB, M5 32GB). No M3/M4 Pro/Ultra data.
- **Greedy decoding only**: All code quality evaluations use temperature=0. Sampling-based pass@k scores would differ.
- **Reasoning models**: Models like DeepSeek R1 Distill were tested with `--no-think`, which disables their chain-of-thought reasoning and may understate their capability.
- **Quality scores on M5 only**: HumanEval+ evaluations were run on a single hardware config. Scores should be hardware-independent, but inference artifacts from quantization on different memory configurations could vary.
- **Gemma 4 scores**: The Gemma 4 models score unusually low. This may reflect early quantization issues, prompt template incompatibilities, or model behavior at Q4_K_M precision.
## Links
- **GitHub**: [enescingoz/mac-llm-bench](https://github.com/enescingoz/mac-llm-bench)
- **HF Collection**: [enescingoz/humaneval-apple-silicon](https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon)
## Citation
```bibtex
@dataset{mac_coding_bench_v1,
title = {Mac Coding Bench Results v1},
author = {Enes Cingoz},
year = {2026},
url = {https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon},
note = {Speed and code quality benchmarks for quantized LLMs on Apple Silicon}
}
```
## License
MIT
提供机构:
enescingoz



