enescingoz/humaneval-apple-silicon

Name: enescingoz/humaneval-apple-silicon
Creator: enescingoz
Published: 2026-04-21 08:40:05
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/enescingoz/humaneval-apple-silicon

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: en license: mit tags: - benchmark - apple-silicon - coding - llm-evaluation - macos - humaneval task_categories: - text-generation size_categories: - n<1K --- # Mac Coding Bench Results v1 — Speed + Code Quality Benchmarks on Apple Silicon Speed and code quality benchmarks for quantized LLMs running locally on Apple Silicon Macs. The dataset pairs inference speed measurements (tokens/sec) with HumanEval+ functional correctness scores for 21 models, across three hardware configurations (M1, M2 Max, M5) totaling 123 benchmark results. ## Key Highlights - **Qwen 3.6 35B-A3B** achieves 89.6% HumanEval+ pass@1 at 16.7 tok/s — best quality score in the dataset, and fast thanks to MoE architecture - **Qwen 2.5 Coder 7B** hits 84.2% at 11.3 tok/s — best quality-to-speed ratio for a dense model - **Phi 4 Mini 3.8B** reaches 70.7% at 19.6 tok/s — strong for its size - **Gemma 4 family** scores 9-31% on HumanEval+, significantly underperforming Gemma 3 (34-79%) - 21 models evaluated for code quality, 123 total speed benchmark results across 3 chips - All models quantized to Q4_K_M (GGUF) or 4-bit (MLX) ## Dataset Description Each row represents one model benchmarked on one hardware configuration. Speed metrics come from `llama-bench` (GGUF) or `mlx_lm.benchmark` (MLX). Code quality metrics come from EvalPlus HumanEval+ evaluation (164 problems). Quality scores are available for 21 models on the M5 configuration; the remaining 102 rows have speed-only data across M1, M2 Max, and M5. ## Data Fields | Field | Type | Description | |---|---|---| | `model_name` | string | Human-readable model name | | `model_id` | string | Slug identifier for the model | | `params` | string | Parameter count (e.g., "7B", "35B") | | `quant` | string | Quantization method (Q4_K_M or 4bit) | | `runtime` | string | Inference runtime (llama.cpp or mlx-lm) | | `chip` | string | Apple Silicon chip (M1, M2 Max, M5) | | `cpu_cores` | int | Number of CPU cores | | `gpu_cores` | int | Number of GPU cores | | `ram_gb` | int | Total system RAM in GB | | `os_version` | string | macOS kernel version | | `pp128_toks` | float | Prompt processing speed, 128 tokens (tok/s) | | `pp256_toks` | float | Prompt processing speed, 256 tokens (tok/s) | | `pp512_toks` | float | Prompt processing speed, 512 tokens (tok/s) | | `tg128_toks` | float | Text generation speed, 128 tokens (tok/s) | | `tg256_toks` | float | Text generation speed, 256 tokens (tok/s) | | `peak_memory_gb` | float | Peak RSS memory usage in GB | | `humaneval_plus_pass1` | float | HumanEval+ pass@1 score (0-1), null if not evaluated | | `humaneval_base_pass1` | float | HumanEval base pass@1 score (0-1), null if not evaluated | | `perplexity` | float | Perplexity score, null if not evaluated | | `eval_framework_version` | string | EvalPlus version used | | `timestamp` | string | ISO 8601 timestamp of the benchmark run | ## Hardware Configurations | Chip | CPU Cores | GPU Cores | RAM | Rows | |---|---|---|---|---| | Apple M1 | 8 | 7 | 16 GB | 20 | | Apple M2 Max | 12 | 38 | 32 GB | 39 | | Apple M5 | 10 | 10 | 32 GB | 64 | Code quality evaluations (HumanEval+) were run on the M5 configuration only. ## Benchmark Methodology **Speed benchmarks:** - GGUF models: `llama-bench` with flash attention enabled, all layers offloaded to GPU (`-ngl 99`) - MLX models: `mlx_lm.benchmark` - Prompt processing measured at 128, 256, and 512 input tokens - Text generation measured at 128 and 256 output tokens **Code quality benchmarks:** - Framework: [EvalPlus](https://github.com/evalplus/evalplus) HumanEval+ - 164 problems with 80x test coverage over the original HumanEval test suite - Greedy decoding (temperature=0) - Reasoning models evaluated with `--no-think` flag - Quantization: Q4_K_M (GGUF), 4-bit (MLX) ## Models with Code Quality Scores All results on Apple M5, 32 GB, Q4_K_M quantization. | Model | Params | HumanEval+ pass@1 | tg128 (tok/s) | |---|---|---|---| | Qwen 3.6 35B-A3B | 35B | 89.6% | 16.7 | | Qwen 2.5 Coder 32B | 32B | 87.2% | 2.5 | | Qwen 2.5 Coder 14B | 14B | 86.6% | 5.9 | | Qwen 2.5 Coder 7B | 7B | 84.2% | 11.3 | | Phi 4 14B | 14B | 82.3% | 5.3 | | Devstral Small 24B | 24B | 81.7% | 3.6 | | Gemma 3 27B | 27B | 78.7% | 3.1 | | Gemma 3 12B | 12B | 75.6% | 5.7 | | Mistral Small 3.1 24B | 24B | 75.6% | 3.6 | | Phi 4 Mini 3.8B | 3.8B | 70.7% | 19.6 | | Mistral Nemo 12B | 12B | 64.6% | 6.9 | | Gemma 3 4B | 4B | 64.6% | 16.5 | | Llama 3.1 8B Instruct | 8B | 61.0% | 10.8 | | Llama 3.2 3B Instruct | 3B | 60.4% | 24.1 | | Mistral 7B Instruct v0.3 | 7B | 37.2% | 11.5 | | Gemma 3 1B | 1B | 34.2% | 46.6 | | Llama 3.2 1B Instruct | 1B | 32.9% | 59.4 | | Gemma 4 31B | 31B | 31.1% | 5.5 | | Gemma 4 E4B | 4B | 14.6% | 36.7 | | Gemma 4 26B-A4B MoE | 26B | 12.2% | 16.2 | | Gemma 4 E2B | 2B | 9.2% | 29.2 | ## Limitations - **Single quantization level**: All GGUF models use Q4_K_M; only one MLX model included. Results may differ at other quantization levels. - **Limited hardware configs**: Three Apple Silicon chips (M1 16GB, M2 Max 32GB, M5 32GB). No M3/M4 Pro/Ultra data. - **Greedy decoding only**: All code quality evaluations use temperature=0. Sampling-based pass@k scores would differ. - **Reasoning models**: Models like DeepSeek R1 Distill were tested with `--no-think`, which disables their chain-of-thought reasoning and may understate their capability. - **Quality scores on M5 only**: HumanEval+ evaluations were run on a single hardware config. Scores should be hardware-independent, but inference artifacts from quantization on different memory configurations could vary. - **Gemma 4 scores**: The Gemma 4 models score unusually low. This may reflect early quantization issues, prompt template incompatibilities, or model behavior at Q4_K_M precision. ## Links - **GitHub**: [enescingoz/mac-llm-bench](https://github.com/enescingoz/mac-llm-bench) - **HF Collection**: [enescingoz/humaneval-apple-silicon](https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon) ## Citation ```bibtex @dataset{mac_coding_bench_v1, title = {Mac Coding Bench Results v1}, author = {Enes Cingoz}, year = {2026}, url = {https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon}, note = {Speed and code quality benchmarks for quantized LLMs on Apple Silicon} } ``` ## License MIT

提供机构：

enescingoz

5,000+

优质数据集

54 个

任务类型

进入经典数据集