five

aigencydev/aigency-v4-evaluation

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/aigencydev/aigency-v4-evaluation
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - tr - en size_categories: - 10K<n<100K task_categories: - text-generation - multiple-choice - question-answering - image-text-to-text tags: - aigency - benchmark - evaluation - turkish - frontier-comparison - reproducibility pretty_name: AIGENCY V4 Benchmark Evaluation Results configs: - config_name: summary data_files: summary.json - config_name: humaneval data_files: humaneval/scored.jsonl - config_name: ifeval data_files: ifeval/scored.jsonl - config_name: gpqa_diamond data_files: gpqa_diamond/scored.jsonl - config_name: belebele_tr data_files: belebele_tr/scored.jsonl - config_name: arc_challenge data_files: arc_challenge/scored.jsonl - config_name: truthfulqa_mc1 data_files: truthfulqa_mc1/scored.jsonl - config_name: gsm8k data_files: gsm8k/scored.jsonl - config_name: mmlu data_files: mmlu/scored.jsonl - config_name: mmlu_pro data_files: mmlu_pro/scored.jsonl - config_name: hellaswag data_files: hellaswag/scored.jsonl - config_name: winogrande data_files: winogrande/scored.jsonl - config_name: humaneval_plus data_files: humaneval_plus/scored.jsonl - config_name: mbpp data_files: mbpp/scored.jsonl - config_name: mbpp_plus data_files: mbpp_plus/scored.jsonl - config_name: tr_mmlu data_files: tr_mmlu/scored.jsonl - config_name: xnli_tr data_files: xnli_tr/scored.jsonl - config_name: tquad data_files: tquad/scored.jsonl - config_name: tr_grammar data_files: tr_grammar/scored.jsonl - config_name: chartqa data_files: chartqa/scored.jsonl - config_name: mathvista data_files: mathvista/scored.jsonl - config_name: docvqa data_files: docvqa/scored.jsonl - config_name: mmmu data_files: mmmu/scored.jsonl --- # AIGENCY V4 — Benchmark Evaluation Results > **Reproducibility capsule** for the AIGENCY V4 whitepaper. > 13,344 real API calls · 22 benchmarks · Wilson 95% CI · seed=42. This dataset is the verifiable evidence behind the [AIGENCY V4 model card](https://huggingface.co/aigencydev/AIGENCY-V4) and the [AIGENCY V4 whitepaper](https://github.com/ecloud-bh/aigency-v4-whitepaper). Every benchmark folder contains one `scored.jsonl` (per-item predictions, gold answers, scores) and a `summary.json` (aggregate accuracy with Wilson 95% CI). ## What's in this dataset For each of the 22 benchmarks: ``` {benchmark}/ ├── summary.json # accuracy, ci_low, ci_high, n_total, n_scored, errors, │ # avg_latency_s, p95_latency_s, timestamp_utc └── scored.jsonl # one line per item: {item_id, prompt_excerpt, gold, # pred, correct, latency_s, ...} ``` Top-level files: - **`summary.json`** — combined summary across all 22 benchmarks (also includes operational telemetry: total_api_calls, latency_avg_s, latency_p50_s, latency_p95_s, latency_p99_s). - **`README.md`** — this file. ## Benchmarks included | Benchmark | Tier | Accuracy | Wilson 95% CI | n | Errors | |---|---|---|---|---|---| | HumanEval | 1 | 0.8415 | [0.778, 0.889] | 164/164 | 0 | | IFEval (strict) | 1 | 0.8022 | [0.767, 0.834] | 541/541 | 1 | | GPQA Diamond | 1 | 0.3788 | [0.314, 0.448] | 198/198 | 0 | | Belebele-TR | 1 | 0.8733 | [0.850, 0.893] | 900/900 | 0 | | ARC-Challenge | 1 | 0.9488 | [0.935, 0.960] | 1172/1172 | 0 | | TruthfulQA MC1 | 1 | 0.7638 | [0.734, 0.792] | 817/817 | 0 | | GSM8K | 1 | 0.9462 | [0.933, 0.957] | 1319/1319 | 0 | | MMLU | 2 | 0.8010 | [0.775, 0.825] | 1000/1000 | 0 | | MMLU-Pro | 2 | 0.5020 | [0.471, 0.533] | 1000/1000 | 0 | | HellaSwag | 2 | 0.8860 | [0.865, 0.904] | 1000/1000 | 0 | | WinoGrande | 2 | 0.7466 | [0.722, 0.770] | 1267/1267 | 0 | | HumanEval+ | 2 | 0.7988 | [0.731, 0.853] | 164/164 | 0 | | MBPP | 2 | 0.8482 | [0.799, 0.887] | 257/257 | 0 | | MBPP+ | 2 | 0.7804 | [0.736, 0.819] | 378/378 | 0 | | TR-MMLU | 3 | 0.7080 | [0.667, 0.746] | 500/500 | 2 | | XNLI-TR | 3 | 0.7340 | [0.694, 0.771] | 500/500 | 2 | | TQuAD | 3 | 0.8240 | [0.788, 0.855] | 500/500 | 0 | | TR Grammar | 3 | 0.7900 | [0.700, 0.858] | 100/100 | 5 | | ChartQA | 3 | 0.6768 | [0.634, 0.717] | 492/500 | 22 | | MathVista | 3 | 0.3413 | [0.280, 0.408] | 208 | 45 | | DocVQA | 3 | 0.7917 | [0.595, 0.908] | 24 | 5 | | MMMU | 3 | 0.5333 | [0.361, 0.698] | 30/30 | 0 | ## Methodology - **Endpoint**: `https://aigency.dev/api/v2` (production) - **Assistant**: `alparslan-v4` (assistant_id = 277) - **Temperature**: 0.0 (deterministic) - **Top-p**: disabled (greedy decoding) - **Concurrency**: 4–10 parallel workers - **Backoff**: 1s → 2s → 4s → 8s → 16s, 6 attempts - **Subsample seed**: 42 - **Confidence interval**: Wilson 95% (more robust than normal approximation for binomials) - **Date**: 27 April 2026 (single session) ## How to use ```python from datasets import load_dataset # Load the high-level summary summary = load_dataset("aigencydev/aigency-v4-evaluation", "summary") # Load per-item scored items for a specific benchmark gsm8k = load_dataset("aigencydev/aigency-v4-evaluation", "gsm8k") print(gsm8k["train"][0]) # {"item_id": "...", "gold": "...", "pred": "...", "correct": True, ...} ``` ## Citation ```bibtex @misc{aigency-v4-evaluation-2026, title = {AIGENCY V4 Benchmark Evaluation Results}, author = {{eCloud Yaz{\i}l{\i}m Teknolojileri}}, year = {2026}, month = apr, url = {https://huggingface.co/datasets/aigencydev/aigency-v4-evaluation}, note = {Reproducibility capsule for the AIGENCY V4 whitepaper} } ``` ## License MIT (data and runner code). The underlying benchmark datasets retain their original licences (MMLU, GSM8K, HumanEval, MMLU-Pro, ARC, HellaSwag, WinoGrande, TruthfulQA, IFEval, GPQA, Belebele, XNLI, TQuAD, MMMU, ChartQA, DocVQA, MathVista — see each benchmark's source for details). © 2026 eCloud Yazılım Teknolojileri · info@e-cloud.web.tr · ai@aigency.dev
提供机构:
aigencydev
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作