h3rb3rn/moe-sovereign-benchmarks
收藏Hugging Face2026-04-11 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/h3rb3rn/moe-sovereign-benchmarks
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- question-answering
- text-generation
language:
- de
- en
tags:
- moe
- mixture-of-experts
- benchmark
- knowledge-graph
- graphrag
- self-hosted
- digital-sovereignty
- langgraph
- neo4j
- mcp
- tool-use
- multi-agent
- orchestration
- evaluation
pretty_name: "MoE Sovereign Benchmark Results"
size_categories:
- n<1K
configs:
- config_name: moe_eval_v1
data_files: "results/eval_*.json"
- config_name: gaia_l1
data_files: "results/gaia_*.json"
- config_name: longmemeval
data_files: "results/longmemeval_*.json"
- config_name: runs
data_files: "results/run_*.json"
---
# MoE Sovereign — Benchmark Results
Benchmark results from **[MoE Sovereign](https://moe-sovereign.org)**, an open-source Mixture-of-Experts orchestrator for self-hosted LLM inference with deterministic routing, GraphRAG knowledge accumulation, and MCP precision tools.
## Project Overview
MoE Sovereign is a **locally hosted, non-commercial** AI orchestration platform that routes requests through specialized expert LLMs using deterministic template-based routing (not learned routers). It is designed for **digital sovereignty** — all data stays on-premise, all decisions are auditable.
### Architecture
| Component | Technology |
|-----------|-----------|
| Orchestrator | FastAPI + LangGraph (Python) |
| Knowledge Graph | Neo4j (GraphRAG with domain-scoped entity filters) |
| Semantic Cache | ChromaDB (L1, cosine < 0.15) + Valkey (L2-L4) |
| MCP Tools | 23 deterministic tools (AST-whitelisted evaluator) |
| Event Streaming | Apache Kafka (ingest, feedback, linting, audit) |
| Deployment | Single OCI image, 3 profiles (solo/team/enterprise) |
### Hardware
All benchmarks were run on a personally built 48U server rack:
| Node | GPUs | VRAM | RAM | Role |
|------|------|------|-----|------|
| N04-RTX | 5x RTX 3060 | 60 GB | 64 GB | Primary inference (>30B models) |
| N06-M10 | 4x Tesla M10 | 32 GB | 128 GB | Secondary inference (<=20B) |
| N07-GT | 2x GPU | 16 GB | 32 GB | Overflow |
| N09-M60 | 2x Tesla M60 | 16 GB | 14 GB | Lightweight tasks |
No sponsored hardware, no cloud credits, no institutional funding. Every benchmark value was measured on servers purchased with personal funds.
## Benchmark Results Summary
### MoE-Eval v1 (Internal Cognitive Benchmark)
9 test cases across 4 categories, scored via deterministic checks + LLM-as-Judge (gpt-oss:20b, direct Ollama call bypassing pipeline).
| Test | Category | Det. | LLM | Combined |
|------|----------|-----:|----:|---------:|
| Subnet calculation | MCP Precision | 7.0 | 10.0 | **8.8** |
| Arithmetic + units | MCP Precision | 10.0 | 9.0 | **9.4** |
| Date (leap year) | MCP Precision | 10.0 | 9.0 | **9.4** |
| 3-turn memory | Compounding Knowledge | 5.5 | 9.0 | **7.6** |
| 5-turn memory | Compounding Knowledge | 0.0 | 0.0 | 0.0 |
| Legal routing (BGB) | Domain Routing | 0.0 | 8.0 | **4.8** |
| Medical (Hashimoto) | Domain Routing | 9.4 | 0.0 | 3.8 |
| Code review (SQL inj.) | Domain Routing | 8.8 | 9.0 | **8.9** |
| Multi-expert synthesis | Multi-Expert | 2.3 | 0.0 | 0.9 |
| **Average** | | | | **6.0/10** |
**Scoring**: Combined = 0.4 x Deterministic + 0.6 x LLM Judge. Some LLM judge scores are 0.0 due to model unloading between calls (gpt-oss:20b on shared Tesla GPU).
### GAIA Level 1 (External Benchmark)
10 questions from the GAIA validation set (General AI Assistants, multi-step reasoning + tool use).
| # | Question Topic | Expected | Correct | Time |
|---|---------------|----------|---------|-----:|
| 1 | Kipchoge marathon pace calculation | 17 | Yes | 267s |
| 2 | Mercedes Sosa discography (2000-2009) | 3 | Yes | 57s |
| 3 | Game show probability | 3 | Yes | 397s |
| 4 | Fish bag volume (Leicester paper) | 0.1777 | No | 368s |
| 5 | Bird species in YouTube video | 3 | Yes | 389s |
| 6 | Pie Menus paper author's work | (long text) | No | 455s |
| 7 | Doctor Who maze location | THE CASTLE | Yes | 90s |
| 8 | Secret Santa logic puzzle | Fred | No | 180s |
| 9 | Reversed text comprehension | Right | No | 1243s |
| 10 | Spreadsheet land plot connectivity | No | Yes | 941s |
**Result: 6/10 = 60.0%**
GAIA Leaderboard context (April 2026):
- GPT-5 Mini: 44.8%
- Claude 3.7 Sonnet Thinking: 43.9%
- Gemini 2.5 Pro: 33.3%
- **MoE Sovereign (30b-balanced): 60.0%** (Level 1 only, 10 questions)
- Qwen3 32B Thinking: 12.3%
**Note**: Our result is on a 10-question subset of Level 1 only. The leaderboard scores are across all levels. Direct comparison is not meaningful — our score demonstrates orchestration value-add over individual backbone models (11-28%).
### LongMemEval (External Benchmark)
8 curated multi-turn tests across 5 memory ability categories.
| Category | Passed | Avg. Score |
|----------|--------|-----------|
| Multi-session reasoning | 2/2 | 66.7% |
| Temporal reasoning | 1/1 | 66.7% |
| Information extraction | 1/2 | 50.0% |
| Knowledge update | 1/2 | 50.0% |
| Abstention | 0/1 | 20.0% |
| **Overall** | **5/8** | **52.5%** |
Reference: EverMemOS = 83%, TiMem = 76.9%
### Compounding Effect (Run 1 vs Run 2)
A key architectural claim: the system improves over time through knowledge accumulation.
| Metric | Run 1 | Run 2 | Change |
|--------|------:|------:|--------|
| 70b template latency | 1414s | 641s | **-55%** |
| Neo4j entities | ~400 | 1610 | +302% |
| Neo4j relations | ~200 | 1317 | +559% |
| Ontology gaps | 0 | 171 | (tracked) |
## Template Configuration
All benchmarks use the `moe-reference-30b-balanced` template:
| Expert | Model | Node |
|--------|-------|------|
| Planner | phi4:14b | N06-M10 |
| Judge/Merger | gpt-oss:20b | N04-RTX |
| reasoning | deepseek-r1:32b | N04-RTX |
| research | gemma3:27b | N04-RTX |
| technical_support | qwen3:32b | N04-RTX |
| code_reviewer | devstral:24b | N06-M10 |
| math | phi4:14b | N06-M10 |
| legal_advisor | gpt-oss:20b | N06-M10 |
| medical_consult | gpt-oss:20b | N09-M60 |
| creative_writer | gemma3:12b | N07-GT |
| translation | phi4:14b | N07-GT |
## Enterprise Features (validated, no regression)
After implementing four enterprise architecture features inspired by Palantir AIP, Databricks Mosaic AI, and Glean, a validation benchmark confirmed **6.0/10 — identical to baseline**:
1. **Confidence Decay & Self-Healing**: Trust-score computation with automatic removal of decayed unverified triples
2. **Multi-Tenant RBAC**: Graph-level tenant isolation via Neo4j `tenant_id` filtering
3. **Inline Provenance Tags**: `[REF:entity]` source attribution in merger responses
4. **Blast-Radius Estimation**: Quarantine for high-impact triples (>20 connected entities)
## File Structure
```
results/
eval_*.json # MoE-Eval scored results (deterministic + LLM judge)
run_*.json # MoE-Eval raw pipeline outputs
gaia_*.json # GAIA Level 1 results
longmemeval_*.json # LongMemEval results
datasets/
moe_eval_v1.json # MoE-Eval v1 test case definitions (9 tests, 4 categories)
```
## Reproducibility
All benchmarks can be reproduced:
```bash
# Clone the main project
git clone https://github.com/h3rb3rn/moe-sovereign
# Run MoE-Eval
MOE_API_KEY=<your-key> MOE_TEMPLATE=<your-template> python3 benchmarks/runner.py
MOE_API_KEY=<your-key> python3 benchmarks/evaluator.py
# Run GAIA Level 1
HF_TOKEN=<your-token> MOE_API_KEY=<your-key> python3 benchmarks/gaia_runner.py
# Run LongMemEval
MOE_API_KEY=<your-key> python3 benchmarks/longmemeval_runner.py
```
## Citation
```bibtex
@misc{horn2026moesovereign,
title = {Sovereign Mixture-of-Experts: A Locally Hosted, Deterministically Routed LLM Orchestrator with Compounding Knowledge},
author = {Philipp Horn},
year = {2026},
url = {https://moe-sovereign.org},
note = {Non-commercial, CC BY-SA 4.0}
}
```
## Author
**Philipp Horn** — kontakt@philipp-horn.dev
Built entirely on personally purchased consumer hardware. Digital sovereignty lived, not preached.
## License
CC BY-SA 4.0
提供机构:
h3rb3rn
搜集汇总
数据集介绍

构建方式
在异构计算集群环境中,该数据集通过系统化的基准测试流程构建而成。研究团队部署了基于LangGraph的确定性编排流水线,整合了多个本地Ollama模型实例,其参数规模覆盖7B至70B范围,并采用Q4_K_M量化技术。知识库部分依托Neo4j图数据库实现GraphRAG架构,包含超过3240个实体和2745个关系。评估体系融合了27个具有抽象语法树白名单执行机制的多模态计算协议工具,确保了在受控环境下测试的可靠性与复现性。
特点
本数据集的核心特征体现在其多维度的评估框架设计。它不仅包含对69个大型语言模型在混合专家系统中角色适配性的系统分析,还整合了GAIA通用人工智能助手基准测试、长时记忆评估、复合效应分析等多维度性能指标。数据集特别强化了对抗性安全测试维度,涵盖九种代码注入攻击向量的防御效能验证。所有评估结果均附有详细的延迟指标与模型分配记录,为混合专家系统的工程化部署提供了实证依据。
使用方法
研究人员可通过解析数据集中的结构化JSON文件,获取混合专家系统在受管制环境下的性能基准。评估文件详细记录了每个测试问题的得分情况、响应延迟及专家模型分配路径,支持横向对比不同架构方案的效能差异。安全测试模块为开发安全关键型人工智能系统提供了攻击向量参考库,而缓存性能数据则为分布式系统优化提供了量化依据。该数据集可直接作为评估混合专家系统编排算法、知识图谱增强架构及安全防护机制的基准测试套件。
背景与挑战
背景概述
MoE Sovereign Benchmark Dataset 于2026年由研究人员 Philipp Horn 发布,隶属于 MoE Sovereign 项目,该项目致力于为受监管环境构建一个主权混合专家人工智能基础设施。该数据集的核心研究问题聚焦于评估本地大型语言模型在混合专家系统架构中扮演不同角色的适用性,以及衡量此类系统在复杂任务上的整体性能。通过系统性地测试模型在任务规划、输出评判与专家执行等方面的能力,该数据集为在数据主权和合规性要求严格的领域内开发可靠、高效的AI系统提供了关键基准,推动了混合专家系统在专业场景中的实用化进程。
当前挑战
该数据集旨在应对混合专家系统在受监管行业部署时所面临的核心挑战,即如何在确保安全、合规的前提下,实现复杂任务的高效分解、协调与执行。具体构建挑战包括:需设计涵盖数学、代码、法律、医疗等多领域的综合性评估问题集,以全面测试系统的跨领域能力;需开发对抗性安全测试框架,以验证系统抵御代码注入、提示词攻击等九类安全威胁的鲁棒性;需在异构计算集群上协调运行数十个参数量各异的本地模型,并整合图检索增强生成知识库,以评估系统在真实部署环境中的延迟与知识更新性能。
常用场景
经典使用场景
在人工智能领域,特别是针对受监管环境下的混合专家系统,MoE Sovereign Benchmark数据集为评估本地大型语言模型在MoE架构中的角色适应性提供了基准。该数据集通过系统性的测试,衡量模型在任务分解、输出评估与合并、以及遵循专家系统提示等方面的能力,为研究者筛选适合规划者、评判者和专家角色的模型提供了实证依据。其经典使用场景聚焦于异构GPU集群上的确定性编排流程,结合GraphRAG知识库与多工具集成,以支持高效、安全的AI系统部署。
解决学术问题
该数据集致力于解决混合专家系统在学术研究中的关键问题,包括如何量化评估模型在复杂编排任务中的性能,以及如何在资源受限的本地环境中实现与前沿模型相媲美的准确性。通过提供详细的基准结果,如角色适宜性分析、GAIA基准对比和长时记忆评估,它帮助研究者理解模型在知识更新、时序推理和上下文复用方面的表现,从而推动确定性AI系统在理论框架与实证验证上的融合,为受监管行业的AI基础设施设计提供科学参考。
衍生相关工作
基于该数据集衍生的经典工作包括对混合专家系统编排策略的优化研究,以及本地模型与GraphRAG知识库结合的长时记忆增强方法。相关研究进一步探索了确定性模板路由在降低延迟方面的效果,并扩展了对抗性测试框架以涵盖更多攻击向量。这些工作不仅深化了对MoE架构在受控环境中性能瓶颈的理解,还推动了如LangGraph管道和AST白名单工具等技术的创新,为后续在主权AI基础设施领域的系统设计与评估提供了重要基础。
以上内容由遇见数据集搜集并总结生成



