Custom Mil Dataset
收藏DataCite Commons2026-05-03 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20005783
下载链接
链接失效反馈官方服务:
资源简介:
As large language models advance toward deployment in military mission critical information system applications, existing evaluation frameworks prove inadequate for assessing their suitability in defence contexts. Traditional benchmarks like MMLU-Pro and HELM, while effective for general capabilities, don't take into account the special needs of military operations: domain-specific knowledge, operational constraints in air-gapped environments, adversarial robustness under attack and resource limitations in tactical edge deployments. This paper presents MEBL (Military Evaluation Benchmarking for LLM), a comprehensive evaluation framework designed specifically for defence applications. Our methodology evaluates models across four critical axes - foundational capabilities, mission-specific performance, operational robustness, and resource efficiency - using Multi-Criteria Decision Analysis (MCDA) with hardware-specific normalization. We assessed ten prominent open-source models using our large carefully curated Military Golden Dataset, derived from several publically available defense documents and military scenarios, with evaluations conducted on representative military hardware configurations i.e. RTX 3060 GPU workstation for strategic deployments and Intel Core i7 CPU systems for tactical edge scenarios. Results show that model size is only weakly correlated with military utility, highlighting the importance of efficiency, mission specific performance and operational robustness over the raw scale. MEBL further assigns deployment-readiness categories—Strategic, Tactical, and Limited Use—to guide procurement and fielding decisions. The findings show notable disparities in deployment preparedness: 60% of the assessed models need considerable improvements prior to military deployment, while only Gemma 12B and Mistral 7B models achieved operational readiness. These findings suggest that existing LLM evaluation benchmarks inadequately predict performance in military contexts, where operational constraints often outweigh raw accuracy. MEBL addresses this evaluation gap by providing defence organisations with a framework tailored to their unique deployment requirements and constraints.Datasets: 1. MEBL Custom Military Dataset2. Mil Law Dataset3. Fine-tunign IFT Dataset
随着大语言模型(Large Language Model, LLM)朝着军事关键任务信息系统应用部署方向发展,现有评估框架在评估其在国防场景中的适配性时存在明显不足。传统基准测试如MMLU-Pro与HELM,尽管可有效评估通用能力,但未考虑军事行动的特殊需求:领域专属知识、气隙隔离环境下的作战约束、攻击场景下的对抗鲁棒性,以及战术边缘部署中的资源限制。
本论文提出MEBL(大语言模型军事评估基准,Military Evaluation Benchmarking for LLM),一款专为国防应用设计的综合性评估框架。我们的评估方法从四大核心维度对模型开展评测:基础能力、任务专属性能、作战鲁棒性与资源效率,并采用结合硬件专属归一化的多准则决策分析(Multi-Criteria Decision Analysis, MCDA)方案。
我们依托精心甄选的大型军事黄金数据集,对10款主流开源模型开展评估。该数据集源自多份公开国防文档与军事想定,评估在两类典型军事硬件配置上开展:用于战略部署的RTX 3060 GPU工作站,以及面向战术边缘场景的英特尔酷睿i7 CPU系统。
研究结果显示,模型规模与军事效用仅存在弱相关性,凸显出相较于模型原始参数量规模,效率、任务专属性能与作战鲁棒性的重要性。MEBL还设置了部署就绪分类——战略级、战术级与有限使用级——以指导采购与部署决策。
本次研究发现不同模型的部署准备度存在显著差异:60%的受测模型在军事部署前仍需大幅改进,仅有Gemma 12B与Mistral 7B两款模型达到作战就绪标准。上述结果表明,现有LLM评估基准无法准确预测军事场景中的性能表现,因为作战约束往往比原始准确率更具优先级。MEBL填补了这一评估空白,可为国防组织提供适配其独特部署需求与约束的评估框架。
数据集:1. MEBL专属军事数据集 2. 军事法数据集 3. 微调IFT数据集
提供机构:
Zenodo
创建时间:
2026-05-03



