Custom Mil Dataset
收藏DataCite Commons2026-05-03 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20005784
下载链接
链接失效反馈官方服务:
资源简介:
As large language models advance toward deployment in military mission critical information system applications, existing evaluation frameworks prove inadequate for assessing their suitability in defence contexts. Traditional benchmarks like MMLU-Pro and HELM, while effective for general capabilities, don't take into account the special needs of military operations: domain-specific knowledge, operational constraints in air-gapped environments, adversarial robustness under attack and resource limitations in tactical edge deployments. This paper presents MEBL (Military Evaluation Benchmarking for LLM), a comprehensive evaluation framework designed specifically for defence applications. Our methodology evaluates models across four critical axes - foundational capabilities, mission-specific performance, operational robustness, and resource efficiency - using Multi-Criteria Decision Analysis (MCDA) with hardware-specific normalization. We assessed ten prominent open-source models using our large carefully curated Military Golden Dataset, derived from several publically available defense documents and military scenarios, with evaluations conducted on representative military hardware configurations i.e. RTX 3060 GPU workstation for strategic deployments and Intel Core i7 CPU systems for tactical edge scenarios. Results show that model size is only weakly correlated with military utility, highlighting the importance of efficiency, mission specific performance and operational robustness over the raw scale. MEBL further assigns deployment-readiness categories—Strategic, Tactical, and Limited Use—to guide procurement and fielding decisions. The findings show notable disparities in deployment preparedness: 60% of the assessed models need considerable improvements prior to military deployment, while only Gemma 12B and Mistral 7B models achieved operational readiness. These findings suggest that existing LLM evaluation benchmarks inadequately predict performance in military contexts, where operational constraints often outweigh raw accuracy. MEBL addresses this evaluation gap by providing defence organisations with a framework tailored to their unique deployment requirements and constraints.Datasets: 1. MEBL Custom Military Dataset2. Mil Law Dataset3. Fine-tunign IFT Dataset
提供机构:
Zenodo
创建时间:
2026-05-03



