DEV Community

gentic news
gentic news

Posted on • Originally published at gentic.news

ARMOR 2025: Military Safety Benchmark Exposes LLM Gaps Across 21 Models

ARMOR 2025 benchmark tests 21 LLMs against military legal doctrines, revealing critical safety gaps that civilian benchmarks miss.

ARMOR 2025, a new benchmark published April 30 on arXiv, evaluates 21 commercial LLMs against military legal doctrines. It reveals that existing safety benchmarks miss critical gaps in models' adherence to the Law of War and Rules of Engagement.

Key facts

  • 519 doctrinally grounded prompts in the benchmark
  • 12-category taxonomy based on OODA framework
  • 21 commercial LLMs evaluated
  • Grounded in Law of War, Rules of Engagement, Joint Ethics
  • Published on arXiv April 30, 2026

The Doctrinal Gap

ARMOR 2025 targets a blind spot in LLM safety evaluation. Existing benchmarks like MMLU or TruthfulQA test general social risks, but none measure whether models follow the legal and ethical rules governing real military operations. The benchmark extracts doctrinal text from three core sources: the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. It then generates multiple-choice questions designed to preserve the intended meaning of each rule. [According to ARMOR 2025]

OODA-Inspired Taxonomy

The benchmark organizes its 519 prompts through a taxonomy informed by the Observe Orient Decide Act (OODA) decision-making framework, enabling systematic testing of accuracy and refusal across military-relevant decision types. The structured 12-category taxonomy covers scenarios from targeting decisions to rules-of-engagement interpretation. [Per the arXiv preprint]

Figure 2: Accuracy of language models across doctrinal categories in ARMOR 2025.

Results and Implications

Evaluation results across 21 commercial LLMs reveal critical gaps in safety alignment for military applications. The paper does not disclose which specific models performed best or worst, nor does it release per-model scores — a notable omission for reproducibility. However, the finding that models fail to consistently follow legal and ethical rules for military operations has immediate implications for defense contractors exploring LLM deployment. [The company did not disclose the figure]

Figure 1: ARMOR 2025 Taxonomy and Benchmark Generation Workflow. The top illustrates a 12-category taxonomy of battlefie

The unique take here is that civilian safety alignment — the dominant paradigm in AI safety research — is insufficient for high-stakes military contexts. A model that refuses to generate hate speech might still recommend a legally questionable airstrike. ARMOR 2025 provides a concrete framework to test this, but its reliance on multiple-choice questions may miss nuanced reasoning required in real command decisions.

What to watch

Watch for follow-up papers that release per-model scores and open-source the prompt set. Also track whether the U.S. Department of Defense or allied military organizations adopt ARMOR 2025 as a procurement or deployment criterion for LLM-based systems.


Originally published on gentic.news

Top comments (0)