← Back ICRA 2026

Why Cognitive Robotics Matters: Lessons from OntoAgent and LLM Deployment in HARMONIC for Safety-Critical Robot Teaming

Sanjay Oruganti,Sergei Nirenburg,Marjorie McShane,Jesse English,Michael Roberts,Christian Arndt,Ramviyas Parasuraman,Luis Sentis

PDF

AI summary

LLMs lack inherent metacognitive and diagnostic capabilities required for safety-critical robotics, making knowledge-grounded cognitive architectures essential for reliable human-robot teaming.

Cognitive Robotics LLM Safety OntoAgent HARMONIC Metacognition Human-Robot Teaming

Problem

Safety-critical human-robot teaming demands reliable metacognitive self-monitoring, domain-grounded diagnosis, and consequence-based action selection, yet it remains untested whether LLMs can reliably provide these capabilities in embodied settings.

Approach

The authors conducted a controlled comparison in the HARMONIC robotic framework, replacing the OntoAgent cognitive architecture with six frontier and efficient LLMs as drop-in strategic layer replacements, evaluating them under internal knowledge and knowledge-equalized conditions.

Key results

LLMs fail to verify preconditions before acting, causing unrecoverable cascade failures
Diagnostic reasoning improves with external knowledge but hallucination rates remain unchanged
OntoAgent achieves 100% task completion with full precondition verification
Metacognition and diagnosis are architectural properties, not emergent LLM scaling effects

Why it matters

The findings establish that decision authority in safety-critical embodied AI must remain with verifiable cognitive architectures, guiding the safe integration of LLMs in human-robot teaming.

Abstract

Robots operating alongside humans must recognize what they do not know before acting, diagnose problems from domain knowledge, and reason about action conse- quences. These capabilities are operational requirements, not optimization targets, and their absence produces silent and unrecoverable failures. We present a first-of-its-kind controlled comparison between OntoAgent, our content-centric cognitive architecture, and six LLMs spanning frontier and efficient tiers as drop-in replacements at the strategic layer of the same robotic system in HARMONIC. LLMs fail to verify their knowledge state before acting, even when given equivalent procedural knowledge. The deficit is architectural, not knowledge-based. Knowledge-grounded architectures must retain decision author- ity; LLMs contribute where their strengths apply. I. MOTIVATION Large language models are increasingly deployed as the strategic reasoning layer for robotic systems [1]–[3]. For conversational applications, stochastic errors are tolerable because humans remain in the loop to correct and regen- erate. Physical embodiment removes that safety net entirely. In safety-critical human-robot teaming, a hallucinated fact becomes a wrong action, a wrong action becomes an unrecov- erable failure, and that failure unfolds alongside humans who depend on the robot’s judgment. A growing body of evidence documents systematic reasoning failures in LLMs that persist across model scale and prompting strategies [4], [5]. Whether LLMs can reliably provide the cognitive capabilities safety- critical settings demand has not been tested within a con- trolled embodied comparison. We identify three capabilities as critical: metacognitive self-monitoring, domain-grounded diagnosis, and consequence-based action selection. These are not emergent properties of scale but architectural commit- ments, provided by construction in cognitive architectures such as Soar [6], ACT-R/E [7], and OntoAgent [8], [10]. II. THE HARMONIC FRAMEWORK HARMONIC is a dual-control cognitive-robotic architec- ture separating strategic (System 2) deliberative reasoning from tactical (System 1) reactive control [11] through a bidi- rectional interface. The strategic layer instantiates OntoAgent [8]–[10], whose reasoning operates over four interconnected knowledge resources: an ontological world model, procedural scripts and metascripts with explicit preconditions, episodic memory, and a continuously updated situation model. Prior to any action dispatch, OntoAgent inspects the situation model to verify preconditions. Unsatisfied preconditions trigger metascript activation, such as requesting information from a teammate. Diagnostic hypotheses are generated by traversing causal relations in the ontology, and action selection includes an actionability assessment before any command issues. A single verified execution trace fully characterizes system behavior, yielding the inspectability and traceability required for safety-critical deployment. The tactical layer executes real-time motor control through Behavior Trees [12] and a shared blackboard, engaging skills from a modular library that includes state machines, classi- cal controllers, learned policies, and vision-language-action models. Crucially, the strategic layer is interchangeable by design. Any reasoning system that processes timed perception frames and produces parameterized action commands can replace OntoAgent while the tactical infrastructure, percep- tion pipeline, and task environment remain invariant. This modularity enables the controlled comparison reported here. III. EXPERIMENTAL DESIGN We evaluate six LLMs as drop-in replacements for On- toAgent at the strategic layer: Claude Opus 4.6 and Haiku 4.5 (Anthropic), GPT-5.2 and GPT-5 Mini (OpenAI), Gemini 3 Pro and Gemini 3 Flash (Google). Each model operates through an LLMAgent module comprising a context man- ager, system prompt builder, LLM provider, and action parser that translates model outputs into the standardized command format. The evaluation scenario is a collaborative shipboard maintenance task in which the robot assists a mechanic in diagnosing an engine overheating issue and retrieving a replacement thermostat. The scenario imposes three cognitive demands that parallel the measurement targets: generating diagnostic hypotheses from domain knowledge, detecting missing information before executing a fetch plan, and se- lecting action primitives whose execution requirements match the deliberative-reactive timing constraints. Each model runs five trials under two conditions, yielding N = 60 trials total. Under Internal Knowledge (IK), the LLM relies entirely on its pretrained knowledge. ICRA2026 Late Breaking Results Poster presented at 2026 IEEE International Conference on Robotics and Automation (ICRA 2026) June 1-5, 2026. Vienna, Austria

Index terms

Cognitive Control Architectures Embodied Cognitive Science Safety in HRI