← Back ICRA 2026

Hierarchical LLM-VLA-Controller Integration for Task Generalization

INHYUK CHOI, Sangmoon Lee

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating an LLM planner with a VLA executor and a Home Pose Controller boosts robot manipulation success rates from 9% to 90% on decomposable tasks.

LLM-VLA Integration Robot Manipulation Task Generalization Hierarchical Planning Home Pose Controller Embodied AI

Problem

Standalone Vision-Language-Action (VLA) models struggle to generalize to complex, multi-step instructions because they rely on memorizing training trajectories rather than understanding task semantics.

Approach

A hierarchical framework uses GPT-4o to decompose high-level instructions into atomic sub-tasks, which are executed by a fine-tuned OpenVLA model, with a dedicated Home Pose Controller inserted between steps to ensure physical stability.

Key results

90% success rate on decomposable tasks vs. 9% baseline
63% overall success rate across LIBERO-10 benchmark
Home Pose Controller prevents execution failures during sub-task transitions
Effective mapping of abstract instructions to known primitive actions

Why it matters

Demonstrates that hierarchical planning with explicit state resetting is crucial for generalizing VLA models to complex robotic manipulation, guiding future research in embodied AI.

Abstract

Vision-Language-Action (VLA) models often struggle with generalization due to their tendency to memorize training data rather than understanding task semantics. This paper proposes a hierarchical framework that integrates Large Language Models (LLMs) with VLA models to overcome these limitations. By leveraging GPT-4o as a high-level planner, our system decomposes complex instructions into atomic sub- tasks executable by a low-level VLA. We introduce a “Home Pose Controller” between sub-tasks to ensure physical sta- bility. Experimental results on the LIBERO-10 benchmark demonstrate that our approach achieves a 90% success rate on decomposable tasks, significantly outperforming the 9% baseline of the standalone VLA model. I. OVERVIEW Current VLA models, such as OpenVLA, map visual and textual inputs directly to actions. However, they lack the reasoning capabilities required for complex, multi-step instructions or abstract goals. This leads to a sharp perfor- mance drop when encountering task combinations not seen during training. In this work, we argue that a hierarchical structure—using an LLM for reasoning and a VLA for execution—can bridge this gap. Our framework interprets high-level intent, plans a sequence of atomic actions, and maintains stability through a dedicated pose controller, re- sulting in robust generalization across diverse manipulation tasks. II. METHODOLOGY Our system employs a two-tier hierarchy: a high-level LLM agent (GPT-4o) and a low-level VLA executor (OpenVLA-oft). A. LLM Agent (GPT-4o) The LLM acts as a semantic bridge with two key roles: (1) decomposing complex instructions into atomic sub- tasks, and (2) interpreting abstract instructions into struc- tured task plans. As shown in Fig. 1, the agent reasons through complex or ambiguous goals to generate a sequential list of executable primitive tasks. B. Prompting Strategy To effectively leverage GPT-4o as a high-level planner, we designed a structured system prompt that defines the agent’s role and constraints. The LLM is instructed to act as a robot task planner that coordinates a low-level VLA controller. A key constraint in the prompt is the mandatory insertion of a ”Move to Home Pose” command between any two manip- ulation sub-tasks to ensure transition stability. An example of the prompt structure and the resulting decomposition is shown in Table I. Fig. 1. LLM Agent reasoning: Decomposing high-level instructions into atomic sub-tasks (Case 1: multi-object, Case 2: abstract goals). TABLE I EXAMPLE OF LLM PROMPT AND SUB-TASK DECOMPOSITION Component Content System Role You are an intelligent robot task planner. Your goal is to decompose complex user instructions into atomic sub- tasks. Tool Access You have access to a VLA controller capable of exe- cuting primitive tasks: [Put X in basket, Close X, Pick up X, etc.]. Constraint Between every physical manipulation task, you MUST insert a ”Move to Home Pose” command to reset the robot state. User Input ”Put both the alphabet soup and the tomato sauce in the basket.” Output List (1) Put the alphabet soup in the basket (2) Move to Home Pose (3) Put the tomato soup in the basket (4) Move to Home Pose C. Vision-Language-Action (OpenVLA-oft) The VLA executes the sub-tasks planned by the LLM Agent, fine-tuned on 15 primitive tasks from the LIBERO- 90 dataset. Each sub-task is sequentially executed in the order specified by the LLM. D. Home Pose Controller The Home Pose Controller maintains the robot’s initial joint configuration and returns to it between consecutive sub-tasks. Without this module, physical discontinuities be- tween sub-task transitions cause execution failures. The LLM Agent explicitly incorporates home pose commands into the task plan to ensure smooth and stable sequential execution (Fig. 2). ICRA2026 Late Breaking Results Poster presented at 2026 IEEE International Conference on Robotics and Automation (ICRA 2026) June 1-5, 2026. Vienna, Austria

Index terms

Task Planning Imitation Learning Autonomous Agents