← Back ICRA 2026

KAN We Flow? Advancing Robotic Manipulation with 3D Flow Matching Via KAN & RWKV

Zhihao Chen, Yiyuan Ge, Ziyang Wang

PDF

AI summary

Key figure (auto-extracted from paper)

A lightweight flow-matching policy combining RWKV and Kolmogorov-Arnold Networks achieves state-of-the-art robotic manipulation success rates while cutting parameters by 86.8%.

Flow matching RWKV Kolmogorov-Arnold Networks Visuomotor policy Robotic manipulation Parameter-efficient learning

Problem

Current generative visuomotor policies rely on heavy UNet-style backbones, causing high latency and compute costs that prevent deployment on resource-constrained robots.

Approach

KAN-We-Flow replaces large UNets with an efficient RWKV-KAN backbone for linear-time temporal mixing and spline-based feature calibration, augmented by an Action Consistency Regularization loss to stabilize one-step action generation.

Key results

Introduces an RWKV-KAN U-shaped backbone for efficient sequence modeling
Proposes Action Consistency Regularization (ACR) to anchor predictions to expert demonstrations
Reduces model parameters by 86.8% while enabling real-time one-step inference
Achieves state-of-the-art success rates across Adroit, Meta-World, and DexArt benchmarks

Why it matters

Enables high-performance, real-time robotic manipulation on edge devices by drastically reducing compute overhead without sacrificing accuracy.

Abstract

Diffusion-based visuomotor policies excel at mod- eling action distributions but are inference-inefficient, since recursively denoising from noise to policy requires many steps and heavy UNet backbones, which hinders deployment on resource-constrained robots. Flow matching alleviates the sampling burden by learning a one-step vector field, yet prior implementations still inherit large UNet-style architectures. In this work, we present KAN-We-Flow, a flow-matching policy that draws on recent advances in Receptance Weighted Key Value (RWKV) and Kolmogorov-Arnold Networks (KAN) from vision to build a lightweight and highly expressive backbone for 3D manipulation. Concretely, we introduce an RWKV-KAN block: an RWKV first performs efficient time/channel mixing to propagate task context, and a subsequent GroupKAN layer applies learnable spline-based, groupwise functional mappings to perform feature-wise nonlinear calibration of the action mapping on RWKV outputs. Moreover, we introduce an Action Consistency Regularization (ACR), a lightweight auxiliary loss that enforces alignment between predicted action trajectories and expert demonstrations via Euler extrapolation, providing additional supervision to stabilize training and improve pol- icy precision. Without resorting to large UNets, our design reduces parameters by 86.8%, maintains fast runtime, and achieves state-of-the-art success rates on Adroit, Meta-World, and DexArt benchmarks. Our project page can be viewed in link .

Index terms

AI-Enabled Robotics Learning from Experience Embodied Cognitive Science