LLaKey: Follow My Basic Action Instructions to Your Next Key State
Zheyi Zhao, Ying He, Fei Yu, Pengteng Li, Fan Zhuo, Xilong Sun
Abstract
In 3D object manipulation, collecting expert data for end-to-end imitation learning becomes a mainstream method. Though successful, previous works neglect the guiding role of language in action execution. These methods lack the understanding of action semantics, in which multiple action sequences are guided by a category of instructions, resulting in overlearned object semantics and vague action semantics. To address the above limitation, we introduce a novel framework named LLaKey, which breaks down skill commands into more detailed action instructions based on key states for fine- grained action control. Specifically, LLaKey first leverages the knowledge encoded in pre-trained large-scale models to fine- tune an action instruction conductor. Then, these instructions are executed by a downstream action model. Comprehensive experiments show that LLaKey significantly surpasses baselines with a relative improvement of 15% in nine complex and varied skill tasks, demonstrating the superiority of our method.