← Back ICRA 2024

FBPT: A Fully Binary Point Transformer

Zhixing Hou, Yuzhang Shang, Yan Yan

PDF

Abstract

This paper presents a novel Fully Binary Point Cloud Transformer (FBPT) model which has the potential to be widely applied and expanded in the fields of robotics and mobile devices. By compressing the weights and activations of a 32- bit full-precision network to 1-bit binary values, the proposed binary point cloud Transformer network significantly reduces the storage footprint and computational resource requirements of neural network models for point cloud processing tasks, compared to full-precision point cloud networks. However, achieving a fully binary point cloud Transformer network, where all parts except the modules specific to the task are binary, poses challenges and bottlenecks in quantizing the activations of Q, K, V and self-attention in the attention module, as they do not adhere to simple probability distributions and can vary with input data. Furthermore, in our network, the binary attention module undergoes a degradation of the self- attention module due to the uniform distribution that occurs after the softmax operation. The primary focus of this paper is on addressing the performance degradation issue caused by the use of binary point cloud Transformer modules. We propose a novel binarization mechanism called dynamic-static hybridiza- tion. Specifically, our approach combines static binarization of the overall network model with fine granularity dynamic bina- rization of data-sensitive components. Furthermore, we make use of a novel hierarchical training scheme to obtain the optimal model and binarization parameters. These above improvements allow the proposed binarization method to outperform bina- rization methods applied to convolution neural networks when used in point cloud Transformer structures. To demonstrate the superiority of our algorithm, we conducted experiments on two different tasks: point cloud classification and place recognition. In point cloud classification, our model achieved an accuracy of 90.9%, which is only a 2.3% decrease compared to the full precision network. For the place recognition task, we achieved 91.02% in the top @1% metric and 82.87% in the top @1% metric on the Oxford RobotCar dataset in terms of the average recall rate. Moreover, our model exhibits a significant reduction of over 80% in terms of model size and FLOPs (floating-point operations) compared to the baseline.

Index terms

Deep Learning for Visual Perception Deep Learning Methods Recognition