Research Analyzer
← Back IROS 2024

FogROS2-FT: Fault Tolerant Cloud Robotics

Kaiyuan Chen, Kush Hari, Trinity Chung, Michael Wang, Nan Tian, Christian Juette, Jeffrey Ichnowski, Liu Ren, John Kubiatowicz, Ion Stoica, Ken Goldberg

PDF

Abstract

Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance and ease of management. However, cloud compute can be costly, cloud services can suffer occasional downtime, and connectivity between the robot and cloud can be prone to variations in network Quality-of-Service (QoS). We present FogROS2-FT (Fault Tolerant) to mitigate these issues by introducing a multi-cloud extension that automatically replicates independent stateless robotic services, routes requests to these replicas, and directs the first response back. With replication, robots can still benefit from cloud computations even when a cloud service provider is down or there is low QoS. Additionally, many cloud computing providers offer low-cost “spot” computing instances that may shutdown unpredictably. Normally, these low-cost instances would be inappropriate for cloud robotics, but the fault tolerance nature of FogROS2-FT allows them to be used reliably. We demonstrate FogROS2-FT fault tolerance capabilities in 3 cloud-robotics scenarios in simulation (visual object detection, semantic segmentation, motion planning) and 1 physical robot experiment (scan-pick-and-place). Running on the same hardware specification, FogROS2-FT achieves motion planning with up to 2.2x cost reduction and up to a 5.53x reduction on 99 Percentile (P99) long-tail latency. FogROS2-FT reduces the P99 long-tail latency of object detection and semantic segmentation by 2.0x and 2.1x, respectively, under network slowdown and resource contention. Videos and code are available at https://sites.google.com/view/fogros2-ft. I. I N T RO D U C T I O N The complexity of foundational models [1], [2], [3] and sophisticated robot algorithms [4], [5] exceed most onboard robot computing capabilities. Cloud robotics provides shared access to on-demand resources and services with boosted performance and simplified management, enabling the deployment of compute-intensive algorithms on low- cost, mobile robots without powerful on-board hardware, such as GPU, TPU, and high-performance CPU. In previous research, we developed FogROS2, which enables unmodified robotics code in Robot Operating System 2 (ROS2) to offload heavy computing modules to an independent set of cloud hardware resources and accelerators. FogROS2 used on-demand servers that guarantee dedicated computing resources with high uptime (e.g., 99.99 % [6]). However, the network quality of service (QoS) between robots and the cloud can vary, and during rare cloud outages, robots lose all cloud-computing benefits. Additionally, as on-demand 1Department of Electrical Engineering and Computer Science 2Robert Bosch Research and Technology Center North America, Sunny- vale, CA, USA 3Robotics Institute, Carnegie Mellon University 4Department of Industrial Engineering and Operations Research 1,4University of California, Berkeley, CA, USA †For correspondence and questions: kych@berkeley.edu Fig. 1: FogROS2-FT Overview. (Top) Cloud robotics applications, such as grasp planning, when deployed on a single cloud server become a single point of failure. (Bottom) Instead, FogROS2-FT provides a cost-efficient and fault-tolerant solution that deploys unmodified ROS2 applications to multiple low-cost cloud servers, making cloud-robotics applications resilient to individual server termination and network slowdowns. instances can be expensive, many cloud providers offer spot VMs1 at a significantly reduced price with the caveat that they can shut down unpredictably—making them (without fault tolerance) unsuitable for many robotics applications. In this work, we introduce FogROS2-FT, a fault-tolerant extension to FogROS2 [7] that provides robust performance against variable network QoS, infrastructure unavailability, and stochasticity of the robotic algorithms, increasing the reliability and responsiveness of cloud robotics. By adding redundancy to cloud computation, we enable cloud-robotics tasks to continue operating effectively despite the following failures: (A) Resource Unavailability: Although cloud services have high uptime and are managed by dedicated experts, outages can still occur. For example, an AWS outage affected the availability of the iRobot applications [8]. In addition, spot VMs may shut down unpredictably. (B) Resource Oversubscription: The cloud enables flexible usage of computational resources. For example, one can oversubscribe to a system by allocating fewer resources than the sum of resources required by all robots, based on the 1In Google Cloud Platform (GCP) and Microsoft Azure, these are called Spot Virtual Machines. In Amazon Web Services, these are called Spot Instances. More generally, they are also known as preemptible or transient machines. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) October 14-18, 2024. Abu Dhabi, UAE 979-8-3503-7769-9/24/$31.00 ©2024 IEEE 1390

Index terms

Distributed Robot Systems Networked Robots Multi-Robot Systems