Towards Grey Fault Tolerant Cloud Systems
- đ¤ Speaker: Ryan Huang đ Website
- đ Date & Time: Thursday 01 April 2021, 15:00 - 16:00
- đ Venue: meet.google.com/rvx-amuv-xrh
Abstract
Building robust, large-scale distributed systems is notoriously challenging. Decades of research have made significant advances in tackling this challenge with mature techniques such as state-machine replication. These techniques usually assume a fail-stop model. Ample real-world evidence, however, suggests that faults in modern cloud infrastructure are often “grey”, in which a component is severely impaired but still appears to be working. These grey failures cannot be effectively detected or handled by existing solutions.
In this talk, I will discuss the grey failure problem. Using real-world examples, we argue that a key trait of the subtle grey failure mode is a form of differential observability. Based on this insight, I will present Panorama, a solution that harnesses observability in large systems to detect grey failures by using instrumentation to convert any system component into an in-situ observer. To further enhance the inherent system observability, I will propose an intrinsic software watchdog abstraction and a tool called OmegaGen that automatically generates customized watchdogs for a given program by using a program reduction technique. I will conclude by outlining some open challenges in making cloud systems grey-fault-tolerant.
Bio:
Ryan Huang is an Assistant Professor in the Department of Computer Science at Johns Hopkins University. He leads the Ordered Systems Lab at JHU , which conducts research broadly in distributed systems, operating systems, cloud and mobile computing. His work received the best paper award at OSDI 2016 , ASPLOS 2019, NSDI 2020 , and the best paper award nominee at MICRO 2018 . He is a recipient of the NSF CAREER Award (2020). Dr. Huang received a B.S. degree in Computer Science (Economics minor) from Peking University (2010), a P.h.D degree from UC San Diego (2016).
Series This talk is part of the Computer Laboratory Systems Research Group Seminar series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge talks
- Chris Davis' list
- CL's SRG seminar
- Computer Laboratory Systems Research Group Seminar
- Department of Computer Science and Technology talks and seminars
- Interested Talks
- meet.google.com/rvx-amuv-xrh
- ndk22's list
- ob366-ai4er
- rp587
- School of Technology
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)



Thursday 01 April 2021, 15:00-16:00