Reliability and Security

Our research focuses on the evaluation and design of dependable cyber-physical systems, extreme-scale systems, and cloud computing infrastructures. We tackle security and reliability problems using field data collected from real-world computing environments. Such data are an invaluable resource that allows us to analyze and accurately understand trusted system operation and failure conditions that cannot be recreated in synthetic environments.

Our projects span probabilistic graphical models and game theory for the detection of and response to attacks against large-scale computer networks, the combined analysis of cyber and physical domain data for preemptive detection of unreliable operation in cyber-physical systems, new methods to characterize and measure failures in extreme-scale systems from large volumes of heterogeneous data, fault injection campaigns on real systems, and the formal verification of dependability in cloud infrastructures using traces from virtualized environments.

We test methods for increased dependability on production systems, such as a Raven surgical robot, Cray supercomputers, and the National Center for Supercomputing Applications (NCSA) computing infrastructures. We collaborate with industry and national labs, such as Los Alamos National Lab, NERSC, Sandia National Labs, Cray, IBM, Microsoft, and Nokia-Bell Labs.


Data-driven Reliability:

This branch of research is focused on improving the reliability of extreme-scale systems and cyber-physical systems. Our methods include extensive measurements of large volumes of failure data (e.g., event logs, performance counters, metrics) and incident reports (i) from supercomputers in national labs, and (ii) from surgical robots operation. For both surgical robots and supercomputers, we execute extensive fault injection campaigns aimed at validating and improving our understanding of failure modes in the real systems.


Data-driven Security:

This branch of research is focused on the preemptive detection and mitigation of suspicious operations in large-scale interconnected networks and critical systems. The case studies considered are large-scale computing systems, cloud computing infrastructures, and cyber-physical systems, including power grids and industrial control systems.


Dependability Assessment Tools: