Reliability and Security

Our research focuses on the evaluation and design of dependable cyber-physical systems, extreme-scale systems, and cloud computing infrastructures. We tackle security and reliability problems using field data collected from real-world computing environments. Such data are an invaluable resource that allows us to analyze and accurately understand trusted system operation and failure conditions that cannot be recreated in synthetic environments.

Our projects span probabilistic graphical models and game theory for the detection of and response to attacks against large-scale computer networks, the combined analysis of cyber and physical domain data for preemptive detection of unreliable operation in cyber-physical systems, new methods to characterize and measure failures in extreme-scale systems from large volumes of heterogeneous data, fault injection campaigns on real systems, and the formal verification of dependability in cloud infrastructures using traces from virtualized environments.

We test methods for increased dependability on production systems, such as a Raven surgical robot, Cray supercomputers, and the National Center for Supercomputing Applications (NCSA) computing infrastructures. We collaborate with industry and national labs, such as Los Alamos National Lab, NERSC, Sandia National Labs, Cray, IBM, Microsoft, and Nokia-Bell Labs.

Data-driven Reliability:

This branch of research is focused on improving the reliability of extreme-scale systems and cyber-physical systems. Our methods include extensive measurements of large volumes of failure data (e.g., event logs, performance counters, metrics) and incident reports (i) from supercomputers in national labs, and (ii) from surgical robots operation. For both surgical robots and supercomputers, we execute extensive fault injection campaigns aimed at validating and improving our understanding of failure modes in the real systems.

Data-driven Security:

This branch of research is focused on the preemptive detection and mitigation of suspicious operations in large-scale interconnected networks and critical systems. The case studies considered are large-scale computing systems, cloud computing infrastructures, and cyber-physical systems, including power grids and industrial control systems.

Dependability Assessment Tools:

Smartmalware

Machine learning has been studied in and applied to various domains, solving problems with intelligence derived from data. The domain of cybersecurity has adopted this data-driven technology to advance intrusion detection methods. Various aspects of machine learning for cybersecurity have been studied. However, not much work has considered the possibility that such technology can be used against security. How can malware evolve with support from machine learning algorithms? Are we prepared for such advanced threat? In preparation of possible advanced threats, we investigate the possibility of smart malware – advanced malware embedded with machine learning algorithms.

For varying computing applications (e.g., large computing infrastructure, cyber-physical systems, or robots) we characterize the computational workload. Based on the distinct characteristics of different systems, we determine procedures in a security attack that can be replaced by data-driven algorithms. For such procedures, we design and implement potential threats that utilize machine learning algorithms to minimize attacker (human) intervention. The new threats are tested and evaluated in simulated environments (if not in the physical systems). In addition to demonstrating the potential threat, we evaluate and address the limitation of current detection methods against advanced threats.

Kaleidoscope : Bringing AI techniques to system resilience

Highly complex distributed and parallel computing systems and services experience frequent reliability failures and performance-anomalies (e.g., tail latency). These events can result in loss of availability, application failures and performance-loss; consequently, leading to loss of computational resources and PR nightmares for cloud-providers. Such issues can stem from random component failures, resource contention issues, design bugs or configuration problems. To address these challenges, we have built a suite of tools under the umbrella of Kaleidoscope project that combines probabilistic machine-learning methods with deep understanding of design and implementation of underlying computing system and application to enable the following:

  1. Design of resilience and performance metrics that allow field-failure study of error logs and performance metrics to understand failure and resource contention in system/applications causing outages or performance anomalies
  2. Design and deployment of smart-monitors on large-scale production systems that proactively exposes system and applications health-related issues.
  3. Detection and localization of performance anomalies and failures at runtime.
  4. Mining of failure scenarios, systematic issues (including bugs and design gaps) and configuration issues using guided fault-injection methods with the objective of breaching the system key performance indicators (such as availability, reliability or safety).
  5. Design of response and mitigation system to handle user traffic under failure and performance anomalies.