The DEPEND group has a long history of developing fault-injection tools, framework, and methodology to evaluate the fault tolerance, reliability, availability, and failure characteristics of high-performance computing (HPC). These tools have been applied to assess and evaluate production systems by IBM, Cray, Huawei, Honeywell, and Infosys, to name a few.
Research Staff:
- Daniel Chen – Research Programmer
Project Description
Modern computing systems are expected to operate reliably, continuously, and without any downtime. This high level of reliability is achieved through continuous improvement of the reliability of the hardware and software as well as the failure prediction, prevention, detection, and recovery techniques employed on the system. To support the continual improvement of a computing system’s reliability, tools, methods, and metrics are needed to simulate and create realistic failure conditions. This research thrust is aimed at continuously developing and improving the state-of-the-art fault-injection tools and methodology to experimentally evaluate the new generations of computing systems. The fault-injection tools and framework we developed have been used to evaluate telecommunication systems, enterprise servers, space mission apparatus, and cloud-based business applications. We have fault-injection tools for various layers of a modern computing platform, as illustrated in the figure above.
Fault-Injection Tools Developed to Date
- NFTAPE—Fault Injection Framework
- Facilitate automated fault-injection studies
- PFI—Ptrace-based Fault Injector
- Linux based
- User-level fault injector
- X86, PowerPC
- BFI—Breakpoint-based Fault Injector
- Linux based
- User-level fault injector
- X86, PowerPC
- WFI—Windows Fault Injector
- Based on WinDbg
- User-level fault injector
- Windows system
- GDBFI—GDB-based Fault Injector
- Based on GDB
- Any system supports GDB
- User, Kernel, UEFI-level fault injector
- Linux system