ARMORs: A Software-Implemented Fault Tolerance Environment
Introduction
Computers and computer networks provide critical services in areas such as electronic commerce, financing, health, and telecommunication. When the systems providing vital services to an organization are down, the cost can be devastating: lost opportunities, lost revenues, noncompliance penalties, and high maintenance costs. More importantly, partners, customers, and suppliers affected by the system shutdowns perceive the organization as poorly run and not quite suited to meet their needs.
High availability and Security Problems. When the systems providing vital services to an organization are down (due to either random errors or malicious attacks) the cost can be devastating: lost opportunities, lost revenues, noncompliance penalties, and high maintenance costs. More importantly partners, customers, and suppliers affected by the system shutdowns perceive the organization as poorly run and not quite suited to meet their needs. If the organization is concerned with public health or safety, lives can be lost. Consequently, the starting point for discussion on availability and security has to be the cost of information system downtime and information loss or unauthorized disclosure.
High Availability. The telecom industry and the circuit switch networks have, historically, provided high availability as a fundamental requirement for services. For example, a switching component has less than 5 minutes per year downtime and provides better than 99.999% availability. There is a growing need for other high-availability systems, especially in enterprise and corporate IT networks and services (e.g., email, payroll, e-commerce, and directories). Moreover, the current trends towards providing critical services over the Web and the growth in e-commerce drive the need for high availability. Increasing globalization requires that the corporate IT and data networks provide service 24 hours a day and 7 days a week. Worldwide customers want access to product information and corporate road warriors need access to data and services 24/7. There is no good time for a network outage or downtime.
Security. In addition, due to an increase in the intensity and number of malicious attacks, security has become an issue of primary importance in designing robust systems. Attacks due to malicious code exported to the host computer and accessing secure information on the host are commonplace. The extent of attacks ranges from exhaustion of system resources to seizing of root privileges and, ultimately, unrecoverable damage.
Achieving high availability and security calls for open, flexible, and configurable solutions that are able to address, in a cost-effective way, a variety of dependability requirements, including availability, maintainability, integrity, and security. Such solutions can be based on prebuilt and reusable modules, adaptable for a wide range of applications and reusable in different environments.
ARMOR High Availability and Security Infrastructure
High availability is achieved through a process of detecting that a failure has occurred and rapid recovery from the failure to minimize loss of service. Security protection is attained employing compiler-assisted memory layout randomization and guarding (against malicious tampering) function return addresses.
Rapid response to random errors and malicious attacks entail that the system must make correct decisions in an automated and autonomous manner. Consequently, the integrity of the reliability and security infrastructure becomes paramount and is typically the most difficult quality to assure, as the precise conditions of field failures and security threats are difficult to anticipate or reproduce with enough realism to verify the capabilities of the infrastructure. Thus, fault tolerance and resilience to malicious attacks of the infrastructure is of equal importance to the robustness of the application itself.
The ARMOR infrastructure is designed to manage redundant resources across interconnected nodes, foil security threats, detect errors in both the user applications and the infrastructure components, and recover quickly from failures when they occur. The infrastructure components are multithreaded processes that can be distributed across the network to provide customizable services to the application. The ARMOR technology holds several advantages over its competitors in the domain of high availability and security software infrastructures:
- Library of fundamental high availability and security software modules, which provide security protection, error detection, error recovery, and management services to the user application.
- Core ARMOR runtime infrastructure for integrating high availability and security software modules into a solution that meets dependability and performance requirements of the application.
- Unique process architecture for rapidly developing new high availability and security software modules, including those customized for a particular application domain.
- Dependable runtime infrastructure ensures that failures in the ARMOR processes are fully recoverable and that errors do not create security holes in the system.
- Minimal performance and memory footprint versions of the high availability and security software modules and core ARMOR runtime infrastructure are available for resource-constrained execution environments.
Integration with Application
Although a significant amount of protection can be provided by generic, external solutions that are transparent to the application, tighter integration with the application usually allows for more sophisticated error detection and recovery. Three levels of application support are offered by the ARMOR infrastructure.
- Level 1: transparent and external support. This approach offers external fault management solutions that are largely independent of the application. This level provides wide applicability fault and security vulnerability masking techniques, which do not require any modification of the applications. Example capabilities of this level include: (1) reliability support—detection of hardware failures, detection of application process failures, and restarting failed application processes on the same or different machine (node); this technology provides more than 0.999 in availability with an overhead of about 5%; and (ii) security support—security vulnerabilities masking via randomization and control information encoding. This technology has been proven to protect against 60% of attacks with no runtime overhead.
- Level 2: transparent extension of standard libraries. In this approach, standard programming interfaces (e.g., operating system calls or standard C library calls) are hardened with additional capabilities. Most of the changes at this level occur to the libraries that exist in user space; only the more advanced fault tolerance and security protection mechanisms require extensions to the operating system. Some example capabilities of this level include: (1) reliability support—detection of application hangs, protection against data errors on disks through duplication of file writes, and automatic re-establishment of broken TCP/IP socket connections upon recovery that is transparent to both client and server; and (2) security support—protection of return addresses of functions.
- Level 3: instrumentation with ARMOR APIs. The infrastructure defines an API for the application to interact with the ARMOR processes. Fault tolerance and security protection mechanisms can be tightly integrated within the application processes themselves, permitting a greater degree of customization than is available through the other two approaches. Example internal mechanisms include: (1) reliability support—checkpoint of memory state, adaptively reconfigure the error detection and recovery services based upon phases of the application’s execution, and application-specific self-tests that can be invoked by external ARMOR processes to assess health of the application; and (2) security support—automated recovery from security attacks with minimal performance loss. This level also provides support for automatic extraction of program invariants to form separate intrusion and error detectors as ARMOR building blocks.
Because of the flexible ARMOR infrastructure, security protection and detection and recovery services—embodied as reconfigurable high availability and security modules—can be added or removed from the infrastructure, depending on application requirements. The modular design ensures that the customer only pays for the strategies that he uses—in terms of both cost and hardware resources. Plus, a clear upgrade path exists through which additional protection capabilities can be added to the ARMOR infrastructure in the future.
ARMOR Architecture
Understanding the ARMOR-based SIFT environment requires thinking about the ARMORs at two levels:
- ARMORs make up the SIFT environment. Several different ARMOR processes execute in the SIFT environment. Some of these, such as Execution ARMORs, directly provide services to the user applications such as process launching and monitoring. Others, such as manager ARMORs, implement specific recovery policies (i.e., upon detecting an error, the manager chooses the appropriate set of actions to take to recover from the detected error). Manager ARMORs are arranged in a hierarchy, with the fault tolerance manager (FTM) serving as the highest-ranking manager. Finally, daemon ARMORs reside on each node. They launch other ARMOR processes on the node, provide the communication infrastructure for ARMOR-to-ARMOR messages, and detect failures in locally-installed ARMORs.
Most ARMORs can migrate throughout the network as they execute (with the exception of ARMORs such as the daemons and Execution ARMORs that are tied to a particular local resource). The state within the ARMOR processes is protected through microcheckpointing, an incremental checkpointing technique that captures an ARMOR’s state on an element-by-element basis. The hierarchical organization of ARMOR processes in the SIFT environment not only protects against single points of failure, but also tolerates several multiple-failure scenarios as well. When multiple failures simultaneously exist, recovery occurs in a recursive fashion (e.g., a manager recovers, which allows the manager to recover its subordinate ARMORs, which allows the subordinate managers to recover their subordinates, and so on until all processes are recovered). - Elements make up the ARMORs. All functionality and behavior of the ARMOR process are encapsulated in element modules. These elements provide both the core functionality of an ARMOR (e.g., each daemon has a set of elements to launch child ARMOR process, to communicate with its child ARMORs, to communicate with remote daemons on other nodes, and to detect ARMOR failures) and nonfunctional services such as internal fault tolerance mechanisms that make the ARMOR more self-checking. Because an ARMOR’s functionality is embodied within the reconfigurable element-based architecture, additional services—both functional and nonfunctional—can be added to the ARMOR processes, even during runtime. Thus, a base ARMOR process can be outfitted with additional error detection and recovery support such as microcheckpointing, assertion checks, backup elements, and recovery blocks through reconfigurations on the original ARMOR’s design.
Figure 1: ARMORs Architecture
Success Stories
The reconfigurable design of ARMOR process benefits not only ARMOR processes, but also non-ARMOR applications well. Non-ARMOR applications benefit from the SIFT environment’s ability to host a wide variety of fault tolerance mechanisms, which makes it easy to customize the SIFT environment for a particular application’s set of dependability requirements:
- JPL/REE Applications Manager. The ARMOR-based SIFT environment has been used to protect spaceborne scientific MPI applications as part of the Remote Exploration and Experimentation (REE) project at the Jet Propulsion Laboratory. ARMOR processes detect application crash failures, ARMOR crash failures, application hang failures through progress indicators sent by the application, ARMOR hang failures, and node failures. The REE configuration of the SIFT environment has been experimentally evaluated through error injections to stress the error detection and recovery mechanisms of the ARMOR processes and to determine the overhead of the SIFT environment as seen by the application.
- Wireless Telephone Network Controller. A database server for a wireless telephone network controller has been outfitted with elements that provide a data auditing framework for its in-memory database tables. In addition to the data auditing checks embedded into the database server, process-level detection and recovery provided by the external ARMORs are used to tolerate failures in the controller application.
- DHCP Server. The core functionality of the publicly available DHCP (Dynamic Host Configuration Protocol) server has been implemented as a set of elements to demonstrate the design of applications around the ARMOR architecture. Once designed as an ARMOR, the DHCP server is able to take advantage of several internal fault tolerance mechanisms that leverage the reconfigurable structure of ARMOR processes, including microcheckpointing, coarse- and fine-grained signature checks, assertion checks, backup elements, and recovery blocks.
- High Availability Framework for Wireless Client-Server Applications. Standard socket function calls have been overridden to invoke TCP proxy elements incorporated either within the application process or in a local ARMOR process. These proxy elements shield the application from the occasional disconnection expected when using a lossy wireless medium. In addition to transparent recovery of the application’s TCP connections, the ARMOR-based SIFT environment provides the baseline suite of error detection and recovery services to tolerate failures in the application processes.
- Telecommunications Middleware. Existing middleware processes for a telecommunications application have been extended with elements to implement server failover policies. This particular application requires two particular fault tolerance mechanisms: (1) a mechanism to ensure that the backup node has access to all data written to the primary node’s local disk, and (2) a mechanism for migrating the IP address of the primary node to the backup node to provide client transparency. Both of these requirements are satisfied through elements that plug into either the middleware processes or external ARMOR processes in the SIFT environment.
Figure 2: ARMOR Applications