1 Introduction

In this release of Oracle Application Server 10g, 10g (9.0.4), work has been done to improve and extend the high availability solutions for Oracle Application Server. Several new solutions for the Oracle Application Server 10g Infrastructure have been tested and are described in this book. All of these solutions seek to ensure that applications that you deploy on Oracle Application Server 10g meet the required availability to achieve your business goals. The solutions and procedures described in this book seek to eliminate single points of failure of any Oracle Application Server components with no or minimal outage in service.

This chapter explains high availability and its importance from the perspective of Oracle Application Server.

1.1 What is High Availability

The availability of a system or any component in that system is defined by the percentage of time that it works normally. A system works normally when it meets its correctness and performance specifications. For example, a system that works normally for twelve hours per day is 50% available. A system that has 99% availability is down 3.65 days per year on average. System administrators can expect critical systems to have 99.99% or even 99.999% availability. This means that the systems experience as little as four to five minutes of downtime per year.

Availability may not be constant over time. For example, availability may be higher during the daytime when most transactions occur, and lower during the night and on weekends. In the event of an unexpected disaster, such as a fire or earthquake, a system may go down suddenly for a period of time. However, because the Internet provides a global set of users, it is a common requirement that systems always be available.

Redundant components can improve availability, but only if a spare component takes over immediately for a failed component. If it takes ten minutes to detect a component failure and twenty additional minutes to start the spare component, then the system experiences a 50% reduction in availability for that hour of service.

Oracle Application Server is designed to provide a wide variety of high availability solutions, ranging from load balancing and basic clustering to providing maximum system availability during catastrophic hardware and software failures.

1.2 High Availability in Oracle Application Server 10g

Oracle Application Server consists of many components that can be deployed in distributed topologies. The underlying paradigm used to enable high availability for Oracle Application Server is clustering, which unites various Oracle Application Server components in certain permutations to offer scalable and unified functionality, and redundancy should any of the individual components fail.

Before you continue, we recommend that you read the book Oracle Application Server 10g Concepts to gain an understanding of the different components in Oracle Application Server. The descriptions there will allow you to understand the rest of the text in this guide more efficiently.

Oracle Application Server has several solutions and techniques to achieve high availability, which are all described in this guide. They allow you to achieve the following goals:

Redundancy

A highly available system requires its sub-systems to be redundant. All Oracle Application Server components can be deployed redundantly using the procedures and solutions described in this book. Depending on the type of components, they can be deployed in an active-active configuration or active-passive configuration.

In active-active configuration, multiple instances of a component service client requests at the same time. If one instance fails, the requests being serviced by that instance can be fulfilled by other active instances; the failure and failover of that instance is transparent to clients. An active-active configuration can usually be achieved by clustering instances of components together.

In active-passive configuration, requests are usually serviced by one instance of a component. Upon failure of that component, another instance is made active to respond to the request workload.
Death Detection and Auto Restart

Software processes belonging to Oracle Application Server components, local or distributed, are managed by a central process management system. This system is able to detect the death of processes and restart them even if they are distributed over multiple machines. The system allows customization of parameter values that define process death and restart (such as number of heartbeats). The processes implementing the process management system are themselves redundant as each has a shadow process.
Clustering

Clustering components of a system together allows the components to be viewed functionally as a single entity from the perspective of a client. A cluster increases the scalability, availability, and manageability of the components.

Several types of clusters can exist with Oracle Application Server components. Procedures to create and configure these clusters are comprehensively documented in this book.
State Replication and Routing

For stateful client requests, Oracle Application Server can replicate client state in order to enable stateful failover of requests in the event that processes servicing these requests fail. For J2EE requests, replicating client state for J2EE applications can be done declaratively or programmatically, depending on the mechanism being used. For most other components, state-based routing using cookies is available.
Connection Failure Management

Clients often connect to services on the server and reuse these connections. When a process implementing one of these services on the server is restarted, the connection may need to be re-established.

Oracle Application Server components ensure that if a reused connection fails, the connection is retried before a failure condition is propagated to the rest of the system. This allows clients to be transparent to any failures.
Backup and Recovery

Oracle Application Server provides facilities for backing up system state and using this backup to recover from failures. In certain circumstances, a component or system failure may not be repairable. The Oracle Application Server Backup and Recovery Tool can be used to back up the system at certain intervals and restore a backup when an unrepairable failure occurs.

For specific problems localized to the HTTP listener and J2EE container, a runtime configuration management system allows these components to be check pointed quickly and also allows for undo operations for configuration errors.
Disaster Recovery

Natural and physical disasters can happen to areas where an Oracle Application Server site hosting critical applications is physically located. A solution for recovering from such disasters is documented in this guide. This solution is a site-to-site recovery solution that allows the backing up of the state of an entire Oracle Application Server site and recovering it to another site that is physically distant from the first.

1.3 Types of Failures

Table 1-1 depicts the various types of failures that are possible with the Oracle Application Server system and the strategies that are used to prevent or solve the failures. For the purpose of discussion, maintenance activities during planned downtime is also included.

Table 1-1 System downtime, failures, and availability solutions

Downtime Type	Failure Type	Solution
Unplanned Downtime	System Failure	Load balancers, Farm, Oracle Process Management and Notification, Oracle Application Server Active Failover Cluster, Oracle Application Server Cold Failover Clusters
	Data Failure and Disaster	Remote Site, Backup and Recovery, Oracle Data Guard
	Human Error	Backup and Recovery, Oracle Data Guard
Planned Downtime	System Maintenance	Distributed and Dynamic Configuration
	Data Maintenance	No downtime required as data is stored in Oracle database. Backup and Recovery tool for configuration files in filesystem.

As depicted, solutions exist to prevent or recover from unplanned system failures to unintentional human errors. These solutions enable Oracle Application Server to be robust and reliable, and offer high availablity to the applications that it hosts.

1.4 Organization of this Guide

This guide has been organized into several chapters using the layers of the middle tier and Oracle Application Server Infrastructure as a baseline. When the term "middle tier" is mentioned in this book, the reference is made generically to the Oracle Application Server middle tier installation types. However, where Oracle Application Server Clusters are discussed, only the J2EE and Web Cache installation type is inferred as this is the only middle tier installation type that can be part of an Oracle Application Server Cluster.

Chapters 2 and 3 contain the description and configuration of the middle tier for high availability, respectively. Chapters 3 and 5 have the similar organization of information but for the Infrastructure. Chapter 6 contains the setup and operational information for the site-to-site Oracle Application Server Disaster Recovery solution.

1.5 High Availability Information in Other Documentation

The following table provides a list of cross-references to high availability information in other documents in the Oracle library. This information mostly pertains to high availability of various Oracle Application Server components.

Table 1-2 Cross-references to high availability information in Oracle documentation

Component	Location of Information
Overall high availability concepts	In the high availability chapter of Oracle Application Server 10g Concepts.
Oracle installer	In the chapter for installing in a high availability environment in Oracle Application Server 10g Installation Guide.
Oracle Application Server Backup and Recovery Tool	In the backup and restore part of Oracle Application Server 10g Administrator's Guide.
Oracle Application Server Web Cache	Oracle Application Server Web Cache Administrator's Guide
Identity Management service replication	In "Advanced Configurations" chapter of Oracle Application Server Single Sign-On Administrator's Guide.
Identity Management high availability deployment	In "Directory Replication and High Availability" chapter of Oracle Internet Directory Administrator's Guide. In "Oracle Identity Management Deployment Planning" chapter of Oracle Identity Management Concepts and Deployment Planning Guide.
Database high availability	Oracle High Availability Architecture and Best Practices
Distributed Configuration Management commands	Distributed Configuration Management Reference Guide
Oracle Process Management and Notification commands	Oracle Process Manager and Notification Server Administrator's Guide
OC4J high availability	Oracle Application Server Containers for J2EE Services Guide Oracle Application Server Containers for J2EE User's Guide Oracle Application Server Containers for J2EE Enterprise JavaBeans Developer's Guide
Java Object Cache	Oracle Application Server Web Services Developer's Guide
Load balancing to OC4J processes	Oracle HTTP Server Administrator's Guide
Oracle Application Server Wireless high availability	Oracle Application Server Wireless Administrator's Guide
Oracle Application Server Reports Services high availability	Oracle Application Server Reports Services Publishing Reports to the Web
Oracle Application Server Discoverer high availability	Oracle Application Server Discoverer Configuration Guide
Oracle Application Server Forms Services high availability	Oracle Application Server Forms Services Deployment Guide
Oracle Application Server InterConnect ini file information	Oracle Application Server InterConnect User's Guide

In addition, references to these and other documentation are noted in the text of this guide, where applicable.