Thursday Jan 10, 2008

Defining a Data Grid

I saw an article by Robert Cline on Coherence and TimesTen. I've heard some of these questions from customers, since TimesTen and Coherence have both been used to solve some extremely challenging scalability and performance problems in almost every major financial services institution, telco and e-commerce site. Most people know what a database is, but not everyone knows what an In-Memory Data Grid is, so we set out to define it in a concise manner. Here's what we came up with:

The Oracle Coherence In-Memory Data Grid is a data management system for application objects that are shared across multiple servers, require low response time, very high throughput, predictable scalability, continuous availability and information reliability. For clarity, each of these terms and claims is explained:

  • A Data Grid is a system composed of multiple servers that work together to manage information and related operations – such as computations – in a distributed environment.

  • An In-Memory Data Grid is a Data Grid that stores the information in memory in order to achieve very high performance, and uses redundancy – by keeping copies of that information synchronized across multiple servers – in order to ensure the resiliency of the system and the availability of the data in the event of server failure.

  • The application objects are the actual components of the application that contain information that is shared across multiple servers and that must survive server failure in order for the application to be continuously available. These objects are typically built in an object-oriented language such as Java (e.g. POJOs), C++, C#, VB.NET or Ruby. Unlike a relational schema, application objects are often hierarchical in nature, and may contain information that is pulled from many database tables.

  • The application objects must be shared across multiple servers because a middleware application (such as eBay and Amazon.com) is horizontally scaled by adding servers, with each server running an instance of that application. Since the application instance running on one server may read and write some of the same information that an application instance running on another server reads and writes, that information must be shared. The alternative is to always access that information from a shared resource, such as a database, which will lower performance by requiring both remote coordinated access and Object/Relational Mapping (ORM), and decrease scalability by making that shared resource a bottleneck.

  • Because an application object is not relational, in order to retrieve it from a relational database the information must be mapped from a relational query into the object; this is known as Object/Relational Mapping (ORM). Examples include Java’s EJB 3.0 and JPA, and ADO.NET. The same technology allows that object to be stored in a relational database by deconstructing the object (or changes to the object) into a series of SQL inserts, updates and deletes. Since a single object may be composed of information from many tables, the cost of accessing objects from a database using Object/Relational Mapping can be significant, both in terms of the load on the database and the latency of the data access.

  • An In-Memory Data Grid achieves low response times for data access by keeping the information in-memory and in the application object form, and by sharing that information across multiple servers. In other words, applications may be able to access the information that they require without any network communication and without any data transformation step such as ORM. In cases where network communication is required, the Oracle Coherence avoids introducing a Single Point of Bottleneck (SPOB) by partitioning – spreading out – information across the grid, with each server being responsible for managing its own fair share of the total set of information.

  • High throughput of information access and change is achieved through four different aspects of the In-Memory Data Grid. First, Oracle Coherence employs an extremely sophisticated clustering protocol that can achieve wire speed throughput of information on each server, allowing the aggregate flow of information to increase linearly with the number of servers. Second, by partitioning the information, as servers are added each one assumes responsibility for its fair share of the total set of information, thus load-balancing the data management responsibilities into smaller and smaller portions. Third, by combining the wire speed throughput and the partitioning with automatic knowledge of the location of information within the Data Grid, Oracle Coherence routes all read and write requests directly to the servers that manage the targeted information, resulting in true linear scalability of both read and write operations; in other words, high throughput of information access and change. Fourth, for queries, transactions and calculations, particularly those that operate against large sets of data, Oracle Coherence can route those operations to the servers that manage the target data and execute them in parallel.

  • By using dynamic partitioning to eliminate bottlenecks and achieving predictably low latency regardless of the number of servers in the Data Grid, Oracle Coherence provides predictable scalability of applications. While certain applications can use Coherence to achieve linear scalability, that is largely determined by the nature of the application, and thus varies from application to application. More important is the ability of a customer to examine the nature of their application and to be able to predict how many servers will be required in order to achieve a certain level of scale, such as supporting a specified number of concurrent users on a system or completing a complex financial calculation within a certain number of minutes. One way that Coherence accomplishes this is by executing large-scale operations – such as queries, transactions and calculations – in parallel using all of the servers in the Data Grid.

  • One of the ways that Coherence can eliminate bottlenecks is to queue up transactions that have occurred in memory, and asynchronously write the end result to a system of record, such as an Oracle database. This is particularly appropriate in systems that have extremely high rates of change due to the processing of many small transactions, particularly when only the end result needs to be made persistent. Coherence both (i) coalesces multiple changes to a single application object and (ii) batches multiple modified application objects into a single database transaction, meaning that a hundred different changes to each of a hundred different application objects could be persisted to a database in a single, large – and thus highly efficient – transaction. Application objects pending to be written are safeguarded from loss by being managed in a continuously available manner.

  • Continuous availability is achieved by a combination of four capabilities. First, the clustering protocol used by Oracle Coherence can rapidly detect server failure and achieve consensus across all the surviving servers about the detected failure. Second, information is synchronously replicated across multiple servers, so no Single Point of Failure (SPOF) exists. Third, each server knows where the synchronous replicas of each piece of information are located, and automatically re-routes information access and change operations to those replicas. Fourth, Oracle Coherence ensures that each operation executes in a Once-and-Only-Once manner, so that operations that are being executed when a server fails do not accidentally corrupt information during failover.

  • Failover is the process of switching over automatically to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active server, system, or network. Failover happens without human intervention and generally without warning. [1]

  • Information reliability is achieved through a combination of four capabilities. First, Oracle Coherence uses cluster consensus to achieve unambiguous ownership of information within the Data Grid; in other words, at all times exactly one server is responsible for managing the master copy of each piece of information in the Data Grid. Second, because that master copy is owned by a specific server, that server can order the operations that are occurring to that information and synchronize the results of those operations with other servers. Third, because the information is continuously available, these qualities of service exist even during and after the failure of a server. Fourth, by ensuring Once-and-Only-Once operations, no operations are lost or accidentally repeated when server failure does occur. The combination of these four capabilities results is the information within the Data Grid being reliable for use by transactional applications.

As a result of these capabilities, Oracle Coherence is ideally suited for use in computationally intensive, stateful middle-tier applications. Coherence is targeted to run in the application tier, and is often run in-process with the application itself, for example in an Application Server Cluster.

[1] As defined by Wikipedia: http://en.wikipedia.org/wiki/Failover

Posted at 04:45PM Jan 10, 2008 by Cameron Purdy in Technology