Transforming Chaos into Order: Transaction Service

A typical distributed application consists of computations, each requiring coordination of software components functioning across a network of computers. Each computation being coordinated may require a different mix of components residing on different machines. The pure software coordination problem is by itself a daunting task. Even if we manage to write perfect code and all the components reside on the same machine, the combination and permutation of lower level software failures, system hardware failures, reaching resource limits, etc. can create a very large number of 'error states' that our computing system can fall foul of. Now compound the effect with a distributed computing system where network failure, delays, and system availability contribute further to increase the possible error conditions.

Even though you have probably voiced this complaint to yourself already while walking through the Aberdeen & Wilshire code samples, I'll just repeat it here. There's no error handling code! Besides the normal excuse of 'the error handling code complicates the sample', I must admit that writing error handling code for DCOM applications is a science in its own right.

The problem is that the very same component may be running either locally or remotely, and the computational requirement for each case of reuse is not predictable at component design time (i.e. does this component work by itself, or does it feed into other components at every input/output junction?). Even if you know that your application spans, say, 6 computers in a network and uses a total of 30 components during its lifetime, writing a good application requires you to identify and recover from a tremendous combination of complex failure states. It's very hard to be sure that a system will detect or catch all software/hardware failure scenarios. It's easy to predict, though, that a distributed system consisting of a large number of software components will occasionally fail.

Learning from the Internet phenomenon (users on the Internet are notoriously famous for tolerating alpha or beta-software which contains many errors) maybe we should just ignore network errors and go on. While acceptable when casually surfing for interesting information on the web, it's definitely not an acceptable practice in the business world. What we need is a facility, or a service, which will enable us to greatly simplify the handling of errors, and greatly reduce the permutation of failure possibilities.

One implementation of such a service is called a transaction service. Microsoft's version of it is called the Microsoft Transaction Server. A transaction service for distributed components will allow the application writer to adopt a very simple view of error handling. A piece of work performed either by a component in isolation or by a large network of distributed components, can have only one of two outcomes. It either completed or failed. If the operation has failed, the state of the system is guaranteed to be the same as before the attempt to perform the piece of work. While the distributed system is carrying out its piece of work, the intermediate changes to the state of the system aren't visible to other concurrently computing system(s) until a success or failure state for the entire piece of work is reached. Note how this greatly simplifies failure cases. There may have been a million different reasons why a piece of work failed, but the application designer doesn't have to deal with them.

With this simple premise, the work of application design is reduced to transforming the application logic into an orchestration of a set of concurrent transactions. Each transaction can contain multiple computational steps involving many distributed components, and it can also include other nested transactions. A successful transaction is completed by a commit action, and an unsuccessful transaction is completed by a rollback action. A commit sends the system of distributed objects on to the next predictable state while a rollback reverts the system back to a previously stable state. In between transactions, while the system of distributed objects is computing, it may go through an unpredictable series of state changes including partial or complete software/network failure. None of these state changes, however, will be visible outside of the system. The other concurrently executing systems will only see a state change when a successful commit is performed on a transaction.

Before the availability of a robust and reliable transaction service, design presentations for component based distributed applications are full of frantic hand-waving and crossed fingers among the architects and engineers. Notice that the design of an application using a transactional model of computing is significantly different from designing, say, for a C++ or Java implementation. We require a new way of thinking about the problem.