Design and Implementation of System Management Services: Australian Experiences

Complexity is a large number of simplicities.

Presented by Garth L. Wolfendale

Garth is a Senior Consultant with Microsoft® Consulting Services in Australia. His background is primarily in systems development but more recently he has focused on Microsoft's Systems Management Server and Windows NT applied to the provision of systems management services to large enterprises.

1.0 Overview

This paper focuses on the process, from initial vision and scoping of the application of Microsoft® Systems Management Server to production rollout. It is based on experience designing and implementing systems management services for organisations in Australia. My goal is to cover key design, architectural, and technical issues encountered and resolved as well as details of SMS 1.1 that are applicable to the points under discussion.

Designing and implementing a Microsoft Systems Management Server (SMS) site for a large, distributed organisation is a great challenge since it spans every aspect of the organisation from business service issues to the most fine grained technical issue.

It is like developing a transport system, for example an airline, since the service requirements and specifications are complex and often conflicting. For example an airline is required to carry as many passengers, quickly, to as many places as possible at the lowest prices with maximum levels of safety and reliability. Similarly a systems management service is required to deliver software and packages, collect inventory, provide on-line help and remote control, alert support and operations when things go wrong, rollback distributions when they fail.. over large distributed systems on various networks, with maximum reliability and speed and minimum bandwidth utilisation. Microsoft's Systems Management Server is a tool amongst others in this endeavor. In the case of an Airline, a single faulty component on an aircraft can ultimately ruin the corporation. Similarly, for a systems management service, failed system components can produce "show stopper" problems and bring the whole service down or threaten its viability.

It is essential, therefore, to thoroughly specify the service requirements of the system and then match them by the logical and physical design of the system (over-match them in fact by contingency systems, rollback capabilities, redundancy, disaster recovery systems ).

Quite often, the new system must also fit in with an existing system, so that it is necessary to extend the capabilities of SMS software to support, say, SNMP alerting and reporting or NetView support in an SNA environment.

It is assumed that the attendees/readership are familiar with SMS 1.1, since this paper is not an introductory tutorial. Pre-reading TechNet articles on SMS would be an advantage. For example:

"MS Systems Management Server Reviewer's Guide"
"Architectural Design and Implementation Plan: MS Systems Management Server," by Paul Bethany
"Systems Management Server Administrator's Guide" Product Documentation.

2.0 Designing an Organisation's Systems Management Service

In most cases, I have found that difficulties in implementing an enterprise systems management service arise from misunderstanding the following:

Scope of the task
Business service requirements
Training and expertise requirements
Staffing requirements
Resources, hardware scaling requirements
Network utilisation and requirements
Available systems management tools
Software development and scripting requirements
Problem escalation procedures

There is a tendency to seriously underestimate the above from the point of view of resources, time and costs.

In order to deal with these issues, I have used the Microsoft Solutions Framework (MSF) and Solutions Development Discipline (SDD) and adapted them for systems management service development. MSF/SDD was developed by Microsoft, based on the way the corporation itself successfully makes software products, to provide a framework for an organisation's development programme that breaks with the traditional task-oriented approach, and uses one which is iterative and risk-based, relying on small, product-oriented teams of peers.

The key elements of this approach are as follows:

The Architectural Model
The Team Model
The Process Model

2.1 The Architectural Model

This model is based on the analysis of the system in the form of three views:

User View in terms of Services provided by the system
Logical View, specifying the system Objects that provide the relevant services
Physical View that specifies the implementation of the system objects

Service specifications

Before a system can be defined completely, the details of the services provided by the system must be identified. Without these, there is no basis for making decisions concerning the system, the resources required or its schedule. In the case of a large system I was recently involved in, a great deal of time was dedicated to defining the system's services in detail. It was surprising what came out of this phase of the project in terms of impact on the design and implementation details of the system being developed.

More than 20 main elements of service were identified, along with dependency diagrams showing the way in which they connected to the rest of the organisation's services. The personnel required to manage and support the system were quantified, and their skill and training requirements identified. It is important to note that the system itself was to be primarily used for package distribution and installation, so that even with this restricted use, there were a significant number of service elements.

Some of these key services were:

Develop SMS package distribution and installation script to provide:
- Installation
- Backout
- Validation
- Scheduling
- Cleanup
Detect errors in distribution and installation and:
- Take corrective actions if feasible
- Log appropriate events
- Raise appropriate alerts
Audit and reconcile distribution
- Produce reports on distributions including:
  - Timings and other performance measures
  - Errors and successful retries
  - Complete failures
Activate software when installed and confirm that it runs correctly. If not, back out and alert.
Automatically detect (and remedy if necessary) whether target sites are ready to receive distribution
Distribute 4 MBs of compressed package to over 1000 sites, spread across Australia, overnight
Provide complete disaster recovery

These were then followed by Service Interface and Reports specifications.

The latter covered reportage in such areas as Operations, Reconciliation (success and failure statistics and details), Measurement and Service Levels, Inventory details, Performance Data and Management.

These services cannot be provided by SMS alone, or any other single product. The main challenge was to combine various existing systems and products (e.g., SMS ), and develop software (e.g., using the BackOffice® SDK) and scripts (MSTest), to provide the required services.

The service details were then used as the basis for the system's logical design.

Logical design

The Logical Design specifies the enterprise's Domain Model, SMS's Site Model, the Network Architecture, Data Flow Model, Job Process, Distribution Process, Support Process, Management Procedures, SNMP MIBs, Events (their meaning and triggers).

In essence, key business objects are identified (e.g., Central Site, Regional Site, Branch Site, Administrator, Support Technician, Package and Job ), and the services provided by each, and the services used by each are detailed.

This design level forms the basis of the system's functional specifications at which point all parties involved in implementing the system can understand its details (whereas the physical design is targeted at, and understood only by, the specialists involved).

This level of specification is unusual for most organisations. They tend to go from service specifications (requirements) to physical design in a rather haphazard way. As will be described in the section on the MSF/SDD team model, Logical Design is a very important deliverable, providing a basis for agreement on what the system is and does, and is understood by all parties (Business, Logistics, User Help, Development). It is a level at which the system can be checked against service requirements, and provides a level of description against which Physical Design can be validated.

A schematic such as the following is a good example of a logical design level description of a site architecture :-

Numbers are identified (e.g., 20000 workstations, with 15 primary sites and one central site), logistics and user training implications are spelled out. Network architecture is also included in the logical design, identifying network connections, routers, and protocols, with enough detail to provide preliminary timings and delivery estimates.

This logical design is verified against the services design, and forms the basis of the physical design of the system.

Physical design

At this level of design, the details of the system are identified. Specifications of the platforms are in place, details of the network are completed, scaling and identification of the SMS components (Site Servers, SQL Servers) are in place, as well as their positioning on the network.

As with the other levels of design, this process is iterative, and may take some test runs and tuning exercises before the physical design is complete.

2.2 The Team Model

The key concept underlying the MSF/SDD team model is that of a "Team of Peers," as opposed to a hierarchy of specialists or career systems management. The following shows the team of peers concept:

In this paradigm, there is no fixed hierarchy, and full interactivity is an essential component of the model. The Program Management role is to coordinate the team's activities and resolve deadlocked issues. The Product Management role is to ensure that the system being developed is providing the required business services. Also, this role ensures that the services themselves are what are required by the business, are achievable and fully understood by all parties. User Education and Logistics/Planning are brought into the project from a very early stage to ensure that the system is rolled out effectively and the users of the system are properly trained and supported. Development is the technical role which actually ensures the actual building of the system. Each role contributes and/or is responsible for deliverables as the system design and development process unfolds. At the Services specification phase, Product Management is responsible for the deliverables. Program Management is responsible for ensuring the detailed functional and design specifications are delivered, along with project schedules and milestones. QA Testing, Development and Logistics are responsible for completing and delivering the implemented system.

2.3 The Process Model

This model is a an iterative approach to systems development as well as being milestone based. It is described as follows:

The main elements of this model are:

Rather than attempt to design and implement the "complete" system in one step, identify releases that can be implemented in a short time frame (6 months), followed by a "next" release. This allows for changes in business requirements, updates due to new technology, rapid revision of system requirements due to competitive initiatives. Organisations I have dealt with have embarked on multi-year projects that have blown out in schedules, and are already falling behind emerging, relevant and competitive technology.
The first quadrant lays down the scope of the system, and usually entails a proof of concept to validate the key elements of the target system.
The next phase involves the team in development of the detailed functional specifications and design of the system. This requires the production of the user education, logistics, test and quality assurance plans as well as the system's functional and detailed design. At this stage, the schedule for the project is finalised and agreed by all parties.
The next phase is the development phase, leading to the 'scope complete' milestone.
The final quadrant involves prototyping, beta testing, and logistics of rollout.

During development and testing, issues and problems will be identified, being incorporated amongst other services in the next release. Decisions have to be made about problems, whether they are fixed in the current release, or dealt with in the next release.

A guiding principle throughout this process is "Specifiy early but freeze late."

3.0 Key Issues Arising During the Development of Systems

The following sections will be presented as walkthroughs and demonstrations.

3.1 Automated installation of a large number of secondary sites over a routed TCP/IP routed WAN
3.2 Tuning SMS and Windows NT for optimal performance on a routed TCP/IP WAN
3.3 A "shadow" install technique for secondary site installation when WAN connectivity is down
3.4 Generating custom MIFs and triggering SMS alerts
3.5 Using RAS to support remote administration over routed TCP/IP WAN

THESE MATERIALS ARE PROVIDED "AS-IS," FOR INFORMATIONAL PURPOSES ONLY.

NEITHER MICROSOFT NOR ITS SUPPLIERS MAKES ANY WARRANTY, EXPRESS OR IMPLIED WITH RESPECT TO THE CONTENT OF THESE MATERIALS OR THE ACCURACY OF ANY INFORMATION CONTAINED HEREIN, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. BECAUSE SOME STATES/JURISDICTIONS DO NOT ALLOW EXCLUSIONS OF IMPLIED WARRANTIES, THE ABOVE LIMITATION MAY NOT APPLY TO YOU.

NEITHER MICROSOFT NOR ITS SUPPLIERS SHALL HAVE ANY LIABILITY FOR ANY DAMAGES WHATSOEVER INCLUDING CONSEQUENTIAL INCIDENTAL, DIRECT, INDIRECT, SPECIAL, AND LOSS PROFITS. BECAUSE SOME STATES/JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF CONSEQUENTIAL OR INCIDENTAL DAMAGES THE ABOVE LIMITATION MAY NOT APPLY TO YOU. IN ANY EVENT, MICROSOFT'S AND ITS SUPPLIERS' ENTIRE LIABILITY IN ANY MANNER ARISING OUT OF THESE MATERIALS, WHETHER BY TORT, CONTRACT, OR OTHERWISE SHALL NOT EXCEED THE SUGGESTED RETAIL PRICE OF THESE MATERIALS.