First fault software problem solving (FFSPS) is an old mainframe approach that calls for solving problems as soon as they occur. It’s an approach that has gone out of favor except in classic mainframe data centers, but it may be worth reviving as the IT industry moves toward cloud computing and especially private clouds, for which the zEnterprise (z196 and zBX) is particularly well suited.
The point of Dan Skwire’s book First Fault Software Problem Solving: Guide for Engineers, Managers, and Users, is that FFSPS is an effective approach even today. Troubleshooting after a problem has occurred is time-consuming, more costly, inefficient, and often unsuccessful. Complicating troubleshooting typically is lack of information. As Skwire notes: if you have to start troubleshooting after the problem occurs, the odds indicate you will not solve the problem, and along the way, you consume valuable time, extra hardware and software, and other measurable resources.
The FFSPS trick is to capture problem solving data from the start. This is what mainframe data centers did routinely. Specially, they used trace tables and included recovery routines. This continues to be the case with modern z/OS today.
So why should IT managers today care about mainframe disciplines like FFSPS? Skwire’s answer: there surely will be greater customer satisfaction if you solve and repair the customer‘s problem, or if he is empowered to solve and repair his own problem rapidly.
Another reason is risk minimization. As classic mainframe shops have become increasingly heterogeneous, the mainframe disciplines that kept the mainframe rock solid have not been enforced across the new platforms.
Skwire also likes to talk about System Yuk. You probably have a few System Yuks in your shop. What’s System Yuk? As Skwire explains, System Yuk is very complex. It makes many decisions, and analyzes much data. However, the only means it has of conveying an error is the single message to the operator console: SYSTEM HAS DETECTED AN ERROR, which is not particularly helpful.
System Yuk has no trace table or FFSPS tools. To diagnose problems in Yuk you must re-create the environment in your Yuk test-bed, and add instrumentation (write statements, traces, etc) and various tools to get a decent explanation of problems with Yuk, or setup some second-fault tool to capture more and better data on the production System Yuk, which is high risk.
Toward the end of the book Skwire gets into what you can do about System Yuk. It amounts to a call for defensive programming. He then introduces a variety of tools to troubleshoot and fix software problems. These include:ServiceLink by Axeda, AlarmPoint Systems, and LogLogic. Of course, mainframe shops have long relied on management tools from IBM, CA, BMC, and others to enable FFSPS.
With the industry gravitating toward private clouds as a way to efficiently deliver IT as a flexible service, the disciplined methodologies that continue to keep the mainframe a still critical platform in large enterprises will be worth adopting. FFSPS should be one in particular to keep in mind.