A major System z crash in December shocked a lot a mainframe folks. As the Register put it so elegantly this past December, “EDS mainframe goes titsup, crashes RBS cheque system.”
RBS is the Royal Bank of Scotland. EDS, as a result of a nearly $14 million acquisition in 2008, operates now as HP Enterprise Services. As one of those services, it runs mainframe data centers for outsourcing purposes.
The crash apparently resulted from the failure of a System z10 staff stretched too thin to correctly implement a microcode fix. Mainframes are legendary for their resistance to crashes, but they do need to be maintained. Incorrectly applying a microcode fix is a monumental mistake.
The failure reportedly brought down the bank’s check clearing system for about 12 hours. Finextra.com passed along an unconfirmed UK Techwire report that just added to the speculation: HP’s “disaster recovery plan saw processes switched to a z10 in Mitcheldean, Gloucestershire, but this machine also failed to work.”
An HP response to Finextra confirms the failure and adds only that operations were restored quickly.
So, what went wrong? The mainframe community probably won’t ever know for sure. The Register attributes the problem to cost-cutting measures that led to the layoffs of the skilled mainframe staff needed to recognize the importance of the microcode update and implement it correctly.
That makes sense given the general history of big acquisitions, which often rely on deep layoffs to cut costs and boost the bottom line. The mainframe community on LinkedIn didn’t hesitate to voice outrage. A few felt the published accounts didn’t provide sufficient data to determine what went wrong, and I wholeheartedly agree.
Others were quick to finger layoffs as the primary culprit: “Ever since HP took over at EDS, a lot of good mainframe talent [has] hit the streets. Some because HP saw no value in them as they were primarily mainframe-oriented personnel. Others got smart and found other opportunities, or retired, before the economy tanked.”
An online search weeks later failed to turn up any more data except this from a Register article months earlier that lends credence to the suspicion of layoffs or work cutbacks as the problem: “HP has emailed EDS staff offering unpaid leave and temporary cuts in hours and wages in an attempt to cut costs. Before the end of October staff can take either unpaid leave or a reduction in working hours.”
The LinkedIn thread morphed into a discussion of layoffs and training. That is probably the real message from the HP-RBS z10 crash. Although the z10 is very efficient to operate and manage in terms of manpower compared to distributed platforms, it still requires at least a few skilled individuals. The best way to save money in a mainframe data center is by taking advantage of mainframe efficiency and System z specialty processors to consolidate more workloads on the z, not by laying off people.