In a previous blog, DancingDinoaur reported on IBM’s initial announcement of Hadoop and other analytic products, like InfoSphere BigInsights, coming to the z. The IBM announcement itself can be found here.
Subsequent sessions at IBM Enterprise2014 delved more deeply into big data, analytics, and real-time analytics. A particularly good series of sessions was offered by Karen Durward, an IBM InfoSphere software product manager specializing in System z data integration. As Durward noted, BigInsights is Apache Hadoop wrapped up to make it easier to use for general IT and business managers.
Specifically, the real-time analytics package for z includes IBM InfoSphere BigInsights for Linux on System z, which combines open-source Apache Hadoop with enhancements to make Hadoop System z enterprise-ready. The solution also includes IBM DB2 Analytics Accelerator (IDAA), which improves data security while delivering a 2000x faster response time for complex data queries.
In her Hadoop on z session, Durward started with the Hadoop framework, which consists of four components:
- Common Core—the basic modules (libraries and utilities) on which all components are built
- Hadoop Distributed File System (HDFS)—stores data on multiple machines to provide very high aggregate bandwidth across a cluster of machines
- MapReduce—the programming model to support the high data volume data processing by the cluster
- YARN (Yet Another Resource Negotiator)—the platform used to manage the cluster’s compute resources including scheduling users’ applications. In effect, YARN decouples Hadoop workload and resource management.
The typical Hadoop process sounds deceptively straightforward. Simply load data into an HDFS cluster, analyze the data in the cluster using MapReduce, write the resulting analysis back into the HDFS cluster. Then just read it.
Sounds easy enough until you try it. Then you need to deal with client nodes and name nodes, exchange metadata, and more. In addition, Hadoop is an evolving technology. Apache continues to add pieces to the environment in an effort to simplify it. For instance, Hive provides the Apache data warehouse framework, accessible using HivQL, and HBase brings Apache’s Hadoop database. Writing Map/Reduce code is a challenge so there is Pig, Apache’s platform for creating long and deep Hadoop source programs, and the list goes on. In short, Hadoop is not easy, especially for IT groups accustomed to relational databases and SQL. That’s why you need tools like BigInsights. The table below is how Durward sees the Hadoop tool landscape.
|Software Needs||Other Hadoop Products||BigInsights|
|Open Source Apache Hadoop||Y||Y|
|Rich SQL on Hadoop (Big SQL)||some||Y|
|Tools for Business Users (BigSheets)||NA||Y|
|Advanced text analytics||NA||Y|
|Rich developer tools||NA||Y|
|Enterprise workload & storage mgt.||NA||Y|
In fact, you need more than BigInsights. “We don’t know how to look at unstructured data,” said Durward. That’s why IBM layers on tools like Big SQL, which helps you query Hadoop’s HBase using industry-standard SQL. You can migrate a relational table to HBase using Big SQL or connect Big SQL via JDBC to run business intelligence and reporting tools, such as Cognos, which also runs on Linux on z. Similarly IBM offers BigSheets, a cloud application that performs ad hoc analytics at web-scale on unstructured and structured content using the familiar spreadsheet format.
Lastly, Hadoop queries often produce free-form text, which requires text analytics to make sense of the results. Not surprisingly, IBM offers BigInsights Text Analytics, a fast, declarative rule-based information extraction (IE) system that extracts insights from unstructured content. This system consists of a fast, efficient runtime that exploits numerous optimization techniques across extraction programs written in Annotation Query Language (AQL), an English-like declarative language for rule-based information extraction.
Hadoop for the z is more flexible than z data center managers may think. You can merge Hadoop data with z transactional data sources and analyze it all together through BigInsights.
So how big will big data be on the z? DancingDinosaur thought it could scale to hundreds of terabytes, even petabytes. Not so. You should limit Hadoop on the z to moderate volumes—from hundreds of gigabytes to tens of terabytes, Durward advises, adding “after that it gets expensive.”
Still, there are many advantages to running Hadoop on the z. To begin, the z brings rock solid security, is fast to deploy, and, through BigInsights, brings an easy-to-use data ingestion process. It also has proven to be easy to setup and run, taking just a few hours, with conversions handled automatically. Lastly, the data never leaves the platform, which avoids the expense and delay of moving data between platforms. But maybe most importantly, by wrapping Hadoop in a set of familiar, comfortable tools and burying its awkwardness out of sight Hadoop now becomes something every z shop can leverage.
DancingDinosaur is Alan Radding. Follow this blog on Twitter, @mainframeblog. Check out my work at Technologywriter.com
Tags: analytics, Apache Hadoop, Big Data, Big SQL, BigInsights, BigInsights Text Analytics, BigSheets, hadoop, Hadoop Distributed File System, HBase, HDFS, HivQL, IBM, IBM DB2 Analytics Accelerator (IDAA), IBM InfoSphere, InfoSphere, Linux, mainframe, MapReduce, Open Source Apache Hadoop, Pig, System z, YARN, zEnterprise