Posts Tagged ‘InfoSphere’

Hadoop Brings Big Data Analytics to the IBM System z

October 16, 2014

In a previous blog, DancingDinoaur reported on IBM’s initial announcement of Hadoop and other analytic products, like InfoSphere BigInsights, coming to the z. The IBM announcement itself can be found here.

Subsequent sessions at IBM Enterprise2014 delved more deeply into big data, analytics, and real-time analytics. A particularly good series of sessions was offered by Karen Durward, an IBM InfoSphere software product manager specializing in System z data integration. As Durward noted, BigInsights is Apache Hadoop wrapped up to make it easier to use for general IT and business managers.

Specifically, the real-time analytics package for z includes IBM InfoSphere BigInsights for Linux on System z, which combines open-source Apache Hadoop with enhancements to make Hadoop System z enterprise-ready. The solution also includes IBM DB2 Analytics Accelerator (IDAA), which improves data security while delivering a 2000x faster response time for complex data queries.

In her Hadoop on z session, Durward started with the Hadoop framework, which consists of four components:

  1. Common Core—the basic modules (libraries and utilities) on which all components are built
  2. Hadoop Distributed File System (HDFS)—stores data on multiple machines to provide very high aggregate bandwidth across a cluster of machines
  3. MapReduce—the programming model to support the high data volume data processing by the cluster
  4. YARN (Yet Another Resource Negotiator)—the platform used to manage the cluster’s compute resources including scheduling users’ applications. In effect, YARN decouples Hadoop workload and resource management.

The typical Hadoop process sounds deceptively straightforward.  Simply load data into an HDFS cluster, analyze the data in the cluster using MapReduce, write the resulting analysis back into the HDFS cluster. Then just read it.

Sounds easy enough until you try it. Then you need to deal with client nodes and name nodes, exchange metadata, and more. In addition, Hadoop is an evolving technology. Apache continues to add pieces to the environment in an effort to simplify it. For instance, Hive provides the Apache data warehouse framework, accessible using HivQL, and HBase brings Apache’s Hadoop database. Writing Map/Reduce code is a challenge so there is Pig, Apache’s platform for creating long and deep Hadoop source programs, and the list goes on. In short, Hadoop is not easy, especially for IT groups accustomed to relational databases and SQL. That’s why you need tools like BigInsights. The table below is how Durward sees the Hadoop tool landscape.

Software Needs Other Hadoop Products BigInsights
Open Source Apache Hadoop Y Y
Rich SQL on Hadoop (Big SQL) some Y
Tools for Business Users (BigSheets) NA Y
Advanced text analytics NA Y
In-Hadoop analytics NA Y
Rich developer tools NA Y
Enterprise workload & storage mgt. NA Y
Comprehensive suite NA Y

In fact, you need more than BigInsights. “We don’t know how to look at unstructured data,” said Durward. That’s why IBM layers on tools like Big SQL, which helps you query Hadoop’s HBase using industry-standard SQL. You can migrate a relational table to HBase using Big SQL or connect Big SQL via JDBC to run business intelligence and reporting tools, such as Cognos, which also runs on Linux on z. Similarly IBM offers BigSheets, a cloud application that performs ad hoc analytics at web-scale on unstructured and structured content using the familiar spreadsheet format.

Lastly, Hadoop queries often produce free-form text, which requires text analytics to make sense of the results. Not surprisingly, IBM offers BigInsights Text Analytics, a fast, declarative rule-based information extraction (IE) system that extracts insights from unstructured content. This system consists of a fast, efficient runtime that exploits numerous optimization techniques across extraction programs written in Annotation Query Language (AQL), an English-like declarative language for rule-based information extraction.

Hadoop for the z is more flexible than z data center managers may think. You can merge Hadoop data with z transactional data sources and analyze it all together through BigInsights.

So how big will big data be on the z? DancingDinosaur thought it could scale to hundreds of terabytes, even petabytes. Not so. You should limit Hadoop on the z to moderate volumes—from hundreds of gigabytes to tens of terabytes, Durward advises, adding “after that it gets expensive.”

Still, there are many advantages to running Hadoop on the z. To begin, the z brings rock solid security, is fast to deploy, and, through BigInsights, brings an easy-to-use data ingestion process. It also has proven to be easy to setup and run, taking just a few hours, with conversions handled automatically. Lastly, the data never leaves the platform, which avoids the expense and delay of moving data between platforms. But maybe most importantly, by wrapping Hadoop in a set of familiar, comfortable tools and burying its awkwardness out of sight Hadoop now becomes something every z shop can leverage.

DancingDinosaur is Alan Radding. Follow this blog on Twitter, @mainframeblog. Check out my work at Technologywriter.com

Software Licensing for IBM System z Distributed Linux Middleware

October 10, 2014

DancingDinosaur can’t attend a mainframe conference without checking out at least one session on mainframe software pricing by David Chase, IBM’s mainframe pricing guru. At IBM Enterprise2014, which wraps up today, the topic of choice was software licensing for Linux middleware. It’s sufficiently complicated to merit an entire session.

In case you think Linux on z is not in your future, maybe you should think again.  Linux is gaining momentum in even the largest z data centers. Start with IBM bringing new apps like InfoSphere, BigInsights (Hadoop), and OpenStack to z. Then there are apps from ISVs that just weren’t going to get their offerings to z/OS. Together it points to a telltale sign something is happening with Linux on z. And, the queasiness managers used to have about the open source nature of Linux has long been put to rest.

At some point, you will need to think about IBM’s software pricing for Linux middleware. Should you find yourself getting too lost in the topic, check out these links recommended by Chase:

To begin, software for Linux on z is treated differently than traditional mainframe software in terms of pricing. With Linux on z you think in terms of IFLs.  The quantity of IFLs represent the number of Linux engines subjected to IBM’s IPLA-based pricing.

Also think in terms of Processor Value Units (PVUs) rather than MSUs. For a pricing purposes, PVUs are analogous to MSUs although the values are different. A key point to keep in mind: distributed PVUs for Linux are not related to System z IPLA value units used for z/VM products. As is typical of IBM, those two different kinds of value units are NOT interchangeable.

Chase, however, provides a few ground rules:

  • Dedicated partition
    • Processors are always allocated in whole increments
    • Resources are only moved between partitions “explicitly” (e.g. by an operator or a scheduled job)
  • Shared pool:
    • Pool of processors shared by partitions (including virtual machines)
    • System automatically dispatches processor resources between partitions as needed
  • Maximum license requirements
  • Customer does not have to purchase more licenses for a product than the number of processors on the machine (e.g. maximum DB2 UDB licenses on a 12-way machine is 12)
    • Customer does not have to purchase more “shared pool” licenses for a product than the number of processors assigned to the shared pool (e.g. maximum of 7 MQSeries licenses for a shared pool with 7 processors). Note: This limit does not affect the additional licenses that might be required for dedicated partitions.

With that, as Chase explains it, Linux middleware pricing turns out to be relatively straightforward, determined by:

  • Processor Value Unit (PVU) rating for each kind of core
  • Any difference for different processor technologies (p, i, x, z, Sun, HP, AMD, etc—notice that the z is just one of many choices, not handled differently from the others
  • Number of processor cores which must be licensed (z calls them IFLs)
  • Price per PVU (constant per product, not different based upon technology)

Then it becomes a case of doing the basic arithmetic. The formula: # of PVUs x the # of cores required x the value ($) per core = your total cost.  Given this formula it is to your advantage to plan your Linux use to minimize IFLs and cores. You can’t do anything about the cost per PVU.

Distributed PVUs are the basis for licensing middleware on IFLs and are determined by the type of machine processor. The zEC12, z196, and z10 are rated at 120 PVUs. All others are rated at 100 PVUs. For example, any distributed middleware running on Linux on z this works out to:

  • z114—1IFL, 100 PVUs
  • z196—4IFLs, 480 PVUs
  • zEC12—8 IFLs, 960 PVUs

Also, distributed systems Linux middleware offerings are eligible for sub-capacity licensing. Specifically, sub-capacity licensing is available for all PVU-priced software offerings that run on:

  • UNIX (AIX, HP-UX, and Sun Solaris
  • i5/OS, OS/400
  • Linux (System i, System p, System z)
  • x86 (VMware ESX Server, VMware GSX Server, Microsoft Virtual Server)

IBM’s virtualization technologies also are included in Passport Advantage sub-capacity licensing offering, including LPAR, z/VM virtual machines in an LPAR, CPU Pooling support introduced in z/VM 6.3 APAR VM65418, and native z/VM (on machines which still support basic mode).

And in true z style, since this can seem more complicated than it should seem, there are tools available to do the job. In fact Chase doesn’t advise doing this without a tool. The current tool is the IBM License Metric Tool V9.0.1. You can find more details on it here.

If you are considering distributed Linux middleware software or are already wrestling with the pricing process, DancingDinosaur recommends you check out Chase’s links at the top of this piece. Good luck.

DancingDinosaur is Alan Radding. Follow DancingDinosaur on Twitter, @mainframeblog. You can check out more of my work at Technologywriter.com


%d bloggers like this: