Posts Tagged ‘Hadoop Distributed File System (HDFS)’

IBM Introduces a Reference Architecture for On-Premise AI

June 22, 2018

This week IBM announced an AI infrastructure Reference Architecture for on-premises AI deployments. The architecture promises to address the challenges organizations face experimenting with AI PoCs, growing into multi-tenant production systems, and then expanding to enterprise scale while integrating into an organization’s existing IT infrastructure.

The reference architecture includes, according to IBM, a set of integrated software tools built on optimized, accelerated hardware for the purpose of enabling organizations to jump start. AI and Deep Learning projects, speed time to model accuracy, and provide enterprise-grade security, interoperability, and support.  IBM’s graphic above should give you the general picture.

Specifically, IBM’s AI reference architecture should support iterative, multi-stage, data-driven processes or workflows that entail specialized knowledge, skills, and, usually, a new compute and storage infrastructure. Still, these projects have many attributes that are familiar to traditional CIOs and IT departments.

The first of these is that the results are only as good as the data going into it, and model development is dependent upon having a lot of data and the data being in the format expected by the deep learning framework. Surprised? You have been hearing this for decades as GIGO (Garbage In Garbage Out).  The AI process also is iterative; repeatedly looping through data sets and tunings to develop more accurate models and then comparing new data in the model to the original business or technical requirements to refine the approach.  In this sense, AI reference model is no different than IT 101, an intro course for wannabe IT folks.

But AI doesn’t stay simplistic for long. As the reference architecture puts it, AI is a sophisticated, complex process that requires specialized software and infrastructure. That’s where IBM’s PowerAI Platform comes in. Most organizations start with small pilot projects bound to a few systems and data sets but grow from there.

As projects grow beyond the first test systems, however, it is time to bulk up an appropriate storage and networking infrastructure. This will allow it to sustain growth and eventually support a larger organization.

The trickiest part of AI and the part that takes inspired genius to conceive, test, and train is the model. The accuracy and quality of a trained AI model are directly affected by the quality and quantity of data used for training. The data scientist needs to understand the problem they are trying to solve and then find the data needed to build a model that solves the problem.

Data for AI is separated into a few broad sets; the data used to train and test the models and data that is analyzed by the models and the archived data that may be reused. This data can come from many different sources such as traditional organizational data from ERP systems, databases, data lakes, sensors, collaborators and partners, public data, mobile apps, social media, and legacy data. It may be structured or unstructured in many formats such as file, block, object, Hadoop Distributed File Systems (HDFS), or something else.

Many AI projects begin as a big data problem. Regardless of how it starts, a large volume of data is needed, and it inevitably needs preparation, transformation, and manipulation. But it doesn’t stop there.

AI models require the training data to be in a specific format; each model has its own and usually different format. Invariably the initial data is nowhere near those formats. Preparing the data is often one of the largest organizational challenges, not only in complexity but also in the amount of time it takes to transform the data into a format that can be analyzed. Many data scientists, notes IBM, claim that over 80% of their time is spent in this phase and only 20% on the actual process of data science. Data transformation and preparation is typically a highly manual, serial set of steps: identifying and connecting to data sources, extracting to a staging server, tagging the data, using tools and scripts to manipulate the data. Hadoop is often a significant source of this raw data, and Spark typically provides the analytics and transformation engines used along with advanced AI data matching and traditional SQL scripts.

There are two other considerations in this phase: 1) data storage and access and the speed of execution. For this—don’t be shocked—IBM recommends Spectrum Scale to provide multi-protocol support with a native HDFS connector, which can centralize and analyze data in place rather than wasting time copying and moving data. But you may have your preferred platform.

IBM’s reference architecture provides a place to start. A skilled IT group will eventually tweak IBM’s reference architecture, making it their own.

DancingDinosaur is Alan Radding, a veteran information technology analyst, writer, and ghost-writer. Follow DancingDinosaur on Twitter, @mainframeblog. See more of his work at technologywriter.com and here.

IBM’s DeepFlash 150 Completes Its Flash Lineup for Now

July 29, 2016

Two years ago DancingDinosaur wrote about new IBM Flash storage for the mainframe. That was about the DS8870, featuring 6-nines (99.9999) of availability and real-time-compression. Then this past May DancingDinosaur reported on another new IBM all-flash initiative, including the all-flash IBM DS8888 for the z, which also boasts 6-nines availability. Just this week IBM announced it is completing its flash lineup with the IBM DeepFlash 150, intended as a building block for SDS (software defined storage) infrastructures.

IBM DeepFlash 150IBM DeepFlash 150, courtesy of IBM

As IBM reports, the DeepFlash 150 does not use conventional solid-state drives (SSD). Instead, it relies on a systems-level approach that enables organizations to manage much larger data sets without having to manage individual SSD or disk drives. DeepFlash 150 comes complete with all the hardware necessary for enterprise and hyper-scale storage, including up to 64 purpose-engineered flash cards in a 3U chassis and 12-Gbps SAS connectors for up to eight host servers. The wide range of IBM Spectrum Storage and other SDS solutions available for DeepFlash 150 provides flash-optimized scale out and management along with large capacity for block, file and object storage.

The complication for z System shops is that you access the DeepFlash 150 through IBM Spectrum Scale. Apparently you can’t just plug the DeepFlash 150 into the z the way you would plug in the all flash DS8888. IBM Spectrum Scale works with Linux on z Systems servers or IBM LinuxONE systems running RHEL or SLES. Check out the documentation here.

As IBM explains in the Red Book titled IBM Spectrum Scale (GPFS) for Linux on z Systems: IBM Spectrum Scale provides a highly available clustering solution that augments the strengths of Linux on z by helping the z data center control costs and achieve higher levels of quality of service. Spectrum Scale, based on IBM General Parallel File System (GPFS) technology, is a high performance shared-disk file management solution that provides fast, reliable access to data from multiple nodes in a cluster environment. Spectrum Scale also allows data sharing in a mixed platform environment, which can provide benefits in cloud or analytics environments by eliminating the need of transferring data across platforms. When it comes to the DeepFlash 150 IBM is thinking about hyperscale data centers.

Hyperscale data centers can’t absorb the costs of constructing, managing, maintaining and cooling massive hyper- scale environments that use conventional mechanical storage, according to IBM. Those costs are driving the search for storage with a smaller physical footprint, lower costs, greater density, and, of course, much higher performance.

Enter DeepFlash 150, which introduces what IBM considers breakthrough economics for active data sets. The basic DeepFlash 150 hardware platform is priced under $1/GB. For big data deployments IBM recommends IBM Spectrum Scale with DeepFlash 150, providing customers with the overlying storage services and functionality critical for optimization of their big data workloads.

But even at $1/GB DeepFlash 150 isn’t going to come cheap. For starters consider how many gigabytes are in the terabytes or petabytes you will want to install. You can do the math. Even at $1/GB this is going to cost. Then you will need IBM Spectrum Scale. With DeepFlash 150 IBM did achieve extreme density of up to 170TB per rack unit, which adds up to a maximum 7PB of flash in a single rack enclosure.

IBM Spectrum Scale and the DeepFlash 150 are intended to support a wide range of file, object and Hadoop Distributed File System (HDFS) analytics work-loads. According to IBM, as a true SDS solution IBM Spectrum Scale can utilize any appropriate hardware and is designed specifically to maximize the benefits of hyper-scale storage systems like DeepFlash 150. Using a scale-out architecture, IBM Spectrum Scale can add servers or multiple storage types and incorporate them automatically into a single managed resource to maximize performance, efficiency, and data protection.

Although DeepFlash 150 can be used with a private cloud IBM seems to be thinking more in terms of hybrid clouds. To address today’s need for seamlessly integrating high-performance enterprise storage such as DeepFlash 150 with the nearly unlimited resources and capabilities of the cloud, IBM Spectrum Scale offers transparent cloud tiering to place data on cloud-based object storage or in a public cloud service. As IBM explains, the transparent cloud tiering feature of IBM Spectrum Scale can connect on- premises storage such as DeepFlash 150 directly to object storage or a commodity-priced cloud service. This allows enterprises to simultaneously leverage the economic, collaboration, and scale benefits of both on- premises and cloud storage while providing a single, powerful view of all data assets.

A Tech Target report on enterprise flash storage profiled 15 flash storage product lines. In general, the products claim maximum read IOPS ranging from 200,000 to 9 million, peak read throughput from 2.4 GBps to 46 GBps, and read latencies from 50 microseconds to 1 millisecond. The guide comes packed with a ton of caveats. And that’s why DancingDinosaur doesn’t think the DeepFlash 150 is the end of IBM’s flash efforts. Stay tuned.

DancingDinosaur is Alan Radding, a veteran information technology analyst and writer. Please follow DancingDinosaur on Twitter, @mainframeblog. See more of his IT writing at technologywriter.com and here.

 


%d bloggers like this: