These days, organizations are creating and storing massive amounts of data, and in theory this data can be used to drive business decisions through application development, particularly with new techniques such as machine learning. Data is arguably the most important asset, and it is also probably the most difficult thing to manage. Well, excepting people.
Data is tangled mess. It can be structured or unstructured, and it is increasingly scattered in different locations – in on-premises infrastructure, in a public cloud, on a mobile device. It is a challenge to move, thanks to the costs in everything from bandwidth to latency to infrastructure. It has a zillion different formats, sometimes chunks of data are missing, and usually it is unorganized and alarmingly often ungoverned. So for those organizations looking to leverage machine learning for research and application development, the challenges associated with managing and processing the data can be a hurdle.
“Every client is on journey towards AI, which means they have some level of sophistication around machine learning, some level of sophistication around analytics and, ultimately, the right data and information architecture to support those,” Rob Thomas, general manager of IBM Analytics, tells The Next Platform. “Think of those as kind of the building blocks for AI, which are often a deterrent or an obstacle to getting to AI if that is not in place. What we’ve identified as the biggest inhibitor is basically the integration – stitching things together, getting systems to talk to each other, getting the data architecture normalized. The biggest thing that holds up developers today is, ‘I can’t get the organization to get me the data I need, or they’ll only give me a subset, or I can only get access to the on-premises data, I can’t get access to the cloud data.’”
All of this falls close to IBM’s heart. The company has put down a big bet on AI, through its focus on cognitive computing and its efforts to drive that with its Watson technology. At the same time, Big Blue has ramped up efforts to build out its public and private cloud capabilities as it looks to better compete with Amazon Web Services and Microsoft Azure, and to make it easier for businesses and other organizations to collect and analyze the data they are creating. A recent example was the launch in the fall of the Integrated Analytics System, a platform to enable data scientists and developers to use advanced analytics with data regardless of its location – including private, public and hybrid clouds – and to move workloads among data stores. As we noted, it also uses machine learning techniques and data science tools to automate many of the tasks involved with data analytics. Soon after, the company unveiled its IBM Cloud Private, a software platform built on Kubernetes that enables enterprises to develop private clouds, embrace Docker containers, more easily move workloads between private and public clouds and accelerate application development.
This week, in the runup ahead of its Think 2018 conference, IBM is expanding its capabilities around data management, artificial intelligence, and the cloud. Among them is its new Cloud Private for Data, a platform designed to enable businesses to more quickly take in and analyze large amounts of data that is coming in from such areas as IoT sensors and mobile devices. It’s an application layer launched on the IBM Cloud Private platform that is based on Kubernetes and leveraging microservices. It is powered by a fast in-memory database that can ingest and analyze huge amounts of data. The database, built over the past two years, uses an open source Spark engine and the Apache Parquet data format, according to IBM’s Thomas.
“This is extreme high ingestion of data and you can do analytics on the fly,” Thomas says, noting that data traditionally is brought in and stored before it’s analyzed. “We do that high-speed ingestion, and we can do analytics on the fly. We can make sense of any type of event data or any event that triggers data, whether it’s telematics or IoT, we can make sense of that data in real time.”
The offering is designed to organizations greater visibility about the data they have, such as whether it’s structured or unstructured, sensitive or not, and on-premises or in the cloud.
“Most data scientists are encumbered by the quality of data – or lack of quality – and we’ve used machine learning to automate that data quality process inside this,” Thomas said. “We do automatic data matching, we do automatic data preparation, so all those things that are manual steps today we’re doing with the magic of machine learning and software. It really changes your visibility, your ability to act on it, and once you do those things, you can start to build machine learning models and deploy them, and you can monitor the health of your model, meaning, how are they performing in your environment, where do you need to make more adjustments, where do you need more training data, where do you need more enterprise data.”
It is designed for on-premises private clouds, but has what Thomas calls “connective tissue” back to the IBM public cloud in three forms: Machine learning connectivity, enabling users to build models in the private cloud and deploy them in the public cloud, and vice-versa; share metadata between the private and public clouds, giving the user a common catalog; and data connectivity to more easily move data between the two environments. That connectivity is aimed at the IBM Cloud, but the presence of Kubernetes creates consistency among all clouds.
“The machine learning and data science world is playing out in open source, and that’s what we’ve continued to support and drive here,” Thomas says. “With things like the data science capabilities in here, you can use Python, you can use R, you can use Scala, you can use Java. You can have your choice of ML frameworks, anything you can imagine is there. We’ve built this around where the machine learning world is playing out, which is in open source.”
Playing the AI and data focus out even more, IBM at Think 2018 will introduce Deep Learning as a Service in Watson Studio, enabling organizations to use such frameworks as TensorFlow, Caffe and PyTorch as cloud native services on IBM Cloud. It also means that companies can consume it as a service, paying only for the GPU resources they use, making deep learning and AI more widely accessible. IBM also is contributing the core of the deep learning service as an open-source project called Fabric for Deep Learning (FIDL).
IBM also wants to make it easier for data scientists and developers to use containers on bare metal nodes. Containers can run on bare metal infrastructures, but there is a tradeoff in performance and management. Now the vendor is enabling IBM Cloud Container Service – a fully-managed container service based on Kubernetes – to run on bare metal nodes.
“Many of these data-intensive workloads, such as machine learning apps, require high levels of computing power that bare metal excels at delivering,” Jason McGee, IBM Fellow and vice president of IBM Cloud, explains in a post on the company blog. “Until now, running containers on bare metal required considerable configuration and constant management from developer teams. This limited their use on bare metal with complex apps in production, where the benefits of a managed service, such as automatic updating, intelligent scaling and built-in security, prevail. On IBM Cloud, Kubernetes can now fit into an organization’s cloud strategy no matter what that looks like; whether it’s building a completely cloud-native machine learning app, accessing servers directly to handle large data workloads or migrating data-heavy apps to the cloud.”