Sponsored Feature. Enterprises know they want to do machine learning, but they also know they can’t afford to think too long or too hard about it. They need to act, and they have specific business problems that they want to solve.
And they know instinctively and anecdotally from the experience of the hyperscalers and the HPC centers of the world that machine learning techniques can be utterly transformative in augmenting existing applications, replacing hand-coded applications, or creating whole new classes of applications that were not possible before.
They also have to decide if they want to run their AI workloads on-premise or on any one of a number of clouds where a lot of the software for creating models and training them are available as a service. And let’s acknowledge that a lot of those models were created by the public cloud giants for internal workloads long before they were peddled as a service.
Given the cornucopia of frameworks, models, libraries, and other parts of the development and runtime environment for machine learning, the options can be bewildering. We are well into the second decade of the machine learning revolution, and it is still not obvious how pervasive machine learning will be in the enterprise or whether this will be one workload that only the elite can run in their own datacenters.
Factors to consider include the exorbitant cost of the hardware to train neural networks, the tremendous amount of software and algorithm parameter tuning necessary to get a model to work, and the constant retraining that is par for the machine learning course. It also doesn’t help that machine learning expertise is in relative short supply and in very high demand.
Picking a hardware platform for machine learning is relatively easy: Enterprises will probably use clusters with CPU host nodes (possibly with built-in AI acceleration), GPUs, and custom ASICs. Each of these architectures has different advantages when it comes to performance, general-purpose usability, and programmability while having different power and latency constraints. Machine learning can be run on any number of devices. These can include CPUs with vector and matrix math accelerators either embedded in their cores or sitting alongside them in the same package. They can also be GPUs, FPGAs, or custom ASICs that can run the machine learning model and do things like identify objects or speech, translate speech to text, or do more sophisticated natural language processing that takes all kinds of media and synthesizes it in a way that emulates some aspects of human behavior.
Coming up with a deployment model for machine learning training and inference applications is also relatively easy. Developers will no doubt want to containerize the training stack using Kubernetes for example. Increasingly the application stack, including inference embedded in corporate applications, will also be moved to Kubernetes over the long haul.
Everything in between the hardware and the Kubernetes containers is a bit tricky, and very likely will remain that way for the foreseeable future until we can create a full AI stack that works across diverse use cases and hardware/software combinations.
Frankly, we do not fully know what is needed in terms of an AI software stack at this stage of the game. Or at least no more than we knew what was needed from computational fluid dynamics and finite element analysis in the HPC arena back in the 1980s and early 1990s as these technologies were refined, applied, and democratized.
Ironically, it may be a little too early to have a full, complete and portable AI stack coalesce, even if this is something to be desired in the long run. It warrants some thought though, because – as the Unix revolution previously showed – the only way this will ever happen is if enterprises demand it. Back then the touchpaper was sparked when enterprises got sick and tired of expensive and proprietary systems. It eventually culminated in the ascendancy of Linux in the datacenter, but it took three decades to start and five decades to become normal.
In their own ways, many companies have already created vertically integrated machine learning platforms. These include the major compute engine suppliers like Intel, NVIDIA, AMD, Xilinx; the custom ASIC suppliers; big cloud providers like Amazon Web Services, Microsoft Azure, Google Cloud, Alibaba Cloud; and also some independent software development firms.
They may not have the full breadth of AI frameworks, models, and other tools like automated hyperparameter tuning (of which many are rightly skeptical today, just like people were with automatic database tuning decades ago) that can make AI more broadly applicable. But this is a good start – just like the proprietary systems of old, which variously shaped what a good stack for running commercial or technical applications should look like.
We may even be seeing history repeating itself. The complexity of AI tools and the great deal of algorithm hand tuning seems to warrant a utility computing approach, much as the early days of proprietary mainframe and minicomputers warranted the establishment of service bureaus to run these platforms and their applications for companies that lacked the capital or the expertise to do it for themselves.
When the system and application expertise developed over the course of a decade or so, and the cost of systems came down, enterprises knew what they wanted to do with the machines and their applications. And they could also justify making the capital investments in systems and put them on premise. A few decades later, the pendulum swung back to outsourcing for many customers looking to cut costs on mainframes and application support. Some customers dumped mainframes entirely for new platforms like Unix systems to save money but also to improve their systems.
Recessions have a wonderful way of focusing budget and accelerating IT trends. So does intense competition, which is being brought to bear on machine learning just as the waves of automation in the back office did during the 1980s and 1990s, and Internet technologies did in the 1990s and 2000s.
We think a similar kind of pendulum swing will happen with enterprise AI stacks. The vast majority will try AI applications out in the cloud, then move them on premise when the applications and costs warrant the investment. This may mean running a cloud provider’s infrastructure and its AI stack on site, and it may mean running a collection of AI frameworks and models that are woven together by the company itself or by a third party. Using versions of the frameworks and tools optimized for target hardware as part of those workflows is an easy way for enterprises to get orders of magnitude performance gain with minimal code changes. This is where the AI stacks of the compute engine providers such as Intel and Nvidia come in.
AI will be part of every application. This is not some kind of whim. So the choice of AI platform is really important and a tough call to make, even in 2021. Companies choose a database and the systems that run them on the understanding they will last for decades. AI platforms will be the stickiest technology to hit the datacenter since the relational database. So being locked in to a proprietary code base might prove a handicap going forward as new AI architectures continue to emerge.
That said, the kernels of independent AI stacks which could evolve into some kind of enterprise-grade AI stack are forming. Some providers are further down the road than others here, with some focusing on AI training and others on AI inference (and some on both).
This is by no means an exhaustive list, but the most significant emerging AI stacks from the hyperscalers and the clouds include:
- AWS SageMaker: Supports the MXNet, TensorFlow, Keras, and PyTorch frameworks, and has a Feature Store specifically designed to work in real-time and batch mode that supports both training and inference workloads. SageMaker also includes 15 built in algorithms for all kinds of workloads to help train models quickly, while the JumpStart feature has prebuilt applications and one-click deployment over 150 machine learning models that have been open sourced. It includes automatic hyperparameter tuning for models using Bayesian or random search methods and interfaces with AWS Elastic Container Service or Elastic Kubernetes Service to scale training workloads.
- Microsoft Azure AI Platform: Supports the MXNet, TensorFlow, PyTorch, and Scikit-Learn frameworks and uses the Spark in-memory data store to accelerate performance. The AI training scales using the Azure Kubernetes Service container platform and optimizes the models to use the ONNX runtime to create the machine learning inference engine.
- Google Vertex AI: A follow-on to Google AI Platform, Vertex AI supports TensorFlow, PyTorch, and Scikit-Learn frameworks, and custom containers can be added to support other frameworks as needed. Training and comparison of AI models is done through AutoML. It also includes a Feature Store to share machine learning data across models and for training and inference workloads; Pipelines to build TensorFlow and Kubeflow pipelines to string together workflows as part of applications; and Vizier to optimize hyperparameters for models. Workbench integrates Vertex AI to BigQuery, Dataproc, and Spark datastores and the applications that use them.
- In China, Alibaba’s Cloud Intelligence Brain and Baidu’s AI Cloud Machine Learning are available as full lifecycle AI platforms, but as yet do not have the sophistication and breadth of the tools which AWS, Microsoft, and Google offer.
All of these major cloud AI services include management of the complete AI lifecycle, and one that has been put into production at the company for its internal use, at scale, with tools to collect and prepare data, to build models, to train them and tune them, and to deploy them in production. All of them also provide integration with Jupyter Notebooks in some form or another and many have AutoML features to automatically build, train, and tune models based on the dataset presented to them.
For enterprises that want to deploy their own AI stacks, there are plenty of places to start (where that is depends largely on the choice of AI compute engines). All of this software can be deployed in the public cloud and can be the foundation for a hybrid platform that extends from on premise out to the cloud if need be. As stated previously, updating parts of the workflow with frameworks and tools optimized for the individual organisation’s compute engine leads to large performance gains for deployments from the edge to the cloud.
Candidates include:
- Nvidia AI Enterprise: Supports TensorFlow and PyTorch frameworks as well as Nvidia’s own RAPIDS framework for accelerating data science frameworks such as Spark. Includes the TensorRT inference runtime, the Triton inference server and a slew of AI libraries, packaged to run on Kubernetes containers atop the VMware Tanzu platform. Can be deployed on premise, on the cloud, and at the edge, with fleet management services for edge use cases. Proprietary software components in the stack limit usage to just Nvidia hardware.
- Intel AI: Supports several popular deep learning, classical machine learning, and big-data analytics frameworks including TensorFlow, PyTorch, Scikit-learn, XGBoost, Apache Spark, and others. Complementing the AI framework optimizations is a comprehensive portfolio of optimized libraries and tools for end-to-end data science and AI workflows (oneAPI AI analytics Toolkit), deploying high-performance inference applications (OpenVINO toolkit), and scaling AI models to big data clusters (BigDL). Intel says its tools can be deployed across diverse AI hardware because they are built on the foundation of the oneAPI unified programming model.
- Red Hat Open Data Hub: Combines the Red Hat OpenShift Kubernetes container controller with Ceph storage, AMQ Streams stream processing and a stack of open source machine learning frameworks and tools to create an integrated, open source AI stack.
It will be interesting to see how each of these AI stacks grow and others emerge as machine learning and other forms of data analytics become ingrained in modern applications. This is just the beginning though, and the way hyperscalers and compute engine providers create their own AI stacks will have a major influence on how independent stacks will have to develop.
Sponsored by Intel