Making AI Run At Any Scale But Not At All Costs

AI is arguably the most important kind of HPC in the world right now in terms of providing immediate results for immediate problems, and particularly for enterprises with lots of data and a desire to make money in a new economy that does not fit models and forecasts before the coronavirus pandemic.

No surprises then that all of the tech vendors are trying to cash in, developing products and services aimed at making AI more accessible to their many enterprise customers, allowing them to take advantage of the modern technologies while not having to invest in expensive infrastructure or the rare and expensive people who know how to build and run AI infrastructure in practice.

At The Next Platform, we have been writing about such efforts for several years, as companies like Lenovo, Dell, Hewlett Packard Enterprise, and Cray (now part of HPE) have rolled out offerings designed to lower the bar of entry for organizations to technologies that have for a long time been the domain of large and well-resourced behemoths like Google, Amazon and Microsoft.

Most recently, HPE unveiled its Machine Learning Development System, pulling together accelerated hardware and software to create an AI stack that includes HPE systems, AMD chips and Nvidia GPUs and it designed to enable enterprises to build and train AI models at scale. The effort was driven by HPE’s acquisition last year of startup Determined AI.

Last month, Lenovo talked about its Lenovo Intelligence Computing Orchestrator, or LICO, an expansion of its Antilles HPC infrastructure management software tool to include AI workloads.

Most such initiatives are designed to help enterprises bypass the costs and complexities that come with AI training and inferencing, such as building and managing the infrastructure, particularly as it becomes more distributed as the models get larger and more compute is needed.

It was an issue a couple of graduate students, Robert Nishihara and Philipp Moritz, and a computer science professor, Ion Stoica, at the University of California Berkeley encountered while conducting research into AI and machine learning at the school’s RISELab.

AI is going to continue to grow in importance, Nishihara (left in the photo above) tells The Next Platform, “but the challenge is that AI is incredibly computationally intensive. You know about the scale often required to do AI, so if you’re doing AI, there’s this incredible need for scaling applications across many machines.”

While at the RISELab, they launched an open-source project called Ray, with the first lines of code written in early 2016. Ray was meant to help the group address the challenges they were having scaling their AI applications, but the three soon determined that the tools they were creating through the project could help companies facing similar problems.

They open sourced Ray as a Python framework for running distributed AI computing projects in the cloud. In 2019, Anyscale, a company dedicated to the development of Ray and to providing commercial grade support for it, was launched. The framework includes a serverless compute API and associated libraries. Anyscale offers a cloud platform and managed service delivering tools and support around Ray and ensuring the enterprises don’t need to build or manage the underlying platform or develop deep infrastructure expertise.

“From your perspective as a developer and as the Ray user, you’re just writing Python,” say Nishihara, who is Anyscale’s chief executive officer. “You are basically writing a normal Python application. What Ray will do is help you work across the different machines in a cluster. It will handle the scaling. It will handle fault tolerance, like machine failures. It will handle moving the data around. Those are some of the kinds of problems that Ray solves.”

Stoica, co-founder, president, and executive chairman of Anyscale (center in the photo above), says the computing demands for machine learning applications are growing beyond what can be done on a single GPU node and outpacing the AI and infrastructure expertise and resources of enterprises.

“Several years ago, the largest model required one single GPU,” says Stoica, who also is co-founder and executive chairman at database vendor Databricks and still a professor at UC Berkeley. “Today, you may need thousands of GPUs. You also do need to scale the workloads, which is complex. You need to deal with the infrastructure, you need to manage the infrastructure. Also, for AI to be successful, you need to build these end-to-end applications and have better AI in existing applications. There are many AI workloads, you need to do the processing, you need to train them, you need to optimize the model, you need to serve the model.”

Anyscale says that compute requirements for AI applications are growing rapidly, doubling every 3.5 months. The myriad steps necessary for creating machine learning applications – from preprocessing and training to hyperparameter tuning, testing and serving – now need to scale, he says. Each step normally would need its own system in a distributed environment that have to be linked together.

“If you want to scale, it becomes extremely complicated,” Stoica says. “In some sense, we want to provide the same experience like if you’re developing, deploying, and managing an application on your laptop. We want to make it easy. We want to abstract away the infrastructure for the developers, for the users.”

Anyscale’s platform, which can run on such clouds as Amazon Web Services and Google Cloud Platform, offers enterprises a fully managed infrastructure and takes care of such tasks as serverless autoscaling, running and monitoring jobs, and tracking the costs related to the jobs, clusters and users. There also are such tools as APIs and SDKs.

The startup, which has almost 100 employees, comes onto the $13 trillion AI market with a fair amount of money behind it. Between 2019 and 2021, the company raised $160 million, including $100 million in December 2021. At the same time, Ray is being embraced by thousands of organizations, including Uber, LinkedIn, Shopify, Visa, OpenAI, and Amazon.

Anyscale continues to drive Ray’s development, though the project does have an Apache license and Stoica says there are about 700 contributors. It has 18,000 GitHub stars.

When asked what the company will do with the cash it has raised, Nishihara points to what the company unveiled his week at its Ray Summit 2022 show in San Francisco. On the open-source side, Anyscale is introducing Ray 2.0, which includes Ray AI Runtime (AIR), a new runtime layer for machine learning applications and services that is cloud- and framework-agnostic and interoperates with such frameworks as PyTorch, TensorFlow and Hugging Face, and the KubeRay toolkit that improve the execution of Ray applications on Kubernetes.

Anyscale developed KubeRay in collaboration with IBM and Bytedance, with Nishihara saying that much in the latest iteration of Ray came from work the company is doing with organizations using Ray and giving feedback. “The goal of Ray 2.0 is really distilling all of these lessons from working with these thought leaders, these companies, and putting that into a form where all the other companies can succeed with it,” Nishihara says.

For Anyscale, the company is rolling out the next generation of ML Workspace to help developers scale applications between prototype and production for debugging. It also integrates with such machine learning tools as Weights and Biases and Arize AI for scaling applications while using top MLOps tools. In addition, Anyscale’s Enterprise-Ready Platform delivers security for cluster connectivity and enterprise-managed virtual private clouds as well as auditing, monitoring and cost management tools. It is in preview.

Scaling AI applications “is really hard,” Nishihara says. “Building scalable applications really requires a lot of expertise and we want to get to the point where, if you know Python, that’s enough. You can build these scalable applications [and] you can succeed with AI. You can do the kinds of things Google does, but you just need to know Python. That’s what we’re trying to get to. We’re right at the start.”

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. I like this article because it ends by saying “You can do the kinds of things Google does, but you just need to know Python. That’s what we’re trying to get to. We’re right at the start” so I guess I should learn Python ? If I want to create something like Dall E 2? I use Dall E 2 AI and I love it. I do web design and use photoshop a lot but I can’t draw that good. I can trace, fix, and collage a picture together to make great artwork and Dall E 2 is so good that I can basically draw anything now with a little fixing. I see defects in Dall E 2 and would like to somehow fix those someday with AI drawing so maybe this is a start. Thanks

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.