We have been convinced for many years that machine learning, the kind of artificial intelligence that actually works in practice, not in theory, would be a key element of the next platform. In fact, it might be the most important part of the stack. And therefore, those who control how we deploy machine learning will, to a large extent, control the nature of future applications and the systems that run them.
Machine learning is the killer app for the hyperscalers, just like modeling and simulation were for supercomputing centers decades ago, and we believe we are only seeing the tip of the machine learning iceberg as Google, Facebook, Baidu, Amazon, Microsoft, and other titans of the Internet, who have enough data to make machine learning not only practical, but necessary, build out their expertise and embody it in the development and production platforms that support their empires.
Google’s first machine learning platform, called DistBelief, was rolled out in 2011 and used to train deep neural networks using tens of thousands of CPU cores across thousands of servers, By its own admission, DistBelief was difficult to use and tied very tightly to Google’s own infrastructure, and so the company created a better, more generic machine learning platform called TensorFlow, which was unveiled and open source last November. The TensorFlow framework is used to not only train various kinds of machine learning algorithms, using either CPUs or hybrid CPU-GPU systems, but also is used to push finished machine learning models into production applications.
Inside Google, the adoption of machine learning algorithms is on a hockey stick rise, and it seems apparent to us that only the largest organizations in the world with the most experts and, more importantly, the most data, will be able to perfect algorithms. We expect that Google, Microsoft, Amazon, and others will try to make a business out of selling their machine learning expertise as services, and will be more than happy to store your data and provide processing against it for machine learning add-ons to applications. But machine learning is not just another way to make their platforms more sticky, but is rather the only way that correlations between certain kinds of data can be made. There are not enough people in the world to do instantaneous language translation or automatic photo identification at Internet speed, and soon machine learning will be applied to all kinds of processes that currently require people.
Google wants the world to think its way, which is why it is open sourcing tools such as Kubernetes for container management (inspired by its own Borg cluster manager), heavily supporting projects like Prometheus for systems monitoring (inspired by its own Borgmon tool), and opening up TensorFlow pretty much as it was being put into production rather than waiting. Google has learned from Facebook the benefit of opening up its technologies and letting the industry collaborate on them, rather than just showing off in research papers and letting the industry create a variant of a technology that is incompatible with Google, as happened with Hadoop and its variant of the MapReduce framework and Hadoop Distributed File System.
Facebook has its share of techies with deep insight into modern infrastructure and sophisticated application platforms, and it, too, is competing for mindshare and market share among the world’s best software and hardware engineers because it can only attract the best and brightest by showing that it is a place where the best and brightest work. The best way to do that, as far as Facebook is concerned, is to keep a steady beat of innovation coming out of its infrastructure and into the open source community.
Facebook doesn’t seem inclined to sell machine learning as a service or raw capacity to support it, but it certainly wants to embed machine learning deeply wherever it is appropriate in its social network to boost its revenues and profits and to get the industry to adopt its AI framework rather than that of Google or anyone else.
Telling people about its own machine learning platform, called FBLearner Flow, or just Flow for short, is the first step in what we expect will be the eventual opening up of the technology. Like so many things at the hyperscalers, Flow is about masking the underlying complexities of infrastructure and automating as much of the tasks of using that technology so that software engineers can focus on their code and algorithms and not that infrastructure. This, according to Jeffrey Dunn, who worked at Microsoft on machine learning algorithms for click prediction before moving to Facebook five years ago, is the driving force behind Flow, which was created in 2014 to be a better workbench for AI than the prior systems in use at the social network.
“In some of our earliest work to leverage AI and ML – such as delivering the most relevant content to each person – we noticed that the largest improvements in accuracy often came from quick experiments, feature engineering, and model tuning rather than applying fundamentally different algorithms,” Dunne explains in a blog post unveiling the Flow tool. “An engineer may need to attempt hundreds of experiments before finding a successful new feature or set of hyperparameters. Traditional pipeline systems that we evaluated did not appear to be a good fit for our uses – most solutions did not provide a way to rerun pipelines with different inputs, mechanisms to explicitly capture outputs and/or side effects, visualization of outputs, and conditional steps for tasks like parameter sweeps.”
So Facebook took a step back, started from scratch, and created Flow so that machine learning algorithms could be created once and used by all software engineers at the company, and so far, about a quarter of them have embedded some AI in their applications to date and usage is growing sharply. This includes things such as ranking News Feed stories, highlighting trending topics, filtering offensive content, or ranking search engine results.
Equally important to allowing engineers to more easily consume machine learning algorithms is the way that Flow helps automate the training of algorithms, over many machines with slightly different variations to show the outcomes, and allows engineers to see how different approaches worked – or didn’t. Seeing what doesn’t work is sometimes as important as finding what does work.
Like Borg Meets Eclipse For AI
Another important aspect of the Flow platform is that it supports workflows that employ different machine learning techniques, including neural networks, gradient boosted decision trees, LambdaMART, stochastic gradient descent, and logistic regression, and equally importantly, Facebook has put together scalable infrastructure on which to run these tools for training of models and a means of deploying them through API calls on its production infrastructure. We suspect that the company’s open source “Big Sur” GPU platform, which was been contributed to the Open Compute Project, is a big part of the Flow training hardware, but Facebook does not say.
What Dunne did say is that the Flow platform scales run thousands of simultaneous experiments and that since its inception more than a million models have been trained. The back-end production prediction system that makes use of these models (and that presumably use a mix of systems that employ CPUs or low end Tesla M4 GPU accelerators to run the finished routines to do photo identification, speech to text, and other algorithms) currently handles 6 million predictions per second across the four Facebook datacenters in production today. To give you a sense of how much work this Flow machine learning training system does, Dunne says that in April, a cluster with several thousand machines was able to do more than 500,000 unique workflows (which are runs of machine learning training using specific algorithms and datasets). The Flow system ingests trillions of datapoints from the Facebook application stack and elsewhere each day, and trains thousands of models (it can do it offline or in real-time, as necessary) and deploys finished algorithms to the production Facebook server fleet, which is probably on the order of several hundred thousand machines at this point. (Maybe a half million boxes, or half what Google, Microsoft, and Amazon Web Services have.)
Even with all of that iron, machine learning does not scale well outside of a single node and pretty poorly across more than a few nodes, and therefore some of these training sessions can take days to complete. This is one reason why hyperscalers like Facebook, Google, and Baidu, which are big users of GPUs to accelerate machine learning, are so keen on the new “Pascal” Tesla P100 accelerators and their NVLink interconnect, which offer a 12X performance bump compared to systems employing earlier Maxwell-based Tesla M40 cards for machine learning work.
“We are focused on improving the efficiency of executing these experiments to ensure that the platform can scale to the growing demand while simultaneously keeping execution latency to a minimum,” says Dunne. “We’re exploring new solutions around data locality to co-locate computation with the source data, and we’re improving our understanding of resource requirements to pack as many experiments onto each machine as possible.”
Facebook has also create an overlay for Flow, called AutoML, which it has said very little about publicly except that it can leverage idle cycles on its server fleet to improve its machine learning models. A report in Wired suggests that AutoML is a bit more than that and describes it as a means of automatically cleaning datasets before processing and also helping engineers optimize the training of machine learning models by turning AI on Flow itself. The Facebook Applied Machine Learning team, which does production AI work for the social network, has 150 people on it creating the Flow AI platform and maintaining the machine learning workflows, but maybe in the future it won’t even need that many when the AI is driving the AI.
Facebook will still need people to surf the site and see ads, of course. So humans are not completely expendable.
It will be interesting to see what happens when Facebook open sources Flow and how both will co-exist with TensorFlow and other open source machine learning frameworks.