Cloudy Machine Learning For The Masses
April 10, 2015 Timothy Prickett Morgan
If there is one lesson that the big three public cloud providers teach, it is that there is no substitute for breadth and depth in software engineering. And ironically, as each one presents yet another new service that simplifies the life of IT operations, the rest of the world that then becomes dependent on these services and does not do its own thinking not only becomes a bit dumber, but also that much more dependent.
It’s a brilliant business model, if you really consider it. And now that the big three clouds – Amazon Web Services, Microsoft Azure, and Google Cloud Platform – all have machine learning services, culled from their own experiences in peddling products and pushing ads over many years, that are fairly inexpensive, it is hard to imagine that millions of companies, of all sizes and around the world, won’t eventually give these machine learning services a whirl. It sure beats trying to code it yourself and figuring out how to accelerate it with GPUs or FPGAs.
It is hard to draw a line between machine learning and predictive analytics, but all of these services lean a little more towards predictive analytics than they do towards the heavy-duty machine learning that is used to identify objects and people in images or video and write a description of what they are, or that powers the Siri and Cortana personal assistant services from Apple and Microsoft, or that controls a self-driving car. Machine learning sounds a lot cooler, to some ears at least, than does predictive analytics, and hence the terms is bleeding over. No matter what you call it, the new machine learning services from Amazon, Google, and Microsoft are definitely going to give predictive analytics software companies like SAS Institute, IBM, and Oracle a run for the money. Once again, services based on data and compute time fees and appealing to ease of use are pitting themselves against tried-and-true, best-of-breed analytics software with decades of evolution and use.
Amazon is the latest to trot out its machine learning services, and did so at its AWS Summit in San Francisco this week. Amazon Machine Learning (AML) is based on the company’s own experiences with predictive analytics, which it has been dabbling with since it started out as an online bookseller in the dot-com era. Having invested untold sums in creating systems for supply chain management, fraud detection, and click prediction – this is a big one for online retailers – Amazon is not exposing its data visualization, machine learning modeling, and predictive analytics tools to the rest of the world through the AWS cloud. (Amazon uses machine learning to tell workers how to unpack a truck in the most expedient way possible to get the books into its warehouses and flowing back out to customers in other trucks, where ML is used to pack them.) All developers within Amazon have access to the ML stack and can embed it in their applications.
If AML seems a bit like giving away the online store (candy or otherwise), you can bet that whatever ML algorithms AWS is giving away, Amazon has kept some of the real jewels for itself. This is ever the way with the hyperscale titans.
AML starts with data, of course, and it designed to train against and do predictions against datasets that are no more than 100 GB in size. The data can be resident in the Relational Data Service with a MySQL backend, the S3 object store, or the Redshift data warehousing service. The latter two offer customers petabyte-scale storage if they want it, and it stands to reason that as customers embrace Amazon Machine Learning and try to train better models against larger datasets – more data is better than tweaking algorithms to create better predictive models faster, after all – that AWS will lift that 100 GB ceiling on dataset size. AWS says that the service does not actually pull data out of MySQL or Redshift, but rather uses the results of a query executed against those services. Any other data customers might want to use in their models can be stored in a CSV file and sucked into S3. AWS has data visualization tools to help show where fields are missing data in a dataset, and if 10 percent of the records in a dataset fail, then the machine learning service stops the model because the predictions generated won’t be any good.
The ML service goes through the data and builds machine learning models, and they can be fine-tuned given more or better data sources, by making multiple passes over the data or applying different levels of regularization to the data. (Exactly how this all works is a bit of a mystery, and so by intention.) The idea is to train a dataset and then use it to make predictions based on new data streaming in. There is a batch API to have AML go through the whole dataset at one time and make predictions all at once, or you can use a real-time API to make predictions on demand for specific parts of data or specific predictions. AML can return a prediction request back over in about 100 milliseconds, which the company says is fast enough for web, mobile, and desktop applications; the IP address endpoint where a model sits on the AWS cloud can drive about 200 transactions per second. Amazon says that the AML service can be used to make billions of predictions per day, in real time, and it knows this because a variant of the service that has been running inside the online retailer is making more than 50 billion predictions per week (product recommendations and so forth) for the Amazon retail business.
One last thing: AML is sticky. You cannot export your machine learning models out of the AML service, and you cannot import any ML models created elsewhere into the AML service.
Target scenarios for AML are what you would expect. Fraud detection, demand forecasting, predictive customer support, and web click prediction. Customer service is another area, and the AML service could be used to analyze customer feedback from emails, forums, and telephone support transcripts to recommend corrective actions to product engineering and service teams as well as to connect new customers with similar issues to the appropriate customer support technicians who know about the problem and how to solve it.
As for pricing, the AML service is pretty straight-forward. You pay for the local storage on S3, RDS, or Redshift for the data. The is costs another 42 cents per hour for AML to chew on that data to make the model; obviously, the more passes you make, the more money you spend. There is a supplemental charge for predictions on top of that , with batch predictions costing 10 cents per 1,000 predictions (rounded up to the nearest 1,000) and real-time predictions costing 1/10,000th of a penny each plus a an additional 1/1,000th of a penny for each 10 MB of reserved memory provisioned for the ML model as it is running. To make around 1 million predictions from a model if the model takes about 20 hours to run will cost just north of a hundred bucks.
Google Jumped Out First
Google aims its Prediction API service at similar targets as AML and Azure ML, and it has been available since the fall of 2011 as a part of its App Engine platform cloud.
The training size of the Prediction API is capped at 2.5 GB, and files are loaded into the Google Storage service. Google says that it usually takes from minutes to a couple of hours to train a dataset. Once it is trained and is running against new data, it takes on the order of 200 milliseconds to generate a prediction.
Google’s freebie service allows data scientists to do 100 predictions per day and train against a mere 5 MB of data per day; there is a lifetime cap of 20,000 predictions. For the paid Predictive API service, Google has a minimum $10 per month fee, which covers up to 10,000 predictions, and it costs 50 cents per 1,000 predictions after that. It costs 2/10ths of a cent per MB for training datasets, plus additional fees for streaming updates into the dataset. The for-fee service has a cap of 2 million predictions per day, and Google wants to be notified if data scientists want to get above 40,000 predictions per day. Charges do not reflect the Google Storage costs for live datasets.
Microsoft Azure Imports ML Expertise From Amazon
When Microsoft wanted to build its own machine learning stack and expose it as a service on its Azure cloud, it went straight to the top and hired away Joseph Sirosh from Amazon in July 2013. (The commute didn’t change much for him, presumably.)
Microsoft’s Azure ML service went into beta last June, and includes many of the ML algorithms that the company uses to run its Bing search engine and Xbox gaming sites. Microsoft also allows for algorithms written in the open source R statistical language and in Python to be woven into the Azure ML stack, and developers can share the ML algorithms they create for free through a gallery and a for a fee through a marketplace. The recent acquisition by Microsoft of Revolution Analytics, which has radically boosted the performance of the R statistics engine, no doubt will help bolster the Azure ML service.
Carnegie Mellon University is a customer, and is using Azure ML to do predictive maintenance on its facilities, and ThyssenKrupp is using the service to do a similar task on the elevators it installs in skyscrapers. Pier 1 imports is another customer using Azure ML, and in this case it is using the service to do predictive modelling of customer purchases.
Microsoft is only peddling the Azure ML service out of its South Central US region at the moment. There is a free tier that comes with a maximum of 100 modules per experiment. (A module is an algorithm or a source of data or a data transformation operation in Azure ML speak.) The Machine Learning Studio tool that is part of the service can train on a dataset that is 10 GB or smaller, but predictive analytics can be run against a Hive data warehouse layer running on hosted HDInsight Hadoop services or against queries from the Azure SQL Database service. If you have datasets larger than 10 GB, you can partition it and then run training sessions on pieces and merge the results. The freebie version of the Azure ML service also caps out with a 1 hour maximum on the dataset training (what Microsoft calls an experiment) and has a maximum of 10 GB of storage space; it runs on a single node and with the staging API to the web throttled back.
The standard Azure ML service, which has a fee associated with it, has an unlimited number of modules, runs on multiple nodes, and doesn’t have API caps. Azure ML costs $9.99 per seat per month for data scientists, plus $1 per hour for model training and then $2 per compute hour to feed results out to APIs for application integration plus 50 cents per 1,000 API transactions. You have to pay for your larger dataset storage as well, of course, just like on Amazon Machine Learning.
IBM SoftLayer and Cognos/SPSS, your turn. SAS Institute already has its own SaaS analytics, but could partner to get wider exposure, particularly with any of the big public clouds and maybe even smaller players like Rackspace Hosting. That said, Rackspace increasingly likes open software, so a SAS partnership might not make sense. But grabbing the open source R tools and maybe Apache Mahout or Spark MLlib for Hadoop and crafting its own ML service might.