Mainstreaming Machine Learning: Emerging Solutions
November 1, 2016 Thomas Dinsmore
In the course of this three-part series on the challenges and opportunities for enterprise machine learning, we have worked to define the landscape and ecosystem for these workloads in large-scale business settings and have taken an in-depth look at some of the roadblocks on the path to more mainstream machine learning applications.
In this final part of the series, we will turn from pointing to the problems and look at the ways the barriers can be removed, both in terms of leveraging the technology ecosystem around machine learning and addressing more difficult problems, most notably, how to implement the human side of machine learning in an organization. For now, however, let’s start looking at solutions at the top of the technology side with the sheer performance and workflow possibilities.
Reducing Cycle Time
Logically, if we want to reduce the cycle time for machine learning radically, it makes sense to attack the most time-consuming tasks. As we noted previously, data scientists spend most of their time collecting and cleaning data, so it makes sense to focus effort on simplifying and expediting this task.
As we noted previously, the data requirements for machine learning rarely align with existing data warehousing workflows. Consequently, data scientists tend to build “just-in-time” ETL workflows for each project. Open source Apache Spark, which supports scalable data processing and connectors to a wide range of data platforms, supports this task very well. Commercial data blending tools, such as Alteryx, Tamr, and Trifacta, offer graphical user interfaces and a visual workflow metaphor, which makes it easier for business users to check data lineage.
Time needed to train models is not necessarily the gating factor affecting overall cycle time. Nevertheless, data scientists care very much about speed and throughput for machine learning software. Model development is an evolutionary process, and machine learning models improve with iteration; a 10X improvement in training speed means a 10X increase in the number of experiments and a commensurate improvement in model accuracy.
Given the importance of high performance, there are surprisingly few distributed machine learning software tools. As a rule, machine learning algorithms are not embarrassingly parallel; consequently, commercial vendors and open source projects must invest time and effort to rebuild algorithms so that they distribute workload across clustered servers. Among open source projects, Apache Spark, H2O, and XGBoost all support distributed processing. SAS currently has three different distributed architectures: SAS High Performance Analytics, SAS LASR Server, and the recently introduced SAS Viya platform. IBM never marketed a distributed machine learning platform, but it has invested heavily in Apache Spark, and offers push-down integration to Spark from its commercially licensed products. Alpine Data, KNIME, and RapidMiner pursue a similar approach.
While distributed processing on commodity servers provides sufficient performance for business applications, deep learning for applications like speech recognition or image classification requires truly massive processing power. GPU-accelerated machines are gaining acceptance; NVIDIA has delivered several of its DGX-1 supercomputers, which offer the power of 250 conventional servers in a 34x17x5 inch box. Scientists are also giving a hard look to acceleration with field-programmable gate arrays (FPGAs); however, while deep learning software support for GPUs is readily available, that is not so for FPGAs. Nevertheless, Chinese internet giant Baidu recently announced plans to invest in FPGAs to support machine learning in its data centers.
Collaboration among contributors is essential for successful machine learning projects. Continuous engagement between machine learning experts and business stakeholders ensures that the team solves the right problem; involvement of IT personnel ensures rapid deployment of the machine learning solution. Rather than making it easy for business users to build models, it makes far more sense to recognize the different roles and personas of contributors to projects and develop role-based user interfaces. Data scientists, for example, may prefer a programming API, while business stakeholders need the capability to visualize and understand the behavior of a predictive model and developers need a readily accessible scoring API. Alpine Data and DataRobot offer products that support this kind of collaboration.
Defined workflows are another key to reducing time to value for machine learning projects. Project contributors need predictable policies for business, IT, and legal approvals: whose approval is required, the type of information necessary and timelines for review. Organizations should treat machine learning projects like any other development project, with clear project plans, defined roles, schedules, and accountabilities.
While most machine learning software available today can export models as program code or Predictive Model Markup Language (PMML), many organizations still rely on paper-and-pencil specifications to build scoring programs from scratch. This process takes time, introduces the possibility of errors, and requires investment in unit testing and auditing.
Automated model export only works if the data model for the machine learning development environment aligns with the data model for the planned production environment; manual recoding is necessary when the organization does not enforce this alignment. The problem is attributable, in part, to the interdepartmental handoff; data scientists may see model deployment as “someone else’s responsibility.” More efficient team collaboration, project management, and a directive to rely on standards-based model transfer are the best remedy.
Treating machine learning projects as application development project necessarily leads to consideration of agile development techniques. Data scientists may not always see cycle time as a problem, opting instead to build the best possible machine learning model in the first pass. Long and labor-intensive deployment cycles exacerbate this tendency; when implementation takes weeks or months, there is little opportunity for early feedback. Radically reducing time to deployment opens the possibility of a different approach to machine learning, one that relies on the rapid production of a “minimum viable model,” deployed quickly, then improved incrementally based on in-market feedback.
The Skill Shortage
There are two possible ways to address the shortage of data scientists. One is to make data scientists more productive so that teams can accomplish more work without hiring more people. The second is to make more people into data scientists. This second approach implies changing the nature of the work so that more business users can perform machine learning tasks. Gartner has popularized the idea of the Citizen Data Scientist, a business user who can use machine learning but for whom advanced analytics is not the primary job role.
The premise of the Citizen Data Scientist idea is that machine learning software is too difficult for business users to learn. Working data scientists tend to prefer machine learning tools with programming APIs for languages like R, Python, Scala and Java; the extended functionality, utility, power and flexibility of these tools tends to override “ease-of-use” considerations. Committed data scientists are motivated to learn these tools, and tend to do so early in their careers.
Make machine learning more accessible, with a graphical “drag and drop” interface, and you will expand the pool of users, mitigate the data scientist shortage and deploy machine learning throughout the organization. So goes the argument put forward by industry analysts and commercial vendors who market drag-and-drop machine learning software, including Alteryx, Angoss, RapidMiner, and Statistica.
While everyone agrees that “easy to use” is better than “hard to use,” there are good reasons to be skeptical about claims that “self-service” machine learning will drive enterprise transformation. Such tools have existed for years; Angoss launched its software in 1984, and Integral Solutions introduced the product now branded as IBM SPSS Modeler in 1994. Of course, design standards for end-user software are different today from those in force twenty years ago, but the products themselves have evolved. The point is that if there is a hidden pool of citizen data scientists in organizations waiting for the right software to materialize, it is a very well hidden pool.
More fundamentally, it does little good to provide business users with a “drag and drop” interface unless they know what to drag and where to drop it. It is much harder to teach machine learning methodology and best practices than skills in specific tools. Consequently, products like DataRobot, SAP BusinessObjects Predictive Analytics, and SAS Factory Miner, which support end-to-end automation of the machine learning workflow, have more potential to expand the user base because they enforce compliance with best practices and ensure consistently reliable results.
In Part Two, we identified two significant challenges for enterprise machine learning: the shortage of people with machine learning skills and the long time to value for machine learning projects.
There are two ways to address the lack of individuals familiar with machine learning: add more people or make the process more efficient. While industry analysts and commercial vendors suggest that organizations provide simpler machine learning tools to business users, some caution is advisable. Machine learning is a complex and powerful tool; inaccurate or biased models can do more harm than good. Business user tools that offer full workflow support and enforce best practices are superior to those that just provide an interface that is easy to use.
Many different factors contribute to long machine learning project timelines, and solutions should target tasks with the greatest impact on the overall schedule. Based on data scientist reports about how they use their time, data collection and cleaning is the binding constraint; better and more powerful data blending tools can help. Improved policies and practices may reduce the time needed to approve and deploy models. There are also software and hardware solutions that deliver improved performance and throughput for the machine learning process itself.
Thomas W. Dinsmore is an independent consultant and author, specializing in enterprise analytics. Thomas provides clients with intelligence about the analytics marketplace: competitor profiling, requirements assessment, product definition and communications.
Before launching his consultancy in 2015, Thomas served as an analytics expert for The Boston Consulting Group; Director of Product Management for Revolution Analytics (Microsoft); Solution Architect for IBM Big Data (Netezza), SAS and PriceWaterhouseCoopers. He has led or contributed to analytic solutions for more than five hundred clients across vertical markets and around the world, including AT&T, Banco Santander, Citibank, Dell, J.C.Penney, Monsanto, Morgan Stanley, Office Depot, Sony, Staples, United Health Group, UBS, and Vodafone. Thomas’ new book, Disruptive Analytics, published by Apress, is available on Amazon. He co-authored Modern Analytics Methodologies and Advanced Analytics Methodologies for FT Press and served as a reviewer for the Spark Cookbook.