Major Roadblocks on the Path to Machine Learning
October 31, 2016 Thomas Dinsmore
In part one of this series last week, we discussed the emerging ecosystem of machine learning applications and what promise those portend. But of course, as with any emerging application area (although to be fair, machine learning is not new), there are bound to be some barriers.
Even in analytically sophisticated organizations, machine learning often operates in “silos of expertise.” For example, the financial crimes unit in a bank may use advanced techniques to catch anti-money laundering; the credit risk team uses completely different and incompatible tools to predict loan defaults and set risk-based pricing; while treasury uses still other tools to predict cash flow. Meanwhile, customer service and branch operations do not use machine learning at all because they lack the critical mass of specialists and software.
Often these departmental teams do not collaborate with one another, which makes it difficult for the organization to establish standards for people, processes, and technology. The patchwork collection of software raises the TCO for machine learning for the entire company. Moreover, siloed teams make it hard for executives outside of the “silos” to get started with machine learning.
To support digital transformation, machine learning must do three things well:
- Radically transform enterprise business processes: marketing, sales, finance, human resources, supply chain, and so forth.
- Support data, users and workload at an enterprise scale.
- Integrate with enterprise technology stacks.
The examples of Carolina Healthcare System, Cisco, and PayPal, presented in part one, illustrate the potential for machine learning to transform a business process. In many enterprises, this transformation is still in its early stages. From the platform architecture perspective, machine learning must integrate easily with the software platforms that support business processes, support many users with diverse backgrounds, and support many projects.
Scaling to enterprise data means many different things. The vision of an enterprise data warehouse that supports analytics across the firm eludes most organizations. As a practical matter, machine learning software must interface with many different data platforms and ingest data in diverse formats: structured, semi-structured and unstructured. It must work with data that is “tall” (many records) and “wide” (many columns), and use streaming data.
Finally, machine learning software must integrate with the organization’s preferred technology stack. That means, for example, compliance with security protocol; operability with preferred data platforms; and conformance with standards for operating systems, virtualization, and any other established technologies.
The Data Scientist Shortage
There is a widespread perception that enterprises face a severe shortage of data scientists. A McKinsey report projects a shortage of data through 2018; VentureBeat, The Wall Street Journal, The Chicago Tribune and many others all note the scarcity. The Harvard Business Review suggests that you stop looking or lower your standards because real data scientists are unicorns.
The hiring challenge isn’t simply a matter of supply and demand. The McKinsey report, which is now five years old, predicted a shortage of executives who understand Big Data but projected a much smaller shortfall of people who work with data. Degree programs and MOOCs produce thousands of freshly trained data scientists each year. Organizations can send machine learning projects offshore to China and India, among other countries, where consulting firms employ large and growing teams of analysts with advanced degrees.
An absence of professional standards and professional certifications pose the greatest hiring challenge. While there is an effort underway to establish professional standards for data scientists, no widely accepted standard exists today; anyone can call themselves a data scientist. In the 2016 Data Science Salary Survey published by O’Reilly Media, 29% of respondents call themselves data scientists but report that they spend little or no time contributing to machine learning projects, and do not use standard machine learning tools.
There is also considerable uncertainty about the proper role of data scientists. While hiring managers seek out individuals with skills and experience in machine learning, the actual workload may be entirely different. In many organizations, the real role of people with the data scientist title is information retrieval: using query tools to secure data from data platforms so that business users can view it in Tableau or Excel. (SQL is the most popular tool cited in the O’Reilly survey.)
Such misunderstandings undermine the morale of a team and encourage attrition. A recent survey by Stack Overflow reveals that innovation and “building something significant” are key motivators for machine learning professionals, more so than for other disciplines. Placing an individual with machine learning skills in a “data broker” role only because he or she knows how to use SQL is a misuse of human resources.
Long Time to Value
According to a Gartner survey, executives responsible for advanced analytics say that it takes an average of 52 days to build a predictive model. (Gartner’s definition of advanced analytics also includes statistics, descriptive and predictive data mining, simulation, and optimization, but the survey findings are still pertinent.) Reported timelines vary from days to months. The same executives rate “speed of model development” as a top criterion in choosing an advanced analytics platform, second only to general ease of use.
Executives wonder: why does it take so long to build and deploy a predictive model? There are many reasons:
- Data is difficult to access.
- Data is dirty.
- Legacy machine learning tools do not scale to Big Data.
- Management approvals to deploy models are slow and bureaucratic.
- Organizations lack a defined process or technical standards for model deployment.
Most working data scientists spend very little time training machine learning models. In 2014, the New York Times reported that data scientists spend 50-80% of their time collecting and preparing data, according to interviews and expert estimates. Earlier this year, Gil Press reported in Forbes on a survey of data scientists conducted by CrowdFlower, in which respondents said they spend about 80% of their time collecting, cleaning and organizing data.
Considering the investments in enterprise data warehousing, it seems surprising that data scientists must spend so much valuable time securing and cleaning data. There are two principal reasons for this. First, enterprise data warehouses tend to focus support on business intelligence and performance management use cases. These use cases are the lowest hanging fruit; they tend to have stable data requirements and a large population of prospective users. Machine learning projects, on the other hand, frequently work with data from sources that the IT organization does not support in the enterprise data warehouse.
Second, data is critical to the success of a machine learning project — “garbage in/garbage out.” Biased or invalid data produces biased or erroneous predictions; the data scientist is accountable for high-quality output, and cannot dismiss data issues as “someone else’s problem.” With increased social concern about bias in algorithms, we can expect that visibility into data lineage (which is already a fixture in regulated industries) will be a critical factor in the wider adoption of machine learning. This need for accountability necessarily means that data scientists seek control over the data pipeline.
Machine learning makes heavy demands on computing infrastructure, especially so with Big Data. Model development requires iterative testing and re-testing. Most legacy server-based machine learning software – any software developed before 2010 – is single-threaded; at best, a few products support single-machine multi-core parallel processing. (For example, there are more than 300 procedures included in SAS/STAT, a leading legacy software package for advanced analytics; of these, only 22 support multi-threaded processing.)
All of the leading data warehouse vendors include machine learning engines in their distributed databases. Teradata introduced the capability in 1989, followed by IBM in 1992, Microsoft in 2000 and Oracle in 2003; Netezza added machine learning in 2006, while Greenplum contributed to the project now branded as Apache MADlib; independent software vendor Fuzzy Logix introduced its machine learning library on multiple database platforms in 2007. Machine learning engines embedded in MPP databases offer some potential benefits, including reduced data movement, simplified deployment and the performance of an MPP platform.
In practice, however, few data scientists use in-database machine learning tools, for several reasons. First, we only reduce data movement if all of the data required for a machine learning project already resides in the database — a rare occurrence. Second, we only expedite deployment if the analytic database also supports the application that consumes the predictions; this is also rare.
Machine learning libraries embedded in MPP data warehouses also tend to lack features available in other alternatives, forcing the user to compromise or rely on custom coding. Finally, machine learning workloads – which tend to be “lumpy” and unpredictable – drive database administrators to distraction. Many organizations decline to deploy in-database machine learning or strictly curtail its use to avoid affecting finely tuned business intelligence workloads.
While there is little data available to document how much time organizations spend on the model review and approval process, anecdotal evidence suggests that it is significant. Accountable executives demand transparency from machine learning models that affect their business; no bank risk management executive will approve the use of a credit risk model without a thorough understanding of the model’s behavior and the processes used to build, test and validate it.
In regulated industries, such as banking, insurance, and healthcare, legal review is part of the approvals process. In banks, for example, legal teams evaluate credit risk models to make sure that they include no explicit or implicit discriminatory effects, and for other compliance issues.
Organizations with minimal machine learning experience tend to lack a defined process for model deployment. Without a defined process, every project is a custom project, so contributors must perform every task from scratch without the guidance provided by best practices and standard templates. That can take a long time; in some organizations, it takes six months or more to deploy a predictive model. In today’s fast-moving business environment, that can seem like forever for an executive with P&L accountability.
The Challenge of Enterprise Machine Learning
Breaking machine learning out of “silos of expertise” is a key objective for enterprise machine learning. Siloed departmental initiatives raise costs, impede investment and constrain digital transformation.
The shortage of skilled practitioners is the top issue cited by executives as limiting the wider deployment of machine learning. The skills gap is attributable in part to a lack of professional standards for data scientists and a lack of role clarity for contributors to machine learning projects. This skills gap can produce a vicious cycle for the organization, since hiring managers may struggle to justify hiring people dedicated to machine learning without a previous track record of successes.
Executives report long cycle times for machine learning projects, and cite this as a key issue. Machine learning projects take a long time to deliver value because data is dirty and hard to access; because legacy machine learning tools do not scale; because approvals to deploy models can be complicated and bureaucratic; and because many organizations lack defined processes and standards for model deployment.
In Part Three of this series, we will review some emerging solutions to these challenges.