Machine learning has moved from prototype to production across a wide range of business units at financial services giant Capital One due in part to a centralized approach to evaluating and rolling out new projects.
This is no easy task given the scale and scope of the enterprise but according to Zachary Hanif who is director of Capital One’s machine learning “center for excellence”, the trick is to define use cases early that touch as broad of a base within the larger organization as possible and build outwards. This is encapsulated in the philosophy Hanif spearheads—locating machine learning talent in one repository that can branch out and work with the experts across the many business divisions.
Hanif shared these and other lessons for building a machine learning hub inside a large enterprise where purely machine learning experts work with the different domain and departmental efforts to roll new services into production at the GPU Technology Conference (GTC18). While GPUs were not necessarily the topic of the talk by any means, Hanif did say they have quite a number along with standard CPU based clusters and just like any other enterprise or academic center with a wide range of mission-critical R&D projects on the burner, resource contention is a constant struggle, especially when it comes to the more rare and expensive GPUs they have.
“We wanted to deliver large-scale transformative projects in collaboration with many business groups, centralize R&D, and democratize machine learning. We opted to centralize as a strategy for larger change. We needed a central group to champion the needs of machine learning that could also bring in other company members as well as experts. In 2016 we created the center and now it serves all of Capital One,” Hanif says to preface the group’s lessons learned over the last two years.
To manage the complexity of many ongoing projects in various stages of development or decline, Hanif says there is great value in having some kind of rich reproducibility tracking platform that can trace nearly every aspect of each iteration of new models to ensure that results can be deployed accurately and efficiently and to be sure that teams can easily go back and forth within the lifecycle of a project to check status and start from a set point if needed. He says that this platform, which teams inside his machine learning group built themselves, was a key to the success of centralizing machine learning initiatives that were rooted in his group but branched out across the entire massive enterprise.
The experiment logging network Capital One built is key and while it might not practical for everyone, for a large enterprise that plans multi-faceted machine learning deployments across many different business units, its necessity is the real takeaway from the talk—which is aimed at other enterprise leaders trying to understand how to broadly implement AI and machine learning.
“If you can’t reproduce your results, it’s all for nothing. We built our experiment logging network to record experiments every time we run. It records all execution of models from the first time they’re stood up in their early exploration phase to the end. We want to record all of this because if we do that for our developers, we hit the goal of building meaningful models that have all info attached and can be run in champion/challenger mode inside our historical set to check reproducibility,” he explains. Capital One can then create and save results from these model validations, load artifacts and deploy them, and track all versioning and sourcing.
“We’ve seen a big decrease in model development times since implementing this—around a 60%,” Hanif says.
From an infrastructure point of view, the group’s efforts for tracking, testing, and prototyping on its own clusters are rooted in a managed way that tools like Mesos and Kubernetes allow (Capital One uses both). As Hanif explains, “The models are just part of the story; the pipeline, data schemes and monitoring methods are just as important. We have had to look at problems of container management, server monitoring, and storage because these and other infrastructure components are critical to project success and we rely on cluster schedulers and management heavily to cope with resource contention we have among a lot of users.”
“GPUs come with special considerations and having middleware to make it easier to manage is really important. Resource allocation and scheduling tools are critical with more than sixty machine learning engineers vying for time on those systems.”
Aside from these ways to manage complexity, building out teams and creating balanced strategies that do not overload teams of domain experts or the internal machine learning group are other challenges that compound at scale.
“Machine leaning is expensive; you have to choose wisely—you can either go big or go wide. In other words, focus on the transformative opportunities overall or a subset of these, but all of these efforts will be time consuming. Machine learning does not need to be in everything and indeed, it can harm more than help in some situations,” Hanif expressed.
He argues for similar balanced restraint when it comes to building a machine learning team to serve as a centralized hub.
“Machine learning projects are far more than just models. Finding talent is difficult, you want a mix of development and research—degrees are not everything, experience really matters here. And while it may seem like you can never have enough headcount, just as with any large enterprise project, there are scalability concerns with onboarding and offboarding projects. The overhead of coordination and long-term commitments conflicting with shifting priorities can become a problem as you also contend with information loss due to project churn and drift.”