Much has changed at online retail giant, Etsy, since 2015 when we talked to the company’s senior VP of technical operations about their adherence to their on-prem datacenters and database-driven approach to handling exponential growth.
Continuous scaling of their user base, ever-growing data collection efforts to better cater to their searches (and to handle the seller side of the platform), and the possibilities of AI/ML pushed a rethink in infrastructure. In a short span, the company went from on-prem to full-scale Google Cloud Platform (GCP) with almost nothing left to on-site systems in 2019 and certainly not after the pandemic pushed the New York-based company even further outside its own walls.
While search and recommendation have always been a critical part of the company’s operations, the traditional way of doing things via complex database hacks, some homegrown, that was described in that 2015 piece have given way to neural networks. Etsy’s team of ML-oriented data scientists and machine learning engineers, is now well over 50, says Gaurav Anand, senior ML engineer at the company.
Anand tells us that search and recommendation teams balance tradeoffs between costs to train and what the end value will be for product teams but they’ve managed to keep their training sweet spot between four and seven hours, usually with daily retraining for search in particular. He says their teams weigh the various hardware options in GCP when it’s time to train and pick what trains fastest in the right price frame, which means all of their training runs happen on GPUs.
When asked why nearly all production work at Etsy is not running TPU when it’s available on GCP and is well-aligned architecturally with the computer vision and language model-oriented work of their search teams, Anand says it’s better for the bigger models—and budgets.
“In terms of retraining, the tradeoff is being fresh or saving money. We’re always making these tradeoffs and the freshness versus training times and costs vary by model.” He adds “There might be a use case or two here at Etsy on TPU but not in recommendations. Using GPUs versus TPUs with Tensorflow, so much is abstracted, there’s no need to make changes and you can just proceed without hesitation. So far, our training times are reasonable. TPUs are for scenarios where it takes weeks or months to train because there’s so much data. We can eventually move in that direction [larger models, longer training times] but there would have to be a clear benefit for the company.”
“If your data is huge and training is unreasonable, there’s a good argument for TPUs, especially since you can do hyperparameter optimization and experimentation with different models.”
But here’s the thing about Etsy: they could make use of vast amounts of data in training but their most pressing problems in AI/ML aren’t solved by volume, they’re driven by specificity. Further complicating that is that the specific “thing” people might search for could be esoteric, not a trend beyond a niche, and could also look in query a lot like something irrelevant.
In short, Etsy’s task is to deliver on the ultra-specific using bucket-loader AI/ML approaches that have to be very finely tuned and optimized, often for the unknown, emerging trend. Further, as a cost-constrained business, they have to do all of this in purposefully slim training/optimization/development runs so AI/ML directly translates to business sense.
A big part of what makes the site so popular from a buyer standpoint is its wide range of unique items that can be found quickly (and with a large number of results) from both highly specific queries and also broad ones (i.e., filter by all things blue, no matter the category). Getting to this point of catering to vague/specific queries and delivering the right results is difficult, certainly more so than just delivering a mass of results around a basic search on Amazon, Anand explains.
This is where Anand and team are toiling. And there aren’t necessarily established best practices or tools to do it. For instance, we are all familiar with the end product of recommendation engines and semantic or visual search, often by using large platforms (Amazon, for example) that can deliver quantity on search results but can fall short when searching for something that is conceptual versus a direct product with brand, for example. These big platforms aren’t designed for people who, for instance, like anything blue and just want to browse options.
What this means is that Anand and other teams at Etsy have to walk through the fire of multiple optimization efforts and failures, which are common when tweaking models that could alter user and seller experience on the platform.
Balancing training times and costs for this are an ongoing challenges but he says that since their first models started rolling out in 2018, his AI/ML teams have learned some important lessons. These are applicable to far smaller companies with less data who are trying to make the same daily tradeoffs.
The big lesson in on the experimentation side, “having a philosophy in place is important and so is having a process to dissect and understand how and why something fails, which things inevitably do.” He adds that it’s also good to be tied as closely as possible to open source and ditch legacy in-house frameworks. “This space continues to evolved and every company has to keep moving with what new frameworks and libraries are being used. Keeping close to open source, with TensorFlow, for example, is a good start.”
Anand says that at Etsy, much is driven by product teams but he suggests AI/ML developers do what they do—be “entrepreneurial” about shopping new capabilities to product teams.
Almost as mythical as finding the right model to optimize for ethereal search queries and deliver the right recommendation is finding where the cutoff line is between a useful feature that took a lot of development investment and training time and it’s practical value to Etsy’s bottom line. This is the question we don’t have to ask when we talk to the hyperscale companies about how they look at new AI/ML capabilities—they roll it out. But they are, of course the one-percenters.
Etsy’s balancing act of delivering specificity across a massive dataset (deciding what to pull when, how, and for whom) creates yet another balancing act—deciding when optimization ends in one arena and where and what to develop next.