There are few companies with a better handle on the pros and cons of serverless than learning hub, Khan Academy.
The non-profit, which features a wide range of learning tools and content for 20 million engaged users per month, was one of the first customers for Google App Engine (a decade before we graced with the word “serverless”) and while there are some tradeoffs and new technologies that could make a shift possible, they’re planning to stay the App Engine course.
Like so many education-centric companies, Khan Academy faced a rapid uptick in scale during the pandemic, with 3X growth in users almost overnight, according to CTO and VP of Engineering Marta Kosarchyn. All of this hit during a critical phase in an organization-wise migration from Python to Go as the base (along with GraphQL and other tools). One of the reasons Khan Academy could stay focused on the product instead of scrambling for scalability was because of the serverless approach they grew up on. And while there are some tradeoffs in terms of cost and flexibility, it let them focus on meeting demand rather than rushing to hire legions of infrastructure engineers as other companies had to do in 2020.
In 2009, when Khan Academy made its debut, AWS EC2 was available, so they did have options. Containers came along later as well, which offered some options for getting around the lock-in of serverless but it wasn’t until the last couple of years Khan Academy deployed those, although mostly to help onboard new developers quickly. The thing is, when something is working, especially for a product-focused non-profit with massive scale, change doesn’t come easy, nor is it always welcome.
Big transitions for a live product are difficult but Kosarchyn says out of all of them she’s seen in her 30 years (senior manager for R&D at HP Labs, director of product development at Intuit, etc.) the shift away from Python to Go, mostly for performance reasons, went smoothly. Much of this is because no one is worrying about infrastructure and auto-scaling, while not cheap, lets them stay focused.
“Thing is about serverless is that you are hooked in that platform pretty tightly, that’s one of the tradeoffs. It’s not easy to change serverless and that’s part of it. Not to say we’d move off App Engine but it’s a consideration. The other thing is that it’s more expensive because of autoscaling but the beauty of running at the scale we do with a team of five engineers doing devops and eighty-five site engineers is huge, everyone else is developing code and product,” Kosarchyn explains.
“Auto-scaling is expensive but it’s how this all works, you deflate and there is some redundancy built in and you’re paying for the resource you use as you go up and down. The good news is, when you’re done and scale down, you don’t pay but you are paying for that bubble in the transition. It is, however, what makes you fully reliable so nothing goes away too quickly. Yes, it’s expensive and that’s a tradeoff but it’s better for us than having 4-5 times the manpower. Instead we’re using our engineering brains on product where it matters.”
At The Next Platform we are used to talking about everything in terms of node, core, or even instance counts to get a handle on what scalability costs for different use cases. Serverless changes how we talk about this but on average, Kosarchyn says their costs scale almost exactly linearly with usage. The good news is that it lets Khan Academy predict what growth will cost, the bad news is that those costs are high.
Even still, when asked what she might rearchitect if Google App Engine was no longer available (just a hypothetical), Kosarchyn says she would not move away from a serverless architecture for the kind of work Khan Academy does. “Will serverless serve us best forever? Maybe not. We do look at containers, we are building that to be cloud agnostic,” she adds.
For someone with a long career rooted in on-prem infrastructure, Kosarchyn said it took some time to get her head wrapped around not worrying about servers and the whole hardware side of things. The idea of going back to on-prem is anathema, with limited talent resources to pick from, having the best engineers focused on code and product is invaluable.