An online retail business for handmade and vintage goods might be a place where you would reasonably expect the IT department to have nearly free reign to create homegrown systems. Some engineers would bend metal to forge custom systems and others would hack open source code to make a beautiful and ornate work of data processing art.
While there is a certain amount of leading-edge experimentation at Etsy, the company has a conservative approach to IT that many might not have expected – and one that suits a fast-growing business that is trying to get and stay profitable.
Etsy, which is getting ready to go public in the next few months, gave The Next Platform an inside look at its own platform and John Allspaw, senior vice president of technical operations at the retailer, gave us some insight into the company’s philosophy about technology adoption. The attitude at Etsy would probably suit most large enterprises, who are a whole lot less rich and therefore a whole lot more risk adverse than the hyperscalers and HPC institutions that often create the most advanced technologies these days.
Striking the right balance between tried-and-true tech and the stuff out on the bleeding edge is not easy. Many companies never quite get it and fail, while others are able to mask it with their gargantuan IT budgets. (Think of all the things governments and banks try.) Among the HPC and hyperscale crowd, the applications in the upper echelons are for all intents and purposes custom, and the machinery that runs them can be considered more or less bespoke, too. The trick for Etsy and any enterprises that might want to emulate it is to figure out when a technology is safe and when it makes sense to deploy it.
Born In A Brooklyn Living Room, Not On A Cloud
Etsy was founded in an apartment in the hipster and artsy Dumbo section of Brooklyn in June 2005, and had four employees, one web server, and one database server. All hell broke loose only two months later, when the site went viral, and the company has been steadily growing the size and sophistication of its operations, to the point where it now has multiple datacenters on both the East coast (for production workloads, in New Jersey) and on the West coast (for disaster recovery, in an undisclosed location).
The first thing you will notice is that Etsy has its own systems running in datacenters rather than running its applications on Amazon Web Services or another public cloud, like Pinterest does, for instance. Part of that is due to timing – Etsy was founded nine months before AWS launched EC2. While Etsy has used AWS for Elastic MapReduce and other data analytics work in the past, and it still uses the AWS S3 service for the “long tail” of image storage for the Etsy.com site, that is about the extent of it. If Etsy was starting up today, he tells The Next Platform, there is no question that it would start out using AWS infrastructure. But the real question is not where Etsy would start, but when it would move to its own datacenter.
“It doesn’t make sense to farm it out on a couple of fronts,” Allspaw explains. “One of them – and this just reveals a bias of my own – revolves around development pipelines. We would not gain from the standard advantage that you hear around clouds, which is spin up and spin down. Our usage is not conducive and is simply not spikey like that. There is not a huge amount of waste going on in our traffic patterns. Second, because we have on staff experts that can exploit the capabilities of bare metal, we are still convinced that we can be more efficient and more nimble with our own hardware rather than sharing it. It is not so much outsourcing a liability, but we are just going to be better than dedicated instances on EC2.”
The Etsy infrastructure is by no means hyperscale, but with around 3,000 servers, it is commensurate with a business that has 685 employees (about a third of them are engineers) and that booked $195.6 million in revenues against total gross sales of $1.93 billion in 2014. While no one expects Etsy to be the next Amazon or eBay, it has carved out its niche. The company hit a bad patch with business and technology transitions back in the late 2000s, and Allspaw was brought in to help fix some of the technical issues that Etsy had. Since that time, the company has been growing at a steady clip, reaching 1.4 million active sellers and 19.8 active buyers as 2014 came to a close. Just a little over half of the revenue stream at Etsy comes from a 20 cent listing fee for items sold on the site and a 3.5 percent transaction fee that gets tacked on. The remaining money is for checkout services, listing promotion, shipping labels, and other services for sellers.
Getting A Little Too Creative
Back in 2007, when Etsy was still young, the company used Ubuntu Server Linux as its main platform, PostgreSQL as its database, and PHP and Python as its two programming languages. Most of the business logic for the site was put into stored procedures in the PostgreSQL database, which is not an uncommon practice in the enterprise although it does lock you pretty tightly into your database because stored procedures are not portable like external SQL queries are. Etsy looked at a number of different options to scale, including creating its own middleware abstraction layer between the database and the Web front end and rewriting the site in Python or Java. The company went for the first option.
This middleware, called Sprouter, short for stored procedure router, was written in Python and sat between the PHP application on the Web and the PostgreSQL database on the backend. The idea was to be able to cache calls from the Web and also to abstract the database to support sharding, allowing for database scale out as the site grew; the Sprouter layer could be scaled independently, too, as needed, and this is important because scaling the database is very difficult. Spouter went from idea to production release in under a year and was released in the fall of 2008, and not six months later it was deprecated. While the abstraction was great, it was a complex piece of software for Etsy to maintain and it was tightly coupled to PostgreSQL. Moreover, the database sharding The database sharding features that Etsy was counting on for PostgreSQL did not get to market and that meant PostgreSQL was a single point of failure for the site. Deployment of the Etsy stack got more painful. A bunch of the technical team left at this time, and Etsy started with a clean slate and a new more conservative attitude about how to build a more reliable and scalable retail site.
The first thing the company did was beef up the physical database servers behind the site as much as possible to boost performance. The company also shifted to a continuous deployment methodology, and shifted from PostgreSQL to MySQL for its database backend.
“The largest part of the application is written in PHP, and with the exception of search, all of the synchronous bits of Etsy.com and what the native apps hit by way of the API is PHP,” says Allspaw. “For search, it is a mix between Solr and Elasticsearch. We use Scalding for the vast majority of our data analytics. [Scalding is a framework written in Scala.] The data sciences group uses some Fortran for linear and matrix algebra related to recommendations and personalization routines.”
“We focus on what the product is, not what tech we are using for the problem that we are looking to solve. This means we do not have to repeat all the same questions from project to project. When anyone asks what programming language to use, it is either PHP or Java because then anybody at the company can contribute. You have to make a case for C or Fortran.”
Etsy has a familiar tiered Web architecture, and capacity planning happens on a per-cluster basis. There are clusters for Web, API, and database serving, plus others running Hadoop and other analytics.
Etsy uses Gearman, an application framework to move workloads around the clusters. (Gearman, an anagram of manager, was created by Danga Interactive, the same Internet startup that created the popular Memcached in-memory caching layer that sits between web servers and databases all over the world. Yelp, Tumblr, Craig’s List are top users of the tool.)
“It is a clustered model, but the dynamics of how the cluster works doesn’t actually dictate that we need to do a lot of moving around. And that is a trade-off. What you will see, especially with cloud infrastructures where you want to move things around, is that you tend to build architectures and deployment pipelines that reflect that. And when you don’t – and this is my own view – then you tend to put your focus elsewhere. I don’t want to brag too much, but we are pretty good at capacity planning.” (Allspaw literally wrote a book on it.)
Chef handles system configuration at Etsy, but does not do application configuration. “Even though that separation can be deemed to be arbitrary, we like to keep a separation,” says Allspaw. “The PHP itself is deployed by Chef, but our PHP and Java code is put out by Deployinator, a framework that we open sourced a couple of years ago. We use Deployinator for every stack that is not OS or a service like MySQL or Gearman or Memcached and so forth, which is handled by Chef. Clickstream data and visitor logs are pushed into an event pipeline built around Zookeeper. We are a big fan of the tools that LinkedIn has been building on that front.”
For system and application monitoring, Etsy has chosen the usual suspects in the open source community, including Nagios, Ganglia, and Graphite.
As for hardware, Etsy is a “big fan” of Supermicro but also buys machines from Dell and Hewlett-Packard. The main compute engine at Etsy is a modular server that crams four nodes into a 2U rack enclosure, which is a popular option for many hyperscale and HPC workloads across many different vendors. Database servers tend to be a bit beefier rack machines, with eight or sixteen drives and sometimes SSDs. Etsy has an even beefier system that can be used for Hadoop and long term storage. “It is all relatively boring and we kind of like it that way,” quips Allspaw.
Along with the move to MySQL came a shift to the CentOS clone of Red Hat Enterprise Linux. “I don’t see us changing any time soon,” he says. “We have done a lot of thinking over the years, and just about every distribution has its upsides and downsides. I don’t think that we are particularly zealous about it, but we don’t want to have to think about it much and we have gone with the one that has the least amount of distractions for us.”
All of the code at Etsy runs on bare metal with a few exceptions. The developer environment uses Xen to carve up machines into virtual instances, with a dozen or so VMs running on a reasonably hefty machine. The automated testing environment has been virtualized with Linux containers (LXC) and does regression testing of new code before it gets pushed to production, and this is a key part of the continuous development process at Etsy, which pushes out somewhere between 40 and 60 code changes each business day. (Etsy does not push code after 6 PM eastern, which is healthy and humane.)
As a general philosophy, Etsy’s can be summed up pretty simply. “We favor a small number of well-known tools. Then we don’t have to spend a lot of time chasing the most perfect tool. People say you should use the right tool for the job, but we think there should be a pragmatic limitation to that. A good carpenter does not have a thousand hammers.”
“If something doesn’t work in PHP, only then will we look at something else,” Allspaw continues. “We focus on what the product is, not what tech we are using for the problem that we are looking to solve. This means we do not have to repeat all the same questions from project to project. When anyone asks what programming language to use, it is either PHP or Java because then anybody at the company can contribute. You have to make a case for C or Fortran. Where are we going to store the data? In MySQL.”
Hip Hopping To Facebook’s PHP Engine
Facebook operates the largest PHP application in the world, bar none and hands down, and it has been frustrated by performance issues with PHP operating at its vast scale for so long that it created the HipHop Virtual Machine, or HHVM, replacement for the standard PHP engine. And the fact that Etsy has just done a replacement of the PHP engine with HHVM for its API server stack is an indication that this technology is on the leading edge but certainly no longer on the bleeding edge.
HipHop is the result of many things that Facebook engineers have done since 2007 to try to goose the performance of PHP, including rewriting the Zend PHP engine and creating a PHP to C++ converter. HHVM was open sourced in December 2011, and has been in production at the social network for some time now. Wikipedia is moving to HHVM, and so is Box and Baidu. The technology is, at this point, a relatively safe bet, and Dan Miller, a software engineer at Etsy who has been experimenting with HHVM, echoes the sentiment of his boss that Etsy “likes to use boring technologies that are well understood and that fail well,” as he tells The Next Platform.
Miller went through some benchmark tests that Etsy has run pitting the standard PHP engine against HHVM in a recent blog, and this experiment has actually gone into production. Those API servers are a key element of serving up Etsy pages to various kinds of mobile devices. Rather than having a mobile device with a particular aspect ratio and other settings hit just one API to be fed, Etsy breaks down APIs into pieces and then can compose a single API from them on the fly. This allows for a greater extent of code reusability between Web and mobile client interfaces, but it also has the effect of fanning out the APIs as they hit the servers. And this means that as web traffic grows, API loads grow even faster. Like this:
Miller tells The Next Platform that the API server farm is comprised of about 50 machines, and to keep up with growth Etsy would have to at least double it. In synthetic benchmarks on a single machine, HHVM scaled a lot better than the PHP 5.4 engine, but it is important to do your own tests on your own code. With HHVM only being used on the API servers, Etsy was able to render PHP pages about twice as fast as it could with the standard open source PHP engine. Which means it did not have to upgrade those API servers. This kind of change is enough for Etsy to think about a shift in technology, particularly when Facebook is seeing a 3X to 6X performance improvement across its entire application stack using HHVM, according to Miller.
Etsy is looking at how HHVM might be deployed on the hundreds of machines in its worker cluster, the generic machines that run a lot of its code, including fraud detection. This fraud analysis has the highest amount of computation among the PHP applications running Etsy.com, says Miller, and is an obvious candidate for an HHVM test.
But not don’t be so sure Etsy will go all-in with HHVM. As it turns out Rasmus Lerdorf, who created the PHP scripting language and who gets his paycheck these days from Etsy, is working hard on PHP 7 with the open source PHP community. Miller says that the performance improvements with PHP 7 are on par with those from HHVM.
“We are in the great position of being able to play off these two technologies,” says Miller. “This is a great time to be running PHP at scale.”