Sometimes the best way to cope with scale is to keep things simple and do everything you can to avoid it. This is the approach that GitHub, the repository service for the popular Git source version control tool created by Linus Torvalds a decade ago, has taken as it has grown explosively and become one of the centers of gravity for open source software development.
Programmers are understandably picky about the tools they use to craft their code and share it with others so they, in turn, can also tweak and improve it. In a very real sense, software developers live in these systems and the way source version control systems work can either help or hinder the creative process of collaborators.
GitHub was founded back in 2007 by PJ Hyett, currently the company’s COO, Chris Wanstrath, its CEO, Tom Preston-Werner, its former CEO, and Scott Chacon, its CIO. They were all developing Ruby applications on the Rails framework and wanted a better way to collaborate and began building what would launch in 2008 as GitHub. Not so much as a business plan, but more as a tool that they wanted to automate their own software development efforts.
As it turns out, GitHub is one of the largest Ruby on Rails applications in the world, reckons Sam Lambert, who is director of systems at GitHub and who spoke with The Next Platform a bit about the systems that underpin its eponymous source wrangler. Lambert is not at liberty to discuss how many lines of code comprise GitHub, not has the company released figures on how many lines of code are encapsulated in the GitHub repository, but Lambert did give us some metrics about the growth in GitHub use and how the systems underneath it all keep 10 million programmers worldwide, working at some 60,000 organizations or by themselves, tweaking 26 million open source projects.
“It is basically quite a simple stack, which is something that is really important to us,” explains Lambert. “We have grown by trying to adopt as little as possible to keep that stack fairly simple.”
Being on the other side of the 2008 dividing line for startups (two years after Amazon Web Services launched its EC2 compute cloud), GitHub could have started out on the cloud and simply never invested in infrastructure in the first place. But, instead of doing that, the company’s founders and the software engineers they hired have crafted a software stack and a creative set of system management and software deployment tools employed through chat software that essentially runs the IT operations at GitHub.
The company has its own private repository on GitHub to develop GitHub, of course. While Lambert won’t disclose the size of the Ruby application that comprises GitHub, he tells The Next Platform that it has had a quarter million commits in the GitHub repository and has hundreds of people – not all of them work at GitHub, either – contributing those code changes and having them committed.
Repo Man
“GitHub was created initially by us, for us, and we are all basically software engineers and we want to use a good tool to do this,” says Lambert. “We use GitHub to build GitHub and it is something that we are into every day for managing everything. The human resources and the legal teams use GitHub for workflows, too. It is not just programmers. We are quite lucky in that we can go through our code in ways that other companies can’t necessarily do. You can hire developers to work on systems for ad serving, but if they don’t care about ads, they won’t be as engaged. All of our developers like using Git and the workflows around it, so we have that privilege of working on the tool we use every day.”
Starting at the bottom of the GitHub stack is the hardware, which is comprised of several hundred X86 servers spread across a few geographically distributed datacenters. (GitHub does not talk about where these are located, but Lambert did say that as GitHub expands its user base globally, it is considering standing up more datacenters in other regions of the world.)
“We run off-the-shelf machines from a standard vendor,” Lambert says, not naming names or configurations. “We do a lot to optimize how the software runs, but we are not at a scale that we do anything inappropriately bespoke when it comes to hardware. As we scale, we are trying to make our software become more fault tolerant and we are starting to replicate data onto hosts that are disposable and we don’t have to bother fixing a machine. You just destroy it and rebuild the data on another machine. This makes it cheaper to buy machines and therefore cheaper to scale.”
“It takes a real need for us to build something bespoke and unusual because if we do, we lose all of those benefits of what the community is doing. This is what informs our database choice because MySQL is the one that everybody is using. If you have a problem with it, it is known and you don’t have some obscure failure that no one understands.”
The hardware is apparently not all that interesting – particularly to software engineers, ahem – but Lambert was particularly excited about a homegrown deployment system, called GPanel and coded in Ruby, that hooks into the Puppet configuration tool and lets anyone at the company provision machines and deploy its software stack on them.
“This lets us deploy our software as if we were on a public cloud, but allows us to have all of the niceties of having our own hardware.”
The software foundation at GitHub is, of course, Linux, and while Lambert says that the company certainly has enough expertise to roll its own Linux, rather than do that it simply uses Canonical’s Ubuntu Server distribution. As for databases to store the Git code data and other aspects of the access control systems of the GitHub repository, GitHub relies on the MySQL relational database. GitHub maintains the Linux and MySQL software itself, as it does for both Ruby and Rails. GitHub employs some of the key maintainers in the Ruby and Rails communities, so it stands to reason that it would do its own support there, but as it turns out, GitHub has custom versions of both Ruby and Rails that are necessitated by the scale of its GitHub application.
Coming To A Fork In The Code
“The scale problem for us really is having a resilient data store in a highly available manner as the data comes in,” says Lambert. “It is about adapting Git to be scalable and usable, since it never had this in mind. Another way we scale is that GitHub is one of the largest Ruby on Rails programs out there – there are not many companies running Ruby at a bigger scale. We are keeping that lean and doing optimizations to keep it that way. We are not quite at the stage that Facebook is at with HipHop and what they are doing with PHP, but we do have people contributing to Ruby core to make it faster and leaner.”
GitHub tweaks the Ruby interpreter and also created its own garbage collection routines for it, but it is also keen on fixing Ruby and Rails bugs as fast as possible and getting code fixes into GitHub, the application, as well as out to the Ruby and Rails communities. (Ruby development is hosted on GitHub, as so is that for Rails. MySQL development just moved over, and it took Oracle some time to do it.)
GitHub may be the machine by which developers fork code with reckless abandon – well, abandon at least – but forking is not something that GitHub takes lightly, oddly enough. Lambert explains:
“The reason we have kept GitHub as a Ruby on Rails application is that it is really easy and quick to pick up. People start to work on GitHub on their first day at the company. It takes a real need for us to build something bespoke and unusual because if we do, we lose all of those benefits of what the community is doing. This is what informs our database choice because MySQL is the one that everybody is using. If you have a problem with it, it is known and you don’t have some obscure failure that no one understands. There is no weird failure edge case that you cannot find the answer to because someone has encountered it.”
The infrastructure underpinning GitHub has web servers, proxy servers, authentication servers, and a bunch of other systems that perform analytics about the repository and the coders uploading commits to the millions of projects that live there, but the real heart of the machine is the repository itself. Most of this data is text, of course, and that doesn’t take up a lot of space compared to some of the richer photo, video, and audio media that stuffs the disk drives behind the Internet to the gills.
Oddly enough, GitHub does not use traditional data compression on this textual data that comprises the code, but has another way to save space that is its own. If a project gets forked, only the changes from the original are saved in the fork. (We presume that this method also allows you to easily figure out what changes were made at each fork and iteration over time.) If GitHub saved each change and each fork, it would very quickly have untold petabytes of data, and traditional data compression would slow the system down. As it turns out, even though it is accepting hundreds of gigabytes of new data each day from programmers, the whole GitHub repository is measured in hundreds of terabytes.
At some point, there will be enough cat photos on the Internet that all cat photos will be able to be derived from a master cat photo and stored in this forked changes manner. (We are kidding. Sort of.)
“There are a lot of companies out there that say they have terabyte or petabytes of data, and you ask them what data, and it is usually just junk,” says Lambert with a laugh. “This whole big data movement is companies just storing events – just crap, basically. We are very proud that we stay lean and stay optimized, and we are not just storing tons of useless data. Compared to our competitors, our storage-to-repository ratio shows we are very, very lean. We don’t store as much data as we might have to because we have some very smart stuff on the back end that lets us loosely clone and fork. We have a lot of Git, but not as much as we would have if we did not make the optimizations we did.”
To keep up with the growth that GitHub has experienced, the company goes old school and simply overprovisions its hardware so it can quickly fire up new capacity as storage or compute needs dictate.
“We are always overprovisioning out ahead, and I wouldn’t say our growth is stressing us, but it is certainly under pressure,” Lambert qualifies without being specific about how fast its clusters are growing. “We have hundreds of gigabytes of new data coming in each day, and we are scaling rapidly in terms of user count and repository usage, but we have our infrastructure set up in a way that we can keep adding capacity and keep growing. This is something that we planned for well, and there is no slowdown in sight.”
If GitHub is like other hyperscalers, it has to grow its infrastructure less quickly than the usage metrics that are driving that infrastructure. It is simply not possible to scale up servers, storage, and people linearly, which is why there is so much engineering creativity among the hyperscalers.
Using the public GitHub repository is free, but any code that is stored on the public site can be grabbed and forked by anyone interested in doing it. GitHub offers private repositories on the site for a fee, which is how it plans to make money. Prices range from a low of $7 per month for a personal plan with five private repositories to $200 for a business plan aimed at teams of programmers sharing up to 125 private repositories. Companies that want to host GitHub internally for their own code development can license GitHub Enterprise, which costs $2,500 for a ten-seat package per year to install and which has the same look and feel as GitHub. GitHub Enterprise can be hosted on internal iron or on the Amazon Web Services or the Microsoft Azure public clouds. At the moment, GitHub and GitHub Enterprise are maintained by the same support team, but if you develop something internally on GitHub Enterprise and you want to open source it on GitHub, there is no automated way to do that. But Lambert says watch this space.
Aside from the core Ruby on Rails application and the storage algorithm for putting Git code into the file servers, GitHub is working on a number of other applications. “Some technologies you just don’t pull it off the shelf because no one else is the largest code hoster in the world and there are very bespoke domain problems that we have,” says Lambert.
One focus area going forward is to provide a richer set of analytics about projects and the programmers working on them, since a lot of companies are using open source as a way to attract talent to their companies. It also stands to reason that GitHub will expand into new markets where documents with lots of changes and forks are part of the collaborative process. Just like teams inside of GitHub use the tool to keep track of stuff, architects, musicians, and other artisans are starting to use the tool and this could provide another wave of growth for GitHub.
GitHub raised $100 million in its first round of venture financing from Andressen Horowitz back in July 2012, and in July of this year it raised another $250 million in its second round, with Sequoia Capital leading and Andreessen Horowitz, Thrive Capital, and Institutional Venture Partners all kicking in, too. The company is not public, but given its funding has a valuation of around $2 billion, and has the cash to grow its base and expand its addressable market.
The ChatOps Culture And Distributed Development
An important innovation at GitHub that is not strictly part of the code but is definitely part of the company is Hubot, which is a chatbot interface for system management used by the company. This approach is commonly called ChatOps, a tip of the hat to the development-operations moniker with chat as the means of doing DevOps. Or, in the case of GitHub, just about everything at the company.
Soon after it was founded, GitHub created a chatbot called Hubot, which is integrated with GPanel and other system administration tools as well as other functions across the company. As it turns out, 65 percent of the employees at GitHub, which has 325 people, work remotely. So having a meeting or training session in the office is not going to happen. So Hubot is where everything gets done and where everyone can see what everyone else is doing. (Etsy and Box are using Hubot now, too.)
“This culture thing that you would miss, we have kind of replicated around our chatrooms,” says Lambert. “It is a fascinating way to work. When I first started, we didn’t have any training. I just went into the chatroom and watched what was happening and learned. The chatbot gives us instant context. In a lot of companies, when people make changes, they go off to their own computer and they have this really clunky way of explaining what they are doing. Here, if something is going on, you dive straight into the chatroom and that is where everyone is working. All of us are sending commands through the chatbot, everybody can see it, and there is basically nothing that you can’t do. We are one of the companies that pioneered this way of working and it is getting adopted more and more. It allows us to be extremely distributed and asynchronous without having anyone go into an office.”
Just like its millions of users.
I like the chatbot stuff, but slack in particular is creating an oral-history culture inside organizations: you either hear things when they are said or you don’t.
I assume github developers just hate RoR at this point, they have taken it well beyond its capabilities. Also assume the most senior devs there are probably trying to avoid RoR as much as possible.
For me, github is just a glorified LinkedIn page for hackers…you put up your public portfolio and look at the portfolios of others. For private use, I now prefer gitlab, which provides all of the features of github without the weird limitations of the closed model.