There are a lot of moving parts in a modern platform, and in this regard, they are no different from the platforms made a generation earlier. But a modern platform has a lot more automation and is handling more dynamic workloads that are popping into and out of existence on different parts of a cluster like quantum particles, and it takes a higher level of sophistication to monitor and manage the stack and the apps running on it.
Frustration with existing open source monitoring tools like Nagios and Ganglia is why the hyperscaler giants created their own tools – Google has Borgmon and Facebook has Claspin, just to name two – or have massively extended these or other tools to cope with their monitoring problems. You might be thinking that monitoring is boring, but it is the lifeblood of any IT organization that is creating microservices-style applications that can be distributed across thousands of bare-metal or virtualized server instances or possibly tens of thousands of containers. If you don’t gather telemetry on all of these bits, you can’t manage it and therefore you can’t deliver consistent and predictable performance for end users. So they leave for another alternative that is possibly only a click away. . . .
SoundCloud, an audio streaming service based in Berlin, Germany that was unhappy with its StatsD and Graphite monitoring tools, started looking around for an alternative five years ago. Lucky for them, Matt Proud, a software engineer who had worked at Google from 2006 through 2012, had stepped away from the search engine giant to start the Prometheus project. Proud was frustrated that the management tools available from the open source community did not keep time series data in a multi-dimensional format and with an easy to use query language akin to SQL, and started building his own monitoring tool, inspired by what he knew about the Borgmon companion to Google’s Borg cluster manager and job scheduler.
“I had been with Google from 2006 through 2012 and then re-joined them back in 2014,” Proud explains to The Next Platform. “My time at Google has been deeply impactful on my technical outlook and data-orientation. This served as motivation for founding the Prometheus project while I was away. It was impossible to be sufficiently data-oriented as a software engineer without a tool like this.”
Proud did a short stint at SoundCloud (he resigned as head of its technical infrastructure team in late 2013) but others at SoundCloud, notably Julius Volz, co-founder for the project, joined the Prometheus effort. In 2012, the Prometheus project was available under an Apache 2.0 license on GitHub, and it was put into production at SoundCloud a year later. Last January, the members of the Prometheus project made a lot of noise about what they were up to, and by last fall Prometheus was emerging as a contender as the monitoring tool of choice for a modern platform.
The news this week is that Prometheus is officially being incubated by the Cloud Native Computing Foundation, the organization that Google helped create with a bunch of IT industry giants to help steer the development of its Kubernetes container management system. Prometheus is the second official component of this open platform stack being championed by the CNCF, but it is important to realize that Prometheus is going to be one of those components that is wider than any particular platform.
“It is very important to be aware that CNCF is not about Kubernetes only,” Alexis Richardson, chairman of the technical oversight committee for the organization and who is also founder and CEO of Weaveworks these days. “Prometheus is a great example in that it was written before Kubernetes existed, and was written for a world where a range of different types of applications would need to be monitored. So Prometheus is not just monitoring Kubernetes applications, but also those in Mesos, Docker, OpenStack, and other things, too. There is a wide range of stuff that is going to be out there, and I personally believe there will be more than three main platforms. And so it is a really general but powerful tool. We are trying to pick good tools that will be good for a number of use cases, not just one. It doesn’t have to be containers – it can be VMs, too.”
Like many tools that Google likes, Prometheus was written in Go, and the project has seen its popularity rise dramatically since its coming out party in January 2015. The Prometheus project has 33 repositories on GitHub, and as of January of this year, it had more than 200 contributors to the project who have handled over 800 of more than 1,100 issues with the software. By aligning with the CNCF, the project founders want to drive adoption of Prometheus and have a similar governance model to that of Kubernetes, and have legal assistance when necessary and the means to host events and such to bring the community together.
Richardson is no stranger to advanced architectures and the need for monitoring. Back in the late 1990s, he was a quant working on the fixed income derivatives trading desk at Goldman Sachs, and in 2002 he founded MetaLogic, a maker of a Java application server (which still lives on), and then went on to fo8und Cohesive Networks, an early platform cloud provider, in 2006, and Rabbit Technologies, the maker of the RabbitMQ message queuing platform that is now part of Pivotal, in 2009. After a few years at VMware and Pivotal, Richardson founded Weaveworks, a Docker container monitoring and management platform, in 2014.
“I think that monitoring is incredibly important,” says Richardson. “And it is not just monitoring in a general sense, but how you do it. What is the form factor, what is the idiom, and there are many, many products out there already, much of it open source software, and some of them are still fantastic. In the annual surveys, Nagios keeps coming out on top. But you know what? It is pretty difficult to use.”
According to Richardson, Prometheus is based on a monitoring model with two main ideas. First, it is fundamentally a monitoring tool and it focuses on that and, second, it has a built-in time series database and query system. This lets people construct questions about the monitoring data in a natural way that is easy for developers and that has data that goes beyond the normal event-driven reporting that system admins are used to having stream at them. This is part of the evolving market for all software tools, which have analytics built into them from the get-go. Richardson says that the architecture for Prometheus is very flexible, and it was designed so it can be extended such that a larger datastore based on Cassandra or Riak or the Google BigTable or Amazon Web Services DynamoDB services on the cloud. The idea here is the same as any other big data proposition: You store everything so you can ask sophisticated questions later, in this case not to spur revenues but to do root cause and predictive analysis on system and application performance. These extensions are not done yet, but the groundwork has been done.
“I think this shows that Prometheus is not only a good tool, but also is heading in the right direction,” he adds. “It is a bit more comprehensive than a lot of the other monitoring tools that are out there.”
(Weaveworks, his own company, contributes to the Prometheus effort and benefits from it and uses both Kubernetes and Prometheus as a back end for one of the services it sells.)
Earlier this year, Volz said that Prometheus would be deprecating its own graphical interface, called PromDash, and dashboard and opting to make the popular Grafana tool its front end, which will simplify development efforts a bit.
Google is obviously interested in integrating Prometheus with Kubernetes, and is also using Prometheus internally in some capacity, according to Volz. CoreOS, which has a commercialized Kubernetes stack called Tectonic, has integrated Prometheus with its etcd distributed configuration management system, and Docker has integrations with its container tools, too. DigitalOcean, Boxever, KPMG, Outbrain, Ericsson, ShowMax, and the Financial Times have also declared they are using Prometheus.
But don’t get the idea that Prometheus will knock out Nagios or any other tool any time soon.
“The great thing about monitoring tools is how many of them there are,” Richardson says with a laugh. “I think the reason for that is that ultimately, monitoring is all about tuning about very specific ways of looking at the world. If you walk into the typical enterprise and ask them how many different monitoring tools they have, the answer is often 30, 40, or 50, and that is why there are other products that aggregate their data. Every couple of years, there is a new generation of tools aimed at new architectures. But I don’t think it is very likely that the old tools are going away, because they were built for a reason and they still solve that problem.”
One of the problems that the creators of the Prometheus tool wanted was to be able to have a scalable monitoring tool that didn’t require a complex cluster of its own to operate. You could shard the Prometheus services and spread them across a Hadoop cluster, for instance, but this creates its own management headache, and instead Proud and Volz wanted a system that could ingest at least 800,000 samples per second out of a cluster and hold millions of series of time-ordered bits of data on a single machine.
If you want to use the Prometheus tool in production, one of the core committers of the project, Brian Brazil, who was a Google site reliability engineer for seven years working on the search engine giant’s core AdSense and AdWords advertising systems, has started up his own company, called Robust Perception, to provide commercial-grade support for Prometheus. Core committers Volz and Björn Rabenstein were also Google SREs before joining SoundCloud, so when they say Prometheus was inspired by Borgmon, they were actually using the Google tools and not working from some research paper.
By the way, the Prometheus software is not at the 1.0 release level yet, but it is coming this year with the external read and write storage that Richardson referenced above.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
“SRE” in the second to last paragraph stands for Site Reliability Engineer, not Site Recovery Engineer.
Yup. My dyslexia.