Getting Cloud Out Of A Fugue State
August 15, 2016 Timothy Prickett Morgan
The polyphonic weavings of a fugue in baroque music is a beautiful thing and an apt metaphor for how we want orchestration on cloud infrastructure to behave in a harmonic fashion. Unfortunately, most cloudy infrastructure is in more of a fugue state, complete with multiple personalities and amnesia.
A startup founded by some architects and engineers from Amazon Web Services wants to get the metaphor, and therefore the tools, right and have just popped out of stealth mode with a company aptly called Fugue to do just that.
Programmers are in charge of some of the largest and most profitable technology companies in the world, and it is no surprise that the hyperscalers in particular have smashed systems as we know them and turned them into devices that can be programmed – meaning created, adapted, and destroyed – like an application. While there are plenty of tools to manage a cloud and both its stateful and stateless applications and data, there is not, as such, a true operating system for the cloud – at least not in the sense that Fugue co-founder Josh Stella thinks of it.
Among many programming gigs that Stella has had in his career, he was the lead application architect for the US Coast Guard for nine years before becoming a principal solutions architect for Amazon Web Services back in the summer of 2012. After spending some time helping customers architect their systems atop the AWS public cloud, Stella and his co-founders struck out to create a common operating system for AWS that would make it programmable, similar to efforts that some hyperscalers and large enterprises have done individually and for themselves because it did not exist.
If you are running simple n-tier applications with a few dozen instances on a cloud like AWS, this is no big deal. But, as Stella learned from experience, once you start scaling up the applications as well as adding layers of complexity and interdependence between those layers, things can quickly get out of hand.
“I have seen that everyone who runs at scale on cloud runs into some real complexity issues,” Stella tells The Next Platform. “If you look at Netflix as the case study for running at scale on cloud, they put a large team with a great many engineers on the task of automating their very specific use cases. What I see elsewhere is everybody else cobbling together a sort of inferior form of what Netflix made for their specific use cases. We are trying to address the issue of operating at scale on cloud from first principles in a much more generalized way.”
If It Is Baroque, Fix It
The Fugue cloud management stack includes tools for managing deployment, operations during the lifecycle of virtual infrastructure and the applications that reside on it, monitoring everything as it is running, and killing it when it is no longer useful. Stella says his view of cloud is a bit different from others who have created tools to run applications on top of clouds like AWS, Microsoft Azure, or Google Cloud Platform, and like us, Stella says it is best to think of a public cloud as a giant distributed cluster and not a remote datacenter.
Of course, a cloud is a remote datacenter, or a collection of them to be precise, with availability zones and fault tolerant features for compute, networking, and storage. But a cloud is more than that, and it has very different kinds of interfaces than a local, discrete system or a collection of them stacked up in the corporate datacenter. Instead of being made of devices, which need to be configured, a cloud is really just a collection of API-fronted services that do compute, storage, networking, and so on.
Thus, architecturally speaking, the Fugue cloud management tool looks very much like an operating system, albeit for one that manages APIs instead of devices and, importantly, is not designed to replace Linux or Windows Server running atop clouds or as the foundation of those clouds. (Most public clouds are based on Linux, of course, with Microsoft accounting for the vast majority of cloud capacity running atop Windows Server.)
Others have laid claim to being the operating system for cloud – VMware with its software defined datacenter, Mesosphere with its Data Center/Operating System, and the stacks that are evolving around Google’s Kubernetes and Microsoft’s Azure. The hyperscalers have what we could consider a cloud operating system as well, with Google having Borg as its foundation, Facebook having Kobold and FBOSS, and Microsoft having Autopilot. (We don’t know what Amazon has internally, but it probably has some such layer as well.) The Fugue is inspired by these but is its own thing, and for now at least, the company has no plans to open source its wares as has happened with Mesos and Kubernetes.
The Fugue cloud operating system has two main components. Fugue Conductor is roughly analogous to an operating system kernel. With an old-style system from the 1960s and 1970s, applications would address devices directly and operating system kernels were added to play traffic cop to juggle access to devices on a discrete compute. Conductor does the same thing for APIs on the AWS cloud, and takes the management of cloud from what Stella calls “assembler code” level up to a higher level of abstraction that is akin to a programming language and compiler for it. As most management tools do, Fugue has a domain specific language, in this case called Ludwig after Beethoven, the master of the fugue, which is used to programmatically control access to resources on AWS.
The Conductor runs on an EC2 instance that runs inside of a single AWS account, with a very low attack surface, highly secure set of code, according to Stella. Conductor currently requires a single m4.large EC2 instance and you can run a pair of them in a high availability cluster; it also uses some DynamoDB NoSQL database services on AWS to store data. Add these up, and a Conductor instance costs around $300 to run, which is a pittance compared to the tens to hundreds of thousands of dollars a month – in some cases as much as millions – that large enterprises are spending on AWS these days.
The Fugue tool is in a way much more fundamental than something like a container management system like Kubernetes or Mesos, and importantly, Fugue assumes that there will be more higher level abstractions that will run atop the cloudy infrastructure (much as Google created Borg and then implemented it with Omega). Take, for instance, AWS Lambda serverless functions, which Fugue was able to add support for within a matter of weeks after it was launched.
“We can do this quick turnaround on support because we express ourselves as a language,” Stella explains. “I am a programmer, and this is really important. I hate black boxes. I don’t want to have to call a software vendor to add a feature, I want to write code to give myself that feature. And so what we have chosen to do is embed a big chunk of the functionality in Fugue into libraries and you can decide how to do things differently. This is not a PaaS, even though it has all of the benefits of a PaaS in that you can go really fast and you can declare a whole virtual private cluster with all of the subnets and routing, and by the way all of this is all done very well by AWS and you can do this with just eight lines of code in Fugue. But if you want a different kind of network you can also change that and write your own function that will define a network in Ludwig, which uses the Hindley-Milner Type system – and don’t let that scare you, you do not have to become a functional programmer to use Fugue – we just get the benefits in the compiler. The point is, if it has an API, we can automate it and let you reason about it in code and declare it.”
The one thing that Stella does not believe in is architecting a tool that aims for the lowest common denominator of functionality across public clouds, even if there is going to be a multi-cloud architecture in the future of most enterprises. The LCD approach did not work with Unix operating systems two decades ago, with the POSIX and Unix standards adopted to help application and file portability. All of the Unixes still remained distinct and tied very much to their architectures. And we believe, as does Stella, that this will be true of the public clouds and their inevitable private cloud offshoots.
The Fugue Composer and its Ludwig language is designed to work with any cloud, and while the company is starting with the AWS cloud, which has the lion’s share of the market right now, Microsoft Azure support is coming next and it is likely that Google Cloud Platform and then IBM SoftLayer and a few others will follow suit.
“The important thing is not that you have one language abstraction or data abstraction that defines things on multiple clouds,” says Stella. “The important thing is that you have one operating system that is running your processes, just like if you run Word on Windows it is a process. When you run an application, Fugue manages that as a process, just like an operating system kernel does, and so the day in, day out use of Fugue is to declare things and once things are running the Conductor is managing them as processes.”
The blackbox issue is one problem, but another one is that templating systems such as Chef or Puppet, which store their documents in JSON or XML formats, are “human hostile” as Stella puts it. “These have all of the problems of programming and none of the benefits of programming,” he says with a laugh. By its nature, Fugue has to support a lot of different use cases compared to internal tools like Google Borg, Facebook Kobold, or Microsoft Autopilot or even Netflix riding on top of AWS with its Chaos Monkey layer, because the top techies at these companies can simply say to programmers that they have to do certain things in a specific way and that is that. A third party management tool has to be more flexible, as Google itself has learned by recasting parts of Borg as Kubernetes. Fugue has to work with lots of different models and approaches, just like Kubernetes does.
So a hospital chain, with HIPAA compliance issues, has a very different approach to using AWS and Azure compared to Silicon Valley startups trying to become the next Google or Facebook, to give just one example. Fugue is working with dozens of customers as its wares become generally available this month, and they have a range of use cases.
One is Fortune 500-class industrial giants who are heavily regulated and who are interested in implementing “policy as code,” which allows for infrastructure management libraries to embed policies that prevent these companies from breaking the rules and regulations imposed by themselves and by various governments around the world. The startups are looking for an easier way to implement DevOps, with a scarcity of experts in this field and a slew of startups competing aggressively for site recovery engineers and other high level system administrators. This is a kind of full automation of infrastructure – think of it as an “infrastructure as code” model – and integrating Fugue with Jenkins or other application deployment frameworks. A third scenario is an evolving variant of infrastructure as code that is being called “immutable infrastructure” by Stella and others, which essentially makes application upgrading easier by not upgrading running code in the field but by always pushing out shiny new code and running it in production. (We will drill down into these immutable infrastructure concepts in a separate story.)
The Fugue software is available now, and the company is working on a model for pricing that will be proportional in some fashion to the value it brings and to the size and scale of the AWS infrastructure, but Stella says that it will not simply be a percentage of the AWS spending each month. The Conductor orchestrator is designed to scale up to a large AWS account, and many large enterprises have multiple accounts and will need multiple conductors and at some point Fugue will have to federate them. The capacity of each Conductor is gated more by AWS, which will throttle back the API throughput if a user gets too out of hand, than it is by the performance of the EC2 instance running Fugue.