When POSIX I/O Meets Exascale, Do the Old Rules Apply?
December 4, 2017 Dr. Rosemary Francis
We’ve all grown up in a world of digital filing cabinets. POSIX I/O has enabled code portability and extraordinary advances in computation, but it is limited by its design and the way it mirrors the paper offices that it has replaced.
The POSIX API and its implementation assumes that we know roughly where our data is, that accessing it is reasonably quick and that all versions of the data are the same. As we move to exascale, we need to let go of this model and embrace a sea of data and a very different way of handling it.
In computer science, it was often said that there are only two ideas in the discipline: abstraction and caching. While this holds for a lot of sophisticated solutions, those ideas break down very quickly when moving to exascale.
In the Next Platform piece, “What’s So Bad about POSIX I/O?” the author gives his expert take on exactly why POSIX I/O is holding back performance today, such as how the semantics of the POSIX I/O standard in particularly provides a real limitation. To add those those points, it is indeed high time we come up with a new design. POSIX I/O was designed at a time when storage was local. It was simple to implement systems where applications had a consistent view of what was on disk. In a world of distributed systems and exascale compute, we can no longer rely on the assumption that all programs are looking at the same view of the data.
As systems grow, the time taken to sync data increases and long distributed lock times will become the biggest bottlenecks. We need to move to a world where data tiering and access times form part of the system design from the start.
The design of the I/O needs to take as much engineering effort as the compute algorithms we are used to profiling. A highly optimised application can be dramatically slowed down by bad I/O design and silly mistakes, such as shared log files or small I/O operations. Often the application is fine, but the environment it is deployed into makes the wrong choices. We’ve seen start up scripts try to open every file in the home directory in order to set a licence option.
No matter when and how we move away from the working methods of the past, there will inevitably be a period of transition where new layers in the I/O stack are used to migrate legacy code towards a more efficient mode of operation. New technologies such as burst buffers (i.e. IME from DDN) allow applications to continue to take storage for granted, while the heavy lifting is done in the back end to make that work.
At some point, however, we need to start thinking about how to migrate developers away from old ways of doing things. None of the short-term optimisations to accelerate I/O are a one-size fits all solution so the need to design an I/O policy is simply postponed. No one has come up with a general-purpose file-system for exascale, no matter what their branding says. Internally there are still a thousand trade-offs that try to guess what the best set up will be.
We also need to move away from the old guarantees that POSIX provides. POSIX API is too generous. Users don’t necessarily need the guarantees it provides, but systems engineers have to assume that they do. Good system design at a large scale rarely needs large amounts of storage with consistent concurrent reads and writes. Let’s get rid of them!
One of the hopes for the adoption of object store is that developers start to think about the movement of data as part of the application design and not something they need to wrap their computation in. In the short term, most organizations are installing their object storage with a POSIX wrapper, but the effectiveness of these layers are limited to specific I/O patterns. This makes it a very short-term fix indeed.
Many I/O problems are caused by the storage design. For example, it is only possible to trawl the file system looking for a dependency if you have a directory tree structure. If you move to a way of structuring data that reflects the complexity of modern compute, you can do away with that kind of model and suddenly lazy ways of picking up resources disappear. Why are we still using the PATH variable for everything?
To get the best performance, users need to be able to adjust their I/O patterns for the system they are running on. So far, no one has proposed a truly adjustable API where developers do not know how big their reads and writes are or how long they will take.
Let’s consider the option of an Oliver Twist-style I/O policy where applications simply ask for “some more please”. That would allow system designers to configure I/O to be block aligned.
One example of data storage thinking outside the box is the implementation of a log device at Facebook. It’s not simple and it’s not general, but it is scalable.
In this new data utopia, perhaps data can be read-only, write-only, or read-write with the latter being a special mode that has to be entered into explicitly. Ideally, read-write data should be local or in a database. Write-only data can be signed off when the writer has finished. This concept works well in a pipeline-style workflow. Meta-data is ready for a similar revolution.
There are likely a million flaws in this simple design, but instead of pointing them out, let’s start designing the future and see what we can come up with! There are lessons we can learn from the semiconductor design industry. As Moore’s law has made our processor chip sets ever denser, the cost of cross-chip communication has risen. The move from increasingly complex single-core processors to a multi-core model brought challenges to the designers and the programmers, but at the same time it was a step in the direction that we could all see coming.
Returning to the idea of a sea of data, I want to have wider and wackier ideas presented. Let’s stop designing yet another API or caching layer and look to something completely different. Exascale compute is going to be able to do things that we currently cannot imagine so let’s stop filling data centres with digital offices and let’s start to tap into what a completely different universe could give us.
Dr. Rosemary Francis is CEO and founder of Ellexus, the I/O profiling company. Ellexus makes application profiling and monitoring tools that can be run on a live compute cluster to protect from rogue jobs and noisy neighbors, make cloud migration easy and allow a cluster to be scaled rapidly. The system- and storage-agnostic tools provide end-to-end visibility into exactly what applications and users are up to.