Operating Systems, Virtualization, And The Machine

I recall teaching a college class in operating systems, when well into the class, thinking I was getting the points across, a few of the students stopped the class and asked “Yes, but what really is an operating system?” I was momentarily taken aback, but the question happened to be absolutely fair and not really all that easy to answer.

OK folks, it’s your turn. Given what you have seen so far of The Machine from the first four articles in this six part series (see the bottom of this story for links), what is an operating system? And in answering, how does such an operating system guarantee The Machine’s security and integrity? I’ll pause here and hum the Jeopardy! theme song while you mull this over and I decide how to proceed.

Recapping a few things we know about The Machine, all of which need managing:

DRAM memory on a node is only accessible from the processors on that node.
DRAM memory is more quickly accessed and is volatile.
Processor cache coherence is only maintained within a node, the cache lines having been tagged with local-scope real addresses.
Processor cache can hold data blocks from both local DRAM and Fabric Memory.
Processor cache is volatile.
All non-volatile fabric memory can be accessed by any processor in The Machine. Fabric memory can, therefore, be used for either persistence, internode sharing, or both.
Processors on a node can only access those portions of fabric memory over which have been mapped a portion of those processor’s real address space.

We know of and can picture variations on what we are going to describe next, but:

OS design typically assumes full cache coherence amongst all of the processors being managed.
All memory is accessible from all processors, even those residing on different ccNUMA
In virtualized systems, those supporting multiple OS instances, there typically exists a trusted hypervisor which also assumes the previous two points.
Each OS instance, within my background, is called a partition, a term we will be using.
Hypervisor does not trust OSes.
Hypervisors work to isolate the partitions under their control from each other.

So what of The Machine?

At some level, all that an OS for The Machine needs to be is a package of code licensed by Hewlett Packard Enterprise that manages and abstracts all of the lower level aspects of the hardware, but also providing all that you have come to expect from an OS. HPE could provide a single product and it would do its best to make the hardware of The Machine in its entirety appear as the single-system image that you would want it to be. Such as that does seem to be in the future plans for The Machine. Indeed, if I understood the presentation correctly, this presentation at HPE Discovery in London by Martin Fink seems to suggest that such a single system OS – called Carbon – may best be managed using something he called “Container OS”, with containers used for efficiently using and managing resources within a single OS. (If unfamiliar with containers, a good place to start is here , and to be fair here.) But The Machine is nonetheless different, for all of its wonderful attributes; prudence suggests that the architecture development of such a wonderful tool should take its time in determining what all is expected at the general availability of such an all-encompassing OS. (More on containers later.)

It happens, though, that there is a nearer term OS approach which would allow for The Machine’s hardware to be made available sooner.

Recall that one attribute assumed by OSes is full cache coherency as well as physical, if not actual, accessibility to all of the memory. Using an existing OS as we understand it – say a Linux derivative – would require that we limit the scope of such an OS instance – a partition – to a single node of The Machine. Each of The Machine’s nodes would have at least one such Linux-like partition, but those partitions would not span the processors or DRAM of multiple nodes.

Processor virtualization requires that the OSes residing on the same system be assured isolation from each other by default, just as though every OS was on its own hardware. The data owned by one partition and residing in DRAM memory may not be accessed by another, one with physical access to the same DRAM. Serendipitously, The Machine does not allow a partition in one node any physical means of accessing the DRAM owned by a partition in another. In this sense, The Machine is like a distributed-memory cluster, which, because it is distributed memory it also insures isolation. Of course, it remains a hypervisor’s responsibility to assure this isolation for partitions sharing the same node.

Virtualization, as managed by a hypervisor, normally allows a system’s partitions to share the processors of the system, this done via hypervisor-controlled placement and time slicing. But, again, the hardware-enforced local access by each node’s processors to only local memory and the lack of cache coherency across nodal boundaries tends to make cross-node processor sharing impractical. The Machine could move partitions between nodes, but such movement is more than just temporarily having a partition use another node’s processors; it also requires that the partition’s entire state in memory (and cache) be moved to that new node.

As a quick outline, a few – of we are sure many more – additional hypervisor responsibilities in The Machine would include:

Providing to each partition a set of real address ranges which are backed by fabric memory.
Allowing partitions access to only those portions of fabric memory to which the partition has access rights.
Managing the physical location of objects – objects which can also be files – within fabric memory, given some token representing the object.
Not only are the nodes interconnected via the memory fabric, but inter-node communications are possible via a more traditional communications network, the same hardware being used to communicate outside of The Machine. Secure management of both these inter-node and external networks.

As a side observation, you should be asking “Where does the hypervisor really reside?” In more traditional – read that cache-coherent shared-memory – virtualized systems, the hypervisor carves out some of the system’s memory for its own and it steals processor cycles from any of the available processors on an as needed basis. We assume – but do not know – that The Machine’s hypervisor would follow suit. Notice again, though, that The Machine is not a fully cache-coherent shared-memory system; communications between each node’s portion of The Machine’s hypervisor is possible – with some effort – via the fabric memory, atomic storage, and of course, more traditional communications mechanisms. I can imagine that the design of such partially distributed management is not trivial.

Yes, the hypervisor could be completely distributed, sharing parts of processors and memory – persistent and otherwise – on every node. But I getting some hint from a number of sources that at least the global file manager is not. The location of files’ data can potentially reside anywhere in Fabric Memory, but the metadata, the description of the files and there locations appears to be in a separate system, one at top of rack, quickly accessible to and from every other node of the system.

Memory Addressing And Security In HPE’s Future Machine

So there we have it, as detailed in the first four parts of this series, The Machine from Hewlett Packard Enterprise, with scads of compute power and petabytes of persistent memory, all physically accessible by any of the processors – all that data ripe for rapid picking. The folks at HPE have it right when they speak of the need for integrated, tight security throughout the system. Much of the enablement for such security arises from the various forms of addressing found in such a system.

We have already seen how the lowest level of addressing – real addressing – is protected on The Machine in ways not found on most systems. Each node’s processors can have their real address space extend well outside of its own node’s memory, well into fabric memory of any number of other nodes. But system integrity and data security require that each node’s real address space be allowed to extend to only that fabric memory containing data which that node has the right to access. Given an OS instance – a partition – per node, this also means that each partition has access to only allowed portions of memory. It is like a new level of hypervisor-managed security; only the hypervisor is trusted enough to provide the needed real-to-physical location mapping.

Still, most code doesn’t ever work directly with real addresses. Far and away most code works with virtual addresses which the processor hardware securely maps onto these real addresses. That notion has been around forever, but it is because of that mapping that the security understood from both processor virtualization and process isolation happens to work. We will talk about that in more detail shortly, but with many forms of processor virtualization each operating system and each partition is allowed by the hypervisor and the hardware to access only each partition’s own part of the physical memory. The hypervisor owns the real address space(s) and only allows a partition’s virtual address space to be mapped onto portions of the real address space reserved for that partition. System security and integrity is enabled by such address mapping.

Processes within each operating system are similarly isolated and therefore secure because of higher level addressing. Normally, even for the many processes within the same operating system, the data owned by one process is not accessible by another. Such addresses are effectively process-local, it is a range of addresses associated only with the program code executing on behalf of a process. When a user-level program is using an address, it is these types of addresses which are being used (and certainly not real addresses).

We intend to talk to this notion of addressing as it relates to The Machine in a moment. But, just to be sure that we are first all on the same page, and partly because we have not found a sufficient explanation elsewhere, allow us to quickly outline the relationship between this process-local addressing, real addressing, and physical memory. Largely because we find the terminology cleaner, we are going to use terms found in the Power processor architecture from IBM, but the concepts are applicable most everywhere.

As mentioned, user-level programs do not use real addresses. Instead, to ensure isolation between processes, each process is given its own process-local address space. The Power architecture calls this entire address space an effective address space. Again, each process – if you like, each program – is given its own effective address (or EA) space. For instance, you have undoubtedly heard of 32-bit or 64-bit systems. These typically also mean that the size of the effective address Space is 32 bits (2³² = 4 Gibibytes) or 64 bits (2⁶⁴ = 16 Exbibytes, a lot bigger than physical memory). This is what the instructions of a program perceive; it is not an address into physical memory nor a real address. Being process-local, if a Process B attempts to use an EA_a value produced by a Process A, that EA_a value will typically mean something completely different (if it means anything at all). If effect, one process cannot normally address into the memory used by another. This is process isolation, whether those processes – those programs – reside in the same operating system partition or in different partitions.

These effective addresses should ultimately represent some data residing in memory; somehow when the processor’s instructions use an EA it means that memory is to be accessed. Even though EAs do isolate processes from each other, there is often good reason for multiple processes to access the same data at the same memory locations. This is certainly true for processes sharing the same partition, but it can also be true for processes in different partitions sharing the same physical memory. And this last – this shared physical memory – happens to be fabric memory in The Machine. (Even though different nodes can map portions of their real address space onto the same physical location in fabric memory, no user-level program uses real addresses; we need yet to describe a way for such programs to use their EAs to share fabric memory.

To allow such inter-process sharing, the Power architecture includes a notion called a virtual address (VA). For two processes – each with their own process-local EAs – to share the same memory location, the operating system typically securely arranges for the hardware to translate each process’ EAs (which are different values) to the same VA value. In the same way that an effective address space is a contiguous address space for processes, the virtual address space tends to be a single contiguous address space for the whole of an OS, one scoped to that OS only.

If, though, inter-process sharing or not, a process attempts to use an EA value which was not mapped to a VA, this is often consider an addressing exception, perhaps even a violation of security. Upon such exceptions, the OS either securely determines what the legal mapping should have been, or blows this process away for attempting to violate the system’s security.

It is these virtual addresses that the processor hardware maps to real addresses (RAs), the OS having decided which virtual address pages are mapped to which real address pages. Although there are variations on this theme, for now think of a page as being 4096 (2¹²) contiguous bytes each starting on a 4096-byte boundary. This mapping allows pages in an OS’ contiguous virtual address space – an address space which is typically far bigger than a system’s physical memory – to be mapped onto arbitrary pages in a real address space. It’s useful to think – although not sufficiently true – of each byte of the real space as representing a unique byte in physical memory. This also tends to mean that there is a one-to-one relationship between each RA and a byte in physical memory. You have seen that this is not always the case for The Machine.

On The Machine Real Address Spaces are scoped to a node:

Part of a node’s RA space refers to the node’s local DRAM.
Part of a node’s RA space refers to mapped locations in Fabric Memory, where that Fabric Memory can potentially reside on any node.

Where an unmapped EA results in an exception when used, an unmapped VA can identify an access violation, but it more likely – on today’s systems – represents a virtual page which is not at this moment in physical memory. It’s a page fault; the OS goes out and finds the needed page, often in backing persistent storage, and brings that page into DRAM. Recall on most of today’s systems, the smaller DRAM is also a cache of data out on disk. What’s cool about The Machine in this context is that that very same data is already in memory, fabric memory. Certainly the contents of fabric memory could be paged into (i.e., directly copied into) a node’s DRAM and accessed from there, but often it’s just as straightforward simply to map a VA page onto an RA page of something already in fabric memory. That being that case, the delay associated with what today’s systems consider completely normal page faults would seem on The Machine to become non-existent. Too cool. And these mappings only occur if all parties essentially agree that this process has the right to access this data.

Each memory-accessing instruction of a program – and this includes the program’s instruction stream as well – completes a memory access by having the processor hardware translate each EA to a VA and then to an RA and ultimately to a physical memory location. All this has been set up and managed to enable the access and sharing of only that which the process and its OS are allowed to access. And, for reasons that we won’t get into here, the hardware handles this entire address translation process, once set up, very rapidly. And it is all an integrated part of system security. Level-upon-level of security.

Let’s pause here for a moment and consider two forms of virtualization and how these notions get used. In both, there is a requirement for isolation; how that isolation is being provided based on the above is what we will be covering.

First: One form of virtualization, one now generally well understood, allows multiple OS instances – partitions – to all share an SMP’s processors and memory (at least). Each partition perceives functionally that it is alone in executing on this system. The partition is guaranteed by a common hypervisor that it has some amount of physical memory that belongs exclusively to that partition; no other partition perceives – and so cannot change – that memory. This is typically accomplished by having the hypervisor take control of the system’s real address space. Each partition, with its own VA space, requests the hypervisor to map the partition’s VA pages onto RA pages owned only by that partition.

Because of this, the partition does what the partition does, whether it is sharing the physical memory with other partitions or whether it is executing alone on a system, say as part of a distributed cluster. Either way its data is isolated. It is secure either because of a trusted hypervisor managing the one SMP’s physical memory or because – as in distributed memory clusters – the memory really is separate and isolated.

Notice that The Machine hardware has aspects of both of these. Given one OS instance – a partition – per node, the DRAM really is physically isolated. It is the fabric memory which is physically shareable. But because The Machine can carve up the fabric memory on a nodal basis – by mapping each node’s real address space uniquely over segments of fabric memory – even fabric memory can be perceived as isolated at this low level. But, multiple partitions could share the same node, in which case the isolation would be provided by such virtual-to-real address mapping.

Second: A more recent form of virtualization is that provided by containers. I’ll assume here the form of containers which share a single OS. Again, aside from the programming development advantages, much of the intent of containers is to provide an environment which the program(s) can perceive as isolated from any other container, even if they do share the same OS. They do this with the starting knowledge the ordinarily each process has its own private process-local address space (an EA space). As long as the OS does its part in assuring that each process maps only to private portions of an OS’ VA space (and that mapping to unique RA pages), there can be no sharing between the processes. Sets of processes – in many cases programs – reside in containers. If the processes residing in one container is never allowed to share data with the processes of another container, the containers too remain isolated from each other.

But it is not quite that simple, and this is good. Because containers share the same OS – and so potentially parts of the operating systems total name space – there is the potential for – indeed, the desire for – sharing. Read-only sharing of the common operating system programs and other files may often be good because it is common and sharable. Unlike the previous form of virtualization where each partition is absolutely isolated and each partition has a complete OS image, sharing where desirable and well managed is possible. The common data and programs reside only once each in physical memory and so potentially at a single real address. Notice also that containers may have portions of their name space that is common between them, but it is also true that other portions of each container’s name space will be different. If a container – and so all of its processes – have no name for something, those processes will also not be provided an address (of any form) to that something.

So, for The Machine, let’s assume a single OS image residing there within fabric memory. This OS is somehow capable of distributed management of all of the processors and node-private volatile memory throughout the system. For performance reasons, let’s say that read-only copies or even node-private versions of portions of this OS are copied into each node’s DRAM. Additionally, though, this OS decides where, amongst The Machine’s many nodes, where each container is to reside, providing the container’s processes resources from there. As before, a single container per node is guaranteed isolation of any data in its own local DRAM, because there is no physical means by which another container (on a different node) could access another node’s DRAM. Potential undesirable sharing, though, is possible for containers sharing the same node or in fabric memory. Because of The Machine’s real-to-physical mapping of fabric memory, containers on different nodes could only have shared data in fabric memory if the trusted global “hypervisor” (within the OS) had allowed such shared access; if something in fabric memory was private to a container, no other node would have been provided a real address mapping over that private data.

We could go on for quite a while, but we are sure you are seeing the essence of this. Addressing, appropriately controlled at the right level, can provide both the desired isolation and secure sharing.

Let’s look at another case in point, memory-mapped files, a concept used to speed access of the file’s contents. (We will also be looking in the next section at non-volatile heaps, a rather new concept.)

In typical file I/O, a file first resides in persistent storage, say out on disk. A program could, and does often, open a file, access on disk those portions of that file needed by the program, and in doing so map those accessed bytes into the process’ EA space, the OS’ VA space, and – after reading the file from disk – into the DRAM’s RA space. This is done repeatedly, with such repeated remapping taking time.

An often used performance enhancement on that notion, memory-mapped files, maps all or portions of the file into a process’ address space, whether accessed or not, whether in physical DRAM or not. Each byte of the file is known by a unique EA. As each EA is used, the accessed portion of the file is read off of disk and into the DRAM, or, if already address-mapped into DRAM, accessed directly from the DRAM itself. As long as the file remains open in the process, that unique mapping can continue. The file’s bytes can be repeatedly accessed using the same EAs whether those bytes are at that time in memory or not. If not, the disk space where the needed file bytes reside is accessed brought into memory, under the corresponding VAs, remaining still under the relatively persistent EAs.

That is the situation today. What of The Machine in the future? You likely know where we are leading with this. In today’s systems, the memory that we are talking about is the DRAM; the file was read into DRAM because the processor can only access the DRAM (and EA’s are translated to DRAM-based RAs). In The Machine, that same file can be directly accessed from a processor without additional copying. All that is required is that the process doing the accessing have the right to do so. With that right, a memory-mapped file is enabled for access merely by setting up the addressing to it. Done and done.

But why stop there? In IBM i, Big Blue’s proprietary operating system for Power-based machines that has a single-level storage architecture that dates back to the late 1970s, every byte of the persistently held data on disk is known by a persistent virtual address. It does not matter that the OS is active (powered on) or not; each byte on disk has its own virtual address; this is the same virtual address which is used to access that byte when it happens to also reside in DRAM and gets accessed by a processor. Although every byte on disk is known by its unique virtual address, you know that the processor can’t access that byte until it has been read into the DRAM. In this concept, the memory – the DRAM – is essentially a volatile cache of the persistently addressed data which resides on disk. The virtually addressed bytes on disk within IBM i, are instead, in The Machine, residing in fabric memory. DRAM need not act as a cache – although it could – since the fabric memory can already be accessed by the processor, given the appropriate address mapping. How such a persistent and globally shared virtual address space is securely managed in the IBM i OS is a subject for another article; we will note, though, that it is done well via a form of capability addressing.