The Bits And Bytes Of The Machine’s Storage
January 25, 2016 Mark Funk
By now, as we have seen in other parts of this series, we have a pretty good sense for the basic topology of The Machine from Hewlett Packard Enterprise. In it are massive amounts of fabric memory, any node and application can access it, no matter where they are executing. In there somewhere, though, is your file, your object, your database table. You know it’s yours and only you have the right to access it. So, what in The Machine is ensuring that only you get to access it? And, in doing so, still allow you efficient access to it.
Folks designing The Machine speak absolutely correctly about the need to have integrated security and integrity throughout The Machine’s design. So let’s start by looking at one very low level aspect of that integrated security.
As you have seen, every processor on any node can access the contents of any byte of fabric memory, no matter the node on which it resides. You have also seen that only local processors can access a node’s DRAM. You might also know that in more standard symmetric multi-processor systems, or SMPs, every byte of the physical memory is known by a unique real address. A program’s instructions executing on a processor generate such real addresses, using that real address to uniquely request the data found there, and then work with that data.
Knowing that the DRAM is only locally accessible, you might picture that both the volatile and the Fabric Memory as being mapped under a Real Address Space as in the following figure:
In such a mapping, any processor could generate a real address into local DRAM and that address would only access the memory from its own node, and no other. The hardware guarantees that merely from the notion of private/local memory. However, with fabric memory being global, the whole of the fabric memory would be spread out across the remainder of the real address space, allowing every byte found there to be accessed with a unique real address, no matter the processor using that real address.
Yes, that would work, but that is not the mental picture to have for The Machine. Indeed, suppose it was, and any processor can have concurrent access to hundreds of petabytes of persistent memory. Just to keep it simple, let’s say The Machine’s fabric memory size someday becomes 256 pebibytes; that is 256 X 10245, or 28 * 250 bytes, requiring at least 58 bits to span this real address space. If a processor were to need to concurrently access all of this memory, the processor would need to be capable of supporting this full 58-bit address. For comparison, the Intel Xeon Phi supports a 40-bit physical address in 64-bit mode. It’s not that it can’t be done, but that is quite a jump. And from Keith Packard’s presentation we find that The Machine did not make that jump:
“In our current implementation, we are using an ARM-64 core; a multi-core processor. It has 48 bits of virtual addressing, but it has only 44 bits of physical addressing. . . . Out of that, we get 53 bits (real address), and that can address 8 petabytes of memory. . . and we translate those into actual memory fabric addresses, which are 75 bits for an address space potential of 32 zettabytes.”
Still, if such a huge global real address were supported, that means that if any processor can generate such a real address – which also means that if any thread in any process in any OS can generate such a real address – it then also has full access to the whole of The Machine’s memory at any moment. If this were the way that all programs actually access memory, system security and integrity would have a real problem. There are known ways – ways used on most systems and The Machine as well (as we’ll see in a subsequent article in this series) – that this can be avoided; done right, today’s systems tend to be quite secure. Even so, The Machine takes addressing-based security and integrity a step further even at this low level of real addresses as we will see next.
In the real addressing model used by The Machine, rather than real address space being global (as in the above figure), picture instead a real address space per node, or more to the point, one scoped only to the processors of each node. Said differently, the processors of any given node have their own private real address space. Part of that is, of course, used to access the node-local DRAM. But now also picture regions of each node’s real address space as being mapped securely by the hardware onto windows of various sizes into fabric memory, any part of fabric memory. The processors of each node could potentially generate arbitrary real addresses within their own real address space, but it is only those real-address regions securely mapped by the hardware onto physical memory that can actually be accessed. No mapping, no access. Even though the node’s real address space may be smaller than the whole of fabric memory, those portions of fabric memory needed for concurrent access are nonetheless accessible.
For example, a file manager on your node wants access to some file residing in fabric memory, perhaps a single file residing in – spread out amongst – a set of different regions on a number of different nodes. Your OS requests the right to access all of that file. Portions of the file are each known to reside physically on a particular set of nodes and, within those nodes, at particular regions within them. That fact alone does nothing for your program or the processor accessing it; the file is at well-defined locations in physical DRAM, but the processor proper can only generate real addresses. Said differently, the program and the processor could generate real addresses with the intent of accessing fabric memory, but that real address is not the physical tuple like Node ID::Media Controller ID::DIMM::Offset where the file’s bytes really reside.
To actually allow the access, your node’s hardware must be capable of translating the processor’s real address into such a physical representation. That real-to-physical region mapping is held and set securely in the hardware; your program knows nothing about it, only trusted code and the hardware do. Your processor generates the real address of the file as your OS perceives it and the hardware supporting the fabric memory translates that real address to the actual location(s) of your file.
Of course, persistent or not, the fabric memory is also just memory; slower, yes, but there a lot of it. Redrawing the previous figure slightly more abstractly as below (it’s still the same four nodes), we can see that if your program needs more memory, it can ask for more; upon doing so, your program is provided a real address – a real address as understood by your node’s processors – and that real address had been mapped onto some physical region of fabric memory (which can include the local node’s fabric memory).
Additionally, the fabric memory may enable data persistence, but if it is only more “memory” that your program needs, your program need not manage it as persistent. As we saw earlier, blocks in fabric memory is cache-able, each block tagged in the cache using real addresses. Once such blocks are flushed from the cache, the real address is provided to this mapping hardware, which in turn identifies where in fabric memory the flushed block is to be stored. If your object did not need to actually be persistent, rather than explicitly forcing cached blocks out to memory, you can just allow such blocks to sooner or later be written back to even fabric memory. Your program need not know when or even if; as with DRAM, the changed data can sooner or later makes its way back to memory.
Interestingly, as a different concept, even though every node shown here does have a processor (and its own DRAM), if one or more nodes are only contributing fabric memory to the system, while it is only the persistent memory of such nodes being used, the processors and DRAM on such nodes could conceivably be shutdown, saving the cost of keeping them powered.
The Shared Versus Persistent Data Attributes Of Fabric Memory
As implied in the previous section, the topology of The Machine introduces an interesting side effect, perhaps even an anomaly, showing the two sides of fabric memory. The volatile DRAM memory is accessible by only the processors residing on the same node, so any sharing possible is by only the processors on that node. That is as far as that sharing there goes. So if processor-based sharing is to occur amongst any of The Machine’s processors and OSes, it’s the non-volatile fabric memory that is being used for that purpose, not the volatile DRAM. Interestingly, for much of that sharing the data shared need not also require persistence.
See the point? The Machine’s non-volatile fabric memory is being used for essentially two separate reasons;
- For active data sharing, for data that does not need to be maintained as persistent, and
- Separately, as memory which truly is persistent, and – interestingly – is likely being shared as a result.
I did not actually say that the inter-node shared data does not need to be in fabric memory. It does. Inter-node sharing cannot count on cache coherence. In order for another node to see the data being shared, that shared data must be in fabric memory and invalidated from the cache of nodes that want to see the changes.
Said differently, suppose processors of two nodes, Node A and B, want to share data. A processor on Node A has made the most recent change to the shared data, with the change residing still in the cache of that processor. If cache coherence spanned multiple nodes, a processor on Node B would be capable of seeing that changed data, even if still in Node A’s processor’s cache. But cache coherence does not span nodal boundaries. So if Node A’s processor wants to make the change visible to the processors on Node B, Node A’s processor must flush the changed cache lines back out to fabric memory. Additionally, in order for a processor on Node B to see this same data, that data block cannot then reside in a Node B processor’s cache; if it does, that block (unchanged) must also be flushed from that cache to allow a Node B processor to see the change. Seems complex, and it is important to enable inter-node sharing of changing data, but The Machine provides APIs to enable such sharing.
So, yes, we did need to make the shared data reside in fabric memory in order to allow it to be seen by another node, but we did not actually need it to be persistent. That item of data is in persistent fabric memory in order for it to be available to be seen by Node B and others, but actual persistence to failure is a bit more subtle than that. If after successfully flushing the changed cache lines to fabric memory holds the changed/shared data, it will still be there after power cycling, but does there exist enough information for the restarted system to find the changed data? It’s that which we’ll try to explain in the next section.