Striking Practical Computational Balance
February 26, 2018 Dr. James Cuff
The phenomenal complexity of computing is not decreasing. Charts of growth, investment and scale continue to follow an exponential curve.
But how is computational balance to be maintained with any level of objectivity under such extreme circumstances? How do we plan for this known, and yet highly unknown challenge of building balanced systems to operate at scale? The ever more bewildering set of options (e.g. price lists now have APIs) may, if not managed with utmost care, result in chaos and confusion.
This first in a series of articles will set some background and perspective on the critical importance of practical computational balance, which may itself sound like a straightforward topic for those managing any part of a complex pipeline in research HPC, hyperscale datacenters, or large-scale enterprises. Viewed as a set of individual challenges, however, achieving balance (not to mention defining it in this context) is deeply nuanced and worth describing as a starter.
So, with all of this in mind, let’s look at a recent historical example of this balance from life sciences and follow it through the present to see balance (and a lack thereof) in practical terms. . . Let’s start with one of the most challenging data analysis problems in the last several decades. As we know, the human genome is complex and accordingly, the systems needed to annotate the human genome were just as complex. Almost 20 years ago at the Sanger Institute and EBI, the computer systems needed to annotate the human genome were deployed. Built out of an inordinately sophisticated set of fragile “component parts”. Large DRAM (also physically large – then the size of three fridge freezers) 192GB, GS320 SMP, loosely coupled 64 bit alpha CPU, ATM networks, proprietary HSG80 fibre channel storage systems with Memory Channel II interconnects providing highly available and clustered AdvFS storage along with a huge array of fickle RS232 serial concentrators were each deployed in concert. This sophisticated hardware was then coupled with completely untested, brand new ab-initio algorithms and novel software pipelines.
For all of this complexity, systematic component stress and fragility, balance eventually ensued. The human genome was successfully annotated–and it was not easy. Clustered, HA file systems ceased to be HA and crashed spectacularly, inflight data moving from cluster nodes to databases were corrupted through mysterious and random “bit rot” introduced through complexity between the layers, and software counters were incorrectly set, their fantastical off-by-ones resulted in phenomenally incorrect gene assignments. Add this to network devices that were taken offline, overwhelmed with traffic and relational database tables that were accidentally initialized mid way through critical steps of the analysis. Further, core pipeline software needed to be patched and poked at runtime, fresh bugs appeared each day from every conceivable corner of the system, local data caches of software, and novel data distribution algorithms needed to be implemented to moderate the network and central file store. Further adding to the chaos, batch scheduling systems also simply tipped over ceasing all pipeline and computing operations under both extreme load and somewhat “novel” job granularity design and settings.
Does this start to allude to mounting complexity?
Consider as well that websites were designed and attached to live production databases, mid-flight, making data and results instantly and immediately accessible throughout the globe. One particular standalone system served as an open access relational database server, in a “DMZ” (before that term became an actual thing) where anyone could run their own queries against freshly computed results. That system crashed a lot. It was ok though, the data needed to be released “warts and all”, the Bermuda Principles were a critical and core value of the public project. Nothing did what it was supposed to do at first. There were moments of extreme stress and confusion, coupled with lovely moments of great hilarity as each of the teams struggled together through their momentous tasks. It was moderately controlled computational chaos.
In short, there wasn’t an awful lot of balance.
The paper describing the resulting computing solutions today now appears rather ad-hoc. At that time, the project was under global pressure to deliver on time, under a landscape of competition between the public and private efforts.These efforts would define the very future framework of either unencumbered and freely available data or closed, patented and protected genetic information. It was a big deal, and there wasn’t enough time to delicately place a “finger on the scale” to ruminate deeply and achieve the most optimal computational balance. The systems and software clearly co-evolved rapidly and dramatically during the process. Overall, they seem somewhat discombobulated compared to modern beautifully articulated and designed reference architectures.
History and experience has provided hard lessons how to achieve balance. Many computational projects today mirror this identical enthusiasm and critically time sensitive race towards a goal. Maintaining computational balance continues to remain both vital and yet challenging to achieve.
So what about today?
The physical computing used to annotate the human genome now exists only as a set of historical quirks. However, the resulting human genome product is now the critical platform that underpins and relates to everything new we continue to do in life sciences. Time continues to march on and new “post-genomic” technologies (e.g. CRISPR) and huge longitudinal public health cohorts are now considered normal.
Genomic data sets are huge, sensitive and complex. GDPR as one example, is the largest wide ranging personal data regulation and control act in over twenty years which is now sweeping across systems and management teams in Europe. Highly secure and carefully managed compute throughout the entire stack are going to be needed to simply keep up with this newly introduced scale and required regulation. Platforms to sequence genomes are shrinking rapidly, and able to produce personally identifiable data and results both faster and cheaper. For example, it is now in some cases becoming more cost effective to re-sequence than to build storage systems to store the original genetic sequence data. Gaining insight into complex data is difficult, there are only a handful of algorithms available to computationally interrogate data at such a huge scale.
Machine learning and “AI techniques” have been proven to increase in sensitivity and specificity with larger, more detailed, but accurately presented, feature rich data sets. Machine learning techniques are being deployed to an ever increasing number of data sets and analytics tasks. Albeit the “black box” nature of these predictive algorithms receive continuous academic critique, the methods continue to deliver results that are “good enough”. Arguments have even been made against interpretable machine learning algorithms by stating that humans can’t explain their own actions either. Accordingly, these methods have become so ubiquitous that devices are now designed such that floating point calculations can run at reduced accuracy, but with significantly higher performance, ref (half-precision).
The subtle nuance of finite state systems to be trained through “experiences” to detect similar and dissimilar motifs requires that the algorithms be supplied with steady stream of datasets containing sufficient, but appropriate variety. Moving these huge data sets between devices is now a key underlying issue with scale in this field. It is this inherent movement, packing and unpacking of data that causes back pressure in the analytics systems. Databases to “relate” complex data objects are now needed at scale. Single instance databases are rare to find in modern scale-out computing architectures. The data has to be both rich in metadata features and due to volume, be highly distributed.
At the other end of the spectrum, data capture and generation devices are also in-silico systems. Low digit nanometer silicon fabrication yields significantly more dense “detectors”, while also at the other end providing more real-estate for processor designers to add more transistors and features. With each fabrication die shrink, CCD, sensors and MOS devices by definition can therefor generate much more data per square millimeter of detection area. Performance improvement, and fabrication plant advances chip design therefore directly affect the subsequent data output capabilities for detectors. Simply reference commercially available “megapixel” cameras and the subsequent amounts of flash card storage as the two practical and visible examples of this symbiosis. Increases in image quality and detection capabilities directly drive increases and capabilities of the storage.
The world is mobile. Advances in LTE and high speed, reliable mobile and domestic network connectivity has driven a huge surge in “always on” devices. Computing data sets are now collected and generated at the edge–and at scale. Be it a laboratory microscope or a highly distributed voice recognition service, computing and data are big and distributed. Low power, extremely low cost connected devices “things” now form a mesh of a complex set of what are called an “internet of things”. Data locality challenges are now driving serious conversations and decisions of where to best locate compute and storage. Advanced, high-speed commercial and academic wide area global networks are alleviating some of the pain of data movement, but this is still an unsolved data locality problem.
Complex server and operating system images are now containerized and aim to be “serverless” to reduce challenges with the provenance of analysis software and underlying core systems libraries. Complex analytical software is now devised and consumed through notebooks to further abstract the difficulty of the human/computer interaction. Software pipelines to schedule contained system images are now initiated and automatically updated directly from version control systems and “image depots”. Even batch scheduling software and systems management at scale is in and of itself now a computing and difficult management challenge. This is due to vastly more readily publically available and accessible node counts. Simply locating two million individual workloads resting in a batch queue to then map efficiently to tens of thousands of instant-on system images is not a trivial task in and of itself. There is massive, potentially polynomial computation required to simply initiate the compute and schedule the computation if not managed and designed with utmost care and this delicate operation all has to happen before any compute or storage interactions take place.
Internet based application programing interfaces have driven a new way to consume and interact with data and computing. Sophisticated controlled vocabularies and structured data interfaces such as XML and JSON allow serialization of data and metadata when interacting with local and remote API systems. However, parsing and deserializing such data structures rapidly at petabyte scales yet again becomes computationally nontrivial. The very structures developed to manage increasing data sophistication have become in and of themselves a critical issue and key decision point when designing high performance, bulk data transfers. Specifically, when considering wide area data transport and replication to then result in any degree of coherence and consistency as a result of their huge size.
Fewer and fewer humans now write bare metal inline assembly or native C, nor should they have to. High-level languages have been proven as critical in the success of number of large scale scientific projects and he human genome was no different. In that specific case, it was perl (a huge amount of perl to be exact). In this world, abstraction and “software layering” is completely acceptable and absolutely required. However, computational balance must be maintained between the high level languages and underlying systems. There are now systems on top of systems, and software on top of software, on top of the systems on top of systems and as one might imagine, abstracting system complexity drives advancement and further sophistication of methods. But there is always overhead in abstraction.
Vocational training for experts to work with and extend complex development and operational code for massively distributed, high node count systems has resulted in new job titles and roles. “DevOps”,“Site Reliability Engineer” to name but two, aim to provide balance between discrete release, build and operating ethos. Development teams need to release changes, operations teams need to manage change. Balance is needed. Software, data and now HPC carpenters are being trained to manage scale and complexity. Software sustainability is an example of a new practice, Research Software Engineering is another new and critically important role to balance the activities of scientists, researchers and software systems. This is a new type of human balance.
The final delicate balancing act…
The tight and careful coupling of infrastructure, software and people has always been proven to be the only viable route to success at scale. Loose coupling of components results in failure, which is something the hyperscalers now know.
The massive cloud and social datacenter requirements have bent our very definition of scale in the last several years. Open computing “for purpose” systems were needed and designed, as were wide area software defined networks. Local area, “top of rack” interconnects needed to be rebalanced. Modern “mapreduce” and “data sharding” algorithms were developed to achieve distributed scale.
However, for all of these improvements, one single overarching concept remains. Advanced domain-specific pipelines appear in all fields of data analysis still need to sit on top of advanced computing infrastructure.
The incessant desire for ever more performance-driven, larger capacity pipelines remains a constant. The individual component parts may well become a little faster and now computing can also be offloaded to high speed, bespoke devices such as GPGPU and FPGA. This certainly helps bend the performance curve in the right direction. However, worryingly because of massive underlying algorithm complexity, the software pipelines have to become increasingly abstract, and in turn, significantly slower. In short, even the offload engines themselves are introducing new bottlenecks to eliminate.
What we are learning is this: Traditional monolithic computing paradigms don’t scale. And this is not just a hardware problem either.
Per node, and node locked or geolocation locked software licensing schemes also do not scale. Even floating licence schemes can result in a rapid overwhelming, and instant failure of a central license server if not carefully managed.
Monolithic, single instance file systems and storage also obviously do not scale for the same reason. Storage today has to be distributed in order for it to scale. When the analysis pipeline starts to look like a self-inflicted internal distributed denial of service attack and causes operations to cease, there is a problem.
Most concerning is that software and underlying physical infrastructure; the two critical components for success, are by definition, and through no fault, each heading in distinctly opposite directions with respect to performance. As the analysis hardware becomes faster, it enables development of more complex software. However, the software then simply becomes slower due to complexity. It is a vicious circle of matching requirements and demand. Distributing the workload increases performance, but has long been known to introduce new and fascinating challenges in both coherence, consistency and scalability. The very systems used to distribute the new load now become the new bottlenecks (ref. the earlier statement on the critical issues concerning batch scheduling).
Points of computational balance are subtle and not at all obvious. Successful balance is always discovered in unexpected places and how to achieve practical computational balance at scale will be discussed in detail in future upcoming articles. There are millions upon millions of potential points of balance and while not every possible balance point can ever be isolated or discussed in any length, we can at least begin to scratch the surface–perhaps deep enough to leave a permanent mark.
The next article in this series is entitled “Practical Computational Balance: Part Two – Balancing Data”. It will give examples of some practical methods where compute and data storage have managed to achieve a degree of balance even under the overwhelming pressure of confusing data size, complexity and scale.
Distinguished Technical Author, The Next Platform
James Cuff brings insight from the world of advanced computing following a twenty-year career in what he calls “practical supercomputing”. James initially supported the amazing teams who annotated multiple genomes at the Wellcome Trust Sanger Institute and the Broad Institute of Harvard and MIT.
Over the last decade, James built a research computing organization from scratch at Harvard. During his tenure, he designed and built a green datacenter, petascale parallel storage, low-latency networks and sophisticated, integrated computing platforms. However, more importantly he built and worked with phenomenal teams of people who supported our world’s most complex and advanced scientific research.
James was most recently the Assistant Dean and Distinguished Engineer for Research Computing at Harvard, and holds a degree in Chemistry from Manchester University and a doctorate in Molecular Biophysics with a focus on neural networks and protein structure prediction from Oxford University.