We talk about big money being spent on GPU-accelerated HPC and AI systems all the time here at The Next Platform, and we have been clear that we think another area where such acceleration will take off is with databases and related analytics, and particularly with data warehouses that have trillions of rows of data.
The US Air Force agrees, and is allocating up to $100 million in a contract to database maker Kinetica over five years to overhaul and consolidate the data warehouses that underpin the management of its assets – people, aircraft, and such. This is on part with an HPC system capable of hundreds of petaflops of double precision math performance, just to give you the economic scale.
To be precise, the United States Northern Command, or USNORTHCOM, and the North American Aerospace Defense Command (better known as NORAD) are both part of the US Department of Defense. The former is responsible for managing military assets in the United States, Canada, and Mexico and was created in the wake of the 9/11 terrorist attacks. The latter is a joint effort between the United States and Canada to protect the North American continent against aircraft and missile strikes. Both are headquartered at Peterson Air Force Base in Colorado Springs, and have an alternate command center in the Air Force Space Command’s nearby Cheyenne Mountain Complex, made famous in the movie War Games as well as in the TV series Stargate SG-1.
The US Air Force and its partner, the Royal Canadian Air Force, have the same problems that a lot of enterprises have: They have a lot of moving things, some of them people, under their purview. But the situation is a bit different in that USNORTHCOM has to figure out where all of the military assets are and ascertain what the nature is of any anomaly or unidentified object. Sometimes, an image that is streaming on over a radar station is a flock of geese, and sometimes it is a potential threat. All told, NORAD and US Northern Command have about 3 billion different signals streaming in from sensors of various kinds – radar, video, and so on. Various systems have been built over the years to track assets in new and unique ways, and these are often standalone systems. What this means is that, as the picture above shows, analysts working at NORAD and US Northern Command have to work across multiple screens with different data sets, and they have to drill down into the data from various sources to quickly figure out if something is a potential threat or not. This is probably mind numbing if not vital work.
Here is the other problem: The way the systems are currently architected – and the Air Force can’t tell us or they would have to kill us – they have only a limited amount of live, streaming data at their fingertips to do such correlations. The data feeds are so big that it has to be archived quickly.
In the 21st century, this is not the way to handle such data. In a modern world, all of these data sources would be consolidated down into a single data warehouse that would stream the data in and use machine learning and other artificial intelligence and statistical analytics techniques to automatically find correlations across a very large archive of historical data and then flag anomalies for further consideration to the threat assessment analysts. As part of a broader “Pathfinder” digital modernization effort at the US Air Force, this is particularly the type of system that the Defense Innovation Unit, or DIU, prototyped over the past year and is rolling out into production.
Unlike the Defense Advanced Research Project Agency, the R&D arm of the US Department of Defense, and In-Q-Tel, the independent venture capital arm of the US Central Intelligence Agency, the DIU is not trying to create new technologies to solve problems a decade or so out, but rather keeps an eye on technologies in development in Silicon Valley and elsewhere that are in production and can be deployed to solve current problems – and do so now.
Back in late 2019, as part of the Pathfinder program, DIU had a pitch week to help solve the streaming data warehouse problem for NORAD and US Northern Command, and 55 companies made pitches. Two companies were selected and given a year to build prototype systems to consolidate and integrate their various entity tracking and classification databases. We don’t know who the second finalist in the deal was, but we know that Kinetica, which has a long history of entity tracking with several US government branches, won the deal. The prototype systems running the Kinetica Streaming Data Warehouse were a mix of CPU-only and GPU-accelerated instances on the GovCloud region set up by Amazon Web Services. Kinetica is best known as one of the originators of GPU-accelerated databases, but has tweaked its code to run on the substantial vector engines now included with modern CPUs.
The magnitude of the problem is literally a military secret, but the raw sensors are stuffing full billions of records per day in their databases, and that works out to multiple trillions of rows per year – and the Air Force wants to have many, many years of data from which it can apply AI techniques to for anomaly detection. And it wants all of this data to also be live so it can be queried and not have a split between a small set of live data and a very large set of archived data.
And, of course, there will be an AI/ML angle to the applications that run atop the new Kinetica system, according to Dan Nidess, AI/ML portfolio manager at DIU.
“Put yourself in the shoes of an operator who is tasked with observing the airspace in North America,” says Nidess. “Obviously, you care about what is in the air at any given time, and the first thing you can do for that operator is make it really, really clear what is and is not in the air. Moreover, over the last several decades, the number of different data sources that are available to operators has continued to grow, including more and more unclassified FAA radars as well as bespoke classified systems. The operator has to take out all of the noise and figure out what’s an actual entity, and of these thousands of entities figure out what is potentially a threat.”
The first problem that the Air Force had, according to Nidess, is that any given mission system could not ingest all of this telemetry, and that has meant making decisions about what to display and what not to display to the operators. And if they see something in one system, they have to look into another one to get another view of the situation and do their own correlations. Figure it out. Is this a bird, is this a plane, or is this Superman? Operators are spending more time trying to figure out what is a real entity and what is noise, and not enough time reckoning if the real stuff is an inbound bogey.
All of this is being fed by a mix of different classified on-premises systems that have various data sources. The prototype system was all using unclassified data sources to test out the ideas, but the contract was transitioned from DIU last September to a production environment, which we told Nidess we suspected would be a mix of on-premises – perhaps even using AWS Outposts – and AWS GovCloud instances, or even instances running on other secure partitions of the public clouds, say at Google Cloud or Microsoft Azure.
The Kinetica system that the Air Force is installing will have petabytes of data – which is a lot for a database – and Amit Vij, co-founder and president of Kinetica, tells The Next Platform that with the setup costing on the order of one-tenth that of a Spark or Hadoop stack with better functionality, the choice for the US Air Force was fairly easy. (We really want to know who the other vendor was that made it to the final bakeoff.)
The exact size and configuration of the data warehouses based on the Kinetica stack are not known, but generally speaking, there should be many hundreds of nodes per environment and multiple environments. And it will expand from there as new data sources and new capabilities are added atop the streaming data warehouse. The Air Force is not disclosing the architecture specifically, but it is reasonable that the system will use a mix of CPU instances with fat vector engines for some of the database processing and GPU instances for even more peppy processing as well as visualization.
While $100 million sounds like a lot, it really isn’t if you do the math. If you bought a single EC2 P3 instance – take the fat p3dn.24xlarge Cadillac instance, which has eight Tesla V100 GPUs with 256 GB of HBM2 memory cross-coupled with NVLink plus 96 vCPUs on two Xeon SP processors with 768 GB of main memory, it would cost about $84,500 to run it a year with a three-year reserved instance type. That’s about 240 nodes in total for that $20 million. Less beefy nodes cost a little less per unit of compute, but you have the noisy neighbor problem then, which we are sure NORAD and US Northern Command will not tolerate. Now, if the GPU nodes are a smaller portion of the total compute, then there will obviously be more nodes for the same money, but potentially a lot less compute and a lot more main memory.
It would be interesting to see how this is really architected and what the price/performance is of running Kinetica’s database on CPUs versus GPUs. We’re going to have a chat with Vij and get an update on the Kinetica Streaming Data Warehouse and how you play these two compute engines off each other and run them in a hybrid mode. A lot has changed since we drilled down into the Kinetica database architecture way back in September 2016.
As someone who’s worked with Kinetica systems and big data…
I’m a bit concerned.
There are definitely issues w scalability and performance.
There are better alternatives out there.
The US Government is usually a decade behind on stable tech. They should be help pushing the envelope when possible.
That’s interesting, I’ve also been using Kinetica for use cases like this and have found it exemplary. It’s actually quite ahead of the field from our experience, you should probably try again.
Kinetica is one of the few technologies out there (and we have tried many for our financial firm) that can actually scale to trillion row datasets and produce fast results.
Who else has trillion row capabilities with the fused value add of real time, geospatial, and operationalizing AI/ML on an analytics platform?