Back in March, we introduced a chip upstart taking aim at the efficiency of future exascale systems called Rex Computing. The young company is now armed with $1.25 million to hire another few engineers to move the Neo chips from concept to production—and also has a sizable DARPA contract to match the early interest it found with select national labs in the U.S..
For the background on the architecture and to a lesser extent, the company’s founder (who is not yet twenty), check the initial overview of Rex Computing. Since the time of that piece, founder Thomas Sohmers and his small team have been in the process of locking down the architecture to round out the final verified RTL by the end of this year. Rex Computing will be sampling its first chips in the middle of next year and will move to full production silicon in mid-2017 using TSMC’s 28 nanometer process.
In addition to the young company’s first round of financing, Rex Computing has also secured close to $100,000 in DARPA funds. The full description can be found midway down this DARPA document under “Programming New Computers,” and has, according to Sohmers, been instrumental as they start down the verification and early tape out process for the Neo chips. The funding is designed to target the automatic scratch pad memory tools, which, according to Sohmers is the “difficult part and where this approach might succeed where others have failed is the static compilation analysis technology at runtime.”
As a reminder, this automated step at runtime is a key differentiator in the Neo design. With this approach, from the user perspective, using Neo will be similar to tapping a cache-based system but without all the area and power overhead. Rex’s goal is to remove unnecessary complexity in the on-processor memory system and put that into the compiler instead. All of this happens at compile time, so it does not add complexity to the program itself either. The compiler understands where data will need to be at different points and it inserts it where it should go instead of leaving it in DRAM and letting the chip’s memory management units fetch it when it needs to in an inefficient big handful—and with data included that likely will not be used anyway.
“It takes 4200 picojoules to move 64 bits from DRAM to registers while it only takes 100 picojoules to do a double-precision floating point operation. It’s over 40x more energy to move the data than to actually operate on it. What most people would assume is that most of that 4200 picojoules is being used in going off-chip, but in reality, about 60% of that energy usage is being consumed by the on-chip cache hierarchy because of all of the extra gates and wires on the chip that the electrons go through. We are removing that 60%.”
Rex Computing is working under a tight schedule with a very small team. Sohmers tells The Next Platform that they are hiring engineers to bring the company to a total of seven people. They have already created the instruction set architecture and the basic core chip design but as the year moves on, they will push the functional verification and ensure that their ISA is optimized for the applications they are targeting and free from logical inconsistencies and other potential problems.
At the same time, they are taking the functional logical idea of their architecture and implementing it in actual hardware via the UC Berkeley-developed Chisel hardware description language. “There is a traditional flow then where we have RTL engineers writing the RTL based on what our functional model is, we will then take that, hand it over to the VLSI to do to the physical part using EDA tools and start placing the gates and components of the chip on the physical space.” He says the software tools available for this part of the process are abysmal, “it’s very time consuming, even if they can do some of the things automatically.” From this point, the small team will pass it over to a verification engineer to run on an FPGA or on their C++ functional simulator, then put the physical design through a hardware simulation like SPICE or tools like Cadence and Synopsys.
This is all very ambitious. “When Intel does this, they have 300 or more people on many teams over 18 months. We’re doing it with five on a tight schedule,” Sohmers said. Although this is a qualified comparison since Haswell chips, for instance, are far larger and are not even on the same playing field on the functional side compared to Neo, the point is, the small team will be in for a sleepless 2015. But when you’re young, driven, and set with a potential market for a product—what’s a little lost sleep?
Sohmers says Rex Computing has had to change its public facing approach since we spoke earlier in the year to include other markets beyond supercomputing. While he does have interest from national labs, including Sandia, where he is working with Jim Ang, who runs the new architectures group and is working with Sohmers and his small team on some modeling tools for the Neo architecture, Sohmers has to be more public about potential telco, embedded, and other use cases.
“One thing we’ve had to do to get funding is to pivot our public face by not using the word supercomputing or HPC so much. It’s basically a cursed word in the Silicon Valley investor community.”
He says that while there are plenty of investors that do understand the value of high performance computing technologies, the term HPC is still problematic. The investor community wants companies that can target much larger markets and even if they might understand that HPC has something to offer, for young companies looking for funds, this is an interesting note. “In this age of social networks and messaging apps being the big thing in Silicon Valley, it’s almost impossible to get funded if you’re pitching something for the big iron systems,” he explains.
Even with funding, this is a risky venture. “The cost for us going to TSMC and getting 100 chips back is, after you include the packaging and just getting the dies to our door, around $250,000. They sell them in blocks with shared costs of the mask among other companies, which is how we’re getting our first prototypes made.” Before that the other costs are EDA tools. Single seats for Cadence or Synopsys software is in the several hundreds of thousands of dollars even when you’re a startup, he says.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Unum approach by John Gustafson promises even better energy savings, among other things. At ISC we heard that someone in Singapore (if I remember correctly) is doing unum hardware prototypes in FPGA. Addison Snell should have the contact.
Can you do some research on that? It should be interesting.
Yes, we are researching Unum (as part of our DARPA SBIR) along with John Gustafson (who is one of our advisors).
Rex’s goal is to remove unnecessary complexity … and put that into the compiler instead. All of this happens at compile time
This reminds me of what Intel & HP tried with the (failed) Itanium. Compiler writers just aren’t smart enough, and targeting a compiler to a *specific* hardware implementation makes upgrades pretty difficult.
That’s part of the reason why we have our own compiler engineers 🙂
LLVM is finally a mature enough compiler that we are able to make these additions to the compiler and, with some particulars of our architecture that we are not talking publicly about yet, are able to make VLIW actually work this time. While I know that has been said before, we have very good reason to believe it… all I will say is that when you can make hardware based guarantees about the time it takes to move things in memory, your compiler has a lot more information to make good decisions with. The other big this is that our architecture is a hell of a lot simpler (very small pipeline, simple in order execution units, etc) which makes being able to make those scheduling predictions even without our very nice memory and network on chip systems much much cleaner and easier.
Sounds like an interesting approach. I hope your team is successful.
Interesting approach. Love to see innovative ideas coming to fruition. But, why a new ISA? Wouldn’t it be safer to use an existing ISA, like POWER or ARM, even if it’s a subset? Or, does this approach require a specific ISA design?
Couple of reasons…
1. ARMv8 sucks, and I say that as a moderate fan of ARMv7. I’ve been lately calling it “x86 v2” in that they have started to become just another bloated, out of order (which I say is a very bad thing if you have real ILP and DLP), microcoded core. The ISA is getting humongous a la x86 (although they haven’t added the kitchen sink quite yet like Intel has), and all of that complexity (and ARM’s requirement that you be fully compliant) really restricts efficiency when trying to target performance. Plus, their floating point performance (and with ARMv8 you are forced to use Neon or their mandatory floating point “extension”) is abysmal. On top of all of that, you have to hand ARM a fat wad of cash and in many cases a royalty for an architecture I don’t even like. Instead, we spent a good 12 months before we raised any money and made one we think is significant better for the tasks we want to go for.
2. Power, while a decent general purpose architecture, is still way too bulky. I also didn’t want to jump on the bandwagon of an architecture that I think will finally have a nail in the coffin in ~5 years. When it comes to “open” POWER, it is “open” in name only. In order to actually make use of the ISA and design your own cores, you have to join their foundation at the deepest (most expensive) level which they do not want anyone but their friends part of. Beyond all of that, even if we were invited into their club and we forget about the technical limitations, the fact that all of their press since their announcement has been about benevolent IBM and them “opening” an architecture has been complete bullshit, and really turns me off.
3. What should be asked is why we didn’t go with a truly open architecture, such as RISC-V (http://riscv.org). While I find it personally interesting and a cool project for spawning off things like lowRISC, it really is not optimized for being both power efficient and performant, and I think their instruction encoding and general format kind of sucks (for our needs). I do still like the ASPIRE lab, as they gave us Chisel, the HDL that we use.
With all of that being said, we came to the conclusion that it was not all that difficult to define our own ISA that actually checked all the boxes we needed, and have it not really matter for end users. All we have to do is build the backend for LLVM, and anything you can run on x86, ARM, power, etc. that compiles with Clang or another LLVM frontend works… non optimized by default, but that’s where our heavy lifting (and DARPA grant focus) comes in, which is the optimization for our scratchpad memory system.
An interesting project, though to me it’s not obvious why a new processor architecture is required. Maybe it will help avoiding high licensing cost, but it might be difficult to design around the numerous patents in the area of processor design. Obviously, a startup company has to be quite careful not to leak too many details, but the article really leaves a large number of questions open.
One other thing that confuses me is the DARPA grant, which specifically funds the development of scratchpad memory (SPM) performance and energy consumption optimizations. These are well known in the embedded systems area for more than ten years (see, e.g., R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, P. Marwedel: “Scratchpad memory: design alternative for cache on-chip memory in embedded systems”. CODES 2002: 73-78); for HPC applications, similar instruction and data SPM allocation optimizations might work quite well due to the fact that the application is known beforehand.
The article mentions “static compilation analysis technology at runtime”, which seems to be a contradiction in itself. The common definition of static analysis is analysis taking place at (offline) compile time, so this sounds like a sort of JIT or ahead-of-time compilation technique. It would be interesting to hear more about this…
Read my response above which covers the ISA questions.
In regards to the DARPA grant, you are totally correct that scratchpad memory has existed for a long time… it predated hardware managed caches, of course, we just didn’t call them scratchpads back then. Our uniqueness is in what we are not saying, which is some real magic melding the scratchpads, the network on chip, and the compiler. as for the “static compilation analysis”, its a misquote in the article… what it should be saying is that we are doing static analysis techniques at compile time, with some additional runtime profiling that you can see in the software development chart on our website. We aren’t publicly talking about it in a lot of detail just yet, but we will by the end of the year. Our techniques can be used for both JIT and AOT compilation, but we have been mostly focusing on AOT as of late, but have not excluded JIT from our plans. Make sure to sign up for our mailing list on our website to get that info as soon as it comes out.
Some may say that we have too many architectures. Some may say we don’t have enough. And while the details would go completely over my head, I am understanding enough of what you are describing to be a nice evolutionary jump in handing one of the most power-hungry components of a general purpose CPU by attacking it on two fronts. By make the software more efficient (compiler efficiency), you can make the CPU more efficient. I always thought that the compiler speed wars of the 80’s and 90’s was one of the dumbest things our industry ever did (other than the original i8088 decision). I really don’t care how long it takes to compile – I care about how efficient the resulting object code is. For DARPA, when you have to carry everything on your back, every watt counts. Getting 40, 50, 60% more efficiency out the CPU means 40, 50, 60% longer battery life, or less weight one of our boots-on-the-ground have to carry.
Good luck to you and your team.
And how is your approach different from the Epiphany Architecture ? You pitch sounds very much like Olofssons ? Ok, he has gcc instead of llvm but the arch seems similar.
Targeting Apple instead of IBM, smart choice…