Talking Databases With Hadoop Creator Doug Cutting

The list of technologies that has been created because of the limitations of traditional, enterprise-grade relational databases is quite long, and yet these tried and true technologies persist at the heart of the modern enterprise. But they are increasingly sharing space in the datacenter with myriad databases and data stores that try to get around performance or scale limitations to solve particularly pesky problems.

Doug Cutting, the creator of the open source Hadoop platform that mimicked Google’s eponymous file system and MapReduce data chunking and chewing system, ran up against such limitations while working at Yahoo, and did something about it that, in the end, helped spawn a new industry and solve some big problems. As chief architect at Cloudera, which is the largest commercial distributor of the Hadoop platform and soon to be even larger once it acquires rival Hortonworks, Cutting is well aware of where the Hadoop platform fits today and where it can be expanded in the future. The Next Platform recently had a chat with Cutting about the possibilities.

Timothy Prickett Morgan: I am dying to know what happens next with Hadoop. Platforms come and go. What do you do for an encore after Hadoop? I mean, we are The Next Platform, so it is kind of in the job title to think about what follows.

I just spent a week at the Supercomputing 2018 conference, where the hot topic of conversation there was kind of a confluence of traditional simulation and modeling and machine learning either as part of the workflow or embedded in it as part of the ensembles or generating training sets from simulation and propagating that stuff into the simulation to figure out the next step in the simulation or using machine learning to prepare datasets and choose algorithms or portions off algorithms to run.

Good platforms are malleable enough to absorb new technologies, and Hadoop, which started out as a batch processing engine, is a good example as Spark in-memory processing and stream processing were added with machine learning on top and as traditional SQL database queries were supported and as different file systems were rolled in. All kinds of things have come and gone into the Hadoop platform. I don’t even know if you can call it Hadoop anymore. It is intriguing to consider how these platforms might further converge, or maybe they diverge or stay different forever until a whole new thing comes along, like Hadoop, when you did it.  It will be interesting to see if there can ever be just be one analytics platform, but if you watch the database market, the evidence suggests that every time you think you’ve got the database that can do everything, researchers with new databases spring up like mushrooms. It’s really hard to get that kind of a uniformity of platform.

Doug Cutting: The way that I think about it, we are seeing more and more capabilities added. I think of it as this loosely coupled ecosystem of primarily open source projects that that benefit from interoperating as new things come along. They interoperate with it with the older thing and some of the older things become no longer very interesting when something better comes along. Some things retain their utility.

So it’s mostly the case that people are getting more tools and at a faster and faster pace than they have in the past because the innovation in open source is driven more directly by users. Many of the new technologies come out of people’s frustration with the available tools and they see how they can add a new tool that builds on existing things and gives them what they need. And so you’re not necessarily having solutions that are instigated and promoted by vendors, so we get this rapid generation of new tools that somebody found useful in order to create it. They put it out there and other people can try it and see who else finds it useful. When there are a lot of folks who find it useful, then open source vendors like ourselves begin to support it. We’ve seen this happen with Spark, with Kafka, with a number of things now. As for the Hadoop stack, I think it is less likely to be fundamentally disrupted. We have got this going new style of ecosystem, and I think that was the fundamental disruption. It was simply that it was no longer everything built around a relational database management system that was tightly controlled by a couple of vendors.

TPM: The question I have is this: Will we have a proliferation of platforms, a kind of Cambrian explosion? People can assemble different platforms from open source parts. The Cloudera stack itself has so many things in it.

Doug Cutting: The reason is simple: There isn’t one problem. There are lots of different problems that people need to solve. I think this was one of the reasons that the relational database was so ripe for disruption. I had this argument with someone the other day that the relational database didn’t solve all these problems very effectively. And so I think that most of these open source projects really fill valid needs. Companies have some problems that really require streaming. Companies have some problems that really require batch. Companies have things that require machine learning. Companies require SQL or interactive full text search. Those all take different technical solutions and it’s nice when they can share datasets, and they can share as much as is reasonable.

But they are not going to share everything because they are fundamentally different kinds of tasks. So I think I think I also think that trying to coalesce everything into a single platform would set you up for failure and that it would be eventually there would be something that someone would need to do which wouldn’t fit into that platform and people would have to reject it and move on to some different thing. At the rate we’re seeing new kinds of problems and new kinds of applications show up, I don’t think we ought to be standardizing on a single platform. I think it’s premature. We are seeing is this rapid evolution, and having all these different options really fuels the productivity of folks. So it is a little messy, but I mean I think we are changing quickly, trying lots of things. Does that make sense?

TPM: It makes perfect sense. The number of relational databases in the enterprises is large, but they’re not particularly big in size, meaning capacity, individually. You are talking about hundreds of gigabytes or tens of terabytes for most databases. This is not huge. Most data warehouses aren’t particularly large, but companies a lot of them laying around. Is there a crossover point at which the SQL layers on top of a platform like Hadoop, such as the Impala layer in the Cloudera distribution of Hadoop, are good enough that they can do that transaction processing work and customers can get off of Oracle or DB2 or SQL Server?

Doug Cutting: There are folks doing that. I know a couple of years ago the New York Stock Exchange said it had pulled out its last Netezza data warehouse and was replacing everything with Cloudera stack. And you know Amazon talks about how they are pulling Oracle out and gradually using its own in-house databases. They are not using Hadoop but they’re using similarly architected systems. I definitely see this happening, but it’s not easy. The New York Stock Exchange said it was hard, and the fact that it’s taken Amazon so long shows that it’s hard.

Part of it is that you get a culture, an organization built around a particular system and managing it; all of the people have all this knowhow and process relating to the platform, and if you switch the technology platform it’s not usually just as easy as you know silently switching and not telling anybody and everything works the same. A lot of people’s job change, a lot of people’s duties change along with the system architectures and maybe for the better and in the long term, but it is still difficult. Change is hard and even if you are going to consolidate, for example, a bunch of different databases onto a single platform, you may not entirely replicate the old architecture on the new system. You may not want to entirely replicate the list of who had access to what and how it was administered and all those sorts of things. And there are lots of little bits of the actual process that I think are embedded in that discussion. If a certain person was the only person who could add data or add columns or do whatever, then they had a particular journey and a particular way of doing that and this is going to change and they may or not may or may not be the person who is doing that particular step in the process anymore anyway.

TPM: The problem gets even hairier, right, because if you look at the way that database vendors – and I think IBM and Oracle are the most guilty of this – used stored procedures and triggers, which embed key parts of what would normally be your application in the database itself, they have an even tougher time moving because there are no standards for triggers and stored procedures. But if Cloudera

If you guys could you guys actually skin Oracle or DB2 or SQL Server and get reasonable performance, that would be interesting indeed. And would the transaction processing on these databases even be 10 percent of the aggregate workload? If there’s one place where I think convergence is possible, it is across these databases, data stores and data warehouses. I just know that the stickiest thing in the database, and I wonder that for the multitudes of the enterprise – not the New York Stock Exchange with lots of techies to figure things out – but for the tens of millions of other companies or tens of thousands of large enterprises, they could get to this point if the technology gets there. I don’t think it’s we are there yet. Correct me by all means if I’m wrong, but these SQL layers they don’t have anything close to the same performance as some of the relational databases for transaction processing. For analytics they certainly do. And they’re not compatible enough with those relational databases, either.

Doug Cutting: I think you are right. Certainly in transaction processing, the features are not all there, but in analytics, I think they are but they are not in 100 percent compatible ways, so that moving things is, as we just talked about, work. But here’s the question: Do people need new transaction systems predominately or do we need new analytics systems? Customers tell us they need new analytics systems predominantly. That said, a company would like to save on all of the expense of these relational databases.

TPM: That’s my point. If you did it right, you could use the money you saved from all of that Oracle licensing and DB2 to licensing and SQL Server licensing – and think about how expensive DB2 and CICS are on the mainframe – and use all of that money to fund a brand new analytics system that can also absorb that transaction processing workload. I think that’s the trick. Maybe.

Doug Cutting: I hesitate to prognosticate too much or too far out about this kind of thing. Cloudera is now ten years old, and ten years ago we went after customers who were few and bold and adventurous and they were trying to do things that they couldn’t do any other way. And over time, we were able to add more and more functions to replace some things they could do on older platforms with things that scale better and interoperate better. But it’s still much easier for the customer and for us if customers are creating these new systems for new workloads than it is to lift up and replace things compatibly. That is sort of a thankless job.

We’re doing that more and more until we work our way across. And the hardest thing is going to be transaction processing. It’s also in some sense of the thing with the least demand and it is not where the growth in the data systems predominate. Maybe it’s a way for companies to reach some cost savings and simplification. So that’s a good thing. And I think over time that will be the sort of thing that we can move towards. We have only really started to challenge traditional data warehouses directly in the last year or so, where we can entirely replace existing data warehouse technology.

TPM: What’s the difference now? We have been talking about that as a possibility for a long time and Impala has been out for longer than a year. So what is the what is that deciding factor in the last year made that possible?

Doug Cutting: It takes a while. It takes the maturity of Impala, the maturity of the whole toolset, and we finally feel like we’ve got a complete set of features that folks need to implement a data warehouse and we don’t have to say you can only move 90 percent of the data warehouse workload. We now think we can support all of it. And we don’t think we, as an open source community, are anywhere close to that with transaction processing systems. There are some kinds of transactions I think our system could replace, but there are also a lot that can’t.

TPM: What is the cost delta when you when you do a replacement of, say, a Teradata set up? I realize that I am being vague, but some Mike Olson used to talk about the differential cost of storage alone helps pay for the whole Hadoop stack and then some. Does the cost savings over a traditional data warehouse coerce enterprises at this point, or is it something different what’s driving adoption? How much of it is cost savings and how much of it is just having something that’s more modern and scalable or whatever?

Doug Cutting: It is different in each case, but generally I think there is substantial cost savings. Oftentimes we have seen 10X lower costs. I talked to one customer who said it was over 100X cheaper, which was pretty amazing. So that can be a big deal. But there is also scalability, which is another side of costs. If it’s 10X cheaper then you could store 5X more stuff and still halve your costs. So you could have five years of data in your data warehouse instead of one, or five months instead of one month, and so on. This lets you do more, lets you understand more. Or conversely or maybe complementarily, if you have got new product lines, new sources of data, like from mobile products or some new IoT thing,  you can pour that into the data warehouse. If you look at what it would cost with the previous technology, it would be outrageous. Now you can store that stuff and take advantage of it, too, in your analytics. So I think as businesses change, they are becoming more digital, they are generating more and more data and so they’re their needs for a data warehouse are more intensive and they don’t want the percentage of their IT costs for paying for that data warehouse to keep rising. They’d like to keep those down.

TPM: What is it that you need from the industry to keep that cost curve coming down? There is going to be such a cornucopia of open source projects that you can always plug new things into the Hadoop platform. You know it’s almost like the Borg –whatever is useful, I can use. And that’s a good thing about it. But you might also need changes in processing, memory, flash, disk, and interconnects. So what are the limiting factors, as far as you are concerned, on the ability of this Hadoop platform to scale in terms of capability and to keep that cost curve coming down. As far as I can tell, Moore’s Law is drying up. Processors are going to start getting more expensive, not less expensive. You know memory and flash are already more expensive than they were two years ago, although the prices are abating a little bit. But I think that the memory makers like having high profits and I don’t think they are so eager to build fabs to the extent that they cause a crash in their own market. So, what things that are absolutely out of your control but are on your wish list?

Doug Cutting: That is a good question. We have always been consumers of hardware, taking advantage of whatever is there and engineering systems have to leverage those economies.

I believe that even if everything were to stay the same from the hardware and software standpoint, that most companies – this sort of goes back to what we talked about earlier – are not making the most of the data they have and could invest at current levels with current software and hardware systems and build better software to advance their businesses for decades. We are under-automated, and the rate that we have got software tools and hardware availability has vastly outstripped our ability to create the industry specific systems that really take advantage of it.

If you look at most of what industry does, they are using systems that were built decades ago that are very inefficient and they are not really operating in the streamlined ways that you see a brand new company like an Uber or a Google or a Tesla doing. Some of these companies have built themselves with digital systems at their heart from the outset, and they are a little closer to running themselves optimally. But most of existing Fortune 500 companies, even smaller companies, are nowhere near that. There are phenomenal efficiencies to be gained. You talked about having 500 or 5,000 different databases in a big company – that’s horribly inefficient. And probably the functions that these databases are doing are themselves inefficient. I think there is there’s a lot of opportunity, even without new hardware advances. I also think Moore’s Law is dead, long live Moore’s law – we’re going to continue to see hardware advance.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.