In the IT world, solving one gnarly problem can lead to greater complexities down the road. Each incremental step can make things a bit better but can also just as easily expose discontinuities in scale between different parts of a platform.
Storage techies have had to wrestle with issue for the past decade as there has been a deluge of data that needs to be collected, stored, accessed, processed, and analyzed and a mix of structured, unstructured, and semi-structured data.
When data first started getting big, companies used data warehouses for storing structured data, such as credit card, sales, and demographic information, which doesn’t change all that often and which sits in a standardized format with rows and columns and is easily accessible. With the rise of unstructured data – which accounts for almost 90 percent of all of today’s data and which does not fit easily and searchably into a spreadsheet and includes everything from images and videos to emails to telemetry data from IoT devices and other machine-generated information – enterprises constructed data lakes, which can store more data and do so at a lot lower cost than traditional block and file storage while also supporting complex queries against that data.
However, despite the benefits of data lakes, they come with their own issues and we have written quite a bit about how vendors like Dell EMC, Hewlett Packard Enterprise, Pure Storage, and Hitachi Vantara are trying help organizations get control of unstructured data. As the amount of data grows so does the complexity, particularly when AI and machine learning are thrown into the mix. There are concurrency, performance, data reliability, and data management issues that are still problematic.
Vinoth Chandar saw all this as a senior staff engineer and manager at fast-growing Uber from 2014 to 2019.
“We had a warehouse which had all the different advanced transaction capabilities and whatnot, but we couldn’t store all of our data in that,” Chandar tells The Next Platform. “We had a lake which can actually store a lot of the data and maybe even do a lot of high-scale data processing, but the data management of these features weren’t there.”
For example, Uber had a central database holding much of the ride-sharing company’s data regarding trips and, given the rapidly changing nature of the trips due to such issues as weather, Uber needed to be able to quickly replicate trips that are happening to its warehouse and leverage dashboards and queries on top of it.
“There are thousands of people running Uber who are operating in cities need that data to be able to make decisions on the ground,” he says. “We had volumes on the trips data that we could no longer fit in the warehouse that we were using, at least not in any cost-effective way. We needed to return to a lake, but the lake had zero ability to do updates. I couldn’t take trip changes in an upstream database and apply it to the lake directly. The lake can only store files and then I can write files and read files. There’s no intelligence to be able to absorb updates.”
At Uber, engineers were looking for the performance and speed of databases and the scale of data lakes to enable them to run their AI and machine learning workloads and reach near real-time capabilities for estimated arrival times for vehicles to food recommendations. It was in 2016 when Chandar created an open-source tool called Hudi to address such problems by essentially bringing database and data warehouse functionalities to data lakes, meshing speed with scale and helping to mark out an area in the data storage and management space for what are now known as lakehouses, which can manage structured, unstructured and semi-structured data and can run on cloud storage offerings, all built on an open architecture.
A number of companies are vendors developing lakehouse software, from Amazon Web Services and Snowflake to Databricks to Microsoft Azure. With Hudi data lakes now had such database features as transactions, updates and indexing. In 2019, Uber donated Hudi to the Apache Software Foundation, where the project has grown seven-fold over two years to almost a million monthly downloads, with an array of organizations making contributions. The technology has been adopted by such major enterprises as AWS, Walmart, GE Aviation, and Disney+Hotstar.
And there’s the rub. For organizations with the financial means and deep resources, adopting Hudi to help create databases and data lakes that can be converged into lakehouses is within reach. However, that leaves out the other companies that have fewer resources, less money and a much smaller and less-skilled IT staff. It can take organizations six months to a year to build and train a team on myriad open-source technologies.
“There’s been steadily growing a community around Hudi for four years now,” says Chandar, who still leads the project. “But the core problem of using this technology foundation is it’s still too hard to build data lakes. There are no good managed data lake-as-a-service offerings out there that. We routinely find that people come to projects like Hudi to be able to make their lives better or reduce costs or other solutions that they’re using to kick start their data science efforts, but it takes them a while.”
After leaving Uber, Chandar spent more than a year at Confluent and mapping out his next venture. This month, his new company, Onehouse, emerged from stealth and also announced $8 million in seed funding from venture capital firms Greylock Partners and Addition. Onehouse, with just more than a dozen employees, offers a cloud-native managed lakehouse services built atop the Hudi Project. Enterprises can use the services to automatically ingest, manage and optimize their data for faster processing and works with any open source query engine like Trino (formerly known as Presto), data and table formats.
Openhouse’s services enable organizations of any size to build data lakes quickly within minutes, save money and retain control of their data. Chandar sees Hudi’s incremental storage and processing and its incremental pipelines features pushing aside traditional batch processing.
“We are building a cloud SaaS product,” Chanda says. “Our customers sign up and then they link their cloud accounts to our control plane. None of the data really moves; the customer cloud accounts exist within their storage buckets. We launch services and manage services that can do data management capabilities. We bring it to their data.”
Enterprises connect the services to their cloud accounts and decide what data to organize and the Onehouse services create a lakehouse environment. They also automate tasks like clustering, caching and scaling metadata. The goal this year is getting initial pilots up and running and build out the services.
“On a very high level, fully managed data lakes are just getting started,” he says. “The product category is something we are defining here. We hope to be the query-engine-neutral data plane that can manage all of the organization’s data and do a really good job of keeping that interoperable with the new engines that come up and improving performance on all things. We’re not offering a query engine as a service, but we want to squarely focus on creating a new category where we’re focusing on managing the data really well and learning from all the pain points of the last decade around data lakes. The lack of data management is what’s really stopping companies from adopting lakes early on and with a lot more success.”